Wednesday, February 5, 2025

What are main Parts of Diffusion Model

Diffusion Model Overview

Take a look at our diffusion architecture main components.


The main part is a 3D U-Net, which is good at working with videos because they have frames that change over time. This U-Net isn’t just simple layers; it also uses attention. Attention helps the model focus on important parts of the video. Temporal attention looks at how frames relate to each other in time, and spatial attention focuses on the different parts of each frame. These attention layers, along with special blocks, help the model learn from the video data. Also, to make the model generate video that match the text prompt we provide, we use text embeddings.

The model works using something called diffusion. Think of it like this, first, we add noise to the training videos until they’re just random. Then, the model learns to undo this noise. To generate a video, it starts with pure noise and then gradually removes the noise using the U-Net, using the text embeddings that we provided as a guide. Important steps here include adding noise and then removing it. The text prompt is converted to embeddings using BERT and passed to UNet, enabling to generate videos from text. By doing this again and again, we get a video that matches the text we gave, which lets us make videos from words.

Instead of looking at the original complex diagram, let’s visualize a simpler and easier architecture diagram that we will be coding.




Let’s read through the flow of our architecture that we will be coding:


Input Video: The process begins with a video that we want to learn from or use as a starting point.

UNet3D Encoding: The video goes through the UNet3D’s encoder, which is like a special part of the system that shrinks the video and extracts important features from it.

UNet3D Bottleneck Processing: The shrunken video features are then processed in the UNet3D’s bottleneck, the part that has the smallest spatial dimensions of the video in feature map.

UNet3D Decoding: Next, the processed features are sent through the UNet3D’s decoder, which enlarges the features back to a video, adding the learned structure.

Text Conditioning: The provided text prompt, used to guide the generation, is converted into a text embedding, which provides input to the UNet3D at various points, allowing the video to be generated accordingly.

Diffusion Process: During training, noise is added to the video, and the model learns to remove it. During video generation, we starts with noise, and the model uses the UNet3D to gradually remove the noise, generating the video.

Output Video: Finally, we get the output video that is generated by the model and is based on the input video or noise and the text provided as a prompt.

references:

https://levelup.gitconnected.com/building-a-tiny-text-to-video-model-from-scratch-using-python-f31bdab12fbb

https://github.com/FareedKhan-dev/text2video-from-scratch?source=post_page-----f31bdab12fbb--------------------------------



No comments:

Post a Comment