OpenAI’s Sora is powered by Diffusion transformer (DiT): What is it?

android, openai’s sora is powered by diffusion transformer (dit): what is it?

OpenAI’s Sora is powered by Diffusion transformer (DiT): What is it?

In February, OpenAI stunned the world with its latest AI model, Sora. The latest offering from the Sam Altman-led company could use prompts in natural language and generate minute-long videos in high definition. This model, which came after Runway’s Gen-2 and Google’s Lumiere, showcased some breathtaking capabilities of video generation that could potentially replace filmmaking in the future.

At present, there are two kinds of models that are fuelling the AI innovation – transformers and diffusion models. These are essentially architecture that have redefined the landscape of machine learning, a subset of AI, applications. Transformer-based models have radically changed how machine learning models engage with text data, in terms of classifying it and generating it. On the other hand, diffusion models have become the most preferred for AI that generates images.

It needs to be noted that Diffusion models are the process of diffusion which is essentially the spreading of particles from a dense space to a lesser dense area. Sora is not a large language model (LLM), but is a diffusion transformer model. In this article, we will understand what a diffusion transformer model is, and how it is different from other AI models.

What is a Diffusion transformer?

Diffusion transformer also written as DiT is essentially a class of diffusion models that are based on the transformer architecture. DiT has been developed by William Peebles at UC Berkeley, who is currently the research scientist at OpenAI, and Saining XE at New York University in 2023. DiT is aimed at improving the performance of diffusion models by switching the commonly used U-Net (an architecture employed in diffusion models for iterative image denoising) backbone with a transformer.

ICYMI | How Crime GPT is using AI to help police forces with crime detection

Let’s simplify this – imagine you have a big jigsaw puzzle to solve. but you don’t know how the whole picture looks. So, you try to figure out one piece at a time. The DiT is like a special way to solve this puzzle. Usually, U-Net is used to solve it. But DiT uses something called a transformer instead. One can think of U-Net as a way to organise and understand puzzle pieces. However, this may not be the best solution all the time. In simple words, DiT is like a new and improved tool for solving big puzzles, for instance, like understanding complicated pictures or data.

When it comes to Sora, the DiT here uses the concept of diffusion for predicting videos and the strength of transformers for next-level scaling. This can be further broken down into – what happens to Sora after you give a prompt? And how does it employ the concept of diffusion transformers?

How does it all translate into videos?

Based on a LinkedIn post by Professor Tom Yeh, University of Colorado Boulder, here we attempt to simplify the process of prompt to video. Let’s imagine that you have a prompt, ‘Sora is sky’. Once you enter it, Sora splits a related video (from its dataset) into small parts called patches, similar to breaking it down into smaller puzzle pieces. Later, each patch is turned into a simpler version, like summarising it, which helps the model understand the video better.

Also Read | What is Limitless Pendant, the world’s smallest AI wearable device?

In the next step, some random elements (noise) is added to the summarised parts to make things interesting. Then comes the conditioning stage, where the prompt ‘Sora is sky’ is turned into numbers and are mixed up. This essentially helps the model adjust the video based on the prompt. In the next stage, the models use a special function to focus on different parts of the video and figure out what’s important.

Later in the attention pooling stage, the model focuses on the important parts of the video based on the prompt and random noise added. Using all the information, the model tries to guess what the noise may look like in different parts of the video. Now, the model pays attention to all the key details in the video, combines them and makes guesses about what should come next. In case the guess isn’t perfect, Sora learns from its mistakes and tries to do better. Finally, in the last stage Sora reveals the finished video without all the extra noise, making it look smooth and clear.

In simple words, DiT helps Sora understand text prompts and make cool videos by breaking them down into smaller parts, adding a bit of randomness, and then cleaning things up based on the text.

Advantages of DiT

DiT deploys transformers in a latent diffusion process, where noise is gradually transformed into the target image. This is done by reversing the diffusion process guided by a transformer network. The concept of diffusion timesteps is a key aspect of DiT. To simplify this, you have a tool called DiT which helps you make pictures. It works by using something called transformers to change a simple picture bit by bit into something you want. Think of it as cleaning a blurry image step by step. The diffusion timesteps act like checkpoints. At each checkpoint, DiT looks at what the picture looks like and decides how to make it better. In simple words, it is like different stages of cooking – you add different spices at different times.

When it comes to scalability, DiT can handle larger input data without sacrificing performance. This would need efficient resource usage and maintaining sample quality. For e.g., in natural language tasks, input size can vary widely. A scalable DiT should handle this variation without performance loss. As data volume grows, DiT’s ability to scale will be key.

For the latest news from across India, Political updates, Explainers, Sports News, Opinion, Entertainment Updates and more Top News, visit Indian Express. Subscribe to our award-winning Newsletter Download our App here Android & iOS

News Related

OTHER NEWS

Guru Nanak Jayanti: Rishi Sunak Highlights Punjabi Heritage In Message, Trudeau Extends Greetings

In a greeting from 10 Downing Street on the occasion of Guru Nanak Jayanti, British Prime Minister Rishi Sunak mentioned his Punjabi Indian origin, news agency PTI reported. The 43-year-old ... Read more »

What US easing sanctions on Venezuela, home to world’s largest oil reserves, could mean for India

This report is the second of a three-part series on recent Indian engagement in the Latin American and Caribbean (LAC) region. New Delhi: The US’ decision last month to ease ... Read more »

Rajshri Deshpande dedicates OTT award to innocent lives lost in Gaza, Palestine

Rajshri Deshpande dedicates OTT award to innocent lives lost in Gaza, Palestine Actor and social worker Rajshri Deshpande won the Best Actor, Series (Female) award for Netflix’s ‘Trial By Fire’. ... Read more »

‘Ramchandra Keh Gaye…’: From Jan 1, RSS to Spread Word of God, Ayodhya Inauguration Among 10 Crore People

‘Ramchandra Keh Gaye…’: From Jan 1, RSS to Spread Word of God, Ayodhya Inauguration Among 10 Crore People In its effort to take the Ram Janmabhoomi message to households across ... Read more »

Ace designer Rohit Bal critical, on ventilator: report

Ace designer Rohit Bal critical, on ventilator: report Celebrated fashion designer Rohit Bal is in critical condition and is on ventilator support, HT City reported, quoting sources. He has been ... Read more »

Bengaluru: Traffic Advisory Issued, Parking Restrictions In Place As Samyukta Horata Samiti Holds Protest | Details

Bengaluru: Traffic Advisory Issued, Parking Restrictions In Place As Samyukta Horata Samiti Holds Protest | Details The Bengaluru Traffic Police has issued a traffic advisory for November 27 and 28 ... Read more »

Vistara Flights Diverted Due To Air Congestion At Delhi Airport | DETAILS

vistara flights diverted due to air congestion at delhi airport | details Delhi: Two Vistara flight has been diverted to Lucknow and Jaipur due to bad weather and air congestion ... Read more »
Top List in the World