Saturday, February 24, 2024

How Does OpenAI Sora Work?

 We'll delve a bit into Sora as much as we can, but it's impossible to discuss it in detail. Firstly, because OpenAI ironically isn't transparent about the workings of its technology. The ownership and secret sauce differentiating Sora from its competitors aren't detailed for us. Secondly, I'm not a computer scientist, and you might not be one either, so we can only grasp the general workings of this technology.

The good news is there's an excellent (paid) breakdown of Sora by Mike Young on Medium, based on a technical report from OpenAI that he simplifies for us regular humans. While both documents are worth a read, we can extract the most crucial facts here.

Sora is built on the lessons learned by companies like OpenAI when creating technologies like ChatGPT or DALL-E. Sora innovates in how it's trained on video samples by breaking the video down into "patches" analogous to the "tokens" used by ChatGPT's training model. Since all these tokens are the same size, things like clip length, aspect ratio, and resolution size aren't an issue for Sora.

Sora employs the same extensive transformer approach that supports GPT along with the diffusion method used by AI image generators. During training, it looks at partially scattered and noisy patch tokens from a video and tries to predict what the clean, noise-free token appearance would be. By comparing this to the ground truth, the model learns the "language" of video. That's why examples from the Sora website look incredibly authentic.

Despite its remarkable capabilities, Sora also has very detailed text accompanying the video frames it's trained on, which is largely why it can modify the generated video based on text commands.

Sora's ability to accurately simulate physics in videos appears to be an emerging feature, generated only from training on millions of videos containing movements based on real-world physics. Sora exhibits excellent object persistence, even when objects leave the frame or are obstructed by something else in the frame; the object remains and returns seamlessly.


Post a Comment