I meant newer as in "latest half of past year" really. Yes it got more convenient. No it didn't get better, in terms of quality and being less obviously AI, from what I have seen. Which is what I meant.
Last half year?
GPT 1, June 2018
GPT 2, February 2019 (8 months)
GPT 3, May 2020 (15 months)
GPT 3.5, November 2022 (28 months)
GPT 4, March 2023 (6 months)
Now, (11 months)
What a strange metric for plateauing. If we used that then LLM's would have plateaued in 2019, 2020, 2021, 2022, 2023 and 2024. Now, if you went "AI text generation development plateaued in 2019" that would be obviously wrong because in fact it has continued to develop significantly every year since 2018 (aside from arguably 2021 where OpenAI didn't develop a new model) at a very significant and rapid rate.
The same is true for text to image generation. If you stick an unreasonably short timeframe on it (last 6 months (E: You actually seem to be saying last 8 months, with "last half of last year", but that is still way too short a time period)) then sure, there haven't been many fundamental advances. Not none (it can understand and put text in images since Dalle 3 4 months ago), but Dalle 3 isn't a massive leap or anything.
However if you widen the window to a much more reasonable year instead then it very much has. Over that timespan both the average quality and maximum quality have improved. In addition it is now smarter and has in fact reduced obvious "this is an AI" tells (hands, text) which also means yes, it is indeed harder to tell if an image is AI generated.
Now obviously between now and a year ago it hasn't gained the ability to trick people watching or fluent in the technology and still has obvious tells, but there's a pretty huge difference between that and plateauing.
Of course with the events of a few days ago it seems pretty clear that Sora has pushed image generation far further then what existed beforehand so the idea of image generation having plateaued is obviously wrong. I have little doubt that if there is a claim that image /video generation has plateaued 8 months from now due to nothing more advanced then Sora existing that will be proven wrong as well if given more time.
One of their videos has been discovered to be 95% source material with some fuzzing. This is hype.
Sauce?
---
Is the Sora AI creating those from actual scratch (well from its training) or is it doing a video2video (i mean each frames of an existing video processed by an AI in the desired/prompted style) like the guys from Corridor Digital did with "Rock, Paper, Scissor" a year go
https://www.youtube.com/watch?v=GVT3WUa-48Y
All of the results above and in our landing page show text-to-video samples. But Sora can also be prompted with other inputs, such as pre-existing images or video. This capability enables Sora to perform a wide range of image and video editing tasks—creating perfectly looping video, animating static images, extending videos forwards or backwards in time, etc.
It can do both, but the ones presented on the main page were text to image.
https://openai.com/research/video-generation-models-as-world-simulatorsI do advise people to check out the paper if they are interested in how it works, because it gives quite a bit of detail about both that and what Sora can do in general.
Video generation is way trickier to make usable. Why? Mistakes in output are way harder to fix. Generated text is trivial to edit (both manually and with automated tools), images are somewhat trickier and require more work but absolutely double. Fixing video requires a lot of effort which may be beyond practical
It can do video editing no problem. In fact for smaller things I suspect its even easier for it given that there is already a solid world there to base things off and it doesn't have to come up with one on its own.
I do agree that video generation is way harder though.
The first reason is simply compute. A 10 second video has 600 frames, which (if done naively) requires 600 times the compute of a single image generation. Longer videos also require the AI to have a longer "memory" to make sure everything is working properly and doesn't cause problems. There are almost certainly fancy tricks done here to make things cheaper, but its still got to be hella expensive computation wise.
The second reason is that it not only needs to understand three dimensions, but also needs to maintain continuity between them all by having a consistent model of the 3D environment.
Thirdly it needs to understand time and how things move through time.
Finally it also needs to understand physics and the physics of every object within the environment to avoid obviously impossible stuff happening.
Sora has demonstrated understanding of all of these issues although there is obviously some way to go (as shown by them posting videos of more blatant errors and the errors seen even on the good videos).
When I earlier had a look at the Sora examples (on the main link given, the other day), various revealing errors were... revealing.
Take the dalmation at the 'ground' floor window (it wasn't that, much as the cat never got fed treats by the man in the bed, and the rabbit-squirrel never looked up at the fantasy tree), it was clearly a reskinned cat-video. A cat making some windowsill-to-windowsill movement (not something even asked for in the Prompt text) reskinned with the body of the desired breed of dog (but still moved like a cat) rendered over the sort-of-desired background (windows of the appropriate types, if not position). Where the notable folded-out shutter absolutely does not impede even the cat-footed dog's movement across it.
Good catch.
As you say AI in general has proven completely willing to just rip stuff off if it thinks its what it wants, even if as in this case what it wants isn't exactly what its been asked for.
Sora is a diffusion model21,22,23,24,25; given input noisy patches (and conditioning information like text prompts), it’s trained to predict the original “clean” patches.
I am quite a bit more skeptical though that the algorithm is similar to morphing even if in some (many? most? nearly all?) cases the end result is similar in that it draws heavily from some video as a framework; because AFAIK that simply isn't how diffusion in general works at all.
---
Sora is undoubtedly very expensive and probably requires some of those fancy $20,000+ dollar graphics cards so I wouldn't be surprised if it cost say, +$10 bucks per minute to get it to generate a video.
Due to this and the usage requirements it will have (aka, the AI being unwilling to model anything improper/real people/politics + big brother OpenAI spying on you) it will probably take quite some time after release for videos to really begin to circulate on the internet.
But in the end even at $50 bucks per minute its still way cheaper and faster then say, hiring your own drone to follow your car down the road or hiring a video firm to make a commercial for you, so companies are totally going to use it even right out of the box.