A leaker on Discord claims to have access to a new image model from OpenAI. It shows significant progress, especially in font generation and matching prompts.
The leaker first came forward on a Discord channel in May, claiming to be part of an alpha test of a new AI image model from OpenAI. At the time, he showed images generated specifically for the channel, which he claimed were from a new image model trained by OpenAI.
In mid-July, he reappeared and showed more examples that he claimed to have generated using a “closed alpha” test version of what may or may not be DALL-E 3. The model is currently accessible to about 400 people, according to the leaker.
The leaker was invited via email and claims to have been involved in testing DALL-E and DALL-E 2. According to the leaker, the test version of the new image model is uncensored and therefore may contain scenes of violence and nudity or copyrighted material such as company logos.
The images show the typical DALL-E mark in the lower right corner, but it could easily be faked. In any case, the new generations surpass the current capabilities of models like Midjourney and SD XL in terms of details and fonts.
According to the tester, the results are also “significantly” better than Google Parti, which was already far ahead of DALL-E 2 when Google presented it about a year ago. For comparison, the leaker tested prompts from the Parti paper. However, Midjourney is said to be still ahead with photorealistic generations.
Better font and prompt precision
The leaker’s demonstrations show that the potential DALL-E 3 model is much better at handling type, for example, when including a phrase in the prompt that should appear as a phrase on the screen, as the following example shows.
While errors still creep into the words, overall the new model shows a better understanding of the language. Interestingly, in the example above, the model writes “afraid” even though the prompt says “afriad,” probably a spelling error that the model corrected. This could also mean that writing on the image is not 1:1.
The new model’s improved language understanding enables it to accurately render even complex image compositions with many abstract details, such as the following cheese-animal scene or the chilled wombat on a beach chair.
The example of the cheese animals is particularly impressive because in many models there is a so-called concept spillover, i.e. the image model mixes different content concepts. The potential DALL-E 3 model clearly separates the concepts of the cheese animal and the real animal.
The following Midjourney example with the same prompt illustrates the concept spillover. Here, the cheese has not become a cow, but one of the three dogs (instead of one) has horns that look like they could be made of cheese.
DALL-E 2 goes all in on cheese, not even trying to put a real animal in the picture, just sticking to a concept.
If you search for the user “Kaamalauppias”, you can find some more potential DALL-E-3 generations in this Discord channel.
OpenAI and others tinker with next-generation image AI
DALL-E 2 was quickly overtaken by Midjourney and Stable Diffusion after its launch, and then got lost in the hype surrounding ChatGPT and GPT-4. Of course, this does not mean that OpenAI has stopped working on image AI systems.
The first sign of this was the introduction of the Bing Image Creator, which according to Microsoft uses a “better version” of DALL-E 2. Details are not known, and the results of the Image Creator are not on the level of Midjourney or Stable Diffusion XL, even with DALL-E 2.5.
Since the introduction of DALL-E 2, a lot has happened in the field of image models in general, and companies like Meta have introduced new architectures that can generate images and fonts more efficiently and with higher accuracy.
In particular, Meta’s latest image model CM3leon, at least based on the selected examples, seems to provide a similar level of detail to match the prompt as the potential DALL-E 3 generations shown above. Furthermore, CM3leon has been trained exclusively on licensed material.
Earlier this year, Google unveiled Muse, a high-speed AI image model that can also follow prompts more accurately than previous models and generate text.
In April, the OpenAI research team unveiled a new architecture called “Consistency Models,” which generates much faster than classic diffusion models like DALL-E 2 while maintaining high quality – a possible prelude to video generation.
So significant advances in AI image models have been made, but they haven’t made it into a product yet. DALL-E-3 may soon change that.