OpenAI’s DALL-E 3 could raise the bar in image detail and prompt precision



summary
Summary

A leaker on Discord claims to have access to a new image model from OpenAI. It shows significant progress, especially in font generation and matching prompts.

The leaker first came forward on a Discord channel in May, claiming to be part of an alpha test of a new AI image model from OpenAI. At the time, he showed images generated specifically for the channel, which he claimed were from a new image model trained by OpenAI.

In mid-July, he reappeared and showed more examples that he claimed to have generated using a “closed alpha” test version of what may or may not be DALL-E 3. The model is currently accessible to about 400 people, according to the leaker.

The leaker was invited via email and claims to have been involved in testing DALL-E and DALL-E 2. According to the leaker, the test version of the new image model is uncensored and therefore may contain scenes of violence and nudity or copyrighted material such as company logos.

Ad

Subway would probably not be happy with this generation, and with so much blood and religion, OpenAI is more likely to censor images like the one in the final DALL-E 3. | Image: Kaamalauppias, Discord

The images show the typical DALL-E mark in the lower right corner, but it could easily be faked. In any case, the new generations surpass the current capabilities of models like Midjourney and SD XL in terms of details and fonts.

According to the tester, the results are also “significantly” better than Google Parti, which was already far ahead of DALL-E 2 when Google presented it about a year ago. For comparison, the leaker tested prompts from the Parti paper. However, Midjourney is said to be still ahead with photorealistic generations.

Better font and prompt precision

The leaker’s demonstrations show that the potential DALL-E 3 model is much better at handling type, for example, when including a phrase in the prompt that should appear as a phrase on the screen, as the following example shows.

Typos are part of the original prompt: “an image of an angel holding the sun and moon. above the angel, it says, “BE NOT AFRIAD” in the background is the entire universe. fantasy art, 8k reoslution, beautiful, emotional.” | Image: via Discord

While errors still creep into the words, overall the new model shows a better understanding of the language. Interestingly, in the example above, the model writes “afraid” even though the prompt says “afriad,” probably a spelling error that the model corrected. This could also mean that writing on the image is not 1:1.

The new model’s improved language understanding enables it to accurately render even complex image compositions with many abstract details, such as the following cheese-animal scene or the chilled wombat on a beach chair.

Recommendation

more potential DALL-E-3 generations in this Discord channel.

OpenAI and others tinker with next-generation image AI

DALL-E 2 was quickly overtaken by Midjourney and Stable Diffusion after its launch, and then got lost in the hype surrounding ChatGPT and GPT-4. Of course, this does not mean that OpenAI has stopped working on image AI systems.

The first sign of this was the introduction of the Bing Image Creator, which according to Microsoft uses a “better version” of DALL-E 2. Details are not known, and the results of the Image Creator are not on the level of Midjourney or Stable Diffusion XL, even with DALL-E 2.5.

Since the introduction of DALL-E 2, a lot has happened in the field of image models in general, and companies like Meta have introduced new architectures that can generate images and fonts more efficiently and with higher accuracy.

In particular, Meta’s latest image model CM3leon, at least based on the selected examples, seems to provide a similar level of detail to match the prompt as the potential DALL-E 3 generations shown above. Furthermore, CM3leon has been trained exclusively on licensed material.

Earlier this year, Google unveiled Muse, a high-speed AI image model that can also follow prompts more accurately than previous models and generate text.

In April, the OpenAI research team unveiled a new architecture called “Consistency Models,” which generates much faster than classic diffusion models like DALL-E 2 while maintaining high quality – a possible prelude to video generation.

So significant advances in AI image models have been made, but they haven’t made it into a product yet. DALL-E-3 may soon change that.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top