You often read that AI will take our jobs, sometimes jokingly and sometimes fearfully, but how smart is AI really? First, we need to distinguish between LLM (Large Language Model) and generative AI. LLM includes AI such as ChatGPT which often (but not always) connected to internet, which is why many think of it as a glorified Google search. I don’t work a lot with this type of AI, and I am not qualified to judge its intelligence.
Then we have generative AI such as Stable Diffusion, PixArt and Flux. While these types of AI can run completely offline and their purpose is to generate images from text, they also have a kind of language model in order to understand the text prompts that are given to them, these language models are called CLIP (Contrastive Language–Image Pre-training).
CLIP
I’m in no way an expert on this subject, but I can give a short overview over the different CLIP models and how they are used.
The original CLIP model that were used for Stable Diffusion 1 and 1.5 is CLIP ViT L 14, and it was trained on a data set containing 1.4 billion parameters. When using this CLIP model to generate images on Stable Diffusion 1,5 you mostly listed different things you wanted to appear in the image. The size of this CLIP model is 1.7GB, and a prompt might have looked something like this :
Black cat, roof, city, sky scrapers
And the result would be this.
With SDXL came the CLIP ViT G 14 which were trained on a data set containing 2 billion parameters and focused more on natural language than the listing of keywords. Most often, the CLIP ViT L 14 and CLIP ViT G 14 are used together when generating images with SDXL. The size of this CLIP model is 5.47GB and because it is used together with the other CLIP model, a prompt could have both listed keywords and a description using natural language.
Fluffy, fat, cyberpunk, sky scrapers, high quality, realistic, photography. A fluffy fat black cat chasing a rat over the roofs in a cyberpunk city
Resulting in a image like this.
The T-5 XXL model
One of the big news that the Stable Diffusion 3 was released with was a third CLIP model that were based on Google’s model Flan T-5 XXL, which uses a stunning 11 billion parameters. The T-5 XXL encoder focused even more on natural language and Stable Diffusion combined this together with both CLIP ViT L 14 and CLIP ViT G 14.
This made the prompt adherence excellent, meaning that the output image closely resemble the text that you put in the prompt. That the quality of the images that were output in SD 3 was less than expected depended on something else.
So even generative AI is using LLM, but to put it all in a context, Chat GPT-3 has 175 billion parameters. So even the T-5 XXL model at it’s 11 billion parameters is small in comparison, but then again, the two models have different purposes.
So how smart is AI then?
The latest open source generative AI is Flux from Black Forest Labs, and it’s using a dual CLIP text encoding technique combining T-5 XXL and CLIP ViT L 14. I’m in no way qualified to say why, but Flux is even better at prompt adherence than SD 3. Maybe because they skipped the CLIP ViT G 14 model all together, and let the T-5 XXL model handle all the natural language, but I honestly don’t know.
Yesterday I found out something that gives a hint that Flux AI is a lot smarter than, at least I thought it was. And to be fully transparent, I did not myself come up with this but found a reference to it at the Facebook page of a Chinese user.
The images below are generated in one go, using only a text prompt. No additional editing has been done.
The model used to generate the comics was the full Flux bf16 model (almost 24 GB large), and the prompt used to generate these 4 panel comic strips is the following.
A four-panel comic, colored, manga,
Panel 1 (Top Left):
Scene: A classroom with a chalkboard in the background.
Character: A young female student with long brown hair wearing a pink dress, sitting at a desk.
Action: The student is looking at a math problem on a piece of paper, scratching her head in confusion.
Text: No dialogue, just a thought bubble with a question mark ("?"),
Panel 2 (Top Right):
Scene: The same classroom.
Character: The same female student, now looking determined.
Action: The student is holding a pencil and writing numbers on the paper.
Text: The female student's thought bubble says, "I think I got it!",
Panel 3 (Bottom Left):
Scene: The same female student walks up to the teachers desk, handing in her paper.
Character: The student smiling confidently, and the teacher, an older woman with glasses, taking the paper.
Action: The teacher is looking at the student's paper with a neutral expression.
Text: No dialogue or thought bubbles.,
Panel 4 (Bottom Right):
Scene: The classroom, showing the same female student at her desk with the graded paper in front of her.
Character: The girl looking happy with a big "A+" on her paper.
Action: The girl is raising her fists in excitement.
Text: The female student thinks, "Yes! I made it!",
Now, you might argue that the quality of the images in the comic strips isn’t that high quality, and maybe you are correct about that. But if you consider the implications of how accurately the image is compared to the text prompt, you will notice several things.
- The model understands to split the image in 4 panels
- The model can differentiate between left and right and top and bottom
- The model can reference back to earlier instructions (i.e “the same classroom”, “the same female student”)
- The model can differentiate between what a “scene”, a “character”, “text” and an “action” is
- The model even understands the reference to whom a text bubble belongs to as well as a statement that in this panel there is “No dialogue or thought bubbles”
The instructions in the text prompts is fairly complicated, especially as it’s not at a programming level, but only using natural language as input.
Remember that this is all done with a natural language model which has 11 billion parameters (t-5 XXL) with the support of a model that has 1.4 billion parameters (CLIP ViT L 14) and that isn’t connected to the internet.
This is very impressive to anyone who has worked with computers, maybe especially if you have some insight into programming and understand how very exact you have to be in writing commands for it to work.
As I said in the beginning of this text, I’m not working much with LLM (such as ChatGPT), but given how impressive Flux has shown to be, I can only imagine how smart a model with 175 billion parameters that is also connected to, and can fetch information directly from, the internet must be.