Introduction
Imagine trying to describe a masterpiece painting to an artist, but being limited to just a few words. Frustrating, right? That’s exactly the challenge we face with AI image generation. While models like Stable Diffusion have revolutionized digital art creation, they come with a peculiar limitation: they can only process about 77 tokens (roughly 50-60 words) at once. But what if we could break free from these constraints?
In this article, we’ll explore a solution that allows us to use detailed, lengthy prompts for AI image generation, and we’ll see exactly how different approaches affect the final results.
Why do that? CLIP is a native transformer to diffusers and it can yield good results on very limited resources. The problem is prompt length – 77 tokens. I tested CLIP embeddings operations on 32GB RAM and 6GB VRAM – a 1070Ti which is quite old. Let’s see how the code is constructed.
Deep Dive into the Building Blocks
Stable Diffusion: The AI Artist
Stable Diffusion isn’t just any AI model – it’s a sophisticated latent diffusion model that works by:
- Converting images into a compressed latent space
- Learning to reverse the noise addition process
- Using guidance from CLIP to ensure the generated image matches your description
In our implementation, we’re using the “dreamlike-art/dreamlike-diffusion-1.0” model, which specializes in creating artistic, dreamlike images with enhanced color vibrancy and composition.
model_id = "dreamlike-art/dreamlike-diffusion-1.0" pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) pipe = pipe.to("cuda")
CLIP: The Bilingual Translator
CLIP (Contrastive Language-Image Pre-training) is fascinating because it creates a shared space where both text and images can be represented. Here’s how it works:
- Tokenization: Converts text into tokens
tokenizer = CLIPTokenizer.from_pretrained(clip_model_name) inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=77)
- Encoding: Transforms tokens into embeddings
text_encoder = CLIPTextModel.from_pretrained(clip_model_name) embeddings = text_encoder(tokens).last_hidden_state
- Multi-modal Understanding: Creates a bridge between text and image features
The Token Challenge
The 77-token limit isn’t arbitrary – it comes from CLIP’s architecture design. Here’s what happens when we hit this limit:
- Standard approach: Truncates everything after 77 tokens
- Our solution: Process in chunks and intelligently combine them
Breaking Through Limits: Our Approach
The Chunking Strategy
We’ve developed two methods to handle long prompts:
- Concatenation Method
def process_chunks_concatenate(chunks, encoder): chunk_embeddings = [] for chunk in chunks: embedding = encoder(chunk).last_hidden_state chunk_embeddings.append(embedding) return torch.cat(chunk_embeddings, dim=1)
- Averaging Method
def process_chunks_average(chunks, encoder): chunk_embeddings = [] for chunk in chunks: embedding = encoder(chunk).last_hidden_state chunk_embeddings.append(embedding) return torch.mean(torch.stack(chunk_embeddings), dim=0)
Detailed Analysis of Results
Test Case 1: Short Prompt
DINOS_SHORT = """
A lush, vibrant Jurassic jungle sprawls before us, its dense foliage and towering canopy creating a tapestry of greens that stretches as far as the eye can see. The air is thick with humidity, and the scent of blooming flowers and decaying vegetation hangs heavy over the landscape.
"""
Test Case 2: Long Prompt with Concatenation
Test Case 3: Long Prompt with Averaging
Performance Insights
Memory Usage
# Memory optimization for long prompts
@torch.no_grad() # Reduces memory usage during inference def process_embeddings(chunks): return tokenize_extended_clip(chunks)
Processing Time Comparison
- Short Prompt: ~6 seconds
- Concatenated Long Prompt: ~8-9 seconds
- Averaged Long Prompt: ~7-8 seconds
Advanced Tips and Tricks
Optimizing Your Prompts
- Structured Description Format:
prompt = f""" Setting: {environment_description} Subject: {main_subject_description} Lighting: {lighting_details} Style: {artistic_style} """
- Balancing Detail Distribution:
- Front-load important elements
- Use consistent descriptive style
- Include technical specifications last
Mind that those prompts are not very good, by design. I wanted to see differences.
Fine-tuning Generation Parameters
def optimize_generation( prompt, guidance_scale=7.5, # Controls prompt adherence num_inference_steps=50, # Affects detail level width=1024, height=1024 ): return pipe( prompt, guidance_scale=guidance_scale, num_inference_steps=num_inference_steps, width=width, height=height ).images[0]
Real-world Applications
1. Professional Photography Direction
photo_prompt = """ Location: Outdoor autumn forest Subject: Professional model in burgundy dress Pose: Natural, looking towards sunlight Lighting: Golden hour backlighting Camera: Low angle, shallow depth of field Mood: Warm, romantic, ethereal """
2. Character Concept Art
character_prompt = """ Character: Young wizard apprentice Clothing: Flowing blue robes with silver trim Accessories: Ancient spellbook, crystal wand Expression: Determined, focused Environment: Magical library interior Lighting: Soft magical glow from floating orbs Style: Semi-realistic fantasy illustration """
Future Enhancements
- Dynamic Chunk Sizing
def adaptive_chunking(prompt, max_tokens=77): # Future implementation for smart chunk sizing pass
- Weighted Averaging
def weighted_average_embeddings(embeddings, weights): # Future implementation for importance-based averaging pass
Complete Implementation
The full source code is available at https://github.com/sq5rix/llama in the file multi_dream.py
. Read README for detailed implementation tips.
Conclusion
Our exploration into extending CLIP’s capabilities has shown that we can indeed break free from the 77-token limitation while maintaining image quality. The choice between concatenation and averaging depends on your specific needs:
- Use concatenation when precise detail control is crucial
- Use averaging when you want more natural, balanced results
- Consider short prompts for focused, single-concept images
- Use better model and better transformer like T5
Will the gap between human creativity and AI capability grows smaller?
Sources:
Learning Transferable Visual Models From Natural Language Supervision
Learning CLIP Guided Visual-Text Fusion Transformer for Video-based Pedestrian Attribute Recognition
Learning CLIP Guided Visual-Text Fusion Transformer for Video-based Pedestrian Attribute Recognition
Code in repo is constantly changing. It can burn your computer 🙂
import torch
from diffusers import StableDiffusionPipeline
from transformers import CLIPTextModel, CLIPTokenizer
from utils import count_words
# MODEL = "CompVis/stable-diffusion-v1-4"
MODEL = "dreamlike-art/dreamlike-diffusion-1.0"
CLIP = "openai/clip-vit-large-patch14"
DINOS_SHORT = """
A lush, vibrant Jurassic jungle sprawls before us, its dense foliage and towering canopy creating a tapestry of greens that stretches as far as the eye can see. The air is thick with humidity, and the scent of blooming flowers and decaying vegetation hangs heavy over the landscape.
"""
DINOS = """
A lush, vibrant Jurassic jungle sprawls before us, its dense foliage and towering canopy creating a tapestry of greens that stretches as far as the eye can see. The air is thick with humidity, and the scent of blooming flowers and decaying vegetation hangs heavy over the landscape.
To our left, a herd of massive Apatosaurs grazes on the lush undergrowth, their long necks bent as they reach for the treetops. Their scaly skin glistens in the dappled sunlight filtering through the canopy above, and their gentle lowing echoes through the jungle. Nearby, a smaller group of Camptosaurs feeds on the tender shoots of ferns and cycads, their more compact bodies weaving between the Apatosaur's larger forms.
Deeper in the jungle, a trio of Allosaurs stalks its prey, their sharp eyes scanning the underbrush for any sign of movement. These apex predators are built for speed and stealth, their sleek, muscular bodies capable of reaching incredible velocities as they pursue their unsuspecting quarry. A lone Olorotitan wanders through the jungle, its massive body and distinctive crest marking it out from other hadrosaurs.
In a sun-dappled clearing, a pair of Stegosaurs basks in the warmth, their plates glistening with dew and their spiky tails swishing lazily behind them. Nearby, a lone Ceratosaur patrols the edge of the jungle, its distinctive horns and crested head making it a formidable sight to behold.
As we venture deeper into the jungle, the sounds of distant roaring grow louder. A group of massive Tyrannosaurs moves through the undergrowth, their sharp eyes fixed intently on some unseen target. The air seems to vibrate with tension as they stalk their prey, their massive feet barely making a sound as they move.
In the distance, a flock of Pteranodons soars overhead, their wings beating in unison as they ride the thermals above the jungle. A lone Oviraptor stalks its prey through the underbrush, its sharp eyes scanning for any sign of movement.
The light begins to fade as the sun dips below the horizon, casting long shadows across the jungle floor. The air cools, and the sounds of the jungle begin to change, as nocturnal creatures stir from their daytime slumber. The scent of blooming flowers gives way to the musky aroma of nocturnal predators, and the jungle transforms into a world of mystery and danger.
The camera's eye pans across this vibrant, teeming ecosystem, taking in the intricate web of life that exists within the Jurassic jungle. We see the delicate balance between predator and prey, the adaptability of species to their environment, and the sheer diversity of life that thrives in this ancient world.
"""
def tokenize_extended_clip(prompt, aggregation="concatenate"):
# Initialize CLIP tokenizer and text encoder
clip_model_name = CLIP
tokenizer = CLIPTokenizer.from_pretrained(clip_model_name)
text_encoder = CLIPTextModel.from_pretrained(clip_model_name).to("cuda")
# Tokenize the prompt and split into chunks of 77 tokens each
inputs = tokenizer(
prompt, return_tensors="pt", padding=True, truncation=True, max_length=77
)
input_ids = inputs.input_ids.squeeze()
chunk_size = 77
input_chunks = [
input_ids[i : i + chunk_size] for i in range(0, len(input_ids), chunk_size)
]
# Encode each chunk independently
chunk_embeddings = []
for chunk in input_chunks:
chunk = chunk.unsqueeze(0).to("cuda") # Add batch dimension and move to GPU
with torch.no_grad():
embedding = text_encoder(chunk).last_hidden_state
chunk_embeddings.append(embedding)
# Aggregate the embeddings based on the specified method
if aggregation == "average":
combined_embedding = torch.mean(torch.stack(chunk_embeddings), dim=0)
elif aggregation == "concatenate":
combined_embedding = torch.cat(chunk_embeddings, dim=1)
else:
raise ValueError("Aggregation method must be 'average' or 'concatenate'")
return combined_embedding
def dream_model_long(
title,
prompt,
width=512,
height=512,
num_inference_steps=50,
aggregation="concatenate",
):
model_id = "dreamlike-art/dreamlike-diffusion-1.0"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
combined_embedding = tokenize_extended_clip(prompt, aggregation)
image = pipe(
prompt_embeds=combined_embedding,
num_inference_steps=num_inference_steps,
width=width,
height=height,
guidance_scale=7.5,
).images[0]
image.save(f"images/{title}.png")
return f"images/{title}.png"
def dream_model_short(
title,
prompt,
width=512,
height=512,
num_inference_steps=50,
):
model_id = "dreamlike-art/dreamlike-diffusion-1.0"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
image = pipe(
prompt,
num_inference_steps=num_inference_steps,
width=width,
height=height,
guidance_scale=7.5,
).images[0]
image.save(f"images/{title}.png")
return f"images/{title}.png"
print("short prompt len: ", count_words(DINOS_SHORT))
print("long prompt len: ", count_words(DINOS))
_ = dream_model_short("test_cut_prompt", DINOS_SHORT, width=1024, height=1024)
_ = dream_model_long("test_long_prompt_concat", DINOS, width=1024, height=1024)
_ = dream_model_long(
"test_long_prompt_average",
DINOS,
aggregation="average",
width=1024,
height=1024,
)