Skip to content

Breaking the Limits: Supercharging AI Image Generation with Extended Prompts

Introduction

Imagine trying to describe a masterpiece painting to an artist, but being limited to just a few words. Frustrating, right? That’s exactly the challenge we face with AI image generation. While models like Stable Diffusion have revolutionized digital art creation, they come with a peculiar limitation: they can only process about 77 tokens (roughly 50-60 words) at once. But what if we could break free from these constraints?

In this article, we’ll explore a solution that allows us to use detailed, lengthy prompts for AI image generation, and we’ll see exactly how different approaches affect the final results.

Why do that? CLIP is a native transformer to diffusers and it can yield good results on very limited resources. The problem is prompt length – 77 tokens. I tested CLIP embeddings operations on 32GB RAM and 6GB VRAM – a 1070Ti which is quite old. Let’s see how the code is constructed.

Deep Dive into the Building Blocks

Stable Diffusion: The AI Artist

Stable Diffusion isn’t just any AI model – it’s a sophisticated latent diffusion model that works by:

  1. Converting images into a compressed latent space
  2. Learning to reverse the noise addition process
  3. Using guidance from CLIP to ensure the generated image matches your description

In our implementation, we’re using the “dreamlike-art/dreamlike-diffusion-1.0” model, which specializes in creating artistic, dreamlike images with enhanced color vibrancy and composition.

model_id = "dreamlike-art/dreamlike-diffusion-1.0" pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) pipe = pipe.to("cuda")

CLIP: The Bilingual Translator

CLIP (Contrastive Language-Image Pre-training) is fascinating because it creates a shared space where both text and images can be represented. Here’s how it works:

  1. Tokenization: Converts text into tokens

tokenizer = CLIPTokenizer.from_pretrained(clip_model_name) inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=77)

  1. Encoding: Transforms tokens into embeddings
text_encoder = CLIPTextModel.from_pretrained(clip_model_name) embeddings = text_encoder(tokens).last_hidden_state
  1. Multi-modal Understanding: Creates a bridge between text and image features

The Token Challenge

The 77-token limit isn’t arbitrary – it comes from CLIP’s architecture design. Here’s what happens when we hit this limit:

  • Standard approach: Truncates everything after 77 tokens
  • Our solution: Process in chunks and intelligently combine them

Breaking Through Limits: Our Approach

The Chunking Strategy

We’ve developed two methods to handle long prompts:

  1. Concatenation Method
def process_chunks_concatenate(chunks, encoder): chunk_embeddings = [] for chunk in chunks: embedding = encoder(chunk).last_hidden_state chunk_embeddings.append(embedding) return torch.cat(chunk_embeddings, dim=1)
  1. Averaging Method
def process_chunks_average(chunks, encoder): chunk_embeddings = [] for chunk in chunks: embedding = encoder(chunk).last_hidden_state chunk_embeddings.append(embedding) return torch.mean(torch.stack(chunk_embeddings), dim=0)

Detailed Analysis of Results

Test Case 1: Short Prompt

DINOS_SHORT = """
A lush, vibrant Jurassic jungle sprawls before us, its dense foliage and towering canopy creating a tapestry of greens that stretches as far as the eye can see. The air is thick with humidity, and the scent of blooming flowers and decaying vegetation hangs heavy over the landscape.
"""

Test Case 2: Long Prompt with Concatenation

Using CLIP embedding operation to generate images with long prompts

Test Case 3: Long Prompt with Averaging

Averaging CLIP embeddings yields weird mix of prompts losing clarity

Performance Insights

Memory Usage

# Memory optimization for long prompts 

@torch.no_grad() # Reduces memory usage during inference def process_embeddings(chunks): return tokenize_extended_clip(chunks)

Processing Time Comparison

  • Short Prompt: ~6 seconds
  • Concatenated Long Prompt: ~8-9 seconds
  • Averaged Long Prompt: ~7-8 seconds

Advanced Tips and Tricks

Optimizing Your Prompts

  1. Structured Description Format:
prompt = f""" Setting: {environment_description} Subject: {main_subject_description} Lighting: {lighting_details} Style: {artistic_style} """
  1. Balancing Detail Distribution:
  • Front-load important elements
  • Use consistent descriptive style
  • Include technical specifications last

Mind that those prompts are not very good, by design. I wanted to see differences.

Fine-tuning Generation Parameters

def optimize_generation( prompt, guidance_scale=7.5, # Controls prompt adherence num_inference_steps=50, # Affects detail level width=1024, height=1024 ): return pipe( prompt, guidance_scale=guidance_scale, num_inference_steps=num_inference_steps, width=width, height=height ).images[0]

Real-world Applications

1. Professional Photography Direction

photo_prompt = """ Location: Outdoor autumn forest Subject: Professional model in burgundy dress Pose: Natural, looking towards sunlight Lighting: Golden hour backlighting Camera: Low angle, shallow depth of field Mood: Warm, romantic, ethereal """

2. Character Concept Art

character_prompt = """ Character: Young wizard apprentice Clothing: Flowing blue robes with silver trim Accessories: Ancient spellbook, crystal wand Expression: Determined, focused Environment: Magical library interior Lighting: Soft magical glow from floating orbs Style: Semi-realistic fantasy illustration """

Future Enhancements

  1. Dynamic Chunk Sizing
def adaptive_chunking(prompt, max_tokens=77): # Future implementation for smart chunk sizing pass
  1. Weighted Averaging
def weighted_average_embeddings(embeddings, weights): # Future implementation for importance-based averaging pass

Complete Implementation

The full source code is available at https://github.com/sq5rix/llama in the file multi_dream.py. Read README for detailed implementation tips.

Conclusion

Our exploration into extending CLIP’s capabilities has shown that we can indeed break free from the 77-token limitation while maintaining image quality. The choice between concatenation and averaging depends on your specific needs:

  • Use concatenation when precise detail control is crucial
  • Use averaging when you want more natural, balanced results
  • Consider short prompts for focused, single-concept images
  • Use better model and better transformer like T5

Will the gap between human creativity and AI capability grows smaller?

Sources:

Basic paper

Learning Transferable Visual Models From Natural Language Supervision

Learning CLIP Guided Visual-Text Fusion Transformer for Video-based Pedestrian Attribute Recognition

Learning CLIP Guided Visual-Text Fusion Transformer for Video-based Pedestrian Attribute Recognition


Code in repo is constantly changing. It can burn your computer 🙂

import torch
from diffusers import StableDiffusionPipeline
from transformers import CLIPTextModel, CLIPTokenizer
from utils import count_words
# MODEL = "CompVis/stable-diffusion-v1-4"
MODEL = "dreamlike-art/dreamlike-diffusion-1.0"
CLIP = "openai/clip-vit-large-patch14"
DINOS_SHORT = """
A lush, vibrant Jurassic jungle sprawls before us, its dense foliage and towering canopy creating a tapestry of greens that stretches as far as the eye can see. The air is thick with humidity, and the scent of blooming flowers and decaying vegetation hangs heavy over the landscape.
"""
DINOS = """
A lush, vibrant Jurassic jungle sprawls before us, its dense foliage and towering canopy creating a tapestry of greens that stretches as far as the eye can see. The air is thick with humidity, and the scent of blooming flowers and decaying vegetation hangs heavy over the landscape.
To our left, a herd of massive Apatosaurs grazes on the lush undergrowth, their long necks bent as they reach for the treetops. Their scaly skin glistens in the dappled sunlight filtering through the canopy above, and their gentle lowing echoes through the jungle. Nearby, a smaller group of Camptosaurs feeds on the tender shoots of ferns and cycads, their more compact bodies weaving between the Apatosaur's larger forms.
Deeper in the jungle, a trio of Allosaurs stalks its prey, their sharp eyes scanning the underbrush for any sign of movement. These apex predators are built for speed and stealth, their sleek, muscular bodies capable of reaching incredible velocities as they pursue their unsuspecting quarry. A lone Olorotitan wanders through the jungle, its massive body and distinctive crest marking it out from other hadrosaurs.
In a sun-dappled clearing, a pair of Stegosaurs basks in the warmth, their plates glistening with dew and their spiky tails swishing lazily behind them. Nearby, a lone Ceratosaur patrols the edge of the jungle, its distinctive horns and crested head making it a formidable sight to behold.
As we venture deeper into the jungle, the sounds of distant roaring grow louder. A group of massive Tyrannosaurs moves through the undergrowth, their sharp eyes fixed intently on some unseen target. The air seems to vibrate with tension as they stalk their prey, their massive feet barely making a sound as they move.
In the distance, a flock of Pteranodons soars overhead, their wings beating in unison as they ride the thermals above the jungle. A lone Oviraptor stalks its prey through the underbrush, its sharp eyes scanning for any sign of movement.
The light begins to fade as the sun dips below the horizon, casting long shadows across the jungle floor. The air cools, and the sounds of the jungle begin to change, as nocturnal creatures stir from their daytime slumber. The scent of blooming flowers gives way to the musky aroma of nocturnal predators, and the jungle transforms into a world of mystery and danger.
The camera's eye pans across this vibrant, teeming ecosystem, taking in the intricate web of life that exists within the Jurassic jungle. We see the delicate balance between predator and prey, the adaptability of species to their environment, and the sheer diversity of life that thrives in this ancient world.
"""
def tokenize_extended_clip(prompt, aggregation="concatenate"):
    # Initialize CLIP tokenizer and text encoder
    clip_model_name = CLIP
    tokenizer = CLIPTokenizer.from_pretrained(clip_model_name)
    text_encoder = CLIPTextModel.from_pretrained(clip_model_name).to("cuda")
    # Tokenize the prompt and split into chunks of 77 tokens each
    inputs = tokenizer(
        prompt, return_tensors="pt", padding=True, truncation=True, max_length=77
    )
    input_ids = inputs.input_ids.squeeze()
    chunk_size = 77
    input_chunks = [
        input_ids[i : i + chunk_size] for i in range(0, len(input_ids), chunk_size)
    ]
    # Encode each chunk independently
    chunk_embeddings = []
    for chunk in input_chunks:
        chunk = chunk.unsqueeze(0).to("cuda")  # Add batch dimension and move to GPU
        with torch.no_grad():
            embedding = text_encoder(chunk).last_hidden_state
        chunk_embeddings.append(embedding)
    # Aggregate the embeddings based on the specified method
    if aggregation == "average":
        combined_embedding = torch.mean(torch.stack(chunk_embeddings), dim=0)
    elif aggregation == "concatenate":
        combined_embedding = torch.cat(chunk_embeddings, dim=1)
    else:
        raise ValueError("Aggregation method must be 'average' or 'concatenate'")
    return combined_embedding
def dream_model_long(
    title,
    prompt,
    width=512,
    height=512,
    num_inference_steps=50,
    aggregation="concatenate",
):
    model_id = "dreamlike-art/dreamlike-diffusion-1.0"
    pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
    pipe = pipe.to("cuda")
    combined_embedding = tokenize_extended_clip(prompt, aggregation)
    image = pipe(
        prompt_embeds=combined_embedding,
        num_inference_steps=num_inference_steps,
        width=width,
        height=height,
        guidance_scale=7.5,
    ).images[0]
    image.save(f"images/{title}.png")
    return f"images/{title}.png"
def dream_model_short(
    title,
    prompt,
    width=512,
    height=512,
    num_inference_steps=50,
):
    model_id = "dreamlike-art/dreamlike-diffusion-1.0"
    pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
    pipe = pipe.to("cuda")
    image = pipe(
        prompt,
        num_inference_steps=num_inference_steps,
        width=width,
        height=height,
        guidance_scale=7.5,
    ).images[0]
    image.save(f"images/{title}.png")
    return f"images/{title}.png"
print("short prompt len: ", count_words(DINOS_SHORT))
print("long prompt len: ", count_words(DINOS))
_ = dream_model_short("test_cut_prompt", DINOS_SHORT, width=1024, height=1024)
_ = dream_model_long("test_long_prompt_concat", DINOS, width=1024, height=1024)
_ = dream_model_long(
    "test_long_prompt_average",
    DINOS,
    aggregation="average",
    width=1024,
    height=1024,
)

Leave a Reply

Your email address will not be published. Required fields are marked *