General Introduction
CogView4 is an open source text-to-graph model developed by the KEG Lab at Tsinghua University (THUDM), focusing on converting text descriptions into high-quality images. It supports bilingual cue input, and is especially good at understanding Chinese cues and generating images with Chinese characters, which is ideal for advertisement design, short video creation and other scenarios. As the first open-source model that supports generating Chinese characters on screen, CogView4 excels in complex semantic alignment and command following. It is based on the GLM-4-9B text encoder, supports prompt word input of any length, and can generate images up to 2048 resolution. The project is hosted on GitHub with detailed code and documentation, and has attracted a lot of attention and participation from developers and creators.
Newest CogView4 model to go live on March 13th lit. record wisdom and say clearly Official website.
Online experience: https://huggingface.co/spaces/THUDM-HF-SPACE/CogView4
Function List
- Bilingual cue word generation images: It supports both Chinese and English descriptions, and can accurately understand and generate images that match the cues, with Chinese scenes performing particularly well.
- Screen Generation of Chinese Characters: Generate clear Chinese text in images, suitable for making posters, advertisements and other creative works that require text content.
- Arbitrary resolution outputThe company supports the generation of images of any size, from low resolution to 2048x2048, to meet a wide variety of needs.
- Extra-long cue word supportThe program accepts text input of any length and can handle up to 1024 tokens, making it easy to characterize complex scenarios.
- Complex Semantic Alignment: Accurately captures the details in the cued words and generates high quality images that match the semantics.
- Open source model customization: Full code and pre-trained models are provided so that developers can develop or optimize them according to their needs.
Using Help
Installation process
CogView4 is a Python-based open source project that requires a locally configured environment to run. Here are the detailed installation steps:
1. Environmental preparation
- operating system: Windows, Linux or macOS are supported.
- hardware requirement: An NVIDIA GPU (at least 16GB of video memory) is recommended to accelerate inference; a CPU will work but is slower.
- software dependency::
- Python 3.8 or higher
- PyTorch (recommended to install GPU version, torch>=2.0)
- Git (for cloning repositories)
2. Cloning of warehouses
Open a terminal and enter the following command to download the CogView4 project source code:
git clone https://github.com/THUDM/CogView4.git
cd CogView4
3. Installation of dependencies
The project provides the requirements.txt file, run the following command to install the required libraries:
pip install -r requirements.txt
For GPU acceleration, make sure you install the correct version of PyTorch by referring to the PyTorch official site for installation commands, for example:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
4. Downloading pre-trained models
The CogView4-6B model needs to be downloaded manually from Hugging Face or the official link. Visit THUDM's GitHub page and find the model download address (e.g. THUDM/CogView4-6B
), extract it to the project root directory in the checkpoints
folder. Or download automatically by code:
from diffusers import CogView4Pipeline
pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B")
5. Configuration environment
If video memory is limited, enable memory optimization options (e.g. enable_model_cpu_offload
), as described in the instructions for use below.
How to use CogView4
After installation, users can call CogView4 to generate images via Python script. Below is the detailed procedure:
1. Basic image generation
Create a Python file (e.g. generate.py
), enter the following code:
from diffusers import CogView4Pipeline
import torch
# load model to GPU
pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16).to("cuda")
# Optimize video memory usage
pipe.enable_model_cpu_offload() # Move some calculations to CPU
pipe.vae.enable_slicing() # Slicing VAEs
pipe.vae.enable_tiling() # Chunk processing VAE
# Input prompt
prompt = "A red sports car parked on a sunny seaside highway with azure waves in the background"
image = pipe(
prompt=prompt,
guidance_scale=3.5, # Control how well the generated image fits the prompt
num_images_per_prompt=1, # Generate an image
num_inference_steps=50, # Number of inference steps, affects quality
width=1024, # Image width
height=1024 # image height
).images[0]
# Save the image
image.save("output.png")
Run the script:
python generate.py
The result will generate a 1024x1024 image and save it as a output.png
The
2. Generation of images with Chinese characters
CogView4 supports generating Chinese text in images, for example:
prompt = "An advertising poster that says 'Welcome to experience CogView4' with a blue sky and white clouds in the background"
image = pipe(prompt=prompt, width=1024, height=1024).images[0]
image.save("poster.png")
After running, the words "Welcome to CogView4" will be clearly displayed in the image, which is suitable for creating promotional materials.
3. Adjustment of resolution
CogView4 supports output at any resolution, e.g. generating 2048x2048 images:
image = pipe(prompt=prompt, width=2048, height=2048).images[0]
image.save("high_res.png")
Note: Higher resolutions require more video memory and a GPU with 24GB or more video memory is recommended.
4. Handling very long cues
CogView4 can handle complex descriptions, for example:
prompt = "A bustling ancient Chinese bazaar with stalls filled with ceramics and silks, mountains and sunset in the distance, and people shopping in traditional Han Chinese clothing"
image = pipe(prompt=prompt, num_inference_steps=50).images[0]
image.save("market.png")
Supports up to 1024 tokens, fully parses long text and generates richly detailed images.
5. Optimizing performance
If the video memory is insufficient, adjust the parameters:
- lower
torch_dtype
because oftorch.float16
- rise
num_inference_steps
to enhance quality (default 50, recommended 50-100) - utilization
pipe.enable_model_cpu_offload()
Move some models to CPU computation
Featured Functions
Generate bilingual images
CogView4's bilingual support is its biggest draw. For example, enter mixed cue words:
prompt = "A futuristic city with neon lights and flying cars, with a sign that says 'City of the Future'"
image = pipe(prompt=prompt).images[0]
image.save("future_city.png")
The resulting image will contain both the English description of the future city and the Chinese "Future City" logo, demonstrating strong semantic understanding.
High quality detail control
By adjusting guidance_scale
(range 1-10, default 3.5), which controls how well the image fits the cue. The higher the value, the closer the details fit the cue, but may sacrifice creativity:
image = pipe(prompt=prompt, guidance_scale=7.0).images[0]
Batch Generation
Generate multiple images at once:
images = pipe(prompt=prompt, num_images_per_prompt=3).images
images = pipe(prompt=prompt, num_images_per_prompt=3).
img.save(f "output_{i}.png")
caveat
- VGA memory requirements: Approximately 16GB of video memory is required to generate 1024x1024 images, and 24GB+ for 2048x2048.
- inference time: 50 steps of reasoning takes about 1-2 minutes (depending on hardware).
- Community Support: If you encounter problems, ask for help on the GitHub Issues page, or refer to the official README.
With these steps, users can quickly get started with CogView4, generate high-quality images and apply them to creative projects!