Lumina-mGPT-2.0: an autoregressive image generation model for handling multiple image generation tasks

Latest AI Resources4mos agoupdate AI Sharing Circle

1.3K 00

General Introduction

Lumina-mGPT-2.0 is an open source project jointly developed by Shanghai AI Laboratory (Shanghai AI Laboratory), Chinese University of Hong Kong (CUHK) and other organizations, hosted on GitHub and maintained by Alpha-VLLM team. It is a standalone autoregressive model, trained from scratch, with the core function of generating diverse and high-quality images from text. Released on April 3, 2025, this tool not only supports basic text-generated images, but also handles a variety of tasks such as image pair generation, topic-driven generation, multi-round image editing, and controlled generation.

Function List

Supports inputting text to generate high quality images up to 768x768 resolution.
Can generate image pairs suitable for comparison or matching tasks.
Provides theme-driven generation to generate relevant images based on a specific theme.
Supports multiple rounds of image editing, allowing users to adjust the generated results step by step.
Includes Controlled Generation feature for precise adjustment of image details.
Provides fine-tuned code so that users can optimize the model according to their needs.
Supports accelerated inference to reduce image generation time.

Using Help

Installation process

To use Lumina-mGPT-2.0 locally, you need to build the runtime environment first. Below are the detailed steps:

Download Project Code
Open a terminal and enter the following command to clone the code repository:

git clone https://github.com/Alpha-VLLM/Lumina-mGPT-2.0.git

Then go to the project directory:

cd Lumina-mGPT-2.0

Creating a Virtual Environment
Create a separate environment for Python 3.10 with Conda to avoid conflicts:

conda create -n lumina_mgpt_2 python=3.10 -y

Activate the environment:

conda activate lumina_mgpt_2

Installation of dependencies
Install the Python libraries needed for your project:

pip install -r requirements.txt

Next, install the Flash Attention module (for accelerated calculations):

pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.3cxx11abiFALSE-cp310-cp310-linux_x86_64.whl --no-build-isolation

Finally, install the project as a local package:

pip install -e .

Download MoVQGAN Weights
The project depends on MoVQGAN model weights. Create a catalog and download it:

mkdir -p lumina_mgpt/movqgan/270M
wget -O lumina_mgpt/movqgan/270M/movqgan_270M.ckpt https://huggingface.co/ai-forever/MoVQGAN/resolve/main/movqgan_270M.ckpt

test installation
Run the following command to check if the environment is OK:

python generate_examples/generate.py --model_path Alpha-VLLM/Lumina-mGPT-2.0 --save_path save_samples/

If no errors are reported, the installation was successful.

How to use the main features

The main function of Lumina-mGPT-2.0 is to generate images from text. The following is the detailed operation:

Basic Image Generation
Run the generation script in the terminal and enter a text description. For example, you want to generate the image "City skyline at night, all lit up":

python generate_examples/generate.py --model_path Alpha-VLLM/Lumina-mGPT-2.0 --save_path save_samples/ --cfg 4.0 --top_k 4096 --temperature 1.0 --width 768 --height 768 --prompt "City skyline at night with bright lights."

Parameter Description:

--model_path: Modeling paths.
--save_path: The directory where the pictures are saved.
--cfg: text-image correlation, default 4.0, the larger the value, the closer the description.
--top_k: Controls the generation diversity, default 4096.
--temperature: Controls randomness, default 1.0.
--width cap (a poem) --height: Set the resolution to a maximum of 768x768.
--prompt: Text description, support English or Chinese.
The generated images are saved in the save_samples Folder.
accelerated generation
To generate images faster, you can use two acceleration options:
increase --speculative_jacobi: Enable speculative Jacobi decoding to reduce generation time.
increase --quant: Enable model quantization to reduce video memory usage.
Example command:

python generate_examples/generate.py --model_path Alpha-VLLM/Lumina-mGPT-2.0 --save_path save_samples/ --cfg 4.0 --top_k 4096 --temperature 1.0 --width 768 --height 768 --speculative_jacobi --quant

Official test data (based on A100 graphics card):

Normal generation: 694 seconds, using 80 GB of video memory.
Plus presumptive decode: 324 seconds, 79.2 GB video memory.
Plus speculative decoding and quantization: 304 seconds, 33.8 GB of video memory.
Multiple rounds of editing and controlled generation
Supports multiple rounds of image adjustment. For example, first generate an image and then modify some of the details with the new description. Specific operations need to refer to generate_examples folder, or check out the official documentation for the <项目根目录>/README.mdThe
fine-tuned model
If you want to optimize the model with your own data, you can refer to the <项目根目录>/TRAIN.md Documentation. It provides detailed fine-tuning steps, including data preparation and training commands.

workflow

Follow the steps to install the environment and dependencies.
Download MoVQGAN Weights.
Enter a text description and run the generate command.
Check the results, adjust parameters or perform multiple rounds of editing.

When you run into problems, check out the documentation on GitHub or the community discussions. The whole process is clear and suitable for beginners and professional users.

application scenario

Creative Design
Designers entered "Interior of a future space station" to generate conceptual drawings for project inspiration.
academic research
Researchers use it to test the image generation capabilities of auto-regressive models, or to fine-tune models for experiments.
content creation
Bloggers enter "spring garden" to generate images to enhance the visual effect of the article.
Personalization
Users generate theme-specific images, such as "advertising posters with company logos", through multiple rounds of editing.

QA

What hardware support is required?
A high-performance GPU such as the A100 is recommended, with at least 40 GB of video memory.The CPU can run it, but it's very slow.
Are the generated images commercially available?
The project uses the Apache 2.0 protocol and commercial use is permitted, subject to the terms of the agreement.
Why is the generation time long?
It takes a few minutes to generate a 768x768 image with default settings. This can be done with the --speculative_jacobi cap (a poem) --quant Acceleration.
Does it support Chinese descriptions?
Supported, but the English description may be more accurate because the model training data is predominantly in English.