General Introduction
Infinity is a groundbreaking high-resolution image generation framework developed by the FoundationVision team. The project breaks through the limitations of traditional image generation models through an innovative bit-level visual autoregressive modeling approach.The core feature of Infinity is the use of unlimited vocabulary of disambiguators and classifiers, together with the bit-level self-correction mechanism, which is capable of generating ultra-high-quality realism images. The project is fully open-source and provides a choice of model sizes from 2B to 20B parameter scales, supporting image generation at resolutions up to 1024x1024. As a cutting-edge research project, Infinity not only pushes forward the technological progress in the field of computer vision, but also provides new solutions for image generation tasks.
Function List
- 2B parametric model supports high quality image generation up to 1024x1024 resolution
- Provides a visual lexicon with unlimited vocabulary to support finer image feature extraction
- Realization of bit-level self-correction mechanism to improve the quality and accuracy of generated images
- Supports flexible selection of multiple model sizes (125M, 1B, 2B, 20B parameters)
- Provide an interactive inference interface to facilitate user experiments on image generation
- Integrated with a complete training and evaluation framework
- Supports multi-dimensional evaluation of model performance (GenEval, DPG, HPSv2.1 and other metrics)
- Provides an online demo platform that allows users to experience image generation directly
Using Help
1. Environmental configuration
1.1 Basic requirements:
- Python environment
- PyTorch >= 2.5.1 (requires FlexAttention support)
- Install other dependencies via pip:
pip3 install -r requirements.txt
2. Use of models
2.1 Quick start:
- Download the pre-trained model from HuggingFace: infinity_2b_reg.pth
- Download Visual Segmenter: infinity_vae_d32_reg.pth
- Interactive image generation using interactive_infer.ipynb
2.2 Training configuration:
# Starting training with a single command
bash scripts/train.sh
# Training commands for different model sizes
# 125M model (256x256 resolution)
torchrun --nproc_per_node=8 train.py --model=layer12c4 --pn 0.06M
# 2B model (1024x1024 resolution)
torchrun --nproc_per_node=8 train.py --model=2bc8 --pn 1M
2.3 Data preparation:
- The training data needs to be prepared in JSONL format
- Each data item contains: image path, long and short text description, image aspect ratio and other information
- Sample datasets are provided by the project for reference
2.4 Model Evaluation:
- Support for multiple assessment indicators:
- ImageReward: assessing human preference scores for generating images
- HPS v2.1: Evaluation metrics based on 798K manual rankings
- GenEval: Evaluating text-to-image alignment
- FID: Assessing the quality and diversity of generated images
2.5 Online presentation:
- Visit the official demo platform: https://opensource.bytedance.com/gmpt/t2i/invite
- Enter a text description to generate a corresponding high-quality image
- Supports adjustment of multiple image resolutions and generation parameters
3. Advanced functions
3.1 Bit-level self-correcting mechanisms:
- Automatic recognition and correction of errors in the generation process
- Improve the quality and accuracy of generated images
3.2 Model extensions:
- Supports flexible scaling of model size
- Multiple models available from 125M to 20B parameters
- Adapts to different hardware environments and application requirements
4. Cautions
- Ensure hardware resources meet model requirements
- Large-scale models require sufficient GPU memory
- Recommended for training with HPC equipment
- Regular backup training checkpoints
- Note the adherence to the MIT open source protocol