MiniMind-V: 1 hour training of a 26M parameter visual language model

Latest AI Resources4mos agorelease AI Sharing Circle

General Introduction

MiniMind-V is an open source project, hosted on GitHub, designed to help users train a lightweight visual language model (VLM) with only 26 million parameters in less than an hour. It is based on the MiniMind language model , the new visual coder and feature projection module , support for image and text joint processing . The project provides complete code from dataset cleaning to model inference, with a training cost as low as about 1.3 RMB for a single GPU (e.g., NVIDIA 3090). MiniMind-V emphasizes simplicity and ease of use, with fewer than 50 lines of code changes, making it a suitable tool for developers to experiment with and learn about the process of constructing visual language models.

Function List

Provides complete training code for 26 million parameter visual language models, supporting fast training on a single GPU.
Using the CLIP visual coder, a 224x224 pixel image was processed to generate 196 visual tokens.
Supports single and multi-image input, combined with text for dialog, image description or Q&A.
Full process scripts for dataset cleaning, pre-training, and supervised fine-tuning (SFT) are included.
Provides PyTorch native implementation, supports multi-card acceleration, and is highly compatible.
Includes a download of model weights to support the Hugging Face and ModelScope platforms.
Provides a web interface and command line reasoning for easy testing of model effects.
Support for the wandb tool to record losses and performance during training.

Using Help

The process of using MiniMind-V includes environment configuration, data preparation, model training and effect testing. Each step is described in detail below to help users get started quickly.

Environment Configuration

MiniMind-V requires a Python environment and GPU support. Here are the installation steps:

Cloning Code
Run the following command in the terminal to download the project code:
```
git clone https://github.com/jingyaogong/minimind-v
cd minimind-v
```
Installation of dependencies
Project offers requirements.txt file containing the required libraries. Run the following command:
```
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
```
Python 3.9 or above is recommended. Ensure that PyTorch supports CUDA (if you have a GPU). This can be verified by running the following code:
```
import torch
print(torch.cuda.is_available())
```
exports True Indicates that the GPU is available.
Download CLIP Models
MiniMind-V uses the CLIP model (clip-vit-base-patch16) as a visual encoder. Run the following command to download and place the ./model/vision_model::
```
git clone https://huggingface.co/openai/clip-vit-base-patch16 ./model/vision_model
```
Also available for download from ModelScope:
```
git clone https://www.modelscope.cn/models/openai-mirror/clip-vit-base-patch16 ./model/vision_model
```
Download the base language model weights
MiniMind-V is based on the MiniMind language model, which requires downloading the language model weights to the ./out Catalog. Example:
```
wget https://huggingface.co/jingyaogong/MiniMind2-V-PyTorch/blob/main/lm_512.pth -P ./out
```
or download lm_768.pth, depending on the model configuration.

Data preparation

MiniMind-V uses about 570,000 pre-trained images and 300,000 command fine-tuning data with about 5 GB of storage space. the procedure is as follows:

Creating a dataset catalog
In the project root directory, create the ./dataset Folder:
```
mkdir dataset
```
Download Dataset
Download a dataset from Hugging Face or ModelScope containing the *.jsonl Q&A data and *images Picture data:
- Hugging Face: https://huggingface.co/datasets/jingyaogong/minimind-v_dataset
- ModelScope: https://www.modelscope.cn/datasets/gongjy/minimind-v_dataset
  Download and unzip the image data to ./dataset::
```
unzip pretrain_images.zip -d ./dataset
unzip sft_images.zip -d ./dataset
```
Validation Dataset
assure ./dataset Contains the following files:
- pretrain_vlm_data.jsonl: Pre-training data, approximately 570,000 entries.
- sft_vlm_data.jsonl: Single figure fine-tuning data, approximately 300,000 entries.
- sft_vlm_data_multi.jsonl: Multi-map fine-tuning data, approximately 13,600 entries.
- Image folder: contains image files for pre-training and fine-tuning.

model training

MiniMind-V training is categorized into pre-training and supervised fine-tuning, and supports single or multi-card acceleration.

Configuration parameters
compiler ./model/LMConfig.py, set the model parameters. Example:
- Miniatures:dim=512, n_layers=8
- Medium model:dim=768, n_layers=16
  These parameters determine the model size and performance.
pre-training
Run pre-training scripts to learn image description capabilities:
```
python train_pretrain_vlm.py --epochs 4
```
The output weights are saved as ./out/pretrain_vlm_512.pth(or 768.pthThe CLIP model is frozen.) A single NVIDIA 3090 takes about 1 hour to complete 1 epoch. freezes the CLIP model and trains only the projection layer and the last layer of the language model.
Supervised Fine Tuning (SFT)
Fine-tuning using pre-trained weights to optimize conversational capabilities:
```
python train_sft_vlm.py --epochs 4
```
The output weights are saved as ./out/sft_vlm_512.pth. This step trains the projection layer and the language model with all parameters.
Doka training (optional)
If you have N graphics cards, use the following command to accelerate:
```
torchrun --nproc_per_node N train_pretrain_vlm.py --epochs 4
```
interchangeability train_pretrain_vlm.py For other training scripts (e.g. train_sft_vlm.py).
Monitor training
Training losses can be recorded using wandb:
```
python train_pretrain_vlm.py --epochs 4 --use_wandb
```
View real-time data on the official wandb website.

Effectiveness Test

Once training is complete, the model can be tested for image dialog capabilities.

command-line reasoning
Run the following command to load the model:
```
python eval_vlm.py --load 1 --model_mode 1
```
- --load 1: Load the transformers format model from Hugging Face.
- --load 0: from ./out Load PyTorch weights.
- --model_mode 1: Testing fine-tuned models;0 Testing pre-trained models.
Web Interface Testing
Launch the web interface:
```
python web_demo_vlm.py
```
interviews http://localhost:8000, upload an image and enter text to test.

input format
MiniMind-V uses 196 @@@ Placeholders represent an image. Example:

@@@...@@@\n这张图片是什么内容？

Example of multi-image input:

@@@...@@@\n第一张图是什么？\n@@@...@@@\n第二张图是什么？

Download Pre-training Weights
If you don't train, you can just download the official weights:
- PyTorch format:https://huggingface.co/jingyaogong/MiniMind2-V-PyTorch
- Transformers format:https://huggingface.co/collections/jingyaogong/minimind-v-67000833fb60b3a2e1f3597d

caveat

Recommended video memory 24GB (e.g. RTX 3090). If video memory is insufficient, reduce the batch size (batch_size).
Ensuring that the dataset path is correct.*.jsonl and image files need to be placed in the ./datasetThe
Freezing CLIP models during training reduces arithmetic requirements.
Multi-image dialogues have limited effectiveness, and it is recommended to prioritize testing single-image scenarios.

application scenario

AI algorithmic learning
MiniMind-V provides concise visual language modeling code suitable for students to understand cross-modal modeling principles. Users can modify the code to experiment with different parameters or data sets.
Rapid Prototyping
Developers can prototype image dialog applications based on MiniMind-V. It is lightweight and efficient, and is suitable for low-computing devices such as PCs or embedded systems. It is lightweight and efficient, and is suitable for low-computing-power devices such as PCs or embedded systems.
Education and training tools
Colleges and universities can use MiniMind-V in AI courses to show the whole process of model training. The code is clearly commented and suitable for classroom practice.
Low-cost experiments
The project training cost is low, suitable for teams with limited budget to test the effect of multimodal models, without the need for high-performance servers.

QA

What size images does MiniMind-V support?
Default processing is 224x224 pixel images, limited by the CLIP model. Dataset images may be compressed to 128x128 to save space. Larger resolution CLIP models may be tried in the future.
How much time does it take to train?
On a single NVIDIA 3090, 1 epoch of pre-training takes about 1 hour, with fine-tuning a bit faster. The exact time varies depending on the hardware and the amount of data.
Can I just fine-tune it without pre-training?
Can. Download the official pre-training weights directly and run train_sft_vlm.py Fine-tuning.
What languages are supported?
Mainly supports Chinese and English, the effect depends on the dataset. Users can extend other languages by fine-tuning.
How well does the multi-image dialog work?
The current multi-image dialog capability is limited, and it is recommended that single-image scenarios be prioritized. Improvements can be made in the future with larger models and datasets.