MIDI-3D: An open source tool to quickly generate multi-object 3D scenes from a single image

Latest AI Resources4mos agorelease AI Sharing Circle

1.3K 00

General Introduction

MIDI-3D is an open-source project developed by the VAST-AI-Research team that quickly generates 3D scenes containing multiple objects from a single image for developers, researchers, and creators. The tool is based on multi-instance diffusion modeling techniques, combining artificial intelligence and 3D modeling to generate multiple high-quality 3D objects simultaneously and maintain their spatial relationships.MIDI-3D was released at CVPR 2025, and the code, model weights, and an online demo are all open. It supports both realistic and cartoon-style image inputs, with generation times as short as 40 seconds and output files as .glb format, which can be edited in other software. The project aims to simplify the creation of 3D scenes and make it easy for more people to make digital assets.

Function List

Generate 3D scenes containing multiple objects from a single image, supporting both realistic and cartoon styles.
Provide image segmentation function to automatically identify and label objects in pictures.
Simultaneously generate multiple separable 3D instances that are automatically combined into complete scenes.
Supports both command line operation and interactive web presentation.
Automatically downloads pre-trained model weights locally for quick startup.
exports .glb format 3D model files that can be used for subsequent editing or imported into other software.
The generation process is efficient, eliminating the need for object-by-object modeling or lengthy optimization.

Using Help

The use of MIDI-3D is divided into two parts: installation and operation. Below are detailed steps to help you get started from scratch.

Installation process

Preparation of hardware and software environments
You will need a CUDA-enabled computer, as MIDI-3D relies on GPU acceleration. An NVIDIA GPU with at least 6GB of video memory is recommended. make sure Python 3.10 or higher is installed.
Creating a virtual environment (optional)
To avoid conflicts, you can create a new Conda environment:

conda create -n midi python=3.10
conda activate midi

Installing PyTorch
Install PyTorch according to the CUDA version of your GPU. e.g. with CUDA 11.8 the command is:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

If the version is different, go to https://pytorch.org/get-started/locally/ to select the corresponding command.

Download Project Code
Clone the MIDI-3D repository by running the following command in a terminal:

git clone https://github.com/VAST-AI-Research/MIDI-3D.git
cd MIDI-3D

Installation of dependencies
Project offers requirements.txt file, run the following command to install all dependencies:

pip install -r requirements.txt

Getting model weights
When running the script, MIDI-3D automatically downloads the pre-trained model from https://huggingface.co/VAST-AI/MIDI-3D, saves it to the pretrained_weights/MIDI-3D folder. If the network is unstable, you can also manually download and extract to that path.

workflow

MIDI-3D supports two ways of use: command line and interactive demo. Below are the specific steps.

command-line operation

Generate Split Chart
MIDI-3D requires a picture and a corresponding segmentation map (labeled object areas). The segmentation map can be generated with the included Grounded SAM script. For example, you have a picture 04_rgb.png, running:

python -m scripts.grounding_sam --image assets/example_data/Cartoon-Style/04_rgb.png --labels "lamp sofa table dog" --output ./segmentation.png

--image Specifies the input image path.
--labels Enter the names of the objects in the picture, separated by spaces.
--output Specifies the path where the segmentation map is saved.
When run, it generates a segmentation.png Documentation.

Generate 3D scenes
To generate a 3D scene with pictures and split maps, run the following command:

python -m scripts.inference_midi --rgb assets/example_data/Cartoon-Style/00_rgb.png --seg assets/example_data/Cartoon-Style/00_seg.png --output-dir "./output"

--rgb is the original map path.
--seg is the split graph path.
--output-dir is the output folder path.
The generated 3D scene is saved as output.glb file, which usually takes from 40 seconds to 1 minute. If the object is near the edge of the picture, it is recommended to add the parameter --do-image-padding, e.g.:

python -m scripts.inference_midi --rgb 00_rgb.png --seg 00_seg.png --output-dir "./output" --do-image-padding

Interactive Demo

Startup Demo
Run the following command to start the Gradio interface:

python gradio_demo.py

The system will automatically open the browser and display the operation page. You can also visit the online demo at https://huggingface.co/spaces/VAST-AI/MIDI-3D.

Upload image and split
Click "Input Image" in the interface to upload an image. Then use the mouse to select the object area, the system will automatically generate a segmentation map, which will be displayed in the "Segmentation Result".
Generate 3D scenes
Click "Run Segmentation" to confirm the segmentation map, adjust the parameters (such as random seeds), and then click Generate button. After a few seconds, the interface will display the 3D model, click to download it. .glb Documentation.

Functional operation details

image segmentation
Grounded SAM is a pre-processing tool for MIDI-3D that automatically recognizes objects in a picture and generates a segmentation map. You can enter the object name (e.g. "lamp sofa") or manually select it in the interactive interface. It supports multi-object scenes with high segmentation accuracy.
Multi-object 3D generation
MIDI-3D uses multi-instance diffusion modeling to generate multiple 3D objects at the same time and maintain their spatial relationships. For example, a picture of a living room can generate a 3D model of a sofa, table and lamp that directly composes the complete scene. It is faster than the traditional object-by-object generation method.
model output
generated .glb The files are compatible with Blender, Unity and other software. You can import files, adjust materials, lights or add animations to meet different needs.

Supplementary resources

Instructional Videos
An official how-to video (viewed in https://github.com/VAST-AI-Research/MIDI-3D) is provided, demonstrating in detail the process from uploading an image to generating a 3D scene.
References
To get the technical details, you can read the paper: https://arxiv.org/abs/2412.03558.

Frequently Asked Questions

If the generation fails, check if the GPU supports it, or make sure the segmentation map is correct.
If object details are missing, try using a higher resolution image.

application scenario

game development
Developers can use MIDI-3D to generate 3D scenes from sketches. For example, a picture of a forest can be quickly turned into a 3D model of the trees and terrain and imported into Unity for use.
academic research
Researchers can use it to test the effectiveness of multi-instance diffusion models. Although the model is trained only with synthetic data, it is also well adapted to real and cartoon images.
digital art
Artists can generate 3D animated scenes from cartoon pictures to quickly produce creative works and save modeling time.

QA

What image types does MIDI-3D support?
be in favor of .png cap (a poem) .jpg Format. Clear images are recommended for better results.
What hardware configuration is required?
Requires an NVIDIA GPU with at least 6GB of video memory to run in a CUDA environment. CPU is not sufficient.
Is the generated model commercially available?
Yes, the project uses the MIT license to generate the .glb The files are free for commercial use, subject to license requirements.