AI Personal Learning
and practical guidance
讯飞绘镜

SongGen: A Single-Stage Autoregressive Transformer for Automatic Song Generation

General Introduction

SongGen is an open source, single-stage autoregressive Transformer model designed for text-to-song generation tasks. The model is capable of generating songs containing vocals and backing tracks from text input.SongGen provides fine-grained control over a wide range of musical attributes, including lyrics, instrument descriptions, musical style, mood, and timbre. In addition, users have the option of using a three-second reference audio clip for sound cloning. songGen supports two output modes: hybrid mode generates a mixed track containing vocals and backing vocals directly, and two-track mode generates separate vocal and backing tracks for subsequent applications. The project also provides an automated data preprocessing pipeline and an efficient quality control mechanism designed to facilitate community engagement and future research.

SongGen:自动生成歌曲的单阶段自回归Transformer-1


 

Function List

  • Text-to-Song Generation
  • Supports fine-grained control of lyrics, instrument descriptions, musical style, mood and timbre
  • Provides three-second reference audio clips for sound cloning
  • Mixed mode and dual track mode outputs
  • Automated data preprocessing pipeline
  • Open source model weights, training code, annotated data and processing pipeline

 

Using Help

Thanks for providing the official installation process information! I'll be making corrections based on these. Below is the updated help section for use:

Using Help

Installation process

  1. Cloning Project Warehouse:
   git clone https://github.com/LiuZH-19/SongGen.git
cd SongGen
  1. Create and activate a new Conda environment:
   conda create -n songgen python=3.9.18
conda activate songgen
  1. Install CUDA 11.8 and related dependencies:
   conda install pytorch torchvision torchaudio cudatoolkit=11.8 -c pytorch -c nvidia
  1. Use SongGen in inference mode only:
   pip install .

Download Checkpoints

Please download the pre-trained model checkpoints for xcodec and songgen.

running inference

hybrid model

  1. Import the necessary libraries:
   import torch
import os
from songgen import (
VoiceBpeTokenizer,
SongGenMixedForConditionalGeneration,
SongGenProcessor
)
import soundfile as sf
  1. Load the pre-trained model:
   ckpt_path = "..."  # 预训练模型的路径
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = SongGenMixedForConditionalGeneration.from_pretrained(
ckpt_path,
attn_implementation='sdpa').to(device)
processor = SongGenProcessor(ckpt_path, device)
  1. Define the input text and lyrics:
   lyrics = "..."  # 歌词文本
text = "..."  # 音乐描述文本
ref_voice_path = 'path/to/your/reference_audio.wav'  # 参考音频路径,可选
separate = True  # 是否从参考音频中分离人声轨道
  1. Generate songs:
   model_inputs = processor(text=text, lyrics=lyrics, ref_voice_path=ref_voice_path, separate=True)
generation = model.generate(**model_inputs, do_sample=True)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("songgen_out.wav", audio_arr, model.config.sampling_rate)

dual-track model

  1. Import the necessary libraries:
   import torch
import os
from songgen import (
VoiceBpeTokenizer,
SongGenDualTrackForConditionalGeneration,
SongGenProcessor
)
import soundfile as sf
  1. Load the pre-trained model:
   ckpt_path = "..."  # 预训练模型的路径
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = SongGenDualTrackForConditionalGeneration.from_pretrained(
ckpt_path,
attn_implementation='sdpa').to(device)
processor = SongGenProcessor(ckpt_path, device)
  1. Define the input text and lyrics:
   lyrics = "..."  # 歌词文本
text = "..."  # 音乐描述文本
ref_voice_path = 'path/to/your/reference_audio.wav'  # 参考音频路径,可选
separate = True  # 是否从参考音频中分离人声轨道
  1. Generate songs:
   model_inputs = processor(text=text, lyrics=lyrics, ref_voice_path=ref_voice_path, separate=True)
generation = model.generate(**model_inputs, do_sample=True)
vocal_array = generation.vocal_sequences[0, :generation.vocal_audios_length[0]].cpu().numpy()
acc_array = generation.acc_sequences[0, :generation.acc_audios_length[0]].cpu().numpy()
min_len = min(vocal_array.shape[0], acc_array.shape[0])
vocal_array = vocal_array[:min_len]
acc_array = acc_array[:min_len]
audio_arr = vocal_array + acc_array
sf.write("songgen_out.wav", audio_arr, model.config.sampling_rate)

Detailed Function Operation

  • Text-to-Song Generation: Input text containing lyrics and a description of the music, and the model will generate the corresponding audio of the song.
  • fine-grained control: By entering a description in the text, the user can control various attributes of the generated song, such as instrumentation, style, mood, etc.
  • sound cloning: A three-second reference audio clip is provided and the model can mimic that sound for song generation.
  • Output Mode Selection: Select hybrid mode or dual-track mode according to demand, flexible application in different scenes.
  • Data preprocessing pipeline: Automated data preprocessing and quality control to ensure high quality of generated results.
May not be reproduced without permission:Chief AI Sharing Circle " SongGen: A Single-Stage Autoregressive Transformer for Automatic Song Generation
en_USEnglish