AI Personal Learning
and practical guidance
Ali-painted frog

Audio-Reasoner: a large-scale language model supporting audio deep reasoning

General Introduction

Audio-Reasoner is an open source project developed by a team at Tsinghua University and hosted on GitHub, focusing on building large-scale language models that support deep reasoning in audio. The model is based on Qwen2-Audio-Instruct, which enables complex reasoning and multimodal understanding of audio content by introducing structured Chain-of-Thought (CoT) technology. The project includes the Audio-Reasoner-7B model and the upcoming CoTA dataset (with 1.2 million high-quality samples), which has improved the performance of MMAU-mini and AIR-Bench-Chat benchmarks by 25.42% and 14.57%, respectively, to reach the leading level.Audio-Reasoner Audio-Reasoner is an ideal tool for researchers and developers, supporting processing of sound, music, voice, and other audio types, and is suitable for audio analysis, content understanding, and other scenarios.

Audio-Reasoner: a large-scale language model supporting audio deep reasoning-1


 

Function List

  • Audio Deep Reasoning: Analyze audio and generate detailed reasoning processes and results using structured chain thinking.
  • Multimodal task support: Combining audio and text inputs for cross-modal comprehension and reasoning tasks.
  • Multiple audio processing: Supports recognition and analysis of multiple audio types such as voice, music, and speech.
  • High-performance pre-trained models: Provides the Audio-Reasoner-7B model, which excels in a number of benchmark tests.
  • CoTA data set: Contains 1.2 million samples to support structured inference training and capability enhancement of models.
  • Reasoning Code and Demonstration: Provides complete inference code and demo examples for easy user testing and development.
  • open source program:: In the future, the data synthesis process and training code will be opened up to facilitate community collaboration.

 

Using Help

Installation process

The installation of Audio-Reasoner requires configuring the Python environment and downloading the model weights, the following are the detailed steps to ensure that users can complete the build successfully:

1. Cloning a GitHub repository

First clone the Audio-Reasoner project locally. Open a terminal and run the following command:

git clone https://github.com/xzf-thu/Audio-Reasoner.git
cd Audio-Reasoner

This will download the project files locally and into the project directory.

2. Create and activate a virtual environment

To avoid dependency conflicts, it is recommended that you create a separate Python environment using Conda:

conda create -n Audio-Reasoner python=3.10
conda activate Audio-Reasoner

This command creates and activates a Python 3.10-based environment called "Audio-Reasoner".

3. Installation of dependency packages

The program provides requirements.txt file containing the necessary dependencies. The installation steps are as follows:

pip install -r requirements.txt
pip install transformers==4.48.0

Attention:transformers Version 4.48.0 needs to be installed to ensure stable model performance. Install the other dependencies first, then specify the transformers version to avoid conflicts.

4. Download model weights

The Audio-Reasoner-7B model has been published on HuggingFace and needs to be downloaded and path configured manually:

  • interviews HuggingFace Audio-Reasoner-7B, download the model file.
  • Fill the downloaded checkpoint path into the code in the last_model_checkpoint variables, for example:
last_model_checkpoint = "/path/to/Audio-Reasoner-7B"

How to use

Once the installation is complete, users can run Audio-Reasoner via code to handle audio tasks. The following are the detailed operation instructions:

Quick Start: Run the sample code

The project provides a quick start example to help users test model functionality:

  1. Preparing Audio Files
    By default, it uses the project's own assets/test.wav file, or you can replace it with your own WAV-formatted audio. Make sure the path is correct.
  2. Audio Paths and Issues in Editing Code
    show (a ticket) inference.py Or just use the following code to set the audio path and ask questions:

    audiopath = "assets/test.wav"
    prompt = "What is the rhythmic feel and beat of this audio?"
    audioreasoner_gen(audiopath, prompt)
    
  3. running program
    Execute it in the terminal:

    conda activate Audio-Reasoner
    cd Audio-Reasoner
    python inference.py
    

    The model will output structured inference results, including (plan, describe, reason, summarize) and (Final answer).

Core functionality: Audio Deep Reasoning

At the core of Audio-Reasoner is audio reasoning based on chain thinking, and the following is the operational flow:

  1. Input Audio and Issues
    • utilization audioreasoner_gen function, passing in the audio path and a specific question. Example:
      audiopath = "your_audio.wav"
      prompt = "Is there a bird call in the audio?"
      audioreasoner_gen(audiopath, prompt)
      
  2. View inference output
    The model returns detailed reasoning processes, for example:

    
    : Checks for sound characteristics in the audio to recognize if there are bird calls.
    : The audio contains natural ambient sounds, possibly wind and animal calls.
    : Analyzes high frequency sound features to match bird call patterns.
    : Bird calls may be present in the audio.
    </THINK
    : Yes, there are bird calls in the audio.
    
  3. Adjustment of output parameters (optional)
    If a longer or more flexible answer is needed, modify the RequestConfig Parameters:

    request_config = RequestConfig(max_tokens=4096, temperature=0.5, stream=True)
    

Local testing of preset samples

The program has built-in test audio and questions for quick verification:

conda activate Audio-Reasoner
cd Audio-Reasoner
python inference.py

After running, the terminal displays a description of the assets/test.wav The results of the analysis are suitable for the first experience.

Featured Functions: Multimodal Understanding

Audio-Reasoner supports joint analysis of audio and text. Example:

prompt = "Does the mood of this music match the 'sad' description?"
audioreasoner_gen("sad_music.wav", prompt)

The model will combine audio features and text semantics to output inference results.

Precautions and Frequently Asked Questions

  • audio format: Recommended WAV format, sampling rate 16kHz, mono.
  • slow-moving: If it's slow, check that the GPU is enabled (requires PyTorch for CUDA).
  • Model not responding: Verify that the model path is correct and that the dependencies are fully installed.
  • dependency conflict: If the installation fails, try creating a new environment and installing the dependencies in strict order.

Advanced Use

  • Customized reasoning logic:: Modifications system Cue words to adjust the model's reasoning style.
  • batch file:: Will max_batch_size Set to a higher value (e.g., 128) to support multiple audio simultaneous inference.
  • Combined with CoTA dataset: Future CoTA datasets can be used for further training or fine-tuning of the model when they are released.
CDN1
May not be reproduced without permission:Chief AI Sharing Circle " Audio-Reasoner: a large-scale language model supporting audio deep reasoning

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish