Aana SDK: An Open Source Tool for Easy Deployment of Multimodal AI Models

Latest AI Resources5mos agoupdate AI Sharing Circle

1.3K 00

General Introduction

Aana SDK is an open source framework developed by Mobius Labs, named after the Malayalam word ആന (elephant), that helps developers rapidly deploy multimodal AI models that support processing a wide range of data, including text, images, audio, and video. It helps developers quickly deploy and manage multimodal AI models that support processing text, images, audio, and video, etc. Based on the Ray Distributed Computing Framework, the Aana SDK is designed for reliability, scalability, and efficiency. Developers can use it to easily build applications from standalone to clustered, such as video transcription, image description, or smart chat tools.

Function List

Supports multimodal data: can process text, images, audio and video simultaneously.
Model Deployment and Scaling: Machine learning models can be deployed on a single machine or on a cluster.
Auto-generated APIs: Automatically create and validate APIs based on defined endpoints.
Real-time streaming output: supports streaming results for real-time applications and large language models.
Predefined data types: Built-in support for common data types such as image, video, etc.
Background task queue: endpoint tasks run automatically in the background without additional configuration.
Integration of multiple models: Whisper, vLLM, Hugging Face Transformers, etc. are supported.
Documentation Auto-Generation: Automatically generate application documentation based on endpoints.

Using Help

Installation process

There are two ways to install Aana SDK: PyPI and GitHub, here are the steps:

Preparing the environment
- Requires Python 3.8 or later.
- It is recommended to manually install PyTorch (>= 2.1), choosing the appropriate version for your system (see https://pytorch.org/get-started/locally/). Otherwise the default installation may not fully utilize the GPU.
- If you are using a GPU, it is recommended to install the Flash Attention library to improve performance (see https://github.com/Dao-AILab/flash-attention).
Installation via PyPI
- Run the following command to install the core dependencies:
```
pip install aana
```
- For full functionality, install all additional dependencies:
```
pip install aana[all]
```
- Other options include vllm(Language Modeling),asr(speech recognition),transformers(converter model), etc., as required.
Installation via GitHub
- Cloning Warehouse:
```
git clone https://github.com/mobiusml/aana_sdk.git
cd aana_sdk
```
- Install using Poetry (Poetry >= 2.0 recommended, see https://python-poetry.org/docs/#installation):
```
poetry install --extras all
```
- Development environments can add test dependencies:
```
poetry install --extras all --with dev,tests
```
Verify Installation
- importation python -c "import aana; print(aana.__version__)", if the version number is returned then it succeeds.

How to use

At the heart of the Aana SDK are Deployments and Endpoints. Deployments load the model and Endpoints define the functionality. The following is an example of video transcription:

Creating a new application
- New folder (e.g. my_app), create app.pyThe
- Referable templates https://github.com/mobiusml/aana_app_template Quick start.

Configuration Deployment

exist app.py Loading Whisper Model:

from aana.sdk import AanaSDK
from aana.deployments.whisper_deployment import WhisperDeployment, WhisperConfig, WhisperModelSize, WhisperComputeType
app = AanaSDK(name="video_app")
app.register_deployment(
"whisper",
WhisperDeployment.options(
num_replicas=1,
ray_actor_options={"num_gpus": 0.25},  # 若无GPU可删除此行
user_config=WhisperConfig(
model_size=WhisperModelSize.MEDIUM,
compute_type=WhisperComputeType.FLOAT16
).model_dump(mode="json")
)
)

Defining Endpoints

Add transcription endpoints:

from aana.core.models.video import VideoInput
@app.aana_endpoint(name="transcribe_video")
async def transcribe_video(self, video: VideoInput):
audio = await self.download(video.url)  # 下载并提取音频
transcription = await self.whisper.transcribe(audio)  # 转录
return {"transcription": transcription}

Running the application
- Runs in the terminal:
```
python app.py serve
```
- or with the Aana CLI:
```
aana deploy app:app --host 127.0.0.1 --port 8000
```
- When the application starts, the default address is http://127.0.0.1:8000The

test function

Sends a request using cURL:

curl -X POST http://127.0.0.1:8000/transcribe_video -F body='{"url":"https://www.youtube.com/watch?v=VhJFyyukAzA"}'

Or visit the Swagger UI (http://127.0.0.1:8000/docs) Testing.

Featured Function Operation

multimodal processing
In addition to speech transcription, image models (e.g. Blip2) can be integrated to generate descriptions:
```
captions = await self.blip2.generate_captions(video.frames)
```

streaming output
Supports real-time return of results, for example:

@app.aana_endpoint(name="chat", streaming=True)
async def chat(self, question: str):
async for chunk in self.llm.generate_stream(question):
yield chunk

Cluster Extension
To deploy with a Ray cluster, modify the app.connect() Just specify the cluster address.

Additional tools

Ray Dashboard: Post-run access http://127.0.0.1:8265, view cluster status and logs.
Docker Deployment: See https://mobiusml.github.io/aana_sdk/pages/docker/.

application scenario

Video Content Organization
Generate subtitles and summaries for instructional videos for easy archiving and searching.
intelligent question and answer system (Q&A)
The user uploads a video and then asks a question, and the system answers based on the audio and video content.
Enterprise Data Analytics
Extract key information from meeting recordings and videos to generate reports.

QA

Need a GPU?
It's not mandatory; the CPU can run it too, but the GPU (40GB of video memory recommended) will increase efficiency significantly.
How do I handle installation errors?
Check that the Python version and dependencies match, and add the --log-level DEBUG View detailed logs.
What language models are supported?
Built-in vLLM, Whisper, and more, with the ability to integrate more Hugging Face models through Transformers.