General Introduction
Aana SDK is an open source framework developed by Mobius Labs, named after the Malayalam word ആന (elephant), that helps developers rapidly deploy multimodal AI models that support processing a wide range of data, including text, images, audio, and video. It helps developers quickly deploy and manage multimodal AI models that support processing text, images, audio, and video, etc. Based on the Ray Distributed Computing Framework, the Aana SDK is designed for reliability, scalability, and efficiency. Developers can use it to easily build applications from standalone to clustered, such as video transcription, image description, or smart chat tools.
Function List
- Supports multimodal data: can process text, images, audio and video simultaneously.
- Model Deployment and Scaling: Machine learning models can be deployed on a single machine or on a cluster.
- Auto-generated APIs: Automatically create and validate APIs based on defined endpoints.
- Real-time streaming output: supports streaming results for real-time applications and large language models.
- Predefined data types: Built-in support for common data types such as image, video, etc.
- Background task queue: endpoint tasks run automatically in the background without additional configuration.
- Integration of multiple models: Whisper, vLLM, Hugging Face Transformers, etc. are supported.
- Documentation Auto-Generation: Automatically generate application documentation based on endpoints.
Using Help
Installation process
There are two ways to install Aana SDK: PyPI and GitHub, here are the steps:
- Preparing the environment
- Requires Python 3.8 or later.
- It is recommended to manually install PyTorch (>= 2.1), choosing the appropriate version for your system (see https://pytorch.org/get-started/locally/). Otherwise the default installation may not fully utilize the GPU.
- If you are using a GPU, it is recommended to install the Flash Attention library to improve performance (see https://github.com/Dao-AILab/flash-attention).
- Installation via PyPI
- Run the following command to install the core dependencies:
pip install aana
- For full functionality, install all additional dependencies:
pip install aana[all]
- Other options include
vllm
(Language Modeling),asr
(speech recognition),transformers
(converter model), etc., as required.
- Run the following command to install the core dependencies:
- Installation via GitHub
- Cloning Warehouse:
git clone https://github.com/mobiusml/aana_sdk.git cd aana_sdk
- Install using Poetry (Poetry >= 2.0 recommended, see https://python-poetry.org/docs/#installation):
poetry install --extras all
- Development environments can add test dependencies:
poetry install --extras all --with dev,tests
- Cloning Warehouse:
- Verify Installation
- importation
python -c "import aana; print(aana.__version__)"
, if the version number is returned then it succeeds.
- importation
How to use
At the heart of the Aana SDK are Deployments and Endpoints. Deployments load the model and Endpoints define the functionality. The following is an example of video transcription:
- Creating a new application
- New folder (e.g.
my_app
), createapp.py
The - Referable templates https://github.com/mobiusml/aana_app_template Quick start.
- New folder (e.g.
- Configuration Deployment
- exist
app.py
Loading Whisper Model:from aana.sdk import AanaSDK from aana.deployments.whisper_deployment import WhisperDeployment, WhisperConfig, WhisperModelSize, WhisperComputeType app = AanaSDK(name="video_app") app.register_deployment( "whisper", WhisperDeployment.options( num_replicas=1, ray_actor_options={"num_gpus": 0.25}, # 若无GPU可删除此行 user_config=WhisperConfig( model_size=WhisperModelSize.MEDIUM, compute_type=WhisperComputeType.FLOAT16 ).model_dump(mode="json") ) )
- exist
- Defining Endpoints
- Add transcription endpoints:
from aana.core.models.video import VideoInput @app.aana_endpoint(name="transcribe_video") async def transcribe_video(self, video: VideoInput): audio = await self.download(video.url) # 下载并提取音频 transcription = await self.whisper.transcribe(audio) # 转录 return {"transcription": transcription}
- Add transcription endpoints:
- Running the application
- Runs in the terminal:
python app.py serve
- or with the Aana CLI:
aana deploy app:app --host 127.0.0.1 --port 8000
- When the application starts, the default address is
http://127.0.0.1:8000
The
- Runs in the terminal:
- test function
- Sends a request using cURL:
curl -X POST http://127.0.0.1:8000/transcribe_video -F body='{"url":"https://www.youtube.com/watch?v=VhJFyyukAzA"}'
- Or visit the Swagger UI (
http://127.0.0.1:8000/docs
) Testing.
- Sends a request using cURL:
Featured Function Operation
- multimodal processing
In addition to speech transcription, image models (e.g. Blip2) can be integrated to generate descriptions:captions = await self.blip2.generate_captions(video.frames)
- streaming output
Supports real-time return of results, for example:@app.aana_endpoint(name="chat", streaming=True) async def chat(self, question: str): async for chunk in self.llm.generate_stream(question): yield chunk
- Cluster Extension
To deploy with a Ray cluster, modify theapp.connect()
Just specify the cluster address.
Additional tools
- Ray Dashboard: Post-run access
http://127.0.0.1:8265
, view cluster status and logs. - Docker Deployment: See https://mobiusml.github.io/aana_sdk/pages/docker/.
application scenario
- Video Content Organization
Generate subtitles and summaries for instructional videos for easy archiving and searching. - intelligent question and answer system (Q&A)
The user uploads a video and then asks a question, and the system answers based on the audio and video content. - Enterprise Data Analytics
Extract key information from meeting recordings and videos to generate reports.
QA
- Need a GPU?
It's not mandatory; the CPU can run it too, but the GPU (40GB of video memory recommended) will increase efficiency significantly. - How do I handle installation errors?
Check that the Python version and dependencies match, and add the--log-level DEBUG
View detailed logs. - What language models are supported?
Built-in vLLM, Whisper, and more, with the ability to integrate more Hugging Face models through Transformers.