SkyPilot: an open source framework for efficiently running AI and batch tasks in any cloud

🚀 Invitation to Experience: China's First AI IDE Intelligent Programming Software Trae Chinese version downloadThe DeepSeek-R1 and Doubao-pro are available for unlimited use!

General Introduction

SkyPilot is an open source framework initiated by the Sky Computing Lab at the University of California, Berkeley, designed to help users efficiently run AI and batch processing tasks on any cloud infrastructure. With support for over 14 cloud services (including AWS, GCP, Azure, etc.) and Kubernetes, SkyPilot provides a unified execution interface, significant cost savings, and high GPU availability. SkyPilot allows users to deploy tasks to the cloud with no code changes, automatic selection of optimal resources, and fault tolerance through simple YAML or Python API configuration. Fault Tolerance. Since its release in 2022, the project has attracted over 1 million downloads and 7k+ stars on GitHub, and is widely used in machine learning model training, hyperparameter tuning, and bioinformatics.

SkyPilot: an open source framework for efficiently running AI and batch tasks in any cloud-1

Function List

Multi-cloud mission deployment: Supports running tasks on 14+ clouds such as AWS, GCP, Azure, and Kubernetes, avoiding single cloud vendor lock-in.
Cost optimization: Automatically selects the cheapest cloud region or instance (e.g., on-demand or Spot instance), realizing cost savings of up to 3-6.5x.
GPU/TPU Support: Seamlessly schedules and manages GPU (e.g., NVIDIA A100) and TPU resources to improve compute efficiency.
Automatic Fault Recovery: Automatically switch to other clouds or regions when experiencing low capacity or preemption to ensure tasks are not interrupted.
Code and data synchronization: Automatically synchronize local code and data to the cloud cluster to simplify the deployment process.
Task queue management: Supports massively concurrent tasks (e.g., 2000+ jobs) for scenarios such as hyperparameter tuning.
Service deployment (SkyServe): Provides HTTPS support and load balancing policies for easy deployment of AI model services.
Reserved Instance Support: Integrate on-premises resources such as AWS Capacity Blocks, GCP Dynamic Workload Scheduler, and more.

Using Help

SkyPilot's installation and usage process is straightforward and suitable for developers, researchers and enterprise users. Below is a detailed description of how to install, configure and operate SkyPilot's core features.

Installation process

environmental preparation
Ensure that Python 3.8 or later is installed locally and that you have the pip Package management tools. A virtual environment is recommended to avoid dependency conflicts:
```
python3 -m venv skypilot_env
source skypilot_env/bin/activate
```

Installing SkyPilot
pass (a bill or inspection etc) pip Install the latest version:
```
pip install -U "skypilot[all]"
```
included among these [all] Indicates that all cloud supported dependencies are installed. If only a specific cloud (e.g. AWS) is required, replace with skypilot[aws]The
Cloud Authentication
Configure the credentials for the target cloud. Take AWS as an example:
- Install the AWS CLI:pip install awscli
- Configure the credentials:aws configureIf you are not sure what you want to do, enter the Access Key and Secret Key.
  For other clouds (e.g. GCP, Azure), refer to the official documentation to complete a similar configuration.
Verify Installation
Run the following command to check if the installation was successful:
```
sky check
```
The output displays the status of the enabled cloud services.

Main function operation flow

1. Creation of mission profiles

SkyPilot uses YAML files to define tasks. To train a PyTorch model, for example, create a train.yaml::

resources:
accelerators: A100:1  # 使用 1 个 A100 GPU
num_nodes: 1         # 单节点
workdir: ~/my_project  # 同步本地项目目录
setup: |               # 安装依赖
pip install torch torchvision
run: |                 # 运行任务
python main.py --epochs 10

resources: Specify compute resources, such as GPU type and number.
workdir: Local code directory with automatic synchronization to the cloud.
setup: Preparation commands before running.
run: Mission Execution Command.

2. Launching of the mandate

Runs in the terminal:

sky launch -c my-cluster train.yaml

-c my-cluster: Named clusters for easy subsequent management.
SkyPilot automatically selects the optimal cloud resource and starts the task.

3. Viewing the status of mandates

Check the cluster status:

sky status

Displays all running clusters and their resource usage.

4. Cost optimization and Spot examples

Use Spot instances to reduce costs:

sky launch -c my-spot-cluster train.yaml --use-spot

SkyPilot automatically manages Spot instance outages and resumes tasks.

5. Deployment services (SkyServe)

As an example of deploying a LLaMA model, create the serve.yaml::

resources:
accelerators: A100:1
service:
replica: 2          # 2 个副本
ports: 8080         # 服务端口
run: |
python serve.py --model llama

Start the service:

sky serve up serve.yaml -n llama-service

The service is available by accessing the generated HTTPS endpoint.

6. Massive concurrent missions

Run 2000+ tasks (e.g. hyperparameter tuning):

sky jobs launch -c my-jobs train.yaml --num-jobs 2000

SkyPilot automatically allocates resources and manages queues.

Featured Functions

Multi-Cloud Switching

If AWS does not have A100 available, you can specify another cloud:

sky launch -c my-cluster train.yaml --cloud gcp

Or let SkyPilot automatically select the cheapest cloud:

sky launch -c my-cluster train.yaml --cloud cheapest

GPU scheduling

View available GPU resources:

sky show-gpus

Dynamic adjustment to demand resources hit the nail on the head accelerators Parameters.

Logging and Debugging

View task logs in real time:

sky logs my-cluster

Enable debug mode:

export SKYPILOT_DEBUG=1
sky launch -c my-cluster train.yaml

caveat

Credential security: Ensure that the cloud credentials file (e.g. ~/.aws) permissions correctly to avoid leakage.
Resource release: Manually shut down the cluster after the task is completed:
```
sky down my-cluster
```
documentation support: For more advanced usage, refer to the official SkyPilot documentation.

With these steps, users can quickly get started with SkyPilot and run AI and batch tasks efficiently.

SkyPilot: an open-source framework for efficiently running AI and batch tasks in any cloud

General Introduction

Function List

Using Help

Installation process

Main function operation flow

1. Creation of mission profiles

2. Launching of the mandate

3. Viewing the status of mandates

4. Cost optimization and Spot examples

5. Deployment services (SkyServe)

6. Massive concurrent missions

Featured Functions

Multi-Cloud Switching

GPU scheduling

Logging and Debugging

caveat

Related articles

Recommended

Can't find AI tools? Try here!

FLUX.1 image generator (supports Chinese input)

Recent AI Hotspots

AI Tools Recommendations

AI Tools Classification