General Introduction
SkyPilot is an open source framework initiated by the Sky Computing Lab at the University of California, Berkeley, designed to help users efficiently run AI and batch processing tasks on any cloud infrastructure. With support for over 14 cloud services (including AWS, GCP, Azure, etc.) and Kubernetes, SkyPilot provides a unified execution interface, significant cost savings, and high GPU availability. SkyPilot allows users to deploy tasks to the cloud with no code changes, automatic selection of optimal resources, and fault tolerance through simple YAML or Python API configuration. Fault Tolerance. Since its release in 2022, the project has attracted over 1 million downloads and 7k+ stars on GitHub, and is widely used in machine learning model training, hyperparameter tuning, and bioinformatics.
Function List
- Multi-cloud mission deployment: Supports running tasks on 14+ clouds such as AWS, GCP, Azure, and Kubernetes, avoiding single cloud vendor lock-in.
- Cost optimization: Automatically selects the cheapest cloud region or instance (e.g., on-demand or Spot instance), realizing cost savings of up to 3-6.5x.
- GPU/TPU Support: Seamlessly schedules and manages GPU (e.g., NVIDIA A100) and TPU resources to improve compute efficiency.
- Automatic Fault Recovery: Automatically switch to other clouds or regions when experiencing low capacity or preemption to ensure tasks are not interrupted.
- Code and data synchronization: Automatically synchronize local code and data to the cloud cluster to simplify the deployment process.
- Task queue management: Supports massively concurrent tasks (e.g., 2000+ jobs) for scenarios such as hyperparameter tuning.
- Service deployment (SkyServe): Provides HTTPS support and load balancing policies for easy deployment of AI model services.
- Reserved Instance Support: Integrate on-premises resources such as AWS Capacity Blocks, GCP Dynamic Workload Scheduler, and more.
Using Help
SkyPilot's installation and usage process is straightforward and suitable for developers, researchers and enterprise users. Below is a detailed description of how to install, configure and operate SkyPilot's core features.
Installation process
- environmental preparation
Ensure that Python 3.8 or later is installed locally and that you have thepip
Package management tools. A virtual environment is recommended to avoid dependency conflicts:python3 -m venv skypilot_env source skypilot_env/bin/activate
- Installing SkyPilot
pass (a bill or inspection etc)pip
Install the latest version:pip install -U "skypilot[all]"
included among these
[all]
Indicates that all cloud supported dependencies are installed. If only a specific cloud (e.g. AWS) is required, replace withskypilot[aws]
The - Cloud Authentication
Configure the credentials for the target cloud. Take AWS as an example:- Install the AWS CLI:
pip install awscli
- Configure the credentials:
aws configure
If you are not sure what you want to do, enter the Access Key and Secret Key.
For other clouds (e.g. GCP, Azure), refer to the official documentation to complete a similar configuration.
- Install the AWS CLI:
- Verify Installation
Run the following command to check if the installation was successful:sky check
The output displays the status of the enabled cloud services.
Main function operation flow
1. Creation of mission profiles
SkyPilot uses YAML files to define tasks. To train a PyTorch model, for example, create a train.yaml
::
resources: accelerators: A100:1 # using 1 A100 GPU
accelerators: A100:1 # using 1 A100 GPU
num_nodes: 1 # single node
workdir: ~/my_project # Synchronize local project directory
setup: | # install dependencies
pip install torch torchvision
run: | # Run task
python main.py --epochs 10
resources
: Specify compute resources, such as GPU type and number.workdir
: Local code directory with automatic synchronization to the cloud.setup
: Preparation commands before running.run
: Mission Execution Command.
2. Launching of the mandate
Runs in the terminal:
sky launch -c my-cluster train.yaml
-c my-cluster
: Named clusters for easy subsequent management.- SkyPilot automatically selects the optimal cloud resource and starts the task.
3. Viewing the status of mandates
Check the cluster status:
sky status
Displays all running clusters and their resource usage.
4. Cost optimization and Spot examples
Use Spot instances to reduce costs:
sky launch -c my-spot-cluster train.yaml --use-spot
SkyPilot automatically manages Spot instance outages and resumes tasks.
5. Deployment services (SkyServe)
As an example of deploying a LLaMA model, create the serve.yaml
::
resources.
accelerators: A100:1
service: replica: 2 # 2 replicas
replica: 2 # 2 replicas
ports: 8080 # service port
run: || service: replica: #
python serve.py --model llama
Start the service:
sky serve up serve.yaml -n llama-service
The service is available by accessing the generated HTTPS endpoint.
6. Massive concurrent missions
Run 2000+ tasks (e.g. hyperparameter tuning):
sky jobs launch -c my-jobs train.yaml --num-jobs 2000
SkyPilot automatically allocates resources and manages queues.
Featured Functions
Multi-Cloud Switching
If AWS does not have A100 available, you can specify another cloud:
sky launch -c my-cluster train.yaml --cloud gcp
Or let SkyPilot automatically select the cheapest cloud:
sky launch -c my-cluster train.yaml --cloud cheapest
GPU scheduling
View available GPU resources:
sky show-gpus
Dynamic adjustment to demand resources
hit the nail on the head accelerators
Parameters.
Logging and Debugging
View task logs in real time:
sky logs my-cluster
Enable debug mode:
export SKYPILOT_DEBUG=1
sky launch -c my-cluster train.yaml
caveat
- Credential security: Ensure that the cloud credentials file (e.g.
~/.aws
) permissions correctly to avoid leakage. - Resource release: Manually shut down the cluster after the task is completed:
Sky down my-cluster
- documentation support: For more advanced usage, refer to the official SkyPilot documentation.
With these steps, users can quickly get started with SkyPilot and run AI and batch tasks efficiently.