AI Personal Learning
and practical guidance
Resource Recommendation 1

Evo2: An Open Source BioAI Tool to Support Genome Modeling and Design

General Introduction

Arc Institute Evo 2 is an open source project focused on genome modeling and design, developed by the Arc Institute, a non-profit research organization based in Palo Alto, California, and launched in collaboration with partners such as NVIDIA. The project builds biologically based models capable of processing DNA, RNA, and proteins through cutting-edge deep learning techniques for predictive and generative tasks in the life sciences. Evo 2 is trained on diverse genomic data of more than 9 trillion nucleotides, has up to 40 billion parameters, and supports context lengths of up to 1 million bases. Its code, training data, and model weights are fully open source, hosted on GitHub, and designed to accelerate bioengineering and medical research. Both researchers and developers can use the tool to explore the mysteries of the genome and design new biological sequences.

Evo2: An Open Source BioAI Tool to Support Genome Modeling and Design-1


 

Function List

  • Supports genome modeling across life domains: enables prediction and design on the genomes of bacteria, archaea, and eukaryotes.
  • Long Sequence Handling Capability: Handles DNA sequences of up to 1 million bases for ultra-long contextual analysis tasks.
  • DNA Generation and Optimization: Generate new DNA sequences with coding region annotations based on input sequences or species hints.
  • Zero-sample variant effect prediction: Predict the biological impact of a gene variant without additional training, e.g., analyze the effect of BRCA1 variants.
  • Open source datasets and models: provide pre-trained models and OpenGenome2 datasets to support secondary development and research.
  • Multi-GPU parallel computing support: Automatically allocate multiple GPU resources through the Vortex framework to improve large-scale computing efficiency.
  • Integration with NVIDIA BioNeMo: Seamless access to NVIDIA's biocomputing platform to expand application scenarios.
  • Visualization and Interpretation Tools: Combined with Goodfire's Interpretive Visualizer, reveals the biometric features and patterns recognized by the model.

 

Using Help

Installation process

To use Evo 2 locally, certain computing resources and environment configurations are required. Below are the detailed installation steps:

1. Environmental preparation

  • operating system: Linux (e.g. Ubuntu) or macOS is recommended, Windows users need to install WSL2.
  • hardware requirement: At least 1 NVIDIA GPU (multiple GPUs are recommended to support the 40B model) with a minimum of 16GB of video memory (e.g. A100 or RTX 3090).
  • software dependency: Make sure Git, Python 3.8+, PyTorch (with CUDA support), and pip are installed.

2. Cloning the code repository

Open a terminal and run the following command to get the Evo 2 source code:

git clone --recurse-submodules git@github.com:ArcInstitute/evo2.git
cd evo2

Attention:--recurse-submodules Make sure all submodules are also downloaded.

3. Installation of dependencies

Run it in the project root directory:

pip install .

If you encounter problems, try installing from Vortex (see the GitHub README for details). Once the installation is complete, run the tests to verify:

python -m evo2.test

If the output reports no errors, the installation was successful.

4. Downloading pre-trained models

Evo 2 is available in several model versions (e.g., 1B, 7B, 40B parameters) and can be downloaded from Hugging Face or GitHub Releases. Example:

wget https://huggingface.co/arcinstitute/evo2_7b/resolve/main/evo2_7b_base.pt

Place the model files in a local directory for subsequent loading.

How to use

Once installed, the core functions of Evo 2 can be invoked via Python scripts. Below is a detailed flow of how the main functions work:

Function 1: Generation of DNA sequences

Evo 2 can generate a continuation sequence from an input DNA fragment. The procedure is as follows:

  1. Loading Models::
    from evo2 import Evo2
    model = Evo2('evo2_7b') # using 7B parameter model
    
  2. Enter prompts and generate::
    prompt = ["ACGT"] # input initial DNA sequence
    output = model.generate(prompt_seqs=prompt, n_tokens=400, temperature=1.0, top_k=4)
    print(output.sequences[0]) # Output generated 400 nucleotide sequences
    
  3. Interpretation of results: The generated sequences can be used for downstream biological analysis, the temperature parameter controls the randomness, and top_k limits the sampling range.

Function 2: Prediction of zero-sample variant effects

The BRCA1 gene was used as an example to predict the biological impact of variants:

  1. Prepare data: Deposit the reference and variant sequences into the list.
  2. Operational forecasts::
    ref_seqs = ["ATCG..."]  # reference sequence
    var_seqs = ["ATGG..."]  # variant sequences
    ref_scores = model.score_sequences(ref_seqs)
    var_scores = model.score_sequences(var_seqs)
    print(f "Reference likelihood: {ref_scores}, Variant likelihood: {var_scores}")
    
  3. analysis: Compare differences in scores to assess the potential impact of variants on function.

Function 3: Long Sequence Processing

For very long sequences, Evo 2 supports chunked loading and computation:

  1. Loading Large Models::
    model = Evo2('evo2_40b') # requires multi-GPU support
    
  2. Handling long sequences::
    long_seq = "ATCG..." * 100000 # modeling a 1 million base sequence
    output = model.generate([long_seq], n_tokens=1000)
    print(output.sequences[0])
    
  3. caveat: The current long sequence forward propagation may be slow, it is recommended to optimize the hardware configuration or use the teacher prompting method (teacher prompting).

Function 4: Data sets and secondary development

  • Getting the dataset: Download the OpenGenome2 dataset (in FASTA or JSONL format) from Hugging Face.
  • Customized training: Modify model architecture or fine-tune parameters based on the Savanna framework for specific research needs.

Handling tips and precautions

  • Multi-GPU Configuration: If you are using a 40B model, you need to make sure that Vortex recognizes multiple GPUs correctly by using the nvidia-smi Check resource allocation.
  • performance optimization: Long sequence processing reduces temperature values to reduce the computational burden.
  • Community Support: Questions can be directed to the GitHub Issues board, where the Arc Institute team and the community are available to help.

These steps will get you up and running with Evo 2, whether you're generating DNA sequences or analyzing genetic variants, and you'll be able to do it efficiently.

Content 2
May not be reproduced without permission:Chief AI Sharing Circle " Evo2: An Open Source BioAI Tool to Support Genome Modeling and Design

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish