AI Personal Learning
and practical guidance
Resource Recommendation 1

Crawl4LLM: An Efficient Web Crawling Tool for LLM Pretraining

General Introduction

Crawl4LLM is an open source project jointly developed by Tsinghua University and Carnegie Mellon University, focusing on optimizing the efficiency of web crawling for pre-training of large models (LLM). It significantly reduces ineffective crawling by intelligently selecting high-quality web page data, claiming to be able to reduce the workload of originally needing to crawl 100 web pages to 21, while maintaining the pre-training effect. The project is hosted on GitHub, with detailed code and configuration documentation for developers and researchers.The core of Crawl4LLM lies in its data selection algorithm, which evaluates the value of web pages for model training and prioritizes crawling useful content, and has already attracted attention in academia and the developer community.

Crawl4LLM: An Efficient Web Crawling Tool for LLM Pre-training-1


 

Function List

  • Intelligent Data Selection: Filter high-value content based on the influence of web pages on the pre-training of large models.
  • Multiple crawling modesCrawl4LLM: Support Crawl4LLM mode, random crawling, etc., to flexibly respond to different needs.
  • Efficient crawling engine: Use multi-threading and optimized configurations to significantly increase crawling speed.
  • Data extraction and storage: Save the crawled web page IDs and text content as files that can be used for model training.
  • Support for large-scale data sets: Compatible with datasets such as ClueWeb22, suitable for academic research and industrial applications.
  • Configuration Customization: Crawling parameters, such as the number of threads and the maximum number of documents, are adjusted via YAML files.

 

Using Help

Installation process

Crawl4LLM needs to be run in an environment that supports Python, here are the detailed installation steps:

  1. environmental preparation
    • Make sure Python 3.10 or later is installed on your system.
    • Create virtual environments to avoid dependency conflicts:
      python -m venv crawl4llm_env
      source crawl4llm_env/bin/activate # Linux/Mac
      crawl4llm_env\Scripts\activate # Windows
      
  2. cloning project
    • Download the source code from GitHub:
      git clone https://github.com/cxcscmu/Crawl4LLM.git
      cd Crawl4LLM
      
  3. Installation of dependencies
    • Run the following command to install the required libraries:
      pip install -r requirements.txt
      
    • Note: The dependency file lists all the Python packages needed for the crawler to run, make sure you have a good network.
  4. Download Sorter
    • The project uses the DCLM fastText classifier to evaluate the quality of web pages, which requires manually downloading the model file to the fasttext_scorers/ Catalog.
    • Example path:fasttext_scorers/openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train.binThe
    • Available from official resources or related links.
  5. Preparing the dataset
    • If you are using ClueWeb22 datasets, you need to request access and store them on the SSD (the path needs to be specified in the configuration).

How to use

The operation of Crawl4LLM is mainly done through the command line, which is divided into three steps: configuration, crawling and data extraction. The following is the detailed flow:

1. Configure the crawling task

  • exist configs/ Create a YAML file in the directory (e.g. my_config.yaml), content examples:
    cw22_root_path: "/path/to/clueweb22_a"
    seed_docs_file: "seed.txt"
    output_dir: "crawl_results/my_crawl"
    num_selected_docs_per_iter: 10000
    num_selected_docs_per_iter: 10000
    max_num_docs: 20000000
    selection_method: "dclm_fasttext_score"
    order: "desc"
    selection_method: "dclm_fasttext_score" order: "desc" wandb: false
    rating_methods: "dclm_fasttext_score" order: "desc" wandb: false
    - type: "length"
    - selection_method: "dclm_fasttext_score" order: "desc" wandb: false
    model_path: "fasttext_scorers/openhermes_reddit"
    model_path: "fasttext_scorers/openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train.bin"
  • Parameter description::
    • cw22_root_path: ClueWeb22 dataset path.
    • seed_docs_file: Initial seed document list.
    • num_workers: Number of threads, adjusted for machine performance.
    • max_num_docs: Maximum number of documents to crawl.
    • selection_method: Data selection method, recommended dclm_fasttext_scoreThe

2. Running the crawler

  • Execute the crawl command:
    python crawl.py crawl --config configs/my_config.yaml
    
  • When the crawl is complete, the document ID is saved in the output_dir in the file under the specified path.

3. Extracting document content

  • Use the following command to convert the crawled document ID to text:
    python fetch_docs.py --input_dir crawl_results/my_crawl --output_dir crawl_texts --num_workers 16
    
  • The output is a text file that can be used directly for subsequent model training.

4. Viewing Individual Documents

  • If you need to check a specific document and its links, you can run:
    python access_data.py /path/to/clueweb22
    

Featured Function Operation

  • Intelligent Web Page Selection
    • The core of Crawl4LLM is its data filtering capability. It uses fastText classifiers to evaluate the length and quality of web content, prioritizing the pages that are more useful for model training. Users don't need to filter manually, the system does the optimization automatically.
    • How to do it: Set the YAML configuration in the selection_method because of dclm_fasttext_score, and make sure the model path is correct.
  • Multi-threaded acceleration
    • pass (a bill or inspection etc) num_workers parameter to adjust the number of threads. For example, a 16-core CPU can be set to 16 to fully utilize computing resources.
    • Note: A high number of threads may lead to memory overflow, it is recommended to test according to the machine configuration.
  • Support for large-scale crawling
    • The project is designed for very large datasets such as ClueWeb22 and is suitable for research scenarios that require the processing of billions of web pages.
    • Action Suggestion: Ensure I/O performance by storing data on an SSD; set the max_num_docs is the target number of documents (e.g., 20 million).

Tips for use

  • Debugging and Logging: Enable wandb: true The crawling process can be recorded for analyzing efficiency.
  • Optimize storage: Crawl results are large and it is recommended to reserve enough disk space (e.g. hundreds of GB).
  • Extended functionality: In combination with the DCLM framework, the extracted text can be used directly for large model pre-training.

With these steps, users can quickly get started with Crawl4LLM for efficient web data crawling and optimized pre-training process.

Contents3
May not be reproduced without permission:Chief AI Sharing Circle " Crawl4LLM: An Efficient Web Crawling Tool for LLM Pretraining

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish