AI Personal Learning
and practical guidance

Crawl4AI: open source asynchronous web crawler tool to extract structured data without LLM

General Introduction

Crawl4AI is an open source asynchronous web crawler tool designed for large-scale language models (LLMs) and artificial intelligence (AI) applications. It simplifies the web crawling and data extraction process , supports efficient web crawling and provides LLM-friendly output formats such as JSON, cleaned HTML and Markdown.Crawl4AI supports crawling multiple URLs at the same time , completely free and open source , suitable for a variety of data crawling needs.

Official Help Documentation


Online Experience

 

Function List

  • Asynchronous architecture: efficient processing of multiple web pages, fast data crawling
  • Multiple Output Formats: Support JSON, HTML, Markdown
  • Multi-URL crawling: crawl multiple web pages at the same time
  • Media tag extraction: extract image, audio and video tags
  • Link extraction: extract all external and internal links
  • Metadata extraction: extracting metadata from pages
  • Custom hooks: support for authentication, request headers and page modifications
  • User agent customization: custom user agents
  • Page Screenshot: Screenshot of the crawl page
  • Execute custom JavaScript: Execute multiple custom JavaScripts before crawling
  • Proxy Support: Enhancing Privacy and Access
  • Session management: handling complex multi-page crawling scenarios

 

Using Help

Installation process

Crawl4AI offers flexible installation options for a variety of usage scenarios. You can install it as a Python package or use Docker.

Installation with pip

  1. Basic Installation
    pip install crawl4ai
    

    This will install the asynchronous version of Crawl4AI by default, using Playwright for web crawling.

  2. Manual installation of Playwright (if required)
    playwright install
    

    or

    python -m playwright install chromium
    

Installing with Docker

  1. Pulling a Docker image
    docker pull unclecode/crawl4ai
    
  2. Running a Docker Container
    docker run -it unclecode/crawl4ai
    

Guidelines for use

  1. Basic use
    from crawl4ai import AsyncWebCrawler
    
    crawler = AsyncWebCrawler()
    results = crawler.crawl([)"https://example.com"])
    print(results)
    
  2. Customized settings
    from crawl4ai import AsyncWebCrawler
    
    crawler = AsyncWebCrawler(
        user_agent="CustomUserAgent",
        headers={"Authorization": "Bearer token"},
        custom_js=["console.log('Hello, world!')"]
    )
    results = crawler.crawl(["https://example.com"])
    print(results)
    
  3. Extracting specific data
    from crawl4ai import AsyncWebCrawler
    
    crawler = AsyncWebCrawler()
    results = crawler.crawl([)"https://example.com"], extract_media=True, extract_links=True)
    print(results)
    
  4. Session Management
    from crawl4ai import AsyncWebCrawler
    
    crawler = AsyncWebCrawler()
    session = crawler.create_session()
    session_results = session.crawl([)"https://example.com"])
    print(session_results)
    

Crawl4AI offers a rich set of features and flexible configuration options for a variety of web crawling and data crawling needs. With a detailed installation and usage guide, users can easily get started and take full advantage of the tool's powerful features.

May not be reproduced without permission:Chief AI Sharing Circle " Crawl4AI: open source asynchronous web crawler tool to extract structured data without LLM

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish