Crawl4AI: open source asynchronous web crawler tool to extract structured data without LLM

Latest AI Resources9mos agoupdate AI Sharing Circle

3.1K 00

General Introduction

Crawl4AI is an open source asynchronous web crawler tool designed for large-scale language models (LLMs) and artificial intelligence (AI) applications. It simplifies the web crawling and data extraction process , supports efficient web crawling and provides LLM-friendly output formats such as JSON, cleaned HTML and Markdown.Crawl4AI supports crawling multiple URLs at the same time , completely free and open source , suitable for a variety of data crawling needs.

Official Help Documentation

Online Experience

Function List

Asynchronous architecture: efficient processing of multiple web pages, fast data crawling
Multiple Output Formats: Support JSON, HTML, Markdown
Multi-URL crawling: crawl multiple web pages at the same time
Media tag extraction: extract image, audio and video tags
Link extraction: extract all external and internal links
Metadata extraction: extracting metadata from pages
Custom hooks: support for authentication, request headers and page modifications
User agent customization: custom user agents
Page Screenshot: Screenshot of the crawl page
Execute custom JavaScript: Execute multiple custom JavaScripts before crawling
Proxy Support: Enhancing Privacy and Access
Session management: handling complex multi-page crawling scenarios

Using Help

Installation process

Crawl4AI offers flexible installation options for a variety of usage scenarios. You can install it as a Python package or use Docker.

Installation with pip

Basic Installation
```
pip install crawl4ai
```
This will install the asynchronous version of Crawl4AI by default, using Playwright for web crawling.

Manual installation of Playwright (if required)

playwright install

python -m playwright install chromium

Installing with Docker

Pulling a Docker image
```
docker pull unclecode/crawl4ai
```
Running a Docker Container
```
docker run -it unclecode/crawl4ai
```

Guidelines for use

Basic use

from crawl4ai import AsyncWebCrawler

crawler = AsyncWebCrawler()
results = crawler.crawl(["https://example.com"])
print(results)

Customized settings

from crawl4ai import AsyncWebCrawler

crawler = AsyncWebCrawler(
    user_agent="CustomUserAgent",
    headers={"Authorization": "Bearer token"},
    custom_js=["console.log('Hello, world!')"]
)
results = crawler.crawl(["https://example.com"])
print(results)

Extracting specific data

from crawl4ai import AsyncWebCrawler

crawler = AsyncWebCrawler()
results = crawler.crawl(["https://example.com"], extract_media=True, extract_links=True)
print(results)

Session Management

from crawl4ai import AsyncWebCrawler

crawler = AsyncWebCrawler()
session = crawler.create_session()
session_results = session.crawl(["https://example.com"])
print(session_results)

Crawl4AI offers a rich set of features and flexible configuration options for a variety of web crawling and data crawling needs. With a detailed installation and usage guide, users can easily get started and take full advantage of the tool's powerful features.

Latest AI Resources # AI Java Open Source Projecct

Article copyright AI Sharing Circle All, please do not reproduce without permission.

Logome: AI tool to quickly generate professional brand logos

Latest AI Resources # AI Image Style Control

4mos ago

01.9K

Elicit: an AI research assistant to quickly analyze and summarize research papers

Latest AI Resources # AI search tool # AI Educational Tools # Thesis

9mos ago

02.9K

Cooragent: building a multi-intelligence task collaboration tool in one sentence

Latest AI Resources # AI Java Open Source Projecct # Intelligent Body Development Framework

3mos ago

01.9K

ACE++: a versatile open source tool for precise localized image editing with text commands

Latest AI Resources # AI image editing # AI Java Open Source Projecct # AI Face Swap and Dress Up

5mos ago

02K

No comments

You must be logged in to leave a comment!

No comments...

Crawl4AI: open source asynchronous web crawler tool to extract structured data without LLM

General Introduction

Function List

Using Help

Installation process

Installation with pip

Installing with Docker

Guidelines for use

Cloudflare Serverless Registry: A Serverless Container Registry Based on Cloudflare Workers

QuickPiperAudiobook: a key to generate natural sound quality audiobooks, support for PDF, epub, docx and other formats

Related posts

Logome: AI tool to quickly generate professional brand logos

Elicit: an AI research assistant to quickly analyze and summarize research papers

Cooragent: building a multi-intelligence task collaboration tool in one sentence

ACE++: a versatile open source tool for precise localized image editing with text commands

No comments

Latest Collections

Latest Articles

Crawl4AI: open source asynchronous web crawler tool to extract structured data without LLM

General Introduction

Function List

Using Help

Installation process

Installation with pip

Installing with Docker

Guidelines for use

Cloudflare Serverless Registry: A Serverless Container Registry Based on Cloudflare Workers

QuickPiperAudiobook: a key to generate natural sound quality audiobooks, support for PDF, epub, docx and other formats

Related posts

Logome: AI tool to quickly generate professional brand logos

Elicit: an AI research assistant to quickly analyze and summarize research papers

Cooragent: building a multi-intelligence task collaboration tool in one sentence

ACE++: a versatile open source tool for precise localized image editing with text commands

No comments

Selected AI Tools

Latest Collections

Latest Articles