Crawlee: Building Reliable Web Crawlers and Browser Automation Tools with Node.js

Latest AI Resources11mos agorelease AI Sharing Circle

27.4K 00

General Introduction

Crawlee is an open source web crawler and browser automation library developed by Apify , designed for the Node.js environment. It supports JavaScript and TypeScript, and works with tools like Puppeteer, Playwright, Cheerio, JSDOM, and others to provide powerful data crawling and automation features.Crawlee allows users to build reliable crawlers that extract the data needed for AI, LLM, RAG, or GPTs, and download HTML, PDF, JPG, PNG, and more. It is designed to make crawlers look more like human operations, circumvent modern anti-crawler mechanisms, support agent rotation and session management, and is suitable for a variety of complex web crawling tasks.

Crawlee for Python is open to early adopters!

Function List

Single interface for HTTP and headless browser crawling
Persistent URL crawling queues (breadth-first and depth-first)
Pluggable data and file storage
Automatically scales to system resources
Integrated agent rotation and session management
Lifecycle can be customized with hooks
CLI tool for bootstrapping projects
Configurable routing, error handling and retry mechanisms
Providing Dockerfiles for Deployment
Written in TypeScript with generic support
HTTP2 support and automatic generation of browser-style request headers
Integrated fast HTML parser (Cheerio and JSDOM)
JSON API Crawling Support
JavaScript rendering and screenshot support
Headless and Headed Mode Support
Automatic generation of human-style fingerprints
Using Playwright and Puppeteer's Unified Interface
Supports Chrome, Firefox, Webkit, etc.

Using Help

mounting

Crawlee requires Node.js 16 or higher. You can quickly install and create a sample crawler with the following command:

npx crawlee create my-crawler
cd my-crawler
npm start

If you prefer to install manually, you can use the following command:

npm install crawlee playwright

Then import and use it in your project:

import { PlaywrightCrawler, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, enqueueLinks, log }) {
const title = await page.title();
log.info(`Title of ${request.loadedUrl} is '${title}'`);
await Dataset.pushData({ title, url: request.loadedUrl });
await enqueueLinks();
},
// headless: false, // Uncomment to see the browser window
});
await crawler.run(['https://crawlee.dev']);

Functional operation flow

Creating a Crawler Project: Quickly create a new crawler project using the Crawlee CLI, which installs all necessary dependencies and adds sample code.
Configuring the Crawler: Configure the crawler's request processing logic, data storage location, proxy settings, etc. in your project.
Running the crawler: Run the crawler from the command line and Crawlee will automatically process the request, grab the data and store the results.
data storage: Crawlee stores data by default in the current working directory in the ./storage folder, this directory can be overridden by a configuration file.
Extended functionality: Add custom hooks, error handling mechanisms and retry policies as needed to ensure the stability and reliability of the crawler.

Proxy and session management

Crawlee integrates proxy rotation and session management features to ensure that the crawler will not be blocked by the target website during the crawling process. You can set the proxy list and session parameters through the configuration file to realize automatic rotation and management.

deployments

Crawlee provides Dockerfiles for easy deployment of crawlers to the cloud or other environments. Docker containers can be built and run with the following commands:

docker build -t my-crawler .
docker run my-crawler

Latest AI Resources # AI Java Open Source Projecct

Article copyright AI Sharing Circle All, please do not reproduce without permission.

Fireflies.ai: Intelligent meeting assistant with automated meeting transcription and analysis

Latest AI Resources # AI Text and Audio/Video Summarization Tool

1yrs ago

030.6K

Rapid Deployment of MCP Services in Cloudflare Workers

Latest AI Resources # AI Java Open Source Projecct # MCP services

8mos ago

028.6K

Auto-Coder.Chat：基于终端命令行的AI辅助编程助手，提供VSCode插件

Auto-Coder.Chat: Terminal command-line based AI-assisted programming assistant with VSCode plug-in

Latest AI Resources # AI Java Open Source Projecct # AI Programming

7mos ago

025.5K

GenEx: Generating explorable 360° 3D worlds from a single image (code progressively open-sourced)

Latest AI Resources # AI Java Open Source Projecct # AI Text & Image to 3D

11mos ago

025K

No comments

You must be logged in to leave a comment!

No comments...

Crawlee: Building Reliable Web Crawlers and Browser Automation Tools with Node.js

General Introduction

Function List

Using Help

mounting

Functional operation flow

Proxy and session management

deployments

FliFlik: AI image processing client, one-click image high-definition, enlargement, noise reduction and watermark removal

Project IDX: AI IDE for cloud-integrated multi-development environments, Google's free web-based AI code editor

Related posts

Fireflies.ai: Intelligent meeting assistant with automated meeting transcription and analysis

Rapid Deployment of MCP Services in Cloudflare Workers

Auto-Coder.Chat: Terminal command-line based AI-assisted programming assistant with VSCode plug-in

GenEx: Generating explorable 360° 3D worlds from a single image (code progressively open-sourced)

No comments

Latest Collections

Latest Articles

Crawlee: Building Reliable Web Crawlers and Browser Automation Tools with Node.js

General Introduction

Function List

Using Help

mounting

Functional operation flow

Proxy and session management

deployments

FliFlik: AI image processing client, one-click image high-definition, enlargement, noise reduction and watermark removal

Project IDX: AI IDE for cloud-integrated multi-development environments, Google's free web-based AI code editor

Related posts

Fireflies.ai: Intelligent meeting assistant with automated meeting transcription and analysis

Rapid Deployment of MCP Services in Cloudflare Workers

Auto-Coder.Chat: Terminal command-line based AI-assisted programming assistant with VSCode plug-in

GenEx: Generating explorable 360° 3D worlds from a single image (code progressively open-sourced)

No comments

Selected AI Tools

Latest Collections

Latest Articles