AI Personal Learning
and practical guidance

Crawlee: Building Reliable Web Crawlers and Browser Automation Tools with Node.js

General Introduction

Crawlee is an open source web crawler and browser automation library developed by Apify , designed for the Node.js environment. It supports JavaScript and TypeScript, and works with tools like Puppeteer, Playwright, Cheerio, JSDOM, and others to provide powerful data crawling and automation features.Crawlee allows users to build reliable crawlers that extract the data needed for AI, LLM, RAG, or GPTs, and download HTML, PDF, JPG, PNG, and more. It is designed to make crawlers look more like human operations, circumvent modern anti-crawler mechanisms, support agent rotation and session management, and is suitable for a variety of complex web crawling tasks.

Crawlee for Python is open to early adopters!

Crawlee: Building Reliable Web Crawlers and Browser Automation Tools with Node.js-1

 

Function List

  • Single interface for HTTP and headless browser crawling
  • Persistent URL crawling queues (breadth-first and depth-first)
  • Pluggable data and file storage
  • Automatically scales to system resources
  • Integrated agent rotation and session management
  • Lifecycle can be customized with hooks
  • CLI tool for bootstrapping projects
  • Configurable routing, error handling and retry mechanisms
  • Providing Dockerfiles for Deployment
  • Written in TypeScript with generic support
  • HTTP2 support and automatic generation of browser-style request headers
  • Integrated fast HTML parser (Cheerio and JSDOM)
  • JSON API Crawling Support
  • JavaScript rendering and screenshot support
  • Headless and Headed Mode Support
  • Automatic generation of human-style fingerprints
  • Using Playwright and Puppeteer's Unified Interface
  • Supports Chrome, Firefox, Webkit, etc.

 

Using Help

mounting

Crawlee requires Node.js 16 or higher. You can quickly install and create a sample crawler with the following command:

npx crawlee create my-crawler
cd my-crawler
npm start

If you prefer to install manually, you can use the following command:

npm install crawlee playwright

Then import and use it in your project:

import { PlaywrightCrawler, Dataset } from 'crawlee' ;
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, enqueueLinks, log }) {
const title = await page.title();
log.info(`Title of ${request.loadedUrl} is '${title}'`);
await Dataset.pushData({ title, url: request.loadedUrl });
await enqueueLinks();
}, // headless: false, // UniqueLinks(); await enqueueLinks()
// headless: false, // Uncomment to see the browser window
}); await crawler.run['run']; }
await crawler.run(['https://crawlee.dev']);

Functional operation flow

  1. Creating a Crawler Project: Quickly create a new crawler project using the Crawlee CLI, which installs all necessary dependencies and adds sample code.
  2. Configuring the Crawler: Configure the crawler's request processing logic, data storage location, proxy settings, etc. in your project.
  3. Running the crawler: Run the crawler from the command line and Crawlee will automatically process the request, grab the data and store the results.
  4. data storage: Crawlee stores data by default in the current working directory in the . /storage folder, this directory can be overridden by a configuration file.
  5. Extended functionality: Add custom hooks, error handling mechanisms and retry policies as needed to ensure the stability and reliability of the crawler.

Proxy and session management

Crawlee integrates proxy rotation and session management features to ensure that the crawler will not be blocked by the target website during the crawling process. You can set the proxy list and session parameters through the configuration file to realize automatic rotation and management.

deployments

Crawlee provides Dockerfiles for easy deployment of crawlers to the cloud or other environments. Docker containers can be built and run with the following commands:

docker build -t my-crawler .
docker run my-crawler

AI Easy Learning

The layman's guide to getting started with AI

Help you learn how to utilize AI tools at a low cost and from a zero base.AI, like office software, is an essential skill for everyone. Mastering AI will give you an edge in your job search and half the effort in your future work and studies.

View Details>
May not be reproduced without permission:Chief AI Sharing Circle " Crawlee: Building Reliable Web Crawlers and Browser Automation Tools with Node.js

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish