General Introduction
Crawlee is an open source web crawler and browser automation library developed by Apify , designed for the Node.js environment. It supports JavaScript and TypeScript, and works with tools like Puppeteer, Playwright, Cheerio, JSDOM, and others to provide powerful data crawling and automation features.Crawlee allows users to build reliable crawlers that extract the data needed for AI, LLM, RAG, or GPTs, and download HTML, PDF, JPG, PNG, and more. It is designed to make crawlers look more like human operations, circumvent modern anti-crawler mechanisms, support agent rotation and session management, and is suitable for a variety of complex web crawling tasks.
Crawlee for Python is open to early adopters!
Function List
- Single interface for HTTP and headless browser crawling
- Persistent URL crawling queues (breadth-first and depth-first)
- Pluggable data and file storage
- Automatically scales to system resources
- Integrated agent rotation and session management
- Lifecycle can be customized with hooks
- CLI tool for bootstrapping projects
- Configurable routing, error handling and retry mechanisms
- Providing Dockerfiles for Deployment
- Written in TypeScript with generic support
- HTTP2 support and automatic generation of browser-style request headers
- Integrated fast HTML parser (Cheerio and JSDOM)
- JSON API Crawling Support
- JavaScript rendering and screenshot support
- Headless and Headed Mode Support
- Automatic generation of human-style fingerprints
- Using Playwright and Puppeteer's Unified Interface
- Supports Chrome, Firefox, Webkit, etc.
Using Help
mounting
Crawlee requires Node.js 16 or higher. You can quickly install and create a sample crawler with the following command:
npx crawlee create my-crawler
cd my-crawler
npm start
If you prefer to install manually, you can use the following command:
npm install crawlee playwright
Then import and use it in your project:
import { PlaywrightCrawler, Dataset } from 'crawlee' ;
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, enqueueLinks, log }) {
const title = await page.title();
log.info(`Title of ${request.loadedUrl} is '${title}'`);
await Dataset.pushData({ title, url: request.loadedUrl });
await enqueueLinks();
}, // headless: false, // UniqueLinks(); await enqueueLinks()
// headless: false, // Uncomment to see the browser window
}); await crawler.run['run']; }
await crawler.run(['https://crawlee.dev']);
Functional operation flow
- Creating a Crawler Project: Quickly create a new crawler project using the Crawlee CLI, which installs all necessary dependencies and adds sample code.
- Configuring the Crawler: Configure the crawler's request processing logic, data storage location, proxy settings, etc. in your project.
- Running the crawler: Run the crawler from the command line and Crawlee will automatically process the request, grab the data and store the results.
- data storage: Crawlee stores data by default in the current working directory in the
. /storage
folder, this directory can be overridden by a configuration file. - Extended functionality: Add custom hooks, error handling mechanisms and retry policies as needed to ensure the stability and reliability of the crawler.
Proxy and session management
Crawlee integrates proxy rotation and session management features to ensure that the crawler will not be blocked by the target website during the crawling process. You can set the proxy list and session parameters through the configuration file to realize automatic rotation and management.
deployments
Crawlee provides Dockerfiles for easy deployment of crawlers to the cloud or other environments. Docker containers can be built and run with the following commands:
docker build -t my-crawler .
docker run my-crawler