AI Personal Learning
and practical guidance
CyberKnife Drawing Mirror

par_scrape: a crawler tool to intelligently extract data from web pages

General Introduction

par_scrape is a Python-based open source web crawler tool, launched on GitHub by developer Paul Robello, designed to help users intelligently extract data from web pages. It integrates two powerful browser automation technologies, Selenium and Playwright, and combines them with AI processing capabilities to support data crawling from simple static pages to complex dynamic websites. Whether it's extracting prices, headlines, or other structured information, par_scrape quickly completes the task by specifying fields and outputs the results as Markdown, JSON, or CSV. Project for developers, data analysts or users who want to automate the collection of information on the Web , easy to install and flexible features , popular in the open source community .

par_scrape: a crawler tool for intelligent extraction of web data-1


 

Function List

  • Intelligent Data Extraction: Analyze web content with AI models such as OpenAI or Anthropic to accurately extract user-specified fields.
  • Dual Crawler SupportThe website supports both Selenium and Playwright technologies, adapting to the needs of different website architectures.
  • Multiple output formats: Capture results can be exported to Markdown, JSON, CSV or Excel for easy subsequent processing.
  • Custom Field Grabbing: Users can specify the fields to be extracted, such as title, description, price, etc., to meet personalized needs.
  • parallel capture: Support multi-threaded crawling to improve the efficiency of large-scale data collection.
  • Waiting mechanism: Provide a variety of page loading waiting methods (e.g., pause, selector waiting) to ensure the success of dynamic content crawling.
  • AI Model Selection: Supports multiple AI providers (e.g., OpenAI, Anthropic, XAI) for flexible adaptation to different tasks.
  • Cache Optimization: Built-in hint caching feature to reduce the cost of duplicate requests and improve efficiency.

 

Using Help

Installation process

To use par_scrape, you need to complete the following installation steps first to ensure that your environment is well prepared. The following is a detailed installation guide:

1. Environmental preparation

  • Python version: Ensure that Python 3.11 or later is installed on your system by using the command python --version Check.
  • Git Tools: Used to clone code from GitHub, if not installed, it can be accessed via the sudo apt install git(Linux) or the official website to download and install.
  • UV Tools: It is recommended to use UV to manage the dependencies, the installation command is:
    • Linux/Mac:curl -LsSf https://astral.sh/uv/install.sh | sh
    • Windows:powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

2. Cloning projects

Clone the par_scrape project locally by entering the following command in the terminal:

git clone https://github.com/paulrobello/par_scrape.git
cd par_scrape

3. Installation of dependencies

Use UV to install project dependencies:

uv sync

Or install it directly from PyPI:

uv tool install par_scrape
# or use pipx
pipx install par_scrape

4. Installation of Playwright (optional)

If you choose Playwright as your crawler, you will need to install and configure your browser additionally:

uv tool install playwright
playwright install chromium

5. Configure the API key

par_scrape supports multiple AI providers, you need to configure the corresponding key in the environment variable. Edit ~/.par_scrape.env file, add the following (select as required):

OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key
XAI_API_KEY=your_xai_key

Or set the environment variable before running the command:

export OPENAI_API_KEY=your_openai_key

Usage

Once the installation is complete you can run par_scrape from the command line, here is the detailed procedure:

Basic Usage Example

Suppose you want to extract the title, description and price from the OpenAI pricing page:

par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" --model gpt-4o-mini --display-output md
  • --url: The address of the target web page.
  • -f: Specify the extraction field, which can be used multiple times.
  • ---model: Select the AI model (e.g. gpt-4o-mini).
  • --display-output: Output format (md, json, csv, etc.).

Featured Function Operation

  1. Switching Crawler Tools
    Playwright is used by default, if you want to use Selenium, you can add a parameter:

    par_scrape --url "https://example.com" -f "Title" --scraper selenium
    
  2. parallel capture
    Set the maximum number of parallel requests to improve efficiency:

    par_scrape --url "https://example.com" -f "Data" --scrape-max-parallel 5
    
  3. Dynamic Page Waiting
    For dynamically loaded content, wait types and selectors can be set:

    par_scrape --url "https://example.com" -f "Content" --wait-type selector --wait-selector ".dynamic-content"
    

    Supported wait types include none,pause,sleep,idle,selector cap (a poem) textThe

  4. Customizing the output path
    Saves the results to the specified folder:

    par_scrape --url "https://example.com" -f "Title" --output-folder . /my_data
    

Operation process details

Take the crawl pricing page for example:

  1. target-oriented: Go to https://openai.com/api/pricing/ and confirm that you need to extract the "Model", "Pricing Input" and " Pricing Output".
  2. Run command::
    par_scrape --url "https://openai.com/api/pricing/" -f "Model" -f "Pricing Input" -f "Pricing Output" --model gpt-4o-mini --display-output json
    
  3. View Results: After the command is executed, the terminal displays the data in JSON format or saves it to the default output file.
  4. Adjustment parameters: If the data is incomplete, try adding --retries 5(number of retries) or adjustments --sleep-time 5(Waiting time).

caveat

  • API key: Ensure that the key is valid, otherwise the AI extraction function is not available.
  • Website restrictions: Some sites may have an anti-crawl mechanism and it is recommended to use the --headless(headless mode) or to adjust the grabbing frequency.
  • Cache Usage: If you crawl the same page multiple times, enable the --prompt-cache Cost reduction.

With the above steps, users can quickly get started with par_scrape and easily accomplish web page data extraction tasks.

CDN1
May not be reproduced without permission:Chief AI Sharing Circle " par_scrape: a crawler tool to intelligently extract data from web pages

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish