AI Personal Learning
and practical guidance

Scraperr: self-hosted web data scraping tool

General Introduction

Scraperr is a self-hosted web data scraping tool that allows users to specify XPath elements to scrape web data. Users submit a URL and corresponding crawling elements, and the results are displayed in a table that can be downloaded as an Excel file.Scraperr supports user login to manage crawling tasks, and provides log viewing and statistics.

Scraperr: self-hosted web data scraping tool-1


 

Function List

  • Submit and queue URLs for web crawling
  • Adding and Managing Crawl Elements with XPath
  • Crawl all pages under the same domain name
  • Add custom JSON headers to send requests
  • Displaying the results of the grabbed data
  • Download the CSV file containing the results
  • Rerun the capture task
  • Viewing the status of queued tasks
  • Favorites and View Favorite Tasks
  • User login/registration to organize tasks
  • View Application Logs
  • View Task Statistics
  • AI integration to support the inclusion of crawl results in the context of conversations

 

Using Help

Installation process

  1. Cloning Warehouse:
    git clone https://github.com/jaypyles/scraperr.git
    
  2. Setting environment variables and tags: In the docker-compose.yml file to set environment variables and tags, for example:
    scraperr.
    labels.
    - "traefik.enable=true"
    - "traefik.http.routers.scraperr.rule=Host(`localhost`)"
    - "traefik.http.routers.scraperr.entrypoints=web"
    scraperr_api.
    environment.
    - LOG_LEVEL=INFO
    - MONGODB_URI=mongodb://root:example@webscrape-mongo:27017
    - SECRET_KEY=your_secret_key
    - ALGORITHM=HS256
    - access_token_expire_minutes=600
    
  3. Start the service:
    docker-compose up -d
    

Usage Process

  1. Submit URL for crawling::
    • After logging in to Scraperr, go to the Scraping Tasks page.
    • Enter the URL to be crawled and the corresponding XPath element.
    • After submitting a task, the system will automatically queue it and start capturing it.
  2. Managing Crawl Elements::
    • On the crawl task page, you can add, edit, or delete XPath elements.
    • Supports crawling all pages under the same domain.
  3. View Crawl Results::
    • Once the crawl is complete, the results will be displayed in a table.
    • The user can download a CSV file containing the results or choose to re-run the task.
  4. task management::
    • Users can view the status of queued tasks, favorite and view favorite tasks.
    • Provides the Task Statistics view, which displays statistics of the run tasks.
  5. Log View::
    • On the Application Logs page, users can view the system logs for detailed information about the capture task.
  6. AI Integration::
    • Support for incorporating crawl results into dialog context, currently supports Ollama and OpenAI.
AI Easy Learning

The layman's guide to getting started with AI

Help you learn how to utilize AI tools at a low cost and from a zero base.AI, like office software, is an essential skill for everyone. Mastering AI will give you an edge in your job search and half the effort in your future work and studies.

View Details>
May not be reproduced without permission:Chief AI Sharing Circle " Scraperr: self-hosted web data scraping tool

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish