AI Personal Learning
and practical guidance

GPT Crawler: Crawling Knowledge Base Documents from Websites to Create Customized GPTs

General Introduction

GPT Crawler is an open source tool that allows users to generate knowledge files by crawling the content of a specific website, which in turn creates customized GPT models. The project is mainly used to crawl and organize web page information , support for running through the API and local deployment . Users can flexibly configure the crawler to adapt to different website structures and needs , so as to generate knowledge base documents suitable for their own use .

GPT Crawler: crawling knowledge files from websites to create customized GPTs-1


 

 

Function List

  • Crawling website content to generate knowledge files
  • Supports API and local deployment runs
  • Flexible configuration of crawler parameters
  • Generate customized GPT models
  • Supports exclusion of multiple file formats

 

 

Using Help

Installation process

  1. clone warehouse: First, make sure you have Node.js >= 16 installed. then run the following command to clone the repository:
    git clone https://github.com/BuilderIO/gpt-crawler
    
  2. Installation of dependencies: Go to the project directory and install the dependencies:
    cd gpt-crawler
    npm install
    
  3. Configuring the Crawler: Open config.ts file, edit the url cap (a poem) selector property to match your needs. For example, to crawl Builder.io documents, you can use the following configuration:
    export const defaultConfig: Config = {
      url: "https://www.builder.io/c/docs/developers",
      match: "https://www.builder.io/c/docs/**",
      selector: ".docs-builder-container",
      maxPagesToCrawl: 50,
      outputFileName: "output.json",
    };
    
  4. Running the crawler: After the configuration is complete, run the following command to start the crawler:
    npm run start
    

 

Instructions for use

  1. configuration file: in config.ts file, you can set parameters such as the crawler's start URL, match pattern, selector, maximum number of pages to crawl, and so on. Make sure these parameters match the structure of the target site.
  2. Running the crawler: Use npm run start command starts the crawler, which crawls the website content and generates knowledge files based on the parameters in the configuration file.
  3. Generate GPT model: Once the crawler is complete, the generated knowledge files are saved in the project directory. You can upload these files to OpenAI or other GPT services to create customized GPT models.
  4. API Usage: If you want to run the crawler via API, you can refer to the API documentation in the project to configure and start the API service.

 

caveat

  • resource exclusion: in config.ts file, you can pass the resourceExclusions property excludes unwanted file types such as images, videos, etc.
  • File Size Limit: You can set maxFileSize cap (a poem) maxTokens attribute, which limits the maximum size of the generated file and the maximum token Quantity.
AI Easy Learning

The layman's guide to getting started with AI

Help you learn how to utilize AI tools at a low cost and from a zero base.AI, like office software, is an essential skill for everyone. Mastering AI will give you an edge in your job search and half the effort in your future work and studies.

View Details>
May not be reproduced without permission:Chief AI Sharing Circle " GPT Crawler: Crawling Knowledge Base Documents from Websites to Create Customized GPTs

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish