General Introduction
GPT Crawler is an open source tool that allows users to generate knowledge files by crawling the content of a specific website, which in turn creates customized GPT models. The project is mainly used to crawl and organize web page information , support for running through the API and local deployment . Users can flexibly configure the crawler to adapt to different website structures and needs , so as to generate knowledge base documents suitable for their own use .
Function List
- Crawling website content to generate knowledge files
- Supports API and local deployment runs
- Flexible configuration of crawler parameters
- Generate customized GPT models
- Supports exclusion of multiple file formats
Using Help
Installation process
- clone warehouse: First, make sure you have Node.js >= 16 installed. then run the following command to clone the repository:
git clone https://github.com/BuilderIO/gpt-crawler
- Installation of dependencies: Go to the project directory and install the dependencies:
cd gpt-crawler npm install
- Configuring the Crawler: Open
config.ts
file, edit theurl
cap (a poem)selector
property to match your needs. For example, to crawl Builder.io documents, you can use the following configuration:export const defaultConfig: Config = { url: "https://www.builder.io/c/docs/developers", match: "https://www.builder.io/c/docs/**", selector: ".docs-builder-container", maxPagesToCrawl: 50, outputFileName: "output.json", };
- Running the crawler: After the configuration is complete, run the following command to start the crawler:
npm run start
Instructions for use
- configuration file: in
config.ts
file, you can set parameters such as the crawler's start URL, match pattern, selector, maximum number of pages to crawl, and so on. Make sure these parameters match the structure of the target site. - Running the crawler: Use
npm run start
command starts the crawler, which crawls the website content and generates knowledge files based on the parameters in the configuration file. - Generate GPT model: Once the crawler is complete, the generated knowledge files are saved in the project directory. You can upload these files to OpenAI or other GPT services to create customized GPT models.
- API Usage: If you want to run the crawler via API, you can refer to the API documentation in the project to configure and start the API service.
caveat
- resource exclusion: in
config.ts
file, you can pass theresourceExclusions
property excludes unwanted file types such as images, videos, etc. - File Size Limit: You can set
maxFileSize
cap (a poem)maxTokens
attribute, which limits the maximum size of the generated file and the maximum token Quantity.