GPT-Crawler: Automatically Crawling Website Content to Generate Knowledge Base Documents

Latest AI Resources2mos agoupdate AI Sharing Circle

2.2K 00

General Introduction

GPT-Crawler is an open source tool developed by the BuilderIO team and hosted on GitHub. It crawls page content by entering one or more website URLs, generating structured knowledge documents (output.json) for creating custom GPT or AI assistants. Users can configure crawling rules, such as specifying a starting URL and content selector, and the tool automatically extracts the text and organizes it into files. The tool is easy to use and supports local runs, Docker container deployments, and API calls, making it ideal for developers to quickly build proprietary AI assistants from website content. So far, it's gaining traction in the tech community, popularized for its efficiency and open-source nature.

Function List

Crawls website content from one or more URLs to generate the output.json Documentation.
Support for custom crawling rules, including starting URLs, link matching patterns, and CSS selectors.
Ability to handle dynamic web pages and crawl client-side rendered content using a headless browser.
Provides an API interface to start crawling tasks via POST requests.
Supports setting the maximum number of pages (maxPagesToCrawl), document size (maxFileSize) and the number of tokens (maxTokens).
The generated files can be uploaded directly to OpenAI for creating custom GPTs or AI assistants.
Supports Docker container running, easy to deploy in different environments.
Specific resource types (e.g., images, videos, etc.) can be excluded to optimize crawling efficiency.

Using Help

Installation and operation (local mode)

GPT-Crawler is developed based on Node.js and needs to be installed to run. Here are the detailed steps:

Checking the environment
Ensure that your computer has Node.js (version 16 or higher) and npm installed. run the following command to confirm:

node -v
npm -v

If you don't have it, download and install it from the Node.js website.

cloning project
Enter the command in the terminal to download the project locally:

git clone https://github.com/BuilderIO/gpt-crawler.git

Access to the catalog
Once the download is complete, go to the project folder:

cd gpt-crawler

Installation of dependencies
Run the following command to install the required packages:

npm install

Configuring the Crawler
show (a ticket) config.ts file, modify the crawl parameters. For example, to crawl the Builder.io document:

export const defaultConfig: Config = {
url: "https://www.builder.io/c/docs/developers",
match: "https://www.builder.io/c/docs/**",
selector: ".docs-builder-container",
maxPagesToCrawl: 50,
outputFileName: "output.json"
};

url: The starting crawl address.
match: Link matching pattern with wildcard support.
selector: CSS selector for extracting content.
maxPagesToCrawl: Maximum number of pages to crawl.
outputFileName: The name of the output file.

Running the crawler
Once the configuration is complete, run the following command to start the crawl:

npm start

Upon completion.output.json file is generated in the project root directory.

Alternative modes of operation

Using Docker containers

Ensure that Docker is installed (downloaded from the Docker website).
go into containerapp Folder, Edit config.tsThe
Run the following command to build and start the container:

docker build -t gpt-crawler .
docker run -v $(pwd)/data:/app/data gpt-crawler

The output file is generated in the data folder.

Running with the API

After installing the dependencies, start the API service:

npm run start:server

The service runs by default in the http://localhost:3000The
Send a POST request to /crawl, example:

curl -X POST http://localhost:3000/crawl -H "Content-Type: application/json" -d '{"url":"https://example.com","match":"https://example.com/**","selector":"body","maxPagesToCrawl":10,"outputFileName":"output.json"}'

accessible /api-docs View the API documentation (based on Swagger).

Upload to OpenAI

Creating custom GPTs

Open ChatGPT.
Click on your name in the lower left corner and select "My GPTs".
Click "Create a GPT" > "Configure" > "Knowledge".
upload output.json Documentation.
If the file is too large, the config.ts set up maxFileSize maybe maxTokens Split file.

Creating custom assistants

Open the OpenAI platform.
Click "+ Create" > "Upload".
upload output.json Documentation.

Functional operation details

Crawl content
indicate clearly and with certainty url cap (a poem) selector After that, the tool extracts the page text. For example..docs-builder-container Grab only the content of the region.
Generate files
The output file format is:

[{"title": "页面标题", "url": "https://example.com/page", "html": "提取的文本"}, ...]

Optimized Output
utilization resourceExclusions Exclude extraneous resources (e.g. png,jpg), reducing file size.

caveat

An OpenAI paid account is required to create custom GPTs.
Dynamic web crawling relies on headless browsers to ensure dependency integrity.
Configuration can be adjusted to split the upload when the file is too large.

application scenario

Technical Support Assistant
Crawl product documentation websites to generate AI assistants that help users answer technical questions.
Content organizing tools
Grab articles from blogs or news sites to create a knowledge base or Q&A assistant.
Education and training assistant
Crawl online course pages to generate learning assistants that provide course-related answers.

QA

Is it possible to crawl multiple websites?
Can. In the config.ts Just set multiple URLs and match rules in the
What if the file is too large to upload?
set up maxFileSize maybe maxTokens, splits the file into multiple smaller files.
Do you support Chinese sites?
Support. As long as the site content can be parsed by headless browsers, it can be crawled properly.