AI Personal Learning
and practical guidance

llms.txt Generator: Rapidly crawls website content and generates LLM training text datasets.

General Introduction

llmstxt-generator is a professional web content extraction and integration tool specialized in preparing high-quality text datasets for Large Language Model (LLM) training and inference. Developed by Mendable AI, the tool uses web crawling technology provided by @firecrawl_dev and GPT-4-mini for text processing. It automatically crawls the content of a given website and consolidates all the content into a standardized text file. The tool provides a convenient web interface and API interface, allowing users to easily generate training datasets, especially suitable for AI researchers and developers who need to obtain text data from websites in bulk.

llms.txt Generator: Rapidly Crawl Website Content and Generate LLM Training Text Dataset-1


 

Function List

  • Automatically crawl all relevant pages of the target website content
  • Provides two text output formats: standard version (llms.txt) and full version (llms-full.txt).
  • Supports intuitive operation via web interface
  • Provide RESTful API interface to enable programmatic calls
  • Support for special handling of GitHub repository content
  • Capable of intelligent extraction and processing of web content
  • Support for customizing Firecrawl API keys to improve crawl limits
  • Includes a web caching mechanism to improve data capture efficiency
  • Support for multiple content format conversions (e.g. Markdown)

 

Using Help

1. Web interface usage

  1. Visit the official website: https://llmstxt.firecrawl.dev
  2. Enter the URL of the target website in the input box
  3. Click on the "Generate" button to start generating text.
  4. Wait for processing to complete to get the generated text file

2. API usage

Basic API calls:

GET https://llmstxt.firecrawl.dev/[YOUR_URL_HERE]
  • Get the standard version of the text: go directly to the URL above
  • To get the full version of the text: add "/full" to the end of the URL.

Use a custom API Key:

If a higher crawl limit is needed, you can use your own by Firecrawl API key::

GET https://llmstxt.firecrawl.dev/[YOUR_URL_HERE]?FIRECRAWL_API_KEY=YOUR_API_KEY

3. Guidelines for local deployment

If you need to run it in a local environment, please follow the steps below:

  1. Environment Configuration:
    establish .env file and configure the following necessary parameters:
FIRECRAWL_API_KEY=your key
SUPABASE_URL=Your supabase URL
SUPABASE_KEY=your supabase key
OPENAI_API_KEY=your OpenAI key
  1. Installation and operation:
npm install
npm run dev

4. Precautions for use

  • Processing time: may take a few minutes due to the need for site crawling and LLM processing operations
  • Free version limitation: up to 10 pages per site crawled without custom API key
  • Advanced version: Up to 100 page limit with custom Firecrawl API key
  • Caching mechanism: the system will cache the results, repeated requests for the same URL within 3 days will return the cached content directly
  • GitHub repository support: For GitHub repository URLs, special processing is performed to extract repository-related content.
May not be reproduced without permission:Chief AI Sharing Circle " llms.txt Generator: Rapidly crawls website content and generates LLM training text datasets.

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish