wdoc: retrieving content and summarizing knowledge from massive, multi-source documents

🚀 Invitation to Experience: China's First AI IDE Intelligent Programming Software Trae Chinese version downloadThe DeepSeek-R1 and Doubao-pro are available for unlimited use!

General Introduction

wdoc is a powerful RAG (Retrieval Augmentation Generation) system designed for processing and analyzing large and diverse documents. It is capable of retrieving from a wide range of document types, including PDFs, web pages, YouTube videos, audio files, etc. wdoc is particularly well suited for processing large amounts of information sources, making it an ideal tool for researchers, students, and professionals needing to work with large amounts of information. The system utilizes the LangChain library for document processing, supports a wide range of LLM (Large Language Model) providers, and offers high-precision retrieval and summarization capabilities. wdoc is still under constant development, and user feedback and feature requests are welcome.

wdoc：从海量、多源文档中检索内容并总结知识-1

Function List

Multi-file type support: Supports more than 15 file types, including PDFs, web pages, YouTube videos, audio files, and more.
High-precision retrieval and summarization: Provides highly accurate document retrieval and summarization through embedded search and semantic batch processing.
Multi-LLM Support: Multiple LLM providers are supported, including local models and private models with additional security layers.
Advanced RAG Functions: Weak LLM is used to filter irrelevant documents, and strong LLM provides precise answers and merges answers through semantic clustering and sorting.
Easy to expand: Not only a tool, but also a library that allows users to use wdoc in other Python projects.
Detailed documentation and help: Provide rich documentation and help information to facilitate users to get started quickly.

Using Help

mounting

wdoc currently requires Python version 3.11 to run. Please make sure you have the correct version of Python and then follow the steps below to install it:

Use pip to install:
```
pip install -U wdoc
```

Or install a specific git branch:

pip install git+https://github.com/thiswillbeyourgithub/wdoc@dev

It is recommended to install pdftotext and fasttext support:
```
pip install -U wdoc[pdftotext] wdoc[fasttext]
```

utilization

Add the required API key as an environment variable:
```
export OPENAI_API_KEY="您的API密钥"
```

Start wdoc:

wdoc --task=query --path=您的文档路径

Functional operation flow

Document Search

Use wdoc to query the contents of a document:

wdoc --task=query --path=您的文档路径 --filetype=pdf --query="查询内容"

The command will load the PDF file from the specified path and retrieve it according to the query and return the relevant documents.

Documentation Summary

Summarize the document using wdoc:

wdoc --task=summarize --path=您的文档路径 --filetype=pdf

The command will summarize the specified path to the PDF file to return a detailed summary of the document content.

Combined tasks

You can also combine query and summarization tasks:

wdoc --task=summarize_then_query --path=您的文档路径 --filetype=pdf

This command will first summarize the contents of the document and then allow you to make further inquiries about the summary.

Advanced Features

wdoc supports a variety of advanced features such as:

Multi-file type support: Load multiple file types via recursive paths, linked files, etc.
Advanced RAG Functions: Improve retrieval accuracy using techniques such as multi-query search and semantic batch processing.
Local and private LLM support: Ensure that data is secure and not leaked to external providers.
Detailed documentation and help: Bywdoc --helpGet more information on how to use it.

wdoc: retrieve content and summarize knowledge from massive, multi-source documents

General Introduction

Function List

Using Help

mounting

utilization

Functional operation flow

Document Search

Documentation Summary

Combined tasks

Advanced Features

Related articles

Recommended

Can't find AI tools? Try here!

FLUX.1 image generator (supports Chinese input)

Recent AI Hotspots

AI Tools Recommendations

AI Tools Classification