AI Personal Learning
and practical guidance
讯飞绘镜

wdoc: retrieve content and summarize knowledge from massive, multi-source documents

General Introduction

wdoc is a powerful RAG (Retrieval Augmentation Generation) system designed for processing and analyzing large and diverse documents. It is capable of retrieving from a wide range of document types, including PDFs, web pages, YouTube videos, audio files, etc. wdoc is particularly well suited for processing large amounts of information sources, making it an ideal tool for researchers, students, and professionals needing to work with large amounts of information. The system utilizes the LangChain library for document processing, supports a wide range of LLM (Large Language Model) providers, and offers high-precision retrieval and summarization capabilities. wdoc is still under constant development, and user feedback and feature requests are welcome.

wdoc:从海量、多源文档中检索内容并总结知识-1


 

Function List

  • Multi-file type support: Supports more than 15 file types, including PDFs, web pages, YouTube videos, audio files, and more.
  • High-precision retrieval and summarization: Provides highly accurate document retrieval and summarization through embedded search and semantic batch processing.
  • Multi-LLM Support: Multiple LLM providers are supported, including local models and private models with additional security layers.
  • Advanced RAG Functions: Weak LLM is used to filter irrelevant documents, and strong LLM provides precise answers and merges answers through semantic clustering and sorting.
  • Easy to expand: Not only a tool, but also a library that allows users to use wdoc in other Python projects.
  • Detailed documentation and help: Provide rich documentation and help information to facilitate users to get started quickly.

 

Using Help

mounting

wdoc currently requires Python version 3.11 to run. Please make sure you have the correct version of Python and then follow the steps below to install it:

  1. Use pip to install:
    pip install -U wdoc
  1. Or install a specific git branch:
    pip install git+https://github.com/thiswillbeyourgithub/wdoc@dev
    
  2. It is recommended to install pdftotext and fasttext support:
    pip install -U wdoc[pdftotext] wdoc[fasttext]
    

utilization

  1. Add the required API key as an environment variable:
    export OPENAI_API_KEY="您的API密钥"
    
  2. Start wdoc:
    wdoc --task=query --path=您的文档路径
    

Functional operation flow

Document Search

Use wdoc to query the contents of a document:

wdoc --task=query --path=您的文档路径 --filetype=pdf --query="查询内容"

The command will load the PDF file from the specified path and retrieve it according to the query and return the relevant documents.

Documentation Summary

Summarize the document using wdoc:

wdoc --task=summarize --path=您的文档路径 --filetype=pdf

The command will summarize the specified path to the PDF file to return a detailed summary of the document content.

Combined tasks

You can also combine query and summarization tasks:

wdoc --task=summarize_then_query --path=您的文档路径 --filetype=pdf

This command will first summarize the contents of the document and then allow you to make further inquiries about the summary.

Advanced Features

wdoc supports a variety of advanced features such as:

  • Multi-file type support: Load multiple file types via recursive paths, linked files, etc.
  • Advanced RAG Functions: Improve retrieval accuracy using techniques such as multi-query search and semantic batch processing.
  • Local and private LLM support: Ensure that data is secure and not leaked to external providers.
  • Detailed documentation and help: Bywdoc --helpGet more information on how to use it.
May not be reproduced without permission:Chief AI Sharing Circle " wdoc: retrieve content and summarize knowledge from massive, multi-source documents
en_USEnglish