General Introduction
HtmlRAG is an innovative open source project focused on improving retrieval enhancement generation (RAG) approach to HTML document processing in the system. The project presents a novel approach that argues that using HTML formatting in RAG systems is more efficient than plain text. The project encompasses a complete data processing flow, from query rewriting to HTML document crawling to an evaluation system for answer generation. It supports the processing of several mainstream datasets, including ASQA, HotpotQA, NQ, etc., and provides detailed evaluation metrics to measure the system performance. The project not only provides a source code implementation, but also includes preprocessed datasets and evaluation tools, enabling researchers to easily reproduce the results and make further improvements.
Paper address: https://arxiv.org/pdf/2411.02959
-
Non-destructive HTML cleanup: This cleanup process removes only completely irrelevant content and compresses redundant structures, preserving all semantic information in the original HTML. Lossless HTML cleanup compressed HTML is suitable for RAG systems with long context LLMs and an unwillingness to lose any information before generation. -
HTML pruning based on two-step block trees: Block tree-based HTML pruning consists of two steps, both of which are performed on the block tree structure.The first pruning step uses the embedding model to compute scores for blocksbut (not)The second step uses the path to generate the model. The first step deals with the results of the non-destructive HTML cleanup, while the second step deals with the results of the first pruning step.
Example command:
**HTML**: "{HTML}" **Question**: **{Question}** Your task is to identify the text fragment from the HTML document that is most relevant to the given question. The text fragment can be a direct paraphrase of a fact or can be used as supporting evidence to infer that fact. The overall length of the text fragment should be more than 20 words and less than 300 words. You need to provide the path to the text fragment in the HTML document. Example output: <html1><body><div2><p>Some key information... Output. <html1><body><div2><p>Shinsuke Nakamura won the Men's Battle Royal at the historic 2018 Battle Royal . . .
Function List
- Query processing and rewriting capabilities to support decomposition of complex problems into subqueries
- HTML document crawling and processing system, preserving document structure information
- Multi-dataset support (ASQA, HotpotQA, NQ, TriviaQA, MuSiQue, ELI5)
- Complete assessment framework, including multiple dimensions such as correctness and relevance of answers
- Custom dataset processing support, allowing users to work with their own data
- Integrated Bing search function for retrieving relevant web pages
- Provide a detailed scoring system with a variety of assessment metrics including ROUGE scores
- Support for integration with the LangChain framework
- Results visualization and analysis tools
Using Help
1. Environmental configuration
First you need to clone the project repository and install the necessary dependencies:
git clone https://github.com/plageon/HtmlRAG
cd HtmlRAG
pip install -r requirements.txt
2. Data set preparation
The project supports two ways of using data:
- Use a pre-processed dataset:
- The dataset is stored in the html_data folder
- The full test dataset can be downloaded from huggingface: HtmlRAG-test
- Supported dataset formats include ASQA, HotpotQA, NQ, etc.
- Use custom data:
- Prepare the query file in .jsonl format, with each line containing the following fields:
{ "id": "unique_id", "question": "query_text", "answers": ["answer_text_1", "answer_text_2"] }
- Prepare the query file in .jsonl format, with each line containing the following fields:
3. Flow of use of major functions
query processing
- Query Rewrite:
- The system automatically breaks down complex questions into multiple subqueries
- The rewrite result is stored in the rewrite_method_result field of the json object
HTML document processing
- Page Crawl:
- The system uses Bing to search for relevant URLs
- Automatic crawling of static HTML documents
- Retain information about the HTML structure of the document
Evaluation systems
- Evaluate indicator configurations:
- Assessment of correctness of answers
- Assessment of answer relevance
- ROUGE score calculation
- Customized assessment indicator settings
- Use of assessment tools:
from ragas.metrics import AnswerRelevancy
from langchain.embeddings import HuggingFaceEmbeddings
# Initialize the evaluator
embeddings = HuggingFaceEmbeddings('BAAI/bge-base-en')
answer_relevancy = AnswerRelevancy(embeddings=embeddings)
# loading the model
answer_relevancy.init_model()
# Run the evaluation
results = answer_relevancy.score(dataset)
4. Analysis and optimization of results
- Analysis of assessment results using the visualization tools provided
- Adjustment of system parameters based on assessment indicators
- Optimizing Query Rewriting Strategies and Document Processing Methods
5. Cautions
- Ensure enough storage space to handle HTML data
- Compliance with API usage limits and rate limits
- Regularly update dependency packages for the latest features
- Pay attention to the correctness of the data format
- GPUs are recommended to accelerate large-scale data processing
Reference note on HTML document cleaning
HTML cleaning rules (Clean Patterns)
def clean_html(html: str) -> str. # 1. Parsing HTML with BeautifulSoup soup = bs4.BeautifulSoup(html, 'html.parser') # 2. Call simplify_html to simplify it html = simplify_html(soup) # 3. Call clean_xml to clean up the XML markup html = clean_xml(html) return html
1. HTML document processing related words (the following content is not project code)
# HTML cleaning prompt words
CLEAN_HTML_PROMPT = """
Task: clean HTML document to retain valid information
Input: original HTML document
Requirements:
1. Remove.
- JavaScript code (<script>标签)
- CSS样式信息 (<style>标签)
- HTML注释
- 空白标签
- 多余属性
2. 保留:
- 主要内容标签(<title>, <p>, <div>, <h1>-<h6>, etc.)
- Text content
- Structural relationships
Output: cleaned HTML text
"""
# HTML chunking prompt words
HTML_BLOCK_PROMPT = """
Task: split HTML document into semantically complete blocks
Input:
- Cleaned HTML document
- Maximum block size: {max_words} words
Requirements:
1. maintain the hierarchical structure of the HTML tags
2. ensure semantic integrity of each block
3. document the block's hierarchical path
4. control the size of each block does not exceed the limit
Output: list of blocks in JSON format containing.
{
"blocks": [
{
"content": "Block content",
"path": "HTML path",
"depth": "Hierarchy depth"
}
]
}
"""
2. Cue words related to block tree construction
# Block tree node scoring prompt words
BLOCK_SCORING_PROMPT = """
TASK: Evaluate the relevance of an HTML block to a question
Input:
- Question: {question}
- HTML block: {block_content}
- Block path: {block_path}
Requirements:
1. calculate the semantic relevance score (0-1)
2. consider the following factors:
- Text similarity
- Structural importance
- Contextual relevance
Output:
{
"score": float, # relevance score
"reason": "reason for score"
}
"""
# Block tree pruning prompt word
TREE_PRUNING_PROMPT = """
Task: decide whether to keep the HTML block
Input:
- {question}
- Current block: {current_block}
- Parent block: {parent_block}
- List of child blocks: {child_blocks}
Requirements:
1. analyze the importance of the block:
- Relevance to the problem
- Relevance to the problem
- Relationship to parent and child blocks
2. Generate retention/deletion decisions
Output:
{
"action": "keep/remove",
"confidence": float, # Decision confidence level
"reason": "reason for decision"
}
"""
3. Knowledge retrieval-related terms
# Relevance Calculation Prompt Words
RELEVANCE_PROMPT = """
TASK: Calculate the relevance of a document fragment to a question
Input:
- Question: {question}
- Document fragment: {text_chunk}
Evaluation criteria:
1. direct relevance: the extent to which the content directly answers the question
2. Indirect relevance: the extent to which context or additional information is provided.
3. completeness of information: whether the answer is complete
Output:
{
"relevance_score": float, # Relevance score 0-1
"reason": "Reason for scoring".
"key_matches": ["key match 1", "key match 2", ...]
}
"""
# Answer Generation Prompt Words
ANSWER_GENERATION_PROMPT = """
TASK: Generate an answer based on a relevant document
Input:
- User question: {question}
- Related documents: {relevant_docs}
- Document scores: {doc_scores}
Requirements:
1. synthesize information from highly relevant documents
2. maintain accuracy of answers
3. ensure that answers are complete and coherent
4. cite sources where necessary
Output format:
{
"answer": "Generated answer", "sources": ["Documentation sources used"], ["Sources used"], {
"sources": ["Document sources used"], "confidence": float # Answer confidence.
"confidence": float # Answer Confidence
}
"""
4. Assessment of relevant prompts
# Answer Evaluation Prompt Words
EVALUATION_PROMPT = """
TASK: Evaluate the quality of generated answers
Input:
- Original question: {question}
- Generated answer: {generated_answer}
- Reference answer: {reference_answer}
Evaluate the dimensions:
1. correctness: whether the information is accurate
2. completeness: whether the question is answered in full
3. relevance: how relevant the content is to the question
4. coherence: whether the presentation is clear and consistent
Output:
{
"scores": {
"correctness": float, "completeness": float, {
"completeness": float, "relevance": float, "coherence": {
"relevance": float, {
"coherence": float
}, "feedback": "Detailed comments".
"feedback": "Detailed assessment comments"
}
"""
# Error Analysis Prompt Words
ERROR_ANALYSIS_PROMPT = """
TASK: Analyze potential errors in answers
Input:
- Generate answer: {answer}
- Reference: {reference_docs}
Analyze the key points:
1. factual accuracy checking
2. logical reasoning verification
3. source verification
Output:
{
"errors": [
{
"type": "Error type",
"description": "Description of the error",
"correction": "Suggested correction"
}
], "reliability_score".
"reliability_score": float # Reliability Score
}
"""