General Introduction
magic-html is a Python library designed to simplify the process of extracting body region content from HTML. Whether dealing with complex HTML structures or simple web pages, this library aims to provide a convenient and efficient interface for users. It supports multimodal extraction, multiple layoutextractor, including articles, forums and microsoft articles, and also supports latex formula extraction conversion.
Function List
- Extract HTML body area content
- Support for multimodal extraction
- Supports article, forum and weibo post layouts
- Support latex formula extraction and conversion
- Customize the output in plain text or markdown format
Using Help
mounting
To install magic-html, use the pip command:
pip install magic-html
utilization
Once installed, it can be used with the following code:
from magic_html import GeneralExtractor
# 初始化提取器
extractor = GeneralExtractor()
# 示例HTML内容
html = """
<!doctype html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
"""
# 提取数据
data = extractor.extract(html)
print(data)
Functional operation flow
- Initialize the extractor: First you need to import the magic-html library and initialize the extractor.
- Preparing HTML content: Prepare the HTML code from which the content needs to be extracted, which can be in the form of a string.
- Calling the extraction method: Use
extract
method to extract the body content. Different HTML types can be specified as needed, such as articles, forums, or WeChat posts. - output result: The extraction result can be in plain text or markdown format, depending on the user's needs.
typical example
Below is a complete example showing how to extract the body content from a simple HTML page:
from magic_html import GeneralExtractor
# 初始化提取器
extractor = GeneralExtractor()
# 示例HTML内容
html = """
<!doctype html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
"""
# 提取数据
data = extractor.extract(html)
print(data)