magic-html: extract body data from HTML URL, output plain text/markdown

Latest AI Resources11mos agoupdate AI Sharing Circle

1.6K 00

General Introduction

magic-html is a Python library designed to simplify the process of extracting body region content from HTML. Whether dealing with complex HTML structures or simple web pages, this library aims to provide a convenient and efficient interface for users. It supports multimodal extraction, multiple layoutextractor, including articles, forums and microsoft articles, and also supports latex formula extraction conversion.

Function List

Extract HTML body area content
Support for multimodal extraction
Supports article, forum and weibo post layouts
Support latex formula extraction and conversion
Customize the output in plain text or markdown format

Using Help

mounting

To install magic-html, use the pip command:

pip install magic-html

utilization

Once installed, it can be used with the following code:

from magic_html import GeneralExtractor

# 初始化提取器
extractor = GeneralExtractor()

# 示例HTML内容
html = """
<!doctype html>
<html>
<head>
    <title>Example Domain</title>
    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
</head>
<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
"""

# 提取数据
data = extractor.extract(html)
print(data)

Functional operation flow

Initialize the extractor: First you need to import the magic-html library and initialize the extractor.
Preparing HTML content: Prepare the HTML code from which the content needs to be extracted, which can be in the form of a string.
Calling the extraction method: Useextractmethod to extract the body content. Different HTML types can be specified as needed, such as articles, forums, or WeChat posts.
output result: The extraction result can be in plain text or markdown format, depending on the user's needs.

typical example

Below is a complete example showing how to extract the body content from a simple HTML page:

from magic_html import GeneralExtractor

# 初始化提取器
extractor = GeneralExtractor()

# 示例HTML内容
html = """
<!doctype html>
<html>
<head>
    <title>Example Domain</title>
    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
</head>
<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
"""

# 提取数据
data = extractor.extract(html)
print(data)