AI Personal Learning
and practical guidance
豆包Marscode1

magic-html: extract body data from HTML URL, output plain text/markdown

General Introduction

magic-html is a Python library designed to simplify the process of extracting body region content from HTML. Whether dealing with complex HTML structures or simple web pages, this library aims to provide a convenient and efficient interface for users. It supports multimodal extraction, multiple layoutextractor, including articles, forums and microsoft articles, and also supports latex formula extraction conversion.

Function List

  • Extract HTML body area content
  • Support for multimodal extraction
  • Supports article, forum and weibo post layouts
  • Support latex formula extraction and conversion
  • Customize the output in plain text or markdown format

 

Using Help

mounting

To install magic-html, use the pip command:


pip install magic-html

utilization

Once installed, it can be used with the following code:

from magic_html import GeneralExtractor

# 初始化提取器
extractor = GeneralExtractor()

# 示例HTML内容
html = """
<!doctype html>
<html>
<head>
    <title>Example Domain</title>
    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
</head>
<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
"""

# 提取数据
data = extractor.extract(html)
print(data)

Functional operation flow

  1. Initialize the extractor: First you need to import the magic-html library and initialize the extractor.
  2. Preparing HTML content: Prepare the HTML code from which the content needs to be extracted, which can be in the form of a string.
  3. Calling the extraction method: Useextractmethod to extract the body content. Different HTML types can be specified as needed, such as articles, forums, or WeChat posts.
  4. output result: The extraction result can be in plain text or markdown format, depending on the user's needs.

typical example

Below is a complete example showing how to extract the body content from a simple HTML page:

from magic_html import GeneralExtractor

# 初始化提取器
extractor = GeneralExtractor()

# 示例HTML内容
html = """
<!doctype html>
<html>
<head>
    <title>Example Domain</title>
    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
</head>
<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
"""

# 提取数据
data = extractor.extract(html)
print(data)
May not be reproduced without permission:Chief AI Sharing Circle " magic-html: extract body data from HTML URL, output plain text/markdown
en_USEnglish