AI Personal Learning
and practical guidance
Resource Recommendation 1

magic-html: extract body data from HTML URL, output plain text/markdown

General Introduction

magic-html is a Python library designed to simplify the process of extracting body region content from HTML. Whether dealing with complex HTML structures or simple web pages, this library aims to provide a convenient and efficient interface for users. It supports multimodal extraction, multiple layoutextractor, including articles, forums and microsoft articles, and also supports latex formula extraction conversion.

Function List

  • Extract HTML body area content
  • Support for multimodal extraction
  • Supports article, forum and weibo post layouts
  • Support latex formula extraction and conversion
  • Customize the output in plain text or markdown format


Using Help


To install magic-html, use the pip command:

pip install magic-html


Once installed, it can be used with the following code:

from magic_html import General Extractor

# Initialization Extractor
extractor = General Extractor()

# Example HTML content
html = """

    Example Domain

Example Domain</h1

This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for You may use this domain in literature without prior coordination or asking for permission.</p

More information ...

More information. </html """ # Extract data data = extractor.extract(html) print(data)

Functional operation flow

  1. Initialize the extractor: First you need to import the magic-html library and initialize the extractor.
  2. Preparing HTML content: Prepare the HTML code from which the content needs to be extracted, which can be in the form of a string.
  3. Calling the extraction method: Useextractmethod to extract the body content. Different HTML types can be specified as needed, such as articles, forums, or WeChat posts.
  4. output result: The extraction result can be in plain text or markdown format, depending on the user's needs.

typical example

Below is a complete example showing how to extract the body content from a simple HTML page:

May not be reproduced without permission:Chief AI Sharing Circle " magic-html: extract body data from HTML URL, output plain text/markdown

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us