magic-html: HTML URL에서 본문 데이터 추출, 일반 텍스트/마크다운 출력

49.3K 00

일반 소개

magic-html은 HTML에서 본문 영역 콘텐츠를 추출하는 프로세스를 간소화하도록 설계된 Python 라이브러리입니다. 복잡한 HTML 구조를 다루든 단순한 웹 페이지를 다루든 이 라이브러리는 사용자에게 편리하고 효율적인 인터페이스를 제공하는 것을 목표로 합니다. 기사, 포럼, Microsoft 문서 등 다중 모드 추출, 다중 레이아웃 추출기를 지원하며 라텍스 수식 추출 변환도 지원합니다.

기능 목록

HTML 본문 영역 콘텐츠 추출
멀티모달 추출 지원
지원 문서, 포럼 및 웨이보 게시물 레이아웃
라텍스 포뮬러 추출 및 변환 지원
일반 텍스트 또는 마크다운 형식의 사용자 지정 출력

도움말 사용

마운팅

magic-html을 설치하려면 pip 명령을 사용합니다:

pip install magic-html

활용

설치가 완료되면 다음 코드와 함께 사용할 수 있습니다:

from magic_html import GeneralExtractor

# 初始化提取器
extractor = GeneralExtractor()

# 示例HTML内容
html = """
<!doctype html>
<html>
<head>
    <title>Example Domain</title>
    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
</head>
<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
"""

# 提取数据
data = extractor.extract(html)
print(data)

기능 작동 흐름

추출기 초기화하기: 먼저 magic-html 라이브러리를 가져와서 추출기를 초기화해야 합니다.
HTML 콘텐츠 준비: 콘텐츠를 추출해야 하는 HTML 코드를 준비합니다(문자열 형태일 수 있음).
추출 메서드 호출하기사용extract메서드를 사용하여 본문 콘텐츠를 추출합니다. 필요에 따라 기사, 포럼, WeChat 게시물 등 다양한 HTML 유형을 지정할 수 있습니다.
출력 결과추출 결과는 사용자의 필요에 따라 일반 텍스트 또는 마크다운 형식으로 제공될 수 있습니다.

일반적인 예

다음은 간단한 HTML 페이지에서 본문 콘텐츠를 추출하는 방법을 보여주는 전체 예제입니다:

from magic_html import GeneralExtractor

# 初始化提取器
extractor = GeneralExtractor()

# 示例HTML内容
html = """
<!doctype html>
<html>
<head>
    <title>Example Domain</title>
    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
</head>
<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
"""

# 提取数据
data = extractor.extract(html)
print(data)