AI Personal Learning
and practical guidance
讯飞绘镜

E2M: Convert multiple file formats to Markdown for easy document formatting unification

General Introduction

E2M (Everything to Markdown) is an open source Python library designed to convert a wide range of file formats to Markdown format. The tool supports a wide range of file types including doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, and m4a. E2M utilizes a parser-transformer architecture that efficiently parses and transforms file content, providing flexible configuration options for data retrieval enhanced generation (RAG) and model training or fine-tuning. E2M's goal is to provide users with high-quality data conversion services that simplify the process of document format harmonization. Each format has a specialized parser and converter, using the Parser parser to extract text and images from the file, and the Converter converter to convert the extracted content to Markdown.

E2M:将多种文件格式转换为Markdown,轻松实现文档格式统一-1


 

Function List

  • file parsing: Supports parsing of multiple file types, including text and image data.
  • format conversion: Convert the parsed data into Markdown format.
  • Multiple parsers and converters: Parsers and converters that support different engines and strategies.
  • Open source and flexible configuration: Provides open source code and flexible configuration options that can be customized by the user.
  • API Services: Provides API services for easy integration into other applications.

 

Using Help

Installation process

  1. Creating the Environment::
   conda create -n e2m python=3.10
conda activate e2m
  1. Update pip::
   pip install --upgrade pip
  1. Installation of E2M::
    • Install via git (recommended): bash
      pip install git+https://github.com/wisupai/e2m.git --index-url https://pypi.org/simple
    • Installation via pip: bash
      pip install --upgrade wisup_e2m
    • Manual installation: bash
      git clone https://github.com/wisupai/e2m.git
      cd e2m
      pip install poetry
      poetry build
      pip install dist/wisup_e2m-0.1.63-py3-none-any.whl

Usage

  1. Starting the API service::
   gunicorn wisup_e2m.api.main:app --workers 4 --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000
  1. Access to API Documentation: Open your browser and visithttp://127.0.0.1:8000/docsTo view the API documentation and usage examples, click here.

Main function operation flow

  1. File parsing and conversion::
    • Parses the contents of the file using a parser:
     from wisup_e2m.parsers import PdfParser
    parser = PdfParser()
    text_data = parser.parse('example.pdf')
    
    • Use a converter to convert the parsed content to Markdown format:
     from wisup_e2m.converters import TextConverter
    converter = TextConverter()
    markdown_data = converter.convert(text_data)
    
  2. Customized Configuration::
    • Modify the configuration fileconfig.yaml, adjust the parameters of the parser and converter according to the needs:
     parsers:
    pdf:
    engine: 'unstructured'
    converters:
    text:
    engine: 'litellm'
    
  3. Integration into other applications::
    • Integrate E2M into other applications using API services to send HTTP requests for file parsing and conversion: python
      import requests
      response = requests.post('http://127.0.0.1:8000/convert', files={'file': open('example.pdf', 'rb')})
      markdown_data = response.text
May not be reproduced without permission:Chief AI Sharing Circle " E2M: Convert multiple file formats to Markdown for easy document formatting unification
en_USEnglish