AI Personal Learning
and practical guidance

E2M: Convert multiple file formats to Markdown for easy document formatting unification

General Introduction

E2M (Everything to Markdown) is an open source Python library designed to convert a wide range of file formats to Markdown format. The tool supports a wide range of file types including doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, and m4a. E2M utilizes a parser-transformer architecture that efficiently parses and transforms file content, providing flexible configuration options for data retrieval enhanced generation (RAG) and model training or fine-tuning. E2M's goal is to provide users with high-quality data conversion services that simplify the process of document format harmonization. Each format has a specialized parser and converter, using the Parser parser to extract text and images from the file, and the Converter converter to convert the extracted content to Markdown.

E2M: Convert multiple file formats to Markdown, easily achieve document formatting uniformity-1


 

Function List

  • file parsing: Supports parsing of multiple file types, including text and image data.
  • format conversion: Convert the parsed data into Markdown format.
  • Multiple parsers and converters: Parsers and converters that support different engines and strategies.
  • Open source and flexible configuration: Provides open source code and flexible configuration options that can be customized by the user.
  • API Services: Provides API services for easy integration into other applications.

 

Using Help

Installation process

  1. Creating the Environment::
   conda create -n e2m python=3.10
conda activate e2m
  1. Update pip::
   pip install --upgrade pip
  1. Installation of E2M::
    • Install via git (recommended): bash
      pip install git+https://github.com/wisupai/e2m.git --index-url https://pypi.org/simple
    • Installation via pip: bash
      pip install --upgrade wisup_e2m
    • Manual installation: bash
      git clone https://github.com/wisupai/e2m.git
      cd e2m
      pip install poetry
      poetry build
      pip install dist/wisup_e2m-0.1.63-py3-none-any.whl

Usage

  1. Starting the API service::
   gunicorn wisup_e2m.api.main:app --workers 4 --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000
  1. Access to API Documentation: Open your browser and visithttp://127.0.0.1:8000/docsTo view the API documentation and usage examples, click here.

Main function operation flow

  1. File parsing and conversion::
    • Parses the contents of the file using a parser:
     from wisup_e2m.parsers import PdfParser
    parser = PdfParser()
    text_data = parser.parse('example.pdf')
    
    • Use a converter to convert the parsed content to Markdown format:
     from wisup_e2m.converters import TextConverter
    converter = TextConverter()
    markdown_data = converter.convert(text_data)
    
  2. Customized Configuration::
    • Modify the configuration fileconfig.yaml, adjust the parameters of the parser and converter according to the needs:
     parsers.
    pdf.
    engine: 'unstructured'
    converters.
    engine: 'unstructured' converters: text.
    engine: 'litellm'
    
  3. Integration into other applications::
    • Integrate E2M into other applications using API services to send HTTP requests for file parsing and conversion: python
      import requests
      response = requests.post('http://127.0.0.1:8000/convert', files={'file': open('example.pdf', 'rb')})
      markdown_data = response.text
May not be reproduced without permission:Chief AI Sharing Circle " E2M: Convert multiple file formats to Markdown for easy document formatting unification

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish