E2M: Convert multiple file formats to Markdown for easy document formatting unification

Latest AI Resources8mos agorelease AI Sharing Circle

2.7K 00

General Introduction

E2M (Everything to Markdown) is an open source Python library designed to convert a wide range of file formats to Markdown format. The tool supports a wide range of file types including doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, and m4a. E2M utilizes a parser-transformer architecture that efficiently parses and transforms file content, providing flexible configuration options for data retrieval enhanced generation (RAG) and model training or fine-tuning. E2M's goal is to provide users with high-quality data conversion services that simplify the process of document format harmonization. Each format has a specialized parser and converter, using the Parser parser to extract text and images from the file, and the Converter converter to convert the extracted content to Markdown.

Function List

file parsing: Supports parsing of multiple file types, including text and image data.
format conversion: Convert the parsed data into Markdown format.
Multiple parsers and converters: Parsers and converters that support different engines and strategies.
Open source and flexible configuration: Provides open source code and flexible configuration options that can be customized by the user.
API Services: Provides API services for easy integration into other applications.

Using Help

Installation process

Creating the Environment::

   conda create -n e2m python=3.10
conda activate e2m

Update pip::

   pip install --upgrade pip

Installation of E2M::
- Install via git (recommended): bash pip install git+https://github.com/wisupai/e2m.git --index-url https://pypi.org/simple
- Installation via pip: bash pip install --upgrade wisup_e2m
- Manual installation: bash git clone https://github.com/wisupai/e2m.git cd e2m pip install poetry poetry build pip install dist/wisup_e2m-0.1.63-py3-none-any.whl

Usage

Starting the API service::

   gunicorn wisup_e2m.api.main:app --workers 4 --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000

Access to API Documentation: Open your browser and visithttp://127.0.0.1:8000/docsTo view the API documentation and usage examples, click here.

Main function operation flow

File parsing and conversion::

Parses the contents of the file using a parser:

 from wisup_e2m.parsers import PdfParser
parser = PdfParser()
text_data = parser.parse('example.pdf')

Use a converter to convert the parsed content to Markdown format:

 from wisup_e2m.converters import TextConverter
converter = TextConverter()
markdown_data = converter.convert(text_data)

Customized Configuration::
- Modify the configuration fileconfig.yaml, adjust the parameters of the parser and converter according to the needs:
```
 parsers:
pdf:
engine: 'unstructured'
converters:
text:
engine: 'litellm'
```
Integration into other applications::
- Integrate E2M into other applications using API services to send HTTP requests for file parsing and conversion: python import requests response = requests.post('http://127.0.0.1:8000/convert', files={'file': open('example.pdf', 'rb')}) markdown_data = response.text