E2M: Convert multiple file formats to Markdown for easy document formatting unification
General Introduction
E2M (Everything to Markdown) is an open source Python library designed to convert a wide range of file formats to Markdown format. The tool supports a wide range of file types including doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, and m4a. E2M utilizes a parser-transformer architecture that efficiently parses and transforms file content, providing flexible configuration options for data retrieval enhanced generation (RAG) and model training or fine-tuning. E2M's goal is to provide users with high-quality data conversion services that simplify the process of document format harmonization. Each format has a specialized parser and converter, using the Parser parser to extract text and images from the file, and the Converter converter to convert the extracted content to Markdown.

Function List
- file parsing: Supports parsing of multiple file types, including text and image data.
- format conversion: Convert the parsed data into Markdown format.
- Multiple parsers and converters: Parsers and converters that support different engines and strategies.
- Open source and flexible configuration: Provides open source code and flexible configuration options that can be customized by the user.
- API Services: Provides API services for easy integration into other applications.
Using Help
Installation process
- Creating the Environment::
   conda create -n e2m python=3.10
conda activate e2m
- Update pip::
   pip install --upgrade pip
- Installation of E2M::- Install via git (recommended): bash
 pip install git+https://github.com/wisupai/e2m.git --index-url https://pypi.org/simple
- Installation via pip: bash
 pip install --upgrade wisup_e2m
- Manual installation: bash
 git clone https://github.com/wisupai/e2m.git
 cd e2m
 pip install poetry
 poetry build
 pip install dist/wisup_e2m-0.1.63-py3-none-any.whl
 
- Install via git (recommended): 
Usage
- Starting the API service::
   gunicorn wisup_e2m.api.main:app --workers 4 --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000
- Access to API Documentation: Open your browser and visithttp://127.0.0.1:8000/docsTo view the API documentation and usage examples, click here.
Main function operation flow
- File parsing and conversion::- Parses the contents of the file using a parser:
 from wisup_e2m.parsers import PdfParser parser = PdfParser() text_data = parser.parse('example.pdf')- Use a converter to convert the parsed content to Markdown format:
 from wisup_e2m.converters import TextConverter converter = TextConverter() markdown_data = converter.convert(text_data)
- Customized Configuration::- Modify the configuration fileconfig.yaml, adjust the parameters of the parser and converter according to the needs:
 parsers: pdf: engine: 'unstructured' converters: text: engine: 'litellm'
- Modify the configuration file
- Integration into other applications::- Integrate E2M into other applications using API services to send HTTP requests for file parsing and conversion: python
 import requests
 response = requests.post('http://127.0.0.1:8000/convert', files={'file': open('example.pdf', 'rb')})
 markdown_data = response.text
 
- Integrate E2M into other applications using API services to send HTTP requests for file parsing and conversion: 
© Copyright notes
Article copyright AI Sharing Circle  All, please do not reproduce without permission.
Related posts
No comments...





 English
English  简体中文
简体中文  日本語
日本語  한국어
한국어  Русский
Русский  Español
Español