General Introduction
E2M (Everything to Markdown) is an open source Python library designed to convert a wide range of file formats to Markdown format. The tool supports a wide range of file types including doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, and m4a. E2M utilizes a parser-transformer architecture that efficiently parses and transforms file content, providing flexible configuration options for data retrieval enhanced generation (RAG) and model training or fine-tuning. E2M's goal is to provide users with high-quality data conversion services that simplify the process of document format harmonization. Each format has a specialized parser and converter, using the Parser parser to extract text and images from the file, and the Converter converter to convert the extracted content to Markdown.
Function List
- file parsing: Supports parsing of multiple file types, including text and image data.
- format conversion: Convert the parsed data into Markdown format.
- Multiple parsers and converters: Parsers and converters that support different engines and strategies.
- Open source and flexible configuration: Provides open source code and flexible configuration options that can be customized by the user.
- API Services: Provides API services for easy integration into other applications.
Using Help
Installation process
- Creating the Environment::
conda create -n e2m python=3.10
conda activate e2m
- Update pip::
pip install --upgrade pip
- Installation of E2M::
- Install via git (recommended):
bash
pip install git+https://github.com/wisupai/e2m.git --index-url https://pypi.org/simple
- Installation via pip:
bash
pip install --upgrade wisup_e2m
- Manual installation:
bash
git clone https://github.com/wisupai/e2m.git
cd e2m
pip install poetry
poetry build
pip install dist/wisup_e2m-0.1.63-py3-none-any.whl
- Install via git (recommended):
Usage
- Starting the API service::
gunicorn wisup_e2m.api.main:app --workers 4 --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000
- Access to API Documentation: Open your browser and visit
http://127.0.0.1:8000/docs
To view the API documentation and usage examples, click here.
Main function operation flow
- File parsing and conversion::
- Parses the contents of the file using a parser:
from wisup_e2m.parsers import PdfParser parser = PdfParser() text_data = parser.parse('example.pdf')
- Use a converter to convert the parsed content to Markdown format:
from wisup_e2m.converters import TextConverter converter = TextConverter() markdown_data = converter.convert(text_data)
- Customized Configuration::
- Modify the configuration file
config.yaml
, adjust the parameters of the parser and converter according to the needs:
parsers. pdf. engine: 'unstructured' converters. engine: 'unstructured' converters: text. engine: 'litellm'
- Modify the configuration file
- Integration into other applications::
- Integrate E2M into other applications using API services to send HTTP requests for file parsing and conversion:
python
import requests
response = requests.post('http://127.0.0.1:8000/convert', files={'file': open('example.pdf', 'rb')})
markdown_data = response.text
- Integrate E2M into other applications using API services to send HTTP requests for file parsing and conversion: