General Introduction
Morphik Core is an open source project developed by the morphik-org team and hosted on GitHub. It used to be called DataBridge Core, but is now renamed Morphik Core.This tool is a database designed for AI applications that can handle a variety of data such as text, images, PDFs, videos, etc. It provides powerful RAG (Retrieval Augmented Generation) features to help users quickly retrieve and generate information. It offers powerful RAG (Retrieval Augmented Generation) features to help users quickly retrieve and generate information.Morphik Core supports large-scale data processing and can manage millions of documents while keeping retrieval fast. Whether you want to try out a new idea or build a production environment, it provides support. It is currently in development and plans to launch a hosted service where users can join a waiting list.
Function List
- Support for multimodal data: can handle text, PDF, images, video and other formats.
- Intelligent parsing of files: automatically breaks files into smaller chunks and generates embedding.
- ColPali multimodal embedding: combining text and image content for efficient retrieval.
- Knowledge Graph Support: Automatically extract entities and relationships to enhance search results.
- Natural language rules: setting rules for cluttered data to extract structured information.
- Efficient caching: Pre-processing data to reduce computational costs and speed up response.
- Extensible architecture: support for custom parsers and multiple storage methods.
- MCP Protocols: facilitate knowledge sharing with AI systems.
Using Help
Morphik Core is a tool for developers to get the code and use it mainly through GitHub. Below is a detailed installation and operation guide to help you get started quickly.
Installation process
To get started with Morphik Core, you need to download the code from GitHub and configure your environment. The steps are as follows:
- clone warehouse
Enter the command in the terminal to download the project:
git clone https://github.com/morphik-org/morphik-core.git
Then go to the project directory:
cd morphik-core
- Creating a Virtual Environment
Create a standalone environment with Python 3.12 to avoid dependency conflicts:
python3.12 -m venv .venv
Activate the environment:
- Linux/macOS:
source .venv/bin/activate
- Windows:
.venv\Scripts\activate
- Installation of dependencies
The projects arerequirements.txt
file to install the required packages:
pip install -r requirements.txt
If you are missing files, check the GitHub README for the latest dependencies.
- Starting services
Configure and run the server:
python quick_setup.py
python start_server.py
Upon completion, the service will be localhost:8000
Running.
Main Functions
At the core of Morphik Core is the ability to process multimodal data and provide RAG Function. Here is how to do it:
1. Importing data
You can import text or files using the Python SDK. For example, import a piece of text:
from databridge import DataBridge
db = DataBridge("databridge://localhost:8000")
doc = db.ingest_text("这是关于AI技术的示例文档。", metadata={"category": "tech"})
- Operating Instructions: After connecting to the server, import the text and add metadata.
- in the end: The text is processed and stored for retrieval.
Import PDF files:
doc = db.ingest_file("path/to/document.pdf", metadata={"category": "research"})
- functionality: Support for PDF, video, and other formats, with automatic content parsing.
2. Multimodal search (ColPali)
Morphik Core uses ColPali to process documents containing images. Example:
doc = db.ingest_file("report_with_charts.pdf", use_colpali=True)
chunks = db.retrieve_chunks("显示第二季度收入图表", use_colpali=True, k=3)
- move: Enables ColPali when importing a file and returns text and images when retrieving it.
- effect: You can find the content of the chart or picture directly.
3. Setting the rules
Rules can be defined in natural language to extract information:
rules = [
{"type": "metadata_extraction", "schema": {"title": "string", "author": "string"}},
{"type": "natural_language", "prompt": "删除所有个人信息"}
]
doc = db.ingest_file("document.pdf", rules=rules)
- corresponds English -ity, -ism, -ization: Extract titles, authors from files, or clean up data on demand.
- suggestion: The rules are to be adapted to the content of the document.
4. Knowledge mapping
Create and use knowledge graphs to enhance retrieval:
db.create_graph("tech_graph", filters={"category": "tech"})
response = db.query("AI如何与云计算相关?", graph_name="tech_graph", hop_depth=2)
- manipulate: After generating a map, the query returns the associated information.
- dominance: Results are more precise and suitable for complex problems.
5. Batch processing
Supports batch import of files in folders:
docs = db.ingest_directory("data/documents", recursive=True, pattern="*.pdf")
- functionality: Recursively scan the catalog and import all PDFs.
- take: Suitable for processing large amounts of data.
Featured Function Operation
The highlights of Morphik Core are multimodal support and efficiency. Here is a detailed description:
ColPali multimodal embedding
ColPali lets text and images work together. For example:
db.ingest_file("report.pdf", use_colpali=True)
chunks = db.retrieve_chunks("查找2024年的销售数据图", use_colpali=True)
- effect: Not only return text, but also find charts.
- use: Analyze documents containing visual content.
Efficient Caching
Preprocess data for faster retrieval:
db.cache_documents(filters={"category": "research"})
chunks = db.retrieve_chunks("AI最新进展", k=5)
- mileage: Shorter response times and lower computational costs 80%.
- take note of: The cache takes up space and is cleaned regularly.
scalability
Connect to databases and process large-scale data:
db.connect_storage("postgresql://user:password@localhost:5432/dbname")
docs = db.ingest_directory("large_data")
- be in favor of: Manage millions of documents with PostgreSQL or MongoDB.
- tempo: Retrieval times remain in the seconds.
caveat
- Before using it for the first time, read GitHub's
README.md
and official documentation. - Make sure that Python version is 3.12 and that dependencies are installed correctly.
- Questions can be submitted as issues at Discord (https://discord.gg/BwMtv3Zaju) or GitHub.
With these steps, you can easily install and use Morphik Core to handle a variety of data needs.
application scenario
- Research Paper Management
The researcher imports the paper PDF, extracts the title and abstract using rules, generates a knowledge graph, and quickly finds related research. - Enterprise Data Analytics
The company processes reports and contracts, retrieves charts and text with ColPali, and caches data for efficiency. - Organization of educational resources
Teachers import textbooks and videos, set rules to extract key points, and students can query course content.
QA
- Does Morphik Core charge a fee?
There is no charge, it is an open source project licensed by MIT and is free to use. - Need a server?
Yes, self-hosting requires a locally run server, and there will be cloud hosting options in the future. - Does it support video?
Support that parses video and extracts text and content.