AI Personal Learning
and practical guidance
CyberKnife Drawing Mirror

Morphik Core: an open source RAG platform for processing multimodal data

General Introduction

Morphik Core is an open source project developed by the morphik-org team and hosted on GitHub. It used to be called DataBridge Core, but is now renamed Morphik Core.This tool is a database designed for AI applications that can handle a variety of data such as text, images, PDFs, videos, etc. It provides powerful RAG (Retrieval Augmented Generation) features to help users quickly retrieve and generate information. It offers powerful RAG (Retrieval Augmented Generation) features to help users quickly retrieve and generate information.Morphik Core supports large-scale data processing and can manage millions of documents while keeping retrieval fast. Whether you want to try out a new idea or build a production environment, it provides support. It is currently in development and plans to launch a hosted service where users can join a waiting list.

Morphik Core: an open source RAG platform for processing multimodal data-1


 

Function List

  • Support for multimodal data: can handle text, PDF, images, video and other formats.
  • Intelligent parsing of files: automatically breaks files into smaller chunks and generates embedding.
  • ColPali multimodal embedding: combining text and image content for efficient retrieval.
  • Knowledge Graph Support: Automatically extract entities and relationships to enhance search results.
  • Natural language rules: setting rules for cluttered data to extract structured information.
  • Efficient caching: Pre-processing data to reduce computational costs and speed up response.
  • Extensible architecture: support for custom parsers and multiple storage methods.
  • MCP Protocols: facilitate knowledge sharing with AI systems.

 

Using Help

Morphik Core is a tool for developers to get the code and use it mainly through GitHub. Below is a detailed installation and operation guide to help you get started quickly.

Installation process

To get started with Morphik Core, you need to download the code from GitHub and configure your environment. The steps are as follows:

  1. clone warehouse
    Enter the command in the terminal to download the project:
git clone https://github.com/morphik-org/morphik-core.git

Then go to the project directory:

cd morphik-core
  1. Creating a Virtual Environment
    Create a standalone environment with Python 3.12 to avoid dependency conflicts:
python3.12 -m venv .venv

Activate the environment:

  • Linux/macOS:
    source .venv/bin/activate
    
  • Windows:
    .venv\Scripts\activate
    
  1. Installation of dependencies
    The projects are requirements.txt file to install the required packages:
pip install -r requirements.txt

If you are missing files, check the GitHub README for the latest dependencies.

  1. Starting services
    Configure and run the server:
python quick_setup.py
python start_server.py

Upon completion, the service will be localhost:8000 Running.

Main Functions

At the core of Morphik Core is the ability to process multimodal data and provide RAG Function. Here is how to do it:

1. Importing data

You can import text or files using the Python SDK. For example, import a piece of text:

from databridge import DataBridge
db = DataBridge("databridge://localhost:8000")
doc = db.ingest_text("这是关于AI技术的示例文档。", metadata={"category": "tech"})
  • Operating Instructions: After connecting to the server, import the text and add metadata.
  • in the end: The text is processed and stored for retrieval.

Import PDF files:

doc = db.ingest_file("path/to/document.pdf", metadata={"category": "research"})
  • functionality: Support for PDF, video, and other formats, with automatic content parsing.

2. Multimodal search (ColPali)

Morphik Core uses ColPali to process documents containing images. Example:

doc = db.ingest_file("report_with_charts.pdf", use_colpali=True)
chunks = db.retrieve_chunks("显示第二季度收入图表", use_colpali=True, k=3)
  • move: Enables ColPali when importing a file and returns text and images when retrieving it.
  • effect: You can find the content of the chart or picture directly.

3. Setting the rules

Rules can be defined in natural language to extract information:

rules = [
{"type": "metadata_extraction", "schema": {"title": "string", "author": "string"}},
{"type": "natural_language", "prompt": "删除所有个人信息"}
]
doc = db.ingest_file("document.pdf", rules=rules)
  • corresponds English -ity, -ism, -ization: Extract titles, authors from files, or clean up data on demand.
  • suggestion: The rules are to be adapted to the content of the document.

4. Knowledge mapping

Create and use knowledge graphs to enhance retrieval:

db.create_graph("tech_graph", filters={"category": "tech"})
response = db.query("AI如何与云计算相关?", graph_name="tech_graph", hop_depth=2)
  • manipulate: After generating a map, the query returns the associated information.
  • dominance: Results are more precise and suitable for complex problems.

5. Batch processing

Supports batch import of files in folders:

docs = db.ingest_directory("data/documents", recursive=True, pattern="*.pdf")
  • functionality: Recursively scan the catalog and import all PDFs.
  • take: Suitable for processing large amounts of data.

Featured Function Operation

The highlights of Morphik Core are multimodal support and efficiency. Here is a detailed description:

ColPali multimodal embedding

ColPali lets text and images work together. For example:

db.ingest_file("report.pdf", use_colpali=True)
chunks = db.retrieve_chunks("查找2024年的销售数据图", use_colpali=True)
  • effect: Not only return text, but also find charts.
  • use: Analyze documents containing visual content.

Efficient Caching

Preprocess data for faster retrieval:

db.cache_documents(filters={"category": "research"})
chunks = db.retrieve_chunks("AI最新进展", k=5)
  • mileage: Shorter response times and lower computational costs 80%.
  • take note of: The cache takes up space and is cleaned regularly.

scalability

Connect to databases and process large-scale data:

db.connect_storage("postgresql://user:password@localhost:5432/dbname")
docs = db.ingest_directory("large_data")
  • be in favor of: Manage millions of documents with PostgreSQL or MongoDB.
  • tempo: Retrieval times remain in the seconds.

caveat

  • Before using it for the first time, read GitHub's README.md and official documentation.
  • Make sure that Python version is 3.12 and that dependencies are installed correctly.
  • Questions can be submitted as issues at Discord (https://discord.gg/BwMtv3Zaju) or GitHub.

With these steps, you can easily install and use Morphik Core to handle a variety of data needs.

 

application scenario

  1. Research Paper Management
    The researcher imports the paper PDF, extracts the title and abstract using rules, generates a knowledge graph, and quickly finds related research.
  2. Enterprise Data Analytics
    The company processes reports and contracts, retrieves charts and text with ColPali, and caches data for efficiency.
  3. Organization of educational resources
    Teachers import textbooks and videos, set rules to extract key points, and students can query course content.

 

QA

  1. Does Morphik Core charge a fee?
    There is no charge, it is an open source project licensed by MIT and is free to use.
  2. Need a server?
    Yes, self-hosting requires a locally run server, and there will be cloud hosting options in the future.
  3. Does it support video?
    Support that parses video and extracts text and content.
May not be reproduced without permission:Chief AI Sharing Circle " Morphik Core: an open source RAG platform for processing multimodal data
en_USEnglish