DeepSeek Multimodal Large Model Janus-Pro One-Click Installation Package with Tutorials

AI hands-on tutorials6mos agorelease AI Sharing Circle

1.8K 00

Today we introduce you to a powerful open source multimodal model - the DeepSeek The latest version of the Janus series of Janus-Pro . It can not only read pictures and answer questions, but also generate pictures based on text descriptions. In short, it integrates the ability to see pictures similar to GPT-4V and the ability to draw pictures from Stable Diffusion.

This post shares some pros and cons of Janus, how to use it, deployment related and integration packages.

What's Janus?

DeepSeek open source multimodal large model, to put it simply is a model that can understand the content of the picture and generate pictures. There are currently three versions of the model , respectively : Janus, JanusFlow, Janus-Pro.

Janus Test

color recognition

Here I uploaded a picture of a different colored helmet and left it to Janus to interpret the picture.

Janus's reply.

This picture shows four helmets, red, white, yellow and blue. Each helmet has a vent for ventilation and heat dissipation. These helmets are typically used on construction sites, in factories, or other work environments where safety gear is required. The background of the image is white, making the color of the helmet more visible.

As you can see Janus was able to accurately identify the color and number of helmets.

object recognition

Here I found a picture of a CNC machine and threw it at Janus to see if it understood it.

Janus's reply.

This is an industrial machine, specifically a numerically controlled machine tool (CNC machine). The front of the machine has a control panel with several buttons and displays for operating and monitoring the machine. The left side of the machine has an adjustable tray for holding workpieces or materials. The top of the machine has a cooling fan for keeping the work area cool. There are a number of displays and indicators on the side and front panels of the machine to show machine status and operating information. The front and sides of the machine are marked with the "VNT-600L" logo, indicating that it is a VNT branded 600L model. The overall design is modern and functional, suitable for precision machining and manufacturing.

There are some problems here, it is correct that Janus is able to recognize this CNC machine, including the general structure of the whole machine. However, the "VNT-600L" logo is incorrectly recognized here, it should be VMC850L.

text recognition

Here's a test using the official example, and you can see that Janus recognizes mathematical formulas as well.

Janus's reply.

This formula is the definition of a function (A_n) that describes the position of a number (n) in a set (S). Specifically, the formula defines the value of a function (A_n) that depends on the position of a number (n) in a set (S).
Specifically, the formula is defined as follows:
[
A_n = a_0 left[ 1 + frac{3}{4} sum_{k=1}^{n} frac{4}{9} right]^{k}
]
This formula can be interpreted as:
The function (a_0) is a constant that defines the initial value of the function (A_n).
The value of the function (A_n) depends on the position of (n) in the set (S).
The position of (n) in (S) is represented by a number (k) whose value depends on the position of (n) in the set (S).
This formula can be used to calculate the value of (A_n) in the set (S), where (S) is a set containing (n).

In some cases, the recognition of text is not ideal, such as in the following figure

Vincennes diagram

The other big draw of Janus is the text generation graphic, but using it, I don't feel it's great. It's not fast and requires a lot of video memory.

But... Something is better than nothing 👍🏻

Differences between the different versions of the Janus series

Janus, the first generation version, first introduced the concept of "visual decoupling". Model size: 1.3B parameters.
JanusFlow, the second generation version, improves the method of image generation compared to its predecessor, increasing the speed and quality of image generation.
Janus-Pro, the latest version, is fully upgraded and substantially improves model comprehension. Model size: there are two versions, 1B and 7B.

The three versions are like constant upgrades of the same product, with significant improvements in each generation, and the Janus-Pro is by far the most powerful version.

local deployment

python version 3.10

git clone https://github.com/deepseek-ai/Janus.git

pip install -e .

python demo/app_januspro.py

Modify the model

Janus-Pro has two versions of the model, 1B and 7B, different models require different video memory, compared to 1B, 7B is the best.

In the MAC version of the Integration Pack I provided, the 1B model is used by default.

If you have enough memory and want to use a 7B model, you can set thedemo/app_januspro.pyscriptsload_modelmethod in the

model_path = "deepseek-ai/Janus-Pro-1B"

modify to

model_path = "deepseek-ai/Janus-Pro-7B"

Of course, windows peeps don't need to worry about this, there are 1B and 7B boot options to choose from.

running configuration

WIN

Windows NVIDIA Graphics 1B requires 8G video memory

CUDA support is required for the integration package to run. cuda12.4 is required.

https://developer.nvidia.com/cuda-12-4-0-download-archive

MAC

MAC here to consider the configuration of most users, I only put the 1B version of the model, if you need 7B can refer to the steps in the deployment section above to modify the model.

For MAC version I did memory optimization to release memory in time to avoid memory redundancy.MAC Apple Silicon M1/M2/M3/M4 Chip 16G RAM.

write at the end

See here, there may be small partners will feel that the application of this project is relatively limited scene. But according to my understanding, there have been partners to use Janus's powerful graphic comprehension capabilities, engineering drawings, technical documents and other PDF information into an intelligent knowledge base.

Here's the screenshot he sent me:

Even better, if you read my last post on WeChat BOT, it's totally possible to plug Janus in and give your bot the ability to look at and generate pictures as well!