Today we introduce you to a powerful open source multimodal model - Janus-Pro, the latest version of DeepSeek's Janus series. It can not only read pictures and answer questions, but also generate pictures based on text descriptions. In short, it integrates the ability to see pictures similar to GPT-4V and the ability to draw pictures with Stable Diffusion.
This post shares some pros and cons of Janus, how to use it, deployment related and integration packages.
What's Janus?
DeepSeek open source multimodal large model, to put it simply is a model that can understand the content of the picture and generate pictures. There are currently three versions of the model , respectively : Janus, JanusFlow, Janus-Pro.
Janus Test
color recognition
Here I uploaded a picture of a different colored helmet and left it to Janus to interpret the picture.
Janus's reply.
This picture shows four helmets, red, white, yellow and blue. Each helmet has a vent for ventilation and heat dissipation. These helmets are typically used on construction sites, in factories, or other work environments where safety gear is required. The background of the image is white, making the color of the helmet more visible.
As you can see Janus was able to accurately identify the color and number of helmets.
object recognition
Here I found a picture of a CNC machine and threw it at Janus to see if it understood it.
Janus's reply.
This is an industrial machine, specifically a numerically controlled machine tool (CNC machine). The front of the machine has a control panel with several buttons and displays for operating and monitoring the machine. The left side of the machine has an adjustable tray for holding workpieces or materials. The top of the machine has a cooling fan for keeping the work area cool. There are a number of displays and indicators on the side and front panels of the machine to show machine status and operating information. The front and sides of the machine are marked with the "VNT-600L" logo, indicating that it is a VNT branded 600L model. The overall design is modern and functional, suitable for precision machining and manufacturing.
There are some problems here, it is correct that Janus is able to recognize this CNC machine, including the general structure of the whole machine. However, the "VNT-600L" logo is incorrectly recognized here, it should be VMC850L.
text recognition
Here's a test using the official example, and you can see that Janus recognizes mathematical formulas as well.
Janus's reply.
This formula is the definition of a function (A_n) that describes the position of a number (n) in a set (S). Specifically, the formula defines the value of a function (A_n) that depends on the position of a number (n) in a set (S).
Specifically, the formula is defined as follows:
[A_n = a_0 left[ 1 + frac{3}{4} sum_{k=1}^{n} frac{4}{9} right]^{k}
]
This formula can be interpreted as:
- The function (a_0) is a constant that defines the initial value of the function (A_n).
- The value of the function (A_n) depends on the position of (n) in the set (S).
- The position of (n) in (S) is represented by a number (k) whose value depends on the position of (n) in the set (S).
This formula can be used to calculate the value of (A_n) in the set (S), where (S) is a set containing (n).
In some cases, the recognition of text is not ideal, such as in the following figure
Vincennes diagram
The other big draw of Janus is the text generation graphic, but using it, I don't feel it's great. It's not fast and requires a lot of video memory.
But... Something is better than nothing 👍🏻
Differences between the different versions of the Janus series
- Janus, the first generation version, first introduced the concept of "visual decoupling". Model size: 1.3B parameters.
- JanusFlow, the second generation version, improves the method of image generation compared to its predecessor, increasing the speed and quality of image generation.
- Janus-Pro, the latest version, is fully upgraded and substantially improves model comprehension. Model size: there are two versions, 1B and 7B.
The three versions are like constant upgrades of the same product, with significant improvements in each generation, and the Janus-Pro is by far the most powerful version.
local deployment
python version 3.10
git clone https://github.com/deepseek-ai/Janus.git
pip install -e .
python demo/app_januspro.py
Modify the model
Janus-Pro has two versions of the model, 1B and 7B, different models require different video memory, compared to 1B, 7B is the best.
In the MAC version of the Integration Pack I provided, the 1B model is used by default.
If you have enough memory and want to use a 7B model, you can set thedemo/app_januspro.py
scriptsload_model
method in the
model_path = "deepseek-ai/Janus-Pro-1B"
modify to
model_path = "deepseek-ai/Janus-Pro-7B"
Of course, windows peeps don't need to worry about this, there are 1B and 7B boot options to choose from.
running configuration
WIN
Windows NVIDIA Graphics 1B requires 8G video memory
CUDA support is required for the integration package to run. cuda12.4 is required.
https://developer.nvidia.com/cuda-12-4-0-download-archive
MAC
MAC here to consider the configuration of most users, I only put the 1B version of the model, if you need 7B can refer to the steps in the deployment section above to modify the model.
For MAC version I did memory optimization to release memory in time to avoid memory redundancy.MAC Apple Silicon M1/M2/M3/M4 Chip 16G RAM.
write at the end
See here, there may be small partners will feel that the application of this project is relatively limited scene. But according to my understanding, there have been partners to use Janus's powerful graphic comprehension capabilities, engineering drawings, technical documents and other PDF information into an intelligent knowledge base.
Here's the screenshot he sent me:
Even better, if you read my last post on WeChat BOT, it's totally possible to plug Janus in and give your bot the ability to look at and generate pictures as well!