Ali large model is open source again! Able to read pictures and know objects, based on Tongyi Qianwen 7B, commercially available

Source: Qubit

Ali open source large model, and a new one~

Following Tongyi Qianwen-7B (Qwen-7B), Alibaba Cloud launched the large-scale visual language model Qwen-VL, and it will be directly open sourced as soon as it goes online.

Specifically, Qwen-VL is a multi-modal large model based on Tongyi Qianwen-7B, which supports multiple inputs such as images, text, and detection frames, and supports the output of detection frames in addition to text.

For example 🌰, we input a picture of Arnia, through the form of question and answer, Qwen-VL-Chat can not only summarize the content of the picture, but also locate the Arnia in the picture.

In the test task, Qwen-VL demonstrated the strength of the "hexagonal warrior". In the standard English assessment of four types of multi-modal tasks (Zero-shot Caption/VQA/DocVQA/Grounding), it has achieved SOTA .

As soon as the open source news came out, it attracted a lot of attention.

Let's take a look at the specific performance~

The first general model that supports Chinese open domain positioning

Let’s take a look at the characteristics of the Qwen-VL series models as a whole:

  • Multilingual dialogue: support multilingual dialogue, end-to-end support for long text recognition in both Chinese and English in pictures;
  • Multi-picture interleaved dialogue: support multi-picture input and comparison, specify picture question and answer, multi-picture literature creation, etc.;
  • The first general-purpose model that supports Chinese open domain positioning: the detection frame is marked through the Chinese open domain language expression, that is, the target object can be accurately found in the screen;
  • Fine-grained recognition and understanding: Compared with the 224 resolution used by other open source LVLM (large-scale visual language model), Qwen-VL is the first open source 448 resolution LVLM model. Higher resolutions can improve fine-grained text recognition, document question answering, and detection box annotation.

In terms of scenarios, Qwen-VL can be used in scenarios such as knowledge question answering, image question answering, document question answering, and fine-grained visual positioning.

For example, if a foreign friend who cannot understand Chinese goes to the hospital to see a doctor, facing the guide map with one head and two big ones, and does not know how to get to the corresponding department, he can directly throw the map and questions to Qwen-VL, and let it follow the Image information acts as a translator.

Let's test the multi-image input and comparison:

Although he didn't recognize Arnia, his emotional judgment was indeed quite accurate (manual dog head).

In terms of visual positioning ability, even if the picture is very complicated and there are many characters, Qwen-VL can accurately find Hulk and Spiderman according to the requirements.

In terms of technical details, Qwen-VL uses Qwen-7B as the base language model, introduces a visual encoder ViT into the model architecture, and connects the two through a position-aware visual language adapter, so that the model supports visual signal input.

The specific training process is divided into three steps:

  • Pre-training: only optimize the visual encoder and visual language adapter, freeze the language model. Using large-scale image-text paired data, the input image resolution is 224x224.
  • Multi-task pre-training: Introduce higher resolution (448x448) multi-task visual language data, such as VQA, text VQA, reference understanding, etc., for multi-task joint pre-training.
  • Supervised fine-tuning: freeze the visual encoder, optimize the language model and adapters. Use the dialog interaction data for prompt tuning to get the final Qwen-VL-Chat model with interactive capabilities.

The researchers tested Qwen-VL on standard English assessments in four categories of multimodal tasks (Zero-shot Caption/VQA/DocVQA/Grounding).

The results show that Qwen-VL achieves the best results of open source LVLM of the same size.

In addition, the researchers built a test set TouchStone based on the GPT-4 scoring mechanism.

In this comparison test, Qwen-VL-Chat achieved SOTA.

If you are interested in Qwen-VL, there are demos on Modak Community and huggingface that you can try directly, and the link is at the end of the article~

Qwen-VL supports researchers and developers to carry out secondary development, and also allows commercial use, but it should be noted that for commercial use, you need to fill in the questionnaire application first.

Project link:

-Chat

Paper address:

View Original
  • Reward
  • Comment
  • Share
Comment
No comments