这是用户在 2024-5-25 11:06 为 https://github.com/OpenBMB/MiniCPM-V?utm_source=gold_browser_extension 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
Skip to content
OpenBMB  /   MiniCPM-V  /  
Owner avatar MiniCPM-V Public

Notifications

Notification settings

MiniCPM-Llama3-V 2.5: A GPT-4V Level Multimodal LLM on Your Phone

License

Open in github.dev Open in a new github.dev tab Open in codespace

OpenBMB/MiniCPM-V

t

Add file

Add file

Repository files navigation

A GPT-4V Level Multimodal LLM on Your Phone
您手机上的 GPT-4V 级多模式 LLM

中文 | English

MiniCPM-Llama3-V 2.5 🤗 🤖 | MiniCPM-V 2.0 🤗 🤖 | Technical Blog
MiniCPM-Llama3-V 2.5 🤗 🤖 | MiniCPM-V 2.0 🤗 🤖 |技术博客

MiniCPM-V is a series of end-side multimodal LLMs (MLLMs) designed for vision-language understanding. The models take image and text as inputs and provide high-quality text outputs. Since February 2024, we have released 4 versions of the model, aiming to achieve strong performance and efficient deployment! The most notable models in this series currently include:
MiniCPM-V 是一系列专为视觉语言理解而设计的端侧多模态 LLMs (MLLM)。该模型以图像和文本作为输入并提供高质量的文本输出。自2024年2月以来,我们已经发布了4个版本的模型,旨在实现强劲的性能和高效的部署!目前该系列中最著名的型号包括:

  • MiniCPM-Llama3-V 2.5: 🔥🔥🔥 The latest and most capable model in the MiniCPM-V series. With a total of 8B parameters, the model surpasses proprietary models such as GPT-4V-1106, Gemini Pro, Qwen-VL-Max and Claude 3 in overall performance. Equipped with the enhanced OCR and instruction-following capability, the model can also support multimodal conversation for over 30 languages including English, Chinese, French, Spanish, German etc. With help of quantization, compilation optimizations, and several efficient inference techniques on CPUs and NPUs, MiniCPM-Llama3-V 2.5 can be efficiently deployed on end-side devices.
    MiniCPM-Llama3-V 2.5:🔥🔥🔥 MiniCPM-V 系列中最新、功能最强大的型号。该模型共有8B个参数,综合性能超越GPT-4V-1106、Gemini Pro、Qwen-VL-Max、Claude 3等专有模型。该模型配备了增强的 OCR 和指令跟踪功能,还可以支持英语、中文、法语、西班牙语、德语等 30 多种语言的多模态对话。借助量化、编译优化和 CPU 上的多种高效推理技术, NPU、MiniCPM-Llama3-V 2.5可以高效部署在端侧设备上。

  • MiniCPM-V 2.0: The lightest model in the MiniCPM-V series. With 2B parameters, it surpasses larger models such as Yi-VL 34B, CogVLM-Chat 17B, and Qwen-VL-Chat 10B in overall performance. It can accept image inputs of any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), achieving comparable performance with Gemini Pro in understanding scene-text and matches GPT-4V in low hallucination rates.
    MiniCPM-V 2.0:MiniCPM-V 系列中最轻的型号。凭借2B参数,其整体性能超越Yi-VL 34B、CogVLM-Chat 17B、Qwen-VL-Chat 10B等较大型号。它可以接受任何长宽比和高达 180 万像素(例如 1344x1344)的图像输入,在理解场景文本方面达到与 Gemini Pro 相当的性能,并在低幻觉率方面与 GPT-4V 相媲美。

News  消息

  • [2024.05.24] We release the MiniCPM-Llama3-V 2.5 gguf, which supports llama.cpp inference and provides a 6~8 token/s smooth decoding on mobile phones. Try it now!
    [2024.05.24] 我们发布了MiniCPM-Llama3-V 2.5 gguf,支持llama.cpp推理,在手机上提供6~8 token/s的平滑解码。现在就试试!
  • [2024.05.23] 🔥🔥🔥 MiniCPM-V tops GitHub Trending and Hugging Face Trending! Our demo, recommended by Hugging Face Gradio’s official account, is available here. Come and try it out!
    [2024.05.23] 🔥🔥🔥 MiniCPM-V 登顶 GitHub 趋势和拥抱脸趋势!我们的演示是由 Hugging Face Gradio 官方账号推荐的,可以在这里下载。快来尝试一下吧!
  • [2024.05.23] 🔍 We've released a comprehensive comparison between Phi-3-vision-128k-instruct and MiniCPM-Llama3-V 2.5, including benchmarks evaluations, and multilingual capabilities 🌟📊🌍. Click here to view more details.
    [2024.05.23] 🔍我们发布了 Phi-3-vision-128k-instruct 和 MiniCPM-Llama3-V 2.5 的全面比较,包括基准评估和多语言能力🌟📊🌍。点击此处查看更多详细信息。
  • [2024.05.20] We open-soure MiniCPM-Llama3-V 2.5, it has improved OCR capability and supports 30+ languages, representing the first end-side MLLM achieving GPT-4V level performance! We provide efficient inference and simple fine-tuning. Try it now!
    [2024.05.20] 开源MiniCPM-Llama3-V 2.5,提升OCR能力,支持30多种语言,代表首个达到GPT-4V级别性能的端侧MLLM!我们提供高效的推理和简单的微调。现在就试试!
  • [2024.04.23] MiniCPM-V-2.0 supports vLLM now! Click here to view more details.
    [2024.04.23] MiniCPM-V-2.0现已支持vLLM!点击此处查看更多详细信息。
  • [2024.04.18] We create a HuggingFace Space to host the demo of MiniCPM-V 2.0 at here!
    [2024.04.18] 我们在这里创建了一个 HuggingFace Space 来举办 MiniCPM-V 2.0 的演示!
  • [2024.04.17] MiniCPM-V-2.0 supports deploying WebUI Demo now!
    [2024.04.17] MiniCPM-V-2.0现已支持部署WebUI Demo!
  • [2024.04.15] MiniCPM-V-2.0 now also supports fine-tuning with the SWIFT framework!
    [2024.04.15] MiniCPM-V-2.0现在也支持使用SWIFT框架进行微调!
  • [2024.04.12] We open-source MiniCPM-V 2.0, which achieves comparable performance with Gemini Pro in understanding scene text and outperforms strong Qwen-VL-Chat 9.6B and Yi-VL 34B on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. Click here to view the MiniCPM-V 2.0 technical blog.
    [2024.04.12] 我们开源了MiniCPM-V 2.0,在理解场景文本方面达到了与Gemini Pro相当的性能,并在OpenCompass上对11个流行基准的综合评估中超越了强大的Qwen-VL-Chat 9.6B和Yi-VL 34B 。点击此处查看MiniCPM-V 2.0技术博客。
  • [2024.03.14] MiniCPM-V now supports fine-tuning with the SWIFT framework. Thanks to Jintao for the contribution!
    [2024.03.14] MiniCPM-V 现在支持使用 SWIFT 框架进行微调。感谢金涛的贡献!
  • [2024.03.01] MiniCPM-V now can be deployed on Mac!
    [2024.03.01] MiniCPM-V现在可以部署在Mac上了!
  • [2024.02.01] We open-source MiniCPM-V and OmniLMM-12B, which support efficient end-side deployment and powerful multimodal capabilities correspondingly.
    [2024.02.01] 我们开源了MiniCPM-V和OmniLMM-12B,相应支持高效的端侧部署和强大的多模态能力。

Contents  内容

MiniCPM-Llama3-V 2.5 迷你CPM-Llama3-V 2.5

MiniCPM-Llama3-V 2.5 is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0. Notable features of MiniCPM-Llama3-V 2.5 include:
MiniCPM-Llama3-V 2.5 是 MiniCPM-V 系列的最新型号。该模型基于 SigLip-400M 和 Llama3-8B-Instruct 构建,共有 8B 个参数。与 MiniCPM-V 2.0 相比,它表现出显着的性能改进。 MiniCPM-Llama3-V 2.5 的显着特点包括:

  • 🔥 Leading Performance. MiniCPM-Llama3-V 2.5 has achieved an average score of 65.1 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. With only 8B parameters, it surpasses widely used proprietary models like GPT-4V-1106, Gemini Pro, Claude 3 and Qwen-VL-Max and greatly outperforms other Llama 3-based MLLMs.
    🔥 领先的性能。 MiniCPM-Llama3-V 2.5 在 OpenCompass 上获得了 65.1 的平均得分,这是对 11 个流行基准的综合评估。它仅具有 8B 参数,超越了广泛使用的专有模型,如 GPT-4V-1106、Gemini Pro、Claude 3 和 Qwen-VL-Max,并大大优于其他基于 Llama 3 的 MLLM。

  • 💪 Strong OCR Capabilities. MiniCPM-Llama3-V 2.5 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), achieving an 700+ score on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V-0409, Qwen-VL-Max and Gemini Pro. Based on recent user feedback, MiniCPM-Llama3-V 2.5 has now enhanced full-text OCR extraction, table-to-markdown conversion, and other high-utility capabilities, and has further strengthened its instruction-following and complex reasoning abilities, enhancing multimodal interaction experiences.
    💪 强大的 OCR 功能。 MiniCPM-Llama3-V 2.5可以处理任意长宽比、高达180万像素(例如1344x1344)的图像,在OCRBench上获得700+的分数,超越了GPT-4o、GPT-4V-0409、Qwen-等专有模型VL-Max 和 Gemini Pro。根据近期用户反馈,MiniCPM-Llama3-V 2.5现已增强了全文OCR提取、表格转Markdown转换等高实用性能力,并进一步强化了指令跟随和复杂推理能力,增强了多模态能力。互动体验。

  • 🏆 Trustworthy Behavior. Leveraging the latest RLAIF-V method (the newest technique in the RLHF-V [CVPR'24] series), MiniCPM-Llama3-V 2.5 exhibits more trustworthy behavior. It achieves 10.3% hallucination rate on Object HalBench, lower than GPT-4V-1106 (13.6%), achieving the best-level performance within the open-source community.
    🏆 值得信赖的行为。利用最新的 RLAIF-V 方法(RLHF-V [CVPR'24] 系列中的最新技术),MiniCPM-Llama3-V 2.5 表现出更值得信赖的行为。它在Object HalBench上达到了10.3%的幻觉率,低于GPT-4V-1106(13.6%),达到了开源社区中最好水平的性能。

  • 🌏 Multilingual Support. Thanks to the strong multilingual capabilities of Llama 3 and the cross-lingual generalization technique from VisCPM, MiniCPM-Llama3-V 2.5 extends its bilingual (Chinese-English) multimodal capabilities to over 30 languages including German, French, Spanish, Italian, Russian etc. All Supported Languages.

  • 🚀 Efficient Deployment. MiniCPM-Llama3-V 2.5 systematically employs model quantization, CPU optimizations, NPU optimizations and compilation optimizations, achieving high-efficiency deployment on edge devices. For mobile phones with Qualcomm chips, we have integrated the NPU acceleration framework QNN into llama.cpp for the first time. After systematic optimization, MiniCPM-Llama3-V 2.5 has realized a 150-fold acceleration in multimodal large model end-side image encoding and a 3-fold increase in language decoding speed.

Evaluation

Click to view results on TextVQA, DocVQA, OCRBench, OpenCompass, MME, MMBench, MMMU, MathVista, LLaVA Bench, RealWorld QA, Object HalBench.
Model Size OCRBench TextVQA val DocVQA test Open-Compass MME MMB test (en) MMB test (cn) MMMU val Math-Vista LLaVA Bench RealWorld QA Object HalBench
Proprietary
Gemini Pro - 680 74.6 88.1 62.9 2148.9 73.6 74.3 48.9 45.8 79.9 60.4 -
GPT-4V (2023.11.06) - 645 78.0 88.4 63.5 1771.5 77.0 74.4 53.8 47.8 93.1 63.0 86.4
Open-source
Mini-Gemini 2.2B - 56.2 34.2* - 1653.0 - - 31.7 - - - -
Qwen-VL-Chat 9.6B 488 61.5 62.6 51.6 1860.0 61.8 56.3 37.0 33.8 67.7 49.3 56.2
DeepSeek-VL-7B 7.3B 435 64.7* 47.0* 54.6 1765.4 73.8 71.4 38.3 36.8 77.8 54.2 -
Yi-VL-34B 34B 290 43.4* 16.9* 52.2 2050.2 72.4 70.7 45.1 30.7 62.3 54.8 79.3
CogVLM-Chat 17.4B 590 70.4 33.3* 54.2 1736.6 65.8 55.9 37.3 34.7 73.9 60.3 73.6
TextMonkey 9.7B 558 64.3 66.7 - - - - - - - - -
Idefics2 8.0B - 73.0 74.0 57.2 1847.6 75.7 68.6 45.2 52.2 49.1 60.7 -
Bunny-LLama-3-8B 8.4B - - - 54.3 1920.3 77.0 73.9 41.3 31.5 61.2 58.8 -
LLaVA-NeXT Llama-3-8B 8.4B - - 78.2 - 1971.5 - - 41.7 37.5 80.1 60.0 -
Phi-3-vision-128k-instruct 4.2B 639* 70.9 - - 1537.5* - - 40.4 44.5 64.2* 58.8* -
MiniCPM-V 1.0 2.8B 366 60.6 38.2 47.5 1650.2 64.1 62.6 38.3 28.9 51.3 51.2 78.4
MiniCPM-V 2.0 2.8B 605 74.1 71.9 54.5 1808.6 69.1 66.5 38.2 38.7 69.2 55.8 85.5
MiniCPM-Llama3-V 2.5 8.5B 725 76.6 84.8 65.1 2024.6 77.2 74.2 45.8 54.3 86.7 63.5 89.7
* We evaluate the officially released checkpoint by ourselves.

Evaluation results of multilingual LLaVA Bench

Examples

We deploy MiniCPM-Llama3-V 2.5 on end devices. The demo video is the raw screen recording on a Xiaomi 14 Pro without edition.

MiniCPM-V 2.0

Click to view more details of MiniCPM-V 2.0

MiniCPM-V 2.0 is an efficient version with promising performance for deployment. The model is built based on SigLip-400M and MiniCPM-2.4B, connected by a perceiver resampler. Our latest version, MiniCPM-V 2.0 has several notable features.

  • 🔥 State-of-the-art Performance.

    MiniCPM-V 2.0 achieves state-of-the-art performance on multiple benchmarks (including OCRBench, TextVQA, MME, MMB, MathVista, etc) among models under 7B parameters. It even outperforms strong Qwen-VL-Chat 9.6B, CogVLM-Chat 17.4B, and Yi-VL 34B on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. Notably, MiniCPM-V 2.0 shows strong OCR capability, achieving comparable performance to Gemini Pro in scene-text understanding, and state-of-the-art performance on OCRBench among open-source models.

  • 🏆 Trustworthy Behavior.

    LMMs are known for suffering from hallucination, often generating text not factually grounded in images. MiniCPM-V 2.0 is the first end-side LMM aligned via multimodal RLHF for trustworthy behavior (using the recent RLHF-V [CVPR'24] series technique). This allows the model to match GPT-4V in preventing hallucinations on Object HalBench.

  • 🌟 High-Resolution Images at Any Aspect Raito.

    MiniCPM-V 2.0 can accept 1.8 million pixels (e.g., 1344x1344) images at any aspect ratio. This enables better perception of fine-grained visual information such as small objects and optical characters, which is achieved via a recent technique from LLaVA-UHD.

  • ⚡️ High Efficiency.

    MiniCPM-V 2.0 can be efficiently deployed on most GPU cards and personal computers, and even on end devices such as mobile phones. For visual encoding, we compress the image representations into much fewer tokens via a perceiver resampler. This allows MiniCPM-V 2.0 to operate with favorable memory cost and speed during inference even when dealing with high-resolution images.

  • 🙌 Bilingual Support.

    MiniCPM-V 2.0 supports strong bilingual multimodal capabilities in both English and Chinese. This is enabled by generalizing multimodal capabilities across languages, a technique from VisCPM [ICLR'24].

Examples

We deploy MiniCPM-V 2.0 on end devices. The demo video is the raw screen recording on a Xiaomi 14 Pro without edition.

Legacy Models

Model Introduction and Guidance
MiniCPM-V 1.0 Document
OmniLMM-12B Document

Online Demo

Click here to try out the Demo of MiniCPM-Llama3-V 2.5MiniCPM-V 2.0.

Install

  1. Clone this repository and navigate to the source folder
git clone https://github.com/OpenBMB/MiniCPM-V.git
cd MiniCPM-V
  1. Create conda environment
conda create -n MiniCPM-V python=3.10 -y
conda activate MiniCPM-V
  1. Install dependencies
pip install -r requirements.txt

Inference

Model Zoo

Model Device Memory          Description Download
MiniCPM-Llama3-V 2.5 GPU 19 GB The lastest version, achieving state-of-the end-side multimodal performance. 🤗   
MiniCPM-Llama3-V 2.5 gguf CPU 5 GB The gguf version, lower GPU memory and faster inference. 🤗   
MiniCPM-Llama3-V 2.5 int4 GPU 8 GB The int4 quantized version,lower GPU memory usage. 🤗   
MiniCPM-V 2.0 GPU 8 GB Light version, balance the performance the computation cost. 🤗   
MiniCPM-V 1.0 GPU 7 GB Lightest version, achieving the fastest inference. 🤗   

Multi-turn Conversation

Please refer to the following codes to run.

from chat import MiniCPMVChat, img2base64
import torch
import json

torch.manual_seed(0)

chat_model = MiniCPMVChat('openbmb/MiniCPM-Llama3-V-2_5')

im_64 = img2base64('./assets/airplane.jpeg')

# First round chat 
msgs = [{"role": "user", "content": "Tell me the model of this aircraft."}]

inputs = {"image": im_64, "question": json.dumps(msgs)}
answer = chat_model.chat(inputs)
print(answer)

# Second round chat 
# pass history context of multi-turn conversation
msgs.append({"role": "assistant", "content": answer})
msgs.append({"role": "user", "content": "Introduce something about Airbus A380."})

inputs = {"image": im_64, "question": json.dumps(msgs)}
answer = chat_model.chat(inputs)
print(answer)

You will get the following output:

"The aircraft in the image is an Airbus A380, which can be identified by its large size, double-deck structure, and the distinctive shape of its wings and engines. The A380 is a wide-body aircraft known for being the world's largest passenger airliner, designed for long-haul flights. It has four engines, which are characteristic of large commercial aircraft. The registration number on the aircraft can also provide specific information about the model if looked up in an aviation database."

"The Airbus A380 is a double-deck, wide-body, four-engine jet airliner made by Airbus. It is the world's largest passenger airliner and is known for its long-haul capabilities. The aircraft was developed to improve efficiency and comfort for passengers traveling over long distances. It has two full-length passenger decks, which can accommodate more passengers than a typical single-aisle airplane. The A380 has been operated by airlines such as Lufthansa, Singapore Airlines, and Emirates, among others. It is widely recognized for its unique design and significant impact on the aviation industry."

Inference on Mac

Click to view an example, to run MiniCPM-Llama3-V 2.5 on 💻 Mac with MPS (Apple silicon or AMD GPUs).
# test.py  Need more than 16GB memory.
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True, low_cpu_mem_usage=True)
model = model.to(device='mps')

tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True)
model.eval()

image = Image.open('./assets/hk_OCR.jpg').convert('RGB')
question = 'Where is this photo taken?'
msgs = [{'role': 'user', 'content': question}]

answer, context, _ = model.chat(
    image=image,
    msgs=msgs,
    context=None,
    tokenizer=tokenizer,
    sampling=True
)
print(answer)

Run with command:

PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py

Deployment on Mobile Phone

MiniCPM-V 2.0 can be deployed on mobile phones with Android operating systems. 🚀 Click here to install apk. MiniCPM-Llama3-V 2.5 coming soon.

WebUI Demo

Click to see how to deploy WebUI demo on different devices
pip install -r requirements.txt
# For NVIDIA GPUs, run:
python web_demo_2.5.py --device cuda

# For Mac with MPS (Apple silicon or AMD GPUs), run:
PYTORCH_ENABLE_MPS_FALLBACK=1 python web_demo_2.5.py --device mps

Inference with llama.cpp

MiniCPM-Llama3-V 2.5 can run with llama.cpp now! See our fork of llama.cpp for more detail. This implementation supports smooth inference of 6~8 token/s on mobile phones (test environment:Xiaomi 14 pro + Snapdragon 8 Gen 3).

Inference with vLLM

Click to see how to inference with vLLM Because our pull request to vLLM is still waiting for reviewing, we fork this repository to build and test our vLLM demo. Here are the steps:
  1. Clone our version of vLLM:
git clone https://github.com/OpenBMB/vllm.git
  1. Install vLLM:
cd vllm
pip install -e .
  1. Install timm:
pip install timm=0.9.10
  1. Run our demo:
python examples/minicpmv_example.py 

Fine-tuning

Simple Fine-tuning

We support simple fine-tuning with Hugging Face for MiniCPM-V 2.0 and MiniCPM-Llama3-V 2.5.

Reference Document

With the SWIFT Framework

We now support MiniCPM-V series fine-tuning with the SWIFT framework. SWIFT supports training, inference, evaluation and deployment of nearly 200 LLMs and MLLMs . It supports the lightweight training solutions provided by PEFT and a complete Adapters Library including techniques such as NEFTune, LoRA+ and LLaMA-PRO.

Best Practices:MiniCPM-V 1.0, MiniCPM-V 2.0

TODO

  • MiniCPM-V fine-tuning support
  • Code release for real-time interactive assistant

Model License

The code in this repo is released according to Apache-2.0

The usage of MiniCPM-V's and OmniLMM's parameters is subject to "General Model License Agreement - Source Notes - Publicity Restrictions - Commercial License"

The parameters are fully open to academic research

Please contact cpm@modelbest.cn to obtain written authorization for commercial uses. Free commercial use is also allowed after registration.

Statement

As LMMs, MiniCPM-V models (including OmniLMM) generate contents by learning a large amount of multimodal corpora, but they cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-V models does not represent the views and positions of the model developers

We will not be liable for any problems arising from the use of MiniCPMV-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.

Institutions

This project is developed by the following institutions:

Other Multimodal Projects from Our Team

👏 Welcome to explore other multimodal projects of our team:

VisCPM | RLHF-V | LLaVA-UHD | RLAIF-V

🌟 Star History

Star History Chart

Citation

If you find your model/code/paper helpful, please consider cite our papers 📝 and star us ⭐️!

@article{yu2023rlhf,
  title={Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback},
  author={Yu, Tianyu and Yao, Yuan and Zhang, Haoye and He, Taiwen and Han, Yifeng and Cui, Ganqu and Hu, Jinyi and Liu, Zhiyuan and Zheng, Hai-Tao and Sun, Maosong and others},
  journal={arXiv preprint arXiv:2312.00849},
  year={2023}
}
@article{viscpm,
    title={Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages}, 
    author={Jinyi Hu and Yuan Yao and Chongyi Wang and Shan Wang and Yinxu Pan and Qianyu Chen and Tianyu Yu and Hanghao Wu and Yue Zhao and Haoye Zhang and Xu Han and Yankai Lin and Jiao Xue and Dahai Li and Zhiyuan Liu and Maosong Sun},
    journal={arXiv preprint arXiv:2308.12038},
    year={2023}
}
@article{xu2024llava-uhd,
  title={{LLaVA-UHD}: an LMM Perceiving Any Aspect Ratio and High-Resolution Images},
  author={Xu, Ruyi and Yao, Yuan and Guo, Zonghao and Cui, Junbo and Ni, Zanlin and Ge, Chunjiang and Chua, Tat-Seng and Liu, Zhiyuan and Huang, Gao},
  journal={arXiv preprint arXiv:2403.11703},
  year={2024}
}

About

MiniCPM-Llama3-V 2.5: A GPT-4V Level Multimodal LLM on Your Phone

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published