Open VLM Retrieval Leaderboard
This leaderboard presents the performance of various visual embedding models across different business sectors and languages. The evaluation is based on retrieval accuracy for visual search tasks.
Structure
- Sectors: Each column represents a different business sector (e.g., Energy, Education) with documents in either English (_EN) or French (_FR)
- Models: Each row shows a different model's performance
- Scores: Values range from 0 to 1, where higher is better (1.000 being perfect retrieval)
- Average: Overall mean performance across all sectors for each model
- Colors: Blue backgrounds indicate EU models, red backgrounds indicate Chinese models
The leaderboard was created in collaboration with the Intelligence Lab of the ECE - Ecole centrale d'électronique.
How to Read the Results
- Select a language tab to see how models perform with queries in that language
- All scores are normalized retrieval accuracy metrics
- Background colors indicate model origins (Blue = EU, Red = Chinese)
Average Performance Across Languages
This table shows the average performance of each model for each sector, averaged across all query languages.
Model | License | Average | ENERGY_EN | ENERGY_FR |
---|---|---|---|---|
jinaai/jina-embeddings-v4 | Qwen Research License (NC) | 0.908 | 0.912 | 0.904 |
racineai/QwenAmann-4B-dse | Apache 2.0 | 0.903 | 0.896 | 0.909 |
llamaindex/vdr-2b-multi-v1 (1536 dim) (960 max pixels) | Apache 2.0 | 0.866 | 0.889 | 0.843 |
llamaindex/vdr-2b-multi-v1 (1536 dim) (768 max pixels) | Apache 2.0 | 0.863 | 0.885 | 0.841 |
vidore/colqwen2-v1.0 | Apache 2.0 | 0.860 | 0.902 | 0.818 |
marco/mcdse-2b-v1 (1536 dim) (960 max pixels) | Apache 2.0 | 0.845 | 0.865 | 0.825 |
llamaindex/vdr-2b-multi-v1 (768 dim) (960 max pixels) | Apache 2.0 | 0.842 | 0.869 | 0.815 |
marco/mcdse-2b-v1 (768 dim) (960 max pixels) | Apache 2.0 | 0.835 | 0.857 | 0.814 |
Alibaba-NLP/gme-Qwen2-VL-2B-Instruct | Apache 2.0 | 0.821 | 0.832 | 0.809 |
MrLight/dse-qwen2-2b-mrl-v1 (1024 max pixels) | Apache 2.0 | 0.785 | 0.824 | 0.746 |
racineai/Flantier-SmolVLM-2B-dse | Apache 2.0 | 0.767 | 0.794 | 0.740 |
racineai/Flantier-SmolVLM-500M-dse | Apache 2.0 | 0.536 | 0.600 | 0.473 |
HuggingFaceTB/SmolVLM-Instruct (base model) | Apache 2.0 | 0.193 | 0.195 | 0.191 |
HuggingFaceTB/SmolVLM-500M-Instruct (base model) | Apache 2.0 | 0.182 | 0.200 | 0.164 |
Performance with English Queries
The table below shows how each model performs when the search queries are in English.
Model | License | Average | ENERGY_EN | ENERGY_FR |
---|---|---|---|---|
jinaai/jina-embeddings-v4 | Qwen Research License | 0.915 | 0.925 | 0.905 |
racineai/QwenAmann-4B-dse | Apache 2.0 | 0.912 | 0.907 | 0.916 |
llamaindex/vdr-2b-multi-v1 (1536 dim) (960 max pixels) | Apache 2.0 | 0.886 | 0.906 | 0.866 |
llamaindex/vdr-2b-multi-v1 (1536 dim) (768 max pixels) | Apache 2.0 | 0.883 | 0.906 | 0.860 |
vidore/colqwen2-v1.0 | Apache 2.0 | 0.871 | 0.945 | 0.798 |
llamaindex/vdr-2b-multi-v1 (768 dim) (960 max pixels) | Apache 2.0 | 0.865 | 0.904 | 0.826 |
marco/mcdse-2b-v1 (1536 dim) (960 max pixels) | Apache 2.0 | 0.861 | 0.886 | 0.836 |
marco/mcdse-2b-v1 (768 dim) (960 max pixels) | Apache 2.0 | 0.848 | 0.876 | 0.820 |
MrLight/dse-qwen2-2b-mrl-v1 (1024 max pixels) | Apache 2.0 | 0.843 | 0.886 | 0.800 |
Alibaba-NLP/gme-Qwen2-VL-2B-Instruct | Apache 2.0 | 0.825 | 0.854 | 0.795 |
racineai/Flantier-SmolVLM-2B-dse | Apache 2.0 | 0.822 | 0.887 | 0.757 |
racineai/Flantier-SmolVLM-500M-dse | Apache 2.0 | 0.697 | 0.815 | 0.578 |
HuggingFaceTB/SmolVLM-Instruct (base model) | Apache 2.0 | 0.304 | 0.310 | 0.298 |
HuggingFaceTB/SmolVLM-500M-Instruct (base model) | Apache 2.0 | 0.302 | 0.360 | 0.244 |
Performance with French Queries
The table below shows how each model performs when the search queries are in French.
Model | License | Average | ENERGY_EN | ENERGY_FR |
---|---|---|---|---|
jinaai/jina-embeddings-v4 | Qwen Research License | 0.909 | 0.913 | 0.905 |
racineai/QwenAmann-4B-dse | Apache 2.0 | 0.908 | 0.894 | 0.922 |
llamaindex/vdr-2b-multi-v1 (1536 dim) (960 max pixels) | Apache 2.0 | 0.882 | 0.892 | 0.872 |
llamaindex/vdr-2b-multi-v1 (1536 dim) (768 max pixels) | Apache 2.0 | 0.879 | 0.881 | 0.876 |
llamaindex/vdr-2b-multi-v1 (768 dim) (960 max pixels) | Apache 2.0 | 0.865 | 0.875 | 0.856 |
vidore/colqwen2-v1.0 | Apache 2.0 | 0.861 | 0.880 | 0.843 |
marco/mcdse-2b-v1 (1536 dim) (960 max pixels) | Apache 2.0 | 0.846 | 0.858 | 0.834 |
marco/mcdse-2b-v1 (768 dim) (960 max pixels) | Apache 2.0 | 0.844 | 0.857 | 0.831 |
Alibaba-NLP/gme-Qwen2-VL-2B-Instruct | Apache 2.0 | 0.819 | 0.823 | 0.814 |
MrLight/dse-qwen2-2b-mrl-v1 (1024 max pixels) | Apache 2.0 | 0.807 | 0.832 | 0.781 |
racineai/Flantier-SmolVLM-2B-dse | Apache 2.0 | 0.793 | 0.758 | 0.828 |
racineai/Flantier-SmolVLM-500M-dse | Apache 2.0 | 0.579 | 0.607 | 0.551 |
HuggingFaceTB/SmolVLM-Instruct (base model) | Apache 2.0 | 0.158 | 0.160 | 0.157 |
HuggingFaceTB/SmolVLM-500M-Instruct (base model) | Apache 2.0 | 0.157 | 0.162 | 0.151 |
Performance with Spanish Queries
The table below shows how each model performs when the search queries are in Spanish.
Model | License | Average | ENERGY_EN | ENERGY_FR |
---|---|---|---|---|
racineai/QwenAmann-4B-dse | Apache 2.0 | 0.909 | 0.902 | 0.915 |
jinaai/jina-embeddings-v4 | Qwen Research License | 0.908 | 0.911 | 0.906 |
vidore/colqwen2-v1.0 | Apache 2.0 | 0.865 | 0.897 | 0.833 |
llamaindex/vdr-2b-multi-v1 (1536 dim) (960 max pixels) | Apache 2.0 | 0.864 | 0.889 | 0.840 |
llamaindex/vdr-2b-multi-v1 (1536 dim) (768 max pixels) | Apache 2.0 | 0.860 | 0.884 | 0.837 |
marco/mcdse-2b-v1 (1536 dim) (960 max pixels) | Apache 2.0 | 0.847 | 0.867 | 0.827 |
llamaindex/vdr-2b-multi-v1 (768 dim) (960 max pixels) | Apache 2.0 | 0.836 | 0.856 | 0.816 |
Alibaba-NLP/gme-Qwen2-VL-2B-Instruct | Apache 2.0 | 0.836 | 0.844 | 0.828 |
marco/mcdse-2b-v1 (768 dim) (960 max pixels) | Apache 2.0 | 0.833 | 0.855 | 0.812 |
racineai/Flantier-SmolVLM-2B-dse | Apache 2.0 | 0.791 | 0.824 | 0.758 |
MrLight/dse-qwen2-2b-mrl-v1 (1024 max pixels) | Apache 2.0 | 0.776 | 0.811 | 0.741 |
racineai/Flantier-SmolVLM-500M-dse | Apache 2.0 | 0.534 | 0.579 | 0.488 |
HuggingFaceTB/SmolVLM-Instruct (base model) | Apache 2.0 | 0.176 | 0.177 | 0.175 |
HuggingFaceTB/SmolVLM-500M-Instruct (base model) | Apache 2.0 | 0.147 | 0.156 | 0.139 |
Performance with German Queries
The table below shows how each model performs when the search queries are in German.
Model | License | Average | ENERGY_EN | ENERGY_FR |
---|---|---|---|---|
jinaai/jina-embeddings-v4 | Qwen Research License | 0.901 | 0.904 | 0.899 |
racineai/QwenAmann-4B-dse | Apache 2.0 | 0.882 | 0.884 | 0.881 |
llamaindex/vdr-2b-multi-v1 (1536 dim) (960 max pixels) | Apache 2.0 | 0.837 | 0.877 | 0.797 |
vidore/colqwen2-v1.0 | Apache 2.0 | 0.835 | 0.894 | 0.776 |
llamaindex/vdr-2b-multi-v1 (1536 dim) (768 max pixels) | Apache 2.0 | 0.834 | 0.868 | 0.799 |
marco/mcdse-2b-v1 (1536 dim) (960 max pixels) | Apache 2.0 | 0.831 | 0.861 | 0.801 |
marco/mcdse-2b-v1 (768 dim) (960 max pixels) | Apache 2.0 | 0.820 | 0.848 | 0.793 |
llamaindex/vdr-2b-multi-v1 (768 dim) (960 max pixels) | Apache 2.0 | 0.812 | 0.856 | 0.769 |
Alibaba-NLP/gme-Qwen2-VL-2B-Instruct | Apache 2.0 | 0.801 | 0.814 | 0.788 |
MrLight/dse-qwen2-2b-mrl-v1 (1024 max pixels) | Apache 2.0 | 0.747 | 0.801 | 0.693 |
racineai/Flantier-SmolVLM-2B-dse | Apache 2.0 | 0.656 | 0.709 | 0.603 |
racineai/Flantier-SmolVLM-500M-dse | Apache 2.0 | 0.370 | 0.441 | 0.298 |
HuggingFaceTB/SmolVLM-Instruct (base model) | Apache 2.0 | 0.164 | 0.165 | 0.164 |
HuggingFaceTB/SmolVLM-500M-Instruct (base model) | Apache 2.0 | 0.152 | 0.158 | 0.145 |
Performance with Italian Queries
The table below shows how each model performs when the search queries are in Italian.
Model | License | Average | ENERGY_EN | ENERGY_FR |
---|---|---|---|---|
jinaai/jina-embeddings-v4 | Qwen Research License | 0.906 | 0.907 | 0.906 |
racineai/QwenAmann-4B-dse | Apache 2.0 | 0.902 | 0.894 | 0.910 |
vidore/colqwen2-v1.0 | Apache 2.0 | 0.866 | 0.893 | 0.839 |
llamaindex/vdr-2b-multi-v1 (1536 dim) (768 max pixels) | Apache 2.0 | 0.860 | 0.887 | 0.833 |
llamaindex/vdr-2b-multi-v1 (1536 dim) (960 max pixels) | Apache 2.0 | 0.860 | 0.882 | 0.837 |
marco/mcdse-2b-v1 (1536 dim) (960 max pixels) | Apache 2.0 | 0.841 | 0.854 | 0.827 |
marco/mcdse-2b-v1 (768 dim) (960 max pixels) | Apache 2.0 | 0.833 | 0.851 | 0.815 |
llamaindex/vdr-2b-multi-v1 (768 dim) (960 max pixels) | Apache 2.0 | 0.831 | 0.855 | 0.806 |
Alibaba-NLP/gme-Qwen2-VL-2B-Instruct | Apache 2.0 | 0.822 | 0.823 | 0.820 |
racineai/Flantier-SmolVLM-2B-dse | Apache 2.0 | 0.773 | 0.792 | 0.753 |
MrLight/dse-qwen2-2b-mrl-v1 (1024 max pixels) | Apache 2.0 | 0.750 | 0.788 | 0.712 |
racineai/Flantier-SmolVLM-500M-dse | Apache 2.0 | 0.503 | 0.556 | 0.451 |
HuggingFaceTB/SmolVLM-Instruct (base model) | Apache 2.0 | 0.163 | 0.165 | 0.162 |
HuggingFaceTB/SmolVLM-500M-Instruct (base model) | Apache 2.0 | 0.153 | 0.162 | 0.143 |
If you use these benchmarks in your research, please cite:
@article{visual_embeddings_benchmark_2025,
title={Cross-lingual Visual Embeddings Benchmark},
author={racine.ai},
year={2025}
}