This leaderboard presents the performance of various visual embedding models across different business sectors and languages. The evaluation is based on retrieval accuracy for visual search tasks.
Structure
- Sectors: Each column represents a different business sector (e.g., Energy, Education) with documents in either English (_EN) or French (_FR)
- Models: Each row shows a different model's performance
- Scores: Values range from 0 to 1, where higher is better (1.000 being perfect retrieval)
- Average: Overall mean performance across all sectors for each model
- Colors: Blue backgrounds indicate EU models, red backgrounds indicate Chinese models
The leaderboard was created in collaboration with the Intelligence Lab of the ECE - Ecole centrale d'électronique.
How to Read the Results
- Select a language tab to see how models perform with queries in that language
- All scores are normalized retrieval accuracy metrics
- Background colors indicate model origins (Blue = EU, Red = Chinese)
Average Performance Across Languages
This table shows the average performance of each model for each sector, averaged across all query languages.
Model | Average | ENERGY_EN | ENERGY_FR |
---|---|---|---|
racineai/AMPERE-1 (1536 dim) (768 max pixels) | Coming Soon | ||
llamaindex/vdr-2b-multi-v1 (1536 dim) (960 max pixels) | 0.866 | 0.889 | 0.843 |
llamaindex/vdr-2b-multi-v1 (1536 dim) (768 max pixels) | 0.863 | 0.885 | 0.841 |
vidore/colqwen2-v1.0 | 0.860 | 0.902 | 0.818 |
marco/mcdse-2b-v1 (1536 dim) (960 max pixels) | 0.845 | 0.865 | 0.825 |
llamaindex/vdr-2b-multi-v1 (768 dim) (960 max pixels) | 0.842 | 0.869 | 0.815 |
marco/mcdse-2b-v1 (768 dim) (960 max pixels) | 0.835 | 0.857 | 0.814 |
Alibaba-NLP/gme-Qwen2-VL-2B-Instruct | 0.821 | 0.832 | 0.809 |
MrLight/dse-qwen2-2b-mrl-v1 (1024 max pixels) | 0.785 | 0.824 | 0.746 |
racineai/Flantier-SmolVLM-2B-dse | 0.767 | 0.794 | 0.740 |
racineai/Flantier-SmolVLM-500M-dse | 0.536 | 0.600 | 0.473 |
HuggingFaceTB/SmolVLM-Instruct (base model) | 0.193 | 0.195 | 0.191 |
HuggingFaceTB/SmolVLM-500M-Instruct (base model) | 0.182 | 0.200 | 0.164 |
Performance with English Queries
The table below shows how each model performs when the search queries are in English.
Model | Average | ENERGY_EN | ENERGY_FR |
---|---|---|---|
racineai/AMPERE-1 (1536 dim) (768 max pixels) | Coming Soon | ||
llamaindex/vdr-2b-multi-v1 (1536 dim) (960 max pixels) | 0.886 | 0.906 | 0.866 |
llamaindex/vdr-2b-multi-v1 (1536 dim) (768 max pixels) | 0.883 | 0.906 | 0.860 |
vidore/colqwen2-v1.0 | 0.871 | 0.945 | 0.798 |
llamaindex/vdr-2b-multi-v1 (768 dim) (960 max pixels) | 0.865 | 0.904 | 0.826 |
marco/mcdse-2b-v1 (1536 dim) (960 max pixels) | 0.861 | 0.886 | 0.836 |
marco/mcdse-2b-v1 (768 dim) (960 max pixels) | 0.848 | 0.876 | 0.820 |
MrLight/dse-qwen2-2b-mrl-v1 (1024 max pixels) | 0.843 | 0.886 | 0.800 |
Alibaba-NLP/gme-Qwen2-VL-2B-Instruct | 0.825 | 0.854 | 0.795 |
racineai/Flantier-SmolVLM-2B-dse | 0.822 | 0.887 | 0.757 |
racineai/Flantier-SmolVLM-500M-dse | 0.697 | 0.815 | 0.578 |
HuggingFaceTB/SmolVLM-Instruct (base model) | 0.304 | 0.310 | 0.298 |
HuggingFaceTB/SmolVLM-500M-Instruct (base model) | 0.302 | 0.360 | 0.244 |
Performance with French Queries
The table below shows how each model performs when the search queries are in French.
Model | Average | ENERGY_EN | ENERGY_FR |
---|---|---|---|
racineai/AMPERE-1 (1536 dim) (768 max pixels) | Coming Soon | ||
llamaindex/vdr-2b-multi-v1 (1536 dim) (960 max pixels) | 0.882 | 0.892 | 0.872 |
llamaindex/vdr-2b-multi-v1 (1536 dim) (768 max pixels) | 0.879 | 0.881 | 0.876 |
llamaindex/vdr-2b-multi-v1 (768 dim) (960 max pixels) | 0.865 | 0.875 | 0.856 |
vidore/colqwen2-v1.0 | 0.861 | 0.880 | 0.843 |
marco/mcdse-2b-v1 (1536 dim) (960 max pixels) | 0.846 | 0.858 | 0.834 |
marco/mcdse-2b-v1 (768 dim) (960 max pixels) | 0.844 | 0.857 | 0.831 |
Alibaba-NLP/gme-Qwen2-VL-2B-Instruct | 0.819 | 0.823 | 0.814 |
MrLight/dse-qwen2-2b-mrl-v1 (1024 max pixels) | 0.807 | 0.832 | 0.781 |
racineai/Flantier-SmolVLM-2B-dse | 0.793 | 0.758 | 0.828 |
racineai/Flantier-SmolVLM-500M-dse | 0.579 | 0.607 | 0.551 |
HuggingFaceTB/SmolVLM-Instruct (base model) | 0.158 | 0.160 | 0.157 |
HuggingFaceTB/SmolVLM-500M-Instruct (base model) | 0.157 | 0.162 | 0.151 |
Performance with Spanish Queries
The table below shows how each model performs when the search queries are in Spanish.
Model | Average | ENERGY_EN | ENERGY_FR |
---|---|---|---|
racineai/AMPERE-1 (1536 dim) (768 max pixels) | Coming Soon | ||
vidore/colqwen2-v1.0 | 0.865 | 0.897 | 0.833 |
llamaindex/vdr-2b-multi-v1 (1536 dim) (960 max pixels) | 0.864 | 0.889 | 0.840 |
llamaindex/vdr-2b-multi-v1 (1536 dim) (768 max pixels) | 0.860 | 0.884 | 0.837 |
marco/mcdse-2b-v1 (1536 dim) (960 max pixels) | 0.847 | 0.867 | 0.827 |
llamaindex/vdr-2b-multi-v1 (768 dim) (960 max pixels) | 0.836 | 0.856 | 0.816 |
Alibaba-NLP/gme-Qwen2-VL-2B-Instruct | 0.836 | 0.844 | 0.828 |
marco/mcdse-2b-v1 (768 dim) (960 max pixels) | 0.833 | 0.855 | 0.812 |
racineai/Flantier-SmolVLM-2B-dse | 0.791 | 0.824 | 0.758 |
MrLight/dse-qwen2-2b-mrl-v1 (1024 max pixels) | 0.776 | 0.811 | 0.741 |
racineai/Flantier-SmolVLM-500M-dse | 0.534 | 0.579 | 0.488 |
HuggingFaceTB/SmolVLM-Instruct (base model) | 0.176 | 0.177 | 0.175 |
HuggingFaceTB/SmolVLM-500M-Instruct (base model) | 0.147 | 0.156 | 0.139 |
Performance with German Queries
The table below shows how each model performs when the search queries are in German.
Model | Average | ENERGY_EN | ENERGY_FR |
---|---|---|---|
racineai/AMPERE-1 (1536 dim) (768 max pixels) | Coming Soon | ||
llamaindex/vdr-2b-multi-v1 (1536 dim) (960 max pixels) | 0.837 | 0.877 | 0.797 |
vidore/colqwen2-v1.0 | 0.835 | 0.894 | 0.776 |
llamaindex/vdr-2b-multi-v1 (1536 dim) (768 max pixels) | 0.834 | 0.868 | 0.799 |
marco/mcdse-2b-v1 (1536 dim) (960 max pixels) | 0.831 | 0.861 | 0.801 |
marco/mcdse-2b-v1 (768 dim) (960 max pixels) | 0.820 | 0.848 | 0.793 |
llamaindex/vdr-2b-multi-v1 (768 dim) (960 max pixels) | 0.812 | 0.856 | 0.769 |
Alibaba-NLP/gme-Qwen2-VL-2B-Instruct | 0.801 | 0.814 | 0.788 |
MrLight/dse-qwen2-2b-mrl-v1 (1024 max pixels) | 0.747 | 0.801 | 0.693 |
racineai/Flantier-SmolVLM-2B-dse | 0.656 | 0.709 | 0.603 |
racineai/Flantier-SmolVLM-500M-dse | 0.370 | 0.441 | 0.298 |
HuggingFaceTB/SmolVLM-Instruct (base model) | 0.164 | 0.165 | 0.164 |
HuggingFaceTB/SmolVLM-500M-Instruct (base model) | 0.152 | 0.158 | 0.145 |
Performance with Italian Queries
The table below shows how each model performs when the search queries are in Italian.
Model | Average | ENERGY_EN | ENERGY_FR |
---|---|---|---|
racineai/AMPERE-1 (1536 dim) (768 max pixels) | Coming Soon | ||
vidore/colqwen2-v1.0 | 0.866 | 0.893 | 0.839 |
llamaindex/vdr-2b-multi-v1 (1536 dim) (768 max pixels) | 0.860 | 0.887 | 0.833 |
llamaindex/vdr-2b-multi-v1 (1536 dim) (960 max pixels) | 0.860 | 0.882 | 0.837 |
marco/mcdse-2b-v1 (1536 dim) (960 max pixels) | 0.841 | 0.854 | 0.827 |
marco/mcdse-2b-v1 (768 dim) (960 max pixels) | 0.833 | 0.851 | 0.815 |
llamaindex/vdr-2b-multi-v1 (768 dim) (960 max pixels) | 0.831 | 0.855 | 0.806 |
Alibaba-NLP/gme-Qwen2-VL-2B-Instruct | 0.822 | 0.823 | 0.820 |
racineai/Flantier-SmolVLM-2B-dse | 0.773 | 0.792 | 0.753 |
MrLight/dse-qwen2-2b-mrl-v1 (1024 max pixels) | 0.750 | 0.788 | 0.712 |
racineai/Flantier-SmolVLM-500M-dse | 0.503 | 0.556 | 0.451 |
HuggingFaceTB/SmolVLM-Instruct (base model) | 0.163 | 0.165 | 0.162 |
HuggingFaceTB/SmolVLM-500M-Instruct (base model) | 0.153 | 0.162 | 0.143 |
Additional Information
- Scores are updated regularly as new models are evaluated
- All evaluations use the same test set for fair comparison
- Models are evaluated on both English and French datasets to assess cross-lingual capabilities
- Color coding indicates model origin (Blue = EU, Red = Chinese)
Citation
If you use these benchmarks in your research, please cite:
@article{visual_embeddings_benchmark_2025,
title={Cross-lingual Visual Embeddings Benchmark},
author={racine.ai},
year={2025}
}