Open VLM Retrieval Leaderboard

This leaderboard presents the performance of various visual embedding models across different business sectors and languages. The evaluation is based on retrieval accuracy for visual search tasks.

Structure

Sectors: Each column represents a different business sector (e.g., Energy, Education) with documents in either English (_EN) or French (_FR)
Models: Each row shows a different model's performance
Scores: Values range from 0 to 1, where higher is better (1.000 being perfect retrieval)
Average: Overall mean performance across all sectors for each model
Colors: Blue backgrounds indicate EU models, red backgrounds indicate Chinese models

The leaderboard was created in collaboration with the Intelligence Lab of the ECE - Ecole centrale d'électronique.

How to Read the Results

Select a language tab to see how models perform with queries in that language
All scores are normalized retrieval accuracy metrics
Background colors indicate model origins (Blue = EU, Red = Chinese)

Average Performance Across Languages

This table shows the average performance of each model for each sector, averaged across all query languages.

Model	Average	ENERGY_EN	ENERGY_FR
racineai/AMPERE-1 (1536 dim) (768 max pixels)	Coming Soon
llamaindex/vdr-2b-multi-v1 (1536 dim) (960 max pixels)	0.866	0.889	0.843
llamaindex/vdr-2b-multi-v1 (1536 dim) (768 max pixels)	0.863	0.885	0.841
vidore/colqwen2-v1.0	0.860	0.902	0.818
marco/mcdse-2b-v1 (1536 dim) (960 max pixels)	0.845	0.865	0.825
llamaindex/vdr-2b-multi-v1 (768 dim) (960 max pixels)	0.842	0.869	0.815
marco/mcdse-2b-v1 (768 dim) (960 max pixels)	0.835	0.857	0.814
Alibaba-NLP/gme-Qwen2-VL-2B-Instruct	0.821	0.832	0.809
MrLight/dse-qwen2-2b-mrl-v1 (1024 max pixels)	0.785	0.824	0.746
racineai/Flantier-SmolVLM-2B-dse	0.767	0.794	0.740
racineai/Flantier-SmolVLM-500M-dse	0.536	0.600	0.473
HuggingFaceTB/SmolVLM-Instruct (base model)	0.193	0.195	0.191
HuggingFaceTB/SmolVLM-500M-Instruct (base model)	0.182	0.200	0.164

Model Origin:

European Union

China

Additional Information

Scores are updated regularly as new models are evaluated
All evaluations use the same test set for fair comparison
Models are evaluated on both English and French datasets to assess cross-lingual capabilities
Color coding indicates model origin (Blue = EU, Red = Chinese)

Citation

If you use these benchmarks in your research, please cite:

@article{visual_embeddings_benchmark_2025,
    title={Cross-lingual Visual Embeddings Benchmark},
    author={racine.ai},
    year={2025}
}

Model	Average	ENERGY_EN	ENERGY_FR
racineai/AMPERE-1 (1536 dim) (768 max pixels)	Coming Soon
llamaindex/vdr-2b-multi-v1 (1536 dim) (960 max pixels)	0.886	0.906	0.866
llamaindex/vdr-2b-multi-v1 (1536 dim) (768 max pixels)	0.883	0.906	0.860
vidore/colqwen2-v1.0	0.871	0.945	0.798
llamaindex/vdr-2b-multi-v1 (768 dim) (960 max pixels)	0.865	0.904	0.826
marco/mcdse-2b-v1 (1536 dim) (960 max pixels)	0.861	0.886	0.836
marco/mcdse-2b-v1 (768 dim) (960 max pixels)	0.848	0.876	0.820
MrLight/dse-qwen2-2b-mrl-v1 (1024 max pixels)	0.843	0.886	0.800
Alibaba-NLP/gme-Qwen2-VL-2B-Instruct	0.825	0.854	0.795
racineai/Flantier-SmolVLM-2B-dse	0.822	0.887	0.757
racineai/Flantier-SmolVLM-500M-dse	0.697	0.815	0.578
HuggingFaceTB/SmolVLM-Instruct (base model)	0.304	0.310	0.298
HuggingFaceTB/SmolVLM-500M-Instruct (base model)	0.302	0.360	0.244

Model	Average	ENERGY_EN	ENERGY_FR
racineai/AMPERE-1 (1536 dim) (768 max pixels)	Coming Soon
llamaindex/vdr-2b-multi-v1 (1536 dim) (960 max pixels)	0.882	0.892	0.872
llamaindex/vdr-2b-multi-v1 (1536 dim) (768 max pixels)	0.879	0.881	0.876
llamaindex/vdr-2b-multi-v1 (768 dim) (960 max pixels)	0.865	0.875	0.856
vidore/colqwen2-v1.0	0.861	0.880	0.843
marco/mcdse-2b-v1 (1536 dim) (960 max pixels)	0.846	0.858	0.834
marco/mcdse-2b-v1 (768 dim) (960 max pixels)	0.844	0.857	0.831
Alibaba-NLP/gme-Qwen2-VL-2B-Instruct	0.819	0.823	0.814
MrLight/dse-qwen2-2b-mrl-v1 (1024 max pixels)	0.807	0.832	0.781
racineai/Flantier-SmolVLM-2B-dse	0.793	0.758	0.828
racineai/Flantier-SmolVLM-500M-dse	0.579	0.607	0.551
HuggingFaceTB/SmolVLM-Instruct (base model)	0.158	0.160	0.157
HuggingFaceTB/SmolVLM-500M-Instruct (base model)	0.157	0.162	0.151

Model	Average	ENERGY_EN	ENERGY_FR
racineai/AMPERE-1 (1536 dim) (768 max pixels)	Coming Soon
vidore/colqwen2-v1.0	0.865	0.897	0.833
llamaindex/vdr-2b-multi-v1 (1536 dim) (960 max pixels)	0.864	0.889	0.840
llamaindex/vdr-2b-multi-v1 (1536 dim) (768 max pixels)	0.860	0.884	0.837
marco/mcdse-2b-v1 (1536 dim) (960 max pixels)	0.847	0.867	0.827
llamaindex/vdr-2b-multi-v1 (768 dim) (960 max pixels)	0.836	0.856	0.816
Alibaba-NLP/gme-Qwen2-VL-2B-Instruct	0.836	0.844	0.828
marco/mcdse-2b-v1 (768 dim) (960 max pixels)	0.833	0.855	0.812
racineai/Flantier-SmolVLM-2B-dse	0.791	0.824	0.758
MrLight/dse-qwen2-2b-mrl-v1 (1024 max pixels)	0.776	0.811	0.741
racineai/Flantier-SmolVLM-500M-dse	0.534	0.579	0.488
HuggingFaceTB/SmolVLM-Instruct (base model)	0.176	0.177	0.175
HuggingFaceTB/SmolVLM-500M-Instruct (base model)	0.147	0.156	0.139

Model	Average	ENERGY_EN	ENERGY_FR
racineai/AMPERE-1 (1536 dim) (768 max pixels)	Coming Soon
llamaindex/vdr-2b-multi-v1 (1536 dim) (960 max pixels)	0.837	0.877	0.797
vidore/colqwen2-v1.0	0.835	0.894	0.776
llamaindex/vdr-2b-multi-v1 (1536 dim) (768 max pixels)	0.834	0.868	0.799
marco/mcdse-2b-v1 (1536 dim) (960 max pixels)	0.831	0.861	0.801
marco/mcdse-2b-v1 (768 dim) (960 max pixels)	0.820	0.848	0.793
llamaindex/vdr-2b-multi-v1 (768 dim) (960 max pixels)	0.812	0.856	0.769
Alibaba-NLP/gme-Qwen2-VL-2B-Instruct	0.801	0.814	0.788
MrLight/dse-qwen2-2b-mrl-v1 (1024 max pixels)	0.747	0.801	0.693
racineai/Flantier-SmolVLM-2B-dse	0.656	0.709	0.603
racineai/Flantier-SmolVLM-500M-dse	0.370	0.441	0.298
HuggingFaceTB/SmolVLM-Instruct (base model)	0.164	0.165	0.164
HuggingFaceTB/SmolVLM-500M-Instruct (base model)	0.152	0.158	0.145

Model	Average	ENERGY_EN	ENERGY_FR
racineai/AMPERE-1 (1536 dim) (768 max pixels)	Coming Soon
vidore/colqwen2-v1.0	0.866	0.893	0.839
llamaindex/vdr-2b-multi-v1 (1536 dim) (768 max pixels)	0.860	0.887	0.833
llamaindex/vdr-2b-multi-v1 (1536 dim) (960 max pixels)	0.860	0.882	0.837
marco/mcdse-2b-v1 (1536 dim) (960 max pixels)	0.841	0.854	0.827
marco/mcdse-2b-v1 (768 dim) (960 max pixels)	0.833	0.851	0.815
llamaindex/vdr-2b-multi-v1 (768 dim) (960 max pixels)	0.831	0.855	0.806
Alibaba-NLP/gme-Qwen2-VL-2B-Instruct	0.822	0.823	0.820
racineai/Flantier-SmolVLM-2B-dse	0.773	0.792	0.753
MrLight/dse-qwen2-2b-mrl-v1 (1024 max pixels)	0.750	0.788	0.712
racineai/Flantier-SmolVLM-500M-dse	0.503	0.556	0.451
HuggingFaceTB/SmolVLM-Instruct (base model)	0.163	0.165	0.162
HuggingFaceTB/SmolVLM-500M-Instruct (base model)	0.153	0.162	0.143

Open VLM Retrieval Leaderboard

Structure

How to Read the Results

Average Performance Across Languages

Performance with English Queries

Performance with French Queries

Performance with Spanish Queries

Performance with German Queries

Performance with Italian Queries

Additional Information

Citation