LLM-Free Image Captioning Evaluation in Reference-flexible Settings

Pearl Teaser Image

Evaluating image captioning models quickly and reliably without relying on heavy LLMs.

Abstract

We focus on the automatic evaluation of image captions in both reference-based and reference-free settings. Existing metrics based on large language models (LLMs) favor their own generations; therefore, the neutrality is in question. Most LLM-free metrics do not suffer from such an issue, whereas they do not always demonstrate high performance. To address these issues, we propose Pearl, an LLM-free supervised metric for image captioning, which is applicable to both reference-based and reference-free settings. We introduce a novel mechanism that learns the representations of image--caption and caption--caption similarities. Furthermore, we construct a human-annotated dataset for image captioning metrics, that comprises approximately 333k human judgments collected from 2,360 annotators across over 75k images. Pearl outperformed other existing LLM-free metrics on the Composite, Flickr8K-Expert, Flickr8K-CF, Nebula, and FOIL datasets in both reference-based and reference-free settings.

Method Overview

Overview of the Pearl Model Architecture
Figure: Overview of the Pearl Model Architecture

Pearl is designed to evaluate image captioning models efficiently.

Pearl achieves robust performance in both reference-based and reference-free settings.

Quantitative Results

Pearl demonstrates state-of-the-art performance on standard benchmarks including Composite, Flickr8K-Expert, Flickr8K-CF, Nebula, and FOIL. Notably, in the reference-based setting, Pearl achieves high correlation with human judgments while maintaining an inference time of only 8.2ms per sample.

Quantitative Comparison Table

Qualitative Results

Reference-based Evaluation

Evaluating captions with reference captions available

Reference-based Example 1 Preview 1
Reference-based Example 2 Preview 2
Reference-based Example 3 Preview 3
Reference-based Example 4 Preview 4
Reference-based Example 5 Preview 5
Reference-based Example 1

Reference Caption

a guy doing a jump with his skateboard

Candidate Caption

a boy on a skateboard doing a trick


Pearl Score

0.86

Human Score

0.91

BibTeX


comming soon