What Makes LightCap Tick? Breaking Down the Numbers and Components

Authors:

(1) Ning Wang, Huawei Inc.;

(2) Jiangrong Xie, Huawei Inc.;

(3) Hang Luo, Huawei Inc.;

(4) Qinglin Cheng, Huawei Inc.;

(5) Jihao Wu, Huawei Inc.;

(6) Mingbo Jia, Huawei Inc.;

(7) Linlin Li, Huawei Inc.;

Table of Links

Abstract and 1 Introduction

3.2 Model Training

3.3 Knowledge Distillation

4 Experiments

4.1 Datasets and Metrics and 4.2 Implementation Details

4.3 Ablation Study

4.4 Inference on the Mobile Device and 4.5 State-of-the-art Comparison

5 Conclusion and References

A Implementation Details

B Visualization Results

C Results on Nocaps

D Limitations and Future Work

4.1 Datasets and Metrics

Pre-training Datasets. In the experiments, we collect the image-text pairs from Google Conceptual Captions (CC3M) (Sharma et al. 2018), SBU Captions (Ordonez, Kulkarni, and Berg 2011), OpenImages (Shao et al. 2019), and MS-COCO (Lin et al. 2014) to form the pre-training data. In total, our pre-training corpus consists of about 5.8M image-text pairs.

Evaluation Datasets and Metrics. We evaluate the proposed method on the COCO caption of Karpathy split (Lin et al. 2014) and nocaps validation dataset (Agrawal et al. 2019). To evaluate the quality of the generated captions, we use standard metrics in the image captioning task, including BLEU@4 (Papineni et al. 2002), METEOR (Banerje and Lavie 2005), CIDEr (Vedantam, Lawrence Zitnick, and Parikh 2015), and SPICE (Anderson et al. 2016). In the captioning stage, beam search (beam size = 5) is adopted in all experiments and the maximum generation length is restricted to 20 words

4.2 Implementation Details

Visual Encoder. We take the ResNet-50 backbone from the CLIP model (Radford et al. 2021) as the visual feature extractor, whose parameters are frozen in both pre-training and fine-tuning stages. The input image resolution is 224 × 224.

Cross-modal Modulator. The cross-modal modulator contains two sequential linear blocks with sizes of 312×39 and 39 × 2048. The token embedding layer in this modulator shares weights with the embedding layer in TinyBERT.

4.3 Ablation Study

Model Pre-training. It has been well recognized that model pre-training on large-scale image-text corpus benefits the image captioning. As shown in Table 1, for the student model with limited capacity, model pre-training significantly improves the performance by 8.0 CIDEr score.

Visual Concept Extractor. The proposed visual concept extractor provides valuable clues for image captioning via an efficient image-text retrieval manner. As shown in Table 1, for the student model, the visual concept extractor improves the captioning performance by 3.4 CIDEr score on the COCO dataset. This mechanism also improves the strong teacher model by 3.7 CIDEr score

Cross-modal Modulator. The cross-modal modulator takes advantage of the retrieved visual concepts to modulate the raw CLIP features. As shown in Table 1, based on the student model with a visual concept extractor, the proposed cross-modal modulator further improves the captioning performance by 1.8 CIDEr score. This tiny block promotes the strong teacher model by 2.1 CIDEr score

Sequential Model Distillation. In Table 2, we ablate the model knowledge distillation (KD) techniques in our approach. First, we investigate KD in the pre-training stage in Table 2 (top). In these experiments, we only adopt the standard cross-entropy optimization without any KD in the finetuning stage. In the pre-training stage, the “attention & representation distillation” improves 0.8 CIDEr score, and the distillation of output token probability improves 2.0 CIDEr score. Considering the characteristic of cross-modal training, we further propose to distill the soft prediction of the anchor words (i.e., visual concepts), which brings an additional 1.2 CIDEr gain. This indicates the concept distillation facilitates the cross-modal alignment.

Next, we investigate KD in the model fine-tuning stage. As shown in Table 2, based on the distilled fusion model from the pre-training stage, in the fine-tuning stage, “attention & representation distillation” and “output token distillation” further improve 1.1 CIDEr and 2.6 CIDEr, respectively. Combining the above KD techniques achieves the best result of 3.3 CIDEr gain. Finally, by virtue of the model distillation in both pre-training and fine-tuning, our lightweight student model achieves a promising captioning performance of 37.1 BLEU@4 and 124.1 CIDEr, and even matches the strong teacher model (i.e., 37.5 BLUE@4 and 126.3 CIDEr in Table 1).

Ensemble Model Distillation. The above experiments are based on the single head setting. Actually, our model adopts the ensemble head for superior performance. To encourage the prediction diversity, we prepare three teachers to individually distill these heads. As shown in Table 2, ensemble head module and ensemble KD improve 1.7 CIDEr.

This paper is available on arxiv under CC BY 4.0 DEED license.

What Makes LightCap Tick? Breaking Down the Numbers and Components

Too Long; Didn't Read

Company Mentioned

Table of Links

4.1 Datasets and Metrics

4.2 Implementation Details

4.3 Ablation Study

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

What Makes LightCap Tick? Breaking Down the Numbers and Components

Too Long; Didn't Read

Company Mentioned

Table of Links

4.1 Datasets and Metrics

4.2 Implementation Details

4.3 Ablation Study

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics