Authors:
(1) Ning Wang, Huawei Inc.;
(2) Jiangrong Xie, Huawei Inc.;
(3) Hang Luo, Huawei Inc.;
(4) Qinglin Cheng, Huawei Inc.;
(5) Jihao Wu, Huawei Inc.;
(6) Mingbo Jia, Huawei Inc.;
(7) Linlin Li, Huawei Inc.; Table of Links Abstract and 1 Introduction 2 Related Work 3 Methodology and 3.1 Model Architecture 3.2 Model Training 3.3 Knowledge Distillation 4 Experiments 4.1 Datasets and Metrics and 4.2 Implementation Details 4.3 Ablation Study 4.4 Inference on the Mobile Device and 4.5 State-of-the-art Comparison 5 Conclusion and References A Implementation Details B Visualization Results C Results on Nocaps D Limitations and Future Work C Results on Nocaps Due to the limited space, we only exhibit “out-of-domain” and “overall” comparison results on the Nocaps dataset (Agrawal et al. 2019) in the main paper. In Table 8 of this supplementary material, we show the complete results including “in-domain”, “near-domain”, “out-of-domain”, and “overall” performance. D Limitations and Future Work Despite the super-balanced performance and efficiency, the proposed framework still has some limitations: (1) Training a More Efficient CLIP. The main computational cost of our work lies in the visual backbone (i.e., ResNet-50). In the future, we plan to train an EfficientNet-based CLIP model to further reduce the feature extraction latency of the visual encoder. (2) End-to-end Training. Currently, we freeze the model parameters of the CLIP ResNet-50 backbone. We observe that end-to-end training of the visual backbone will degrade the performance, potentially due to the limited training data in the image captioning domain. In the future, we intend to include more data to facilitate the joint training of the visual backbone and fusion model. (3) Adding More Pre-training Data. Although our approach adopts the cross-modal pre-training, as shown in Table 9, our pre-training data is much less than the recent LEMON (Hu et al. 2021a), BLIP (Li et al. 2022), and SimVLM (Wang et al. 2021). In the future, we plan to involve more pre-training data to boost the captioning quality. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Ning Wang, Huawei Inc.; (2) Jiangrong Xie, Huawei Inc.; (3) Hang Luo, Huawei Inc.; (4) Qinglin Cheng, Huawei Inc.; (5) Jihao Wu, Huawei Inc.; (6) Mingbo Jia, Huawei Inc.; (7) Linlin Li, Huawei Inc.; Authors: Authors: (1) Ning Wang, Huawei Inc.; (2) Jiangrong Xie, Huawei Inc.; (3) Hang Luo, Huawei Inc.; (4) Qinglin Cheng, Huawei Inc.; (5) Jihao Wu, Huawei Inc.; (6) Mingbo Jia, Huawei Inc.; (7) Linlin Li, Huawei Inc.; Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Related Work 2 Related Work 3 Methodology and 3.1 Model Architecture 3 Methodology and 3.1 Model Architecture 3.2 Model Training 3.2 Model Training 3.3 Knowledge Distillation 3.3 Knowledge Distillation 4 Experiments 4.1 Datasets and Metrics and 4.2 Implementation Details 4.1 Datasets and Metrics and 4.2 Implementation Details 4.3 Ablation Study 4.3 Ablation Study 4.4 Inference on the Mobile Device and 4.5 State-of-the-art Comparison 4.4 Inference on the Mobile Device and 4.5 State-of-the-art Comparison 5 Conclusion and References 5 Conclusion and References A Implementation Details A Implementation Details B Visualization Results B Visualization Results C Results on Nocaps C Results on Nocaps D Limitations and Future Work D Limitations and Future Work C Results on Nocaps Due to the limited space, we only exhibit “out-of-domain” and “overall” comparison results on the Nocaps dataset (Agrawal et al. 2019) in the main paper. In Table 8 of this supplementary material, we show the complete results including “in-domain”, “near-domain”, “out-of-domain”, and “overall” performance. D Limitations and Future Work Despite the super-balanced performance and efficiency, the proposed framework still has some limitations: (1) Training a More Efficient CLIP. The main computational cost of our work lies in the visual backbone (i.e., ResNet-50). In the future, we plan to train an EfficientNet-based CLIP model to further reduce the feature extraction latency of the visual encoder. (1) Training a More Efficient CLIP. (2) End-to-end Training. Currently, we freeze the model parameters of the CLIP ResNet-50 backbone. We observe that end-to-end training of the visual backbone will degrade the performance, potentially due to the limited training data in the image captioning domain. In the future, we intend to include more data to facilitate the joint training of the visual backbone and fusion model. (2) End-to-end Training. (3) Adding More Pre-training Data. Although our approach adopts the cross-modal pre-training, as shown in Table 9, our pre-training data is much less than the recent LEMON (Hu et al. 2021a), BLIP (Li et al. 2022), and SimVLM (Wang et al. 2021). In the future, we plan to involve more pre-training data to boost the captioning quality. (3) Adding More Pre-training Data. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Huawei

How LightCap Sees and Speaks: Mobile Magic in Just 188ms Per Image

Not Just Small and Fast, But Smart Too: How LightCap Outperforms on Mobile

Read My Stories

Too Long; Didn't Read

Get Free Unlimited IP Geo Data With IPinfo Lite

LightCap’s Success on Nocaps: Limitations and Opportunities for Growth

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Survey of Image Captioning Techniques and Vision-Language Pre-training Strategies

The Noonification: The FBI, Apple, and the San Bernardino Massacre (10/3/2023)

A Detailed Analysis of Inter-Annotator Agreement

A Detailed Analysis on the Effectiveness of Automatic Filtering

AI-Driven Creativity: QDAIF Shines in Generating Diverse and High-Quality Texts

Breathing Life into Still Photos: Exploring Neural Motion Textures

A Survey of Image Captioning Techniques and Vision-Language Pre-training Strategies

The Noonification: The FBI, Apple, and the San Bernardino Massacre (10/3/2023)

A Detailed Analysis of Inter-Annotator Agreement

A Detailed Analysis on the Effectiveness of Automatic Filtering

AI-Driven Creativity: QDAIF Shines in Generating Diverse and High-Quality Texts

Breathing Life into Still Photos: Exploring Neural Motion Textures

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps