Authors:
(1) Ning Wang, Huawei Inc.;
(2) Jiangrong Xie, Huawei Inc.;
(3) Hang Luo, Huawei Inc.;
(4) Qinglin Cheng, Huawei Inc.;
(5) Jihao Wu, Huawei Inc.;
(6) Mingbo Jia, Huawei Inc.;
(7) Linlin Li, Huawei Inc.;
Table of Links
3 Methodology and 3.1 Model Architecture
4 Experiments
4.1 Datasets and Metrics and 4.2 Implementation Details
4.4 Inference on the Mobile Device and 4.5 State-of-the-art Comparison
4.4 Inference on the Mobile Device
Table 3 exhibits the model FLOPs and parameters of each block in the LightCap. Note that the ResNet50 backbone in CLIP adopts the half-precision model training and thus the model storage size of the visual encoder is 56.5MB. Overall, our LightCap consumes a total storage space of 112.5MB, which is affordable for most mobile devices.
Then, we test the inference latency of LightCap model on Huawei P40 smartphone with a Kirin 990 chip. To purely investigate the model inference speed, we set the beam search size to 1. It merely takes about 188ms for our light model to process a single image on the CPU from mobile devices, which meets the real-world efficiency requirements.
4.5 State-of-the-art Comparison
Comparison on Model Size and Efficiency. In Table 4, we compare our LightCap with the state-of-the-art captioning methods in terms of model size and inference efficiency in FLOPs. Most existing pre-training methods such as VLP (Zhou et al. 2020), Oscar (Li et al. 2020b), and UNIMO (Li et al. 2020a) use the Faster R-CNN as the feature extractor and a BERTbase as the fusion model, yielding about 173M parameters and about 800G FLOPs. It is worth noting that the current performance leaders such as VinVL (Zhang et al. 2021a) and LEMON (Hu et al. 2021a) contain a huge FLOPs of more than 1000G. As illustrated in Section 4.4, the overall FLOPs of our LightCap is only 9.8G. Consequently, compared with the recent popular image captioners, our LightCap saves more than 98% of the FLOPs.
To the best of our knowledge, DistillVLM (Fang et al. 2021b) and MiniVLM (Wang et al. 2020a) are the representative lightweight image captioners in the literature. These methods design a tiny object detector called Eff-DET based on the EfficientNet (Tan and Le 2019). Nevertheless, their fusion model (i.e., MiniLM (Wang et al. 2020b)) is still much larger than our TinyBERT4 . As discussed in MiniVLM, changing the fusion model from MiniLM to a TinyBERT4 , the captioning performance will drop sharply (about 10 CIDEr). Thanks to our designed concept extractor, cross-modal modulator, and ensemble head, a lightweight TinyBERT4 also works well in our framework.
Evaluation on COCO. In Table 5, we present the performance of state-of-the-art captioning methods on the COCO Karpathy test split (Karpathy and Fei-Fei 2015). These approaches are generally trained with the cross-entropy loss and further optimized with CIDEr as a reinforcement learning reward. Previous captioners without model pre-training such as BUTD, AoANet, and X-LAN mostly use the Faster R-CNN as the visual feature extractor. The proposed LightCap outperforms all previous pretraining-free algorithms.
Recent “pre-training then fine-tuning” methods typically choose the BERT model as the cross-modal fusion model. These methods struggle to achieve a fast inference speed with the large visual backbone and the heavyweight BERT model. Using similar pre-training data and the same crossentropy optimization, our LightCap (125.8 CIDEr) is superior to the heavyweight OscarB (123.7 CIDEr) and UNIMOB (124.4 CIDEr). Compared with other lightweight captioning
methods such as MiniVLM and DistillVLM, our LightCap retains fewer parameters and FLOPs, but surpasses them by a notable margin of about 5 CIDEr score. Note that BLIP and LEMON algorithms collect large-scale high-quality pretraining datasets containing 129 and 200 million image-text pairs (more than 20× larger than ours) for pre-training, respectively. We believe that the proposed LightCap can be further improved by involving more pre-training data, which leaves as our future work.
Evaluation on Nocaps. Nocaps benchmark (Agrawal et al. 2019) contains 15,100 images collected from OpenImages (Shao et al. 2019). We evaluate the proposed method on the nocaps dataset to assess the model generalizability. Due to the limited space, we only present the out-of-domain and overall performance in Table 6. Following the protocol of this benchmark, we merely train the LightCap model on the COCO-caption without additional pre-training. Our captioning model is much smaller than all the comparison methods such as VIVO and ViTCap. It is also worth mentioning that our method surpasses the human CIDEr score and even slightly outperforms the strong VinVL method in the out-ofdomain, which can be largely contributed to the representational power of the CLIP feature and our designed concept extractor to retrieve novel concepts.
5 Conclusion
In this paper, we propose a lightweight image captioning approach for resource-limited devices. To unveil the potential of a capacity-limited tiny model, we design a visual concept extractor, a cross-modal modulator, and an ensemble head to improve the captioning quality. By virtue of the sequential knowledge distillation and ensemble distillation, our LightCap exhibits competitive performance under a limited model capacity. Extensive experiments verify the super-balanced performance and efficiency of the proposed LightCap.
References
Agrawal, H.; Desai, K.; Wang, Y.; Chen, X.; Jain, R.; Johnson, M.; Batra, D.; Parikh, D.; Lee, S.; and Anderson, P. 2019. Nocaps: Novel object captioning at scale. In ICCV.
Anderson, P.; Fernando, B.; Johnson, M.; and Gould, S. 2016. Spice: Semantic propositional image caption evaluation. In ECCV.
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; and Zhang, L. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.
Banerjee, S.; and Lavie, A. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In ACL Workshop.
Chen, Y.-C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; and Liu, J. 2020. Uniter: Universal image-text representation learning. In ECCV.
Cornia, M.; Baraldi, L.; Fiameni, G.; and Cucchiara, R. 2021. Universal captioner: Long-tail vision-and-language model training through content-style separation. arXiv preprint arXiv:2111.12727.
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dou, Z.-Y.; Xu, Y.; Gan, Z.; Wang, J.; Wang, S.; Wang, L.; Zhu, C.; Liu, Z.; Zeng, M.; et al. 2021. An Empirical Study of Training End-to-End Vision-and-Language Transformers. arXiv preprint arXiv:2111.02387.
Fang, Z.; Wang, J.; Hu, X.; Liang, L.; Gan, Z.; Wang, L.; Yang, Y.; and Liu, Z. 2021a. Injecting semantic concepts into end-to-end image captioning. arXiv preprint arXiv:2112.05230.
Fang, Z.; Wang, J.; Hu, X.; Wang, L.; Yang, Y.; and Liu, Z. 2021b. Compressing visual-linguistic model via knowledge distillation. arXiv preprint arXiv:2104.02096.
Fei, Z. 2022. Attention-Aligned Transformer for Image Captioning. In AAAI.
He, K.; Gkioxari, G.; Dollar, P.; and Girshick, R. 2017. Mask r-cnn. ´ In ICCV.
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.
Hu, X.; Gan, Z.; Wang, J.; Yang, Z.; Liu, Z.; Lu, Y.; and Wang, L. 2021a. Scaling up vision-language pre-training for image captioning. arXiv preprint arXiv:2111.12233.
Hu, X.; Yin, X.; Lin, K.; Wang, L.; Zhang, L.; Gao, J.; and Liu, Z. 2021b. Vivo: Surpassing human performance in novel object captioning with visual vocabulary pre-training. In AAAI.
Huang, L.; Wang, W.; Chen, J.; and Wei, X.-Y. 2019. Attention on attention for image captioning. In ICCV.
Huang, Z.; Zeng, Z.; Huang, Y.; Liu, B.; Fu, D.; and Fu, J. 2021. Seeing out of the box: End-to-end pre-training for vision-language representation learning. In CVPR.
Ji, J.; Luo, Y.; Sun, X.; Chen, F.; Luo, G.; Wu, Y.; Gao, Y.; and Ji, R. 2021. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In AAAI.
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.; Le, Q. V.; Sung, Y.; Li, Z.; and Duerig, T. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML.
Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; and Liu, Q. 2019. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351.
Karpathy, A.; and Fei-Fei, L. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR.
Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D. A.; et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123(1): 32–73.
Li, J.; Li, D.; Xiong, C.; and Hoi, S. 2022. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. arXiv preprint arXiv:2201.12086.
Li, W.; Gao, C.; Niu, G.; Xiao, X.; Liu, H.; Liu, J.; Wu, H.; and Wang, H. 2020a. Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409.
Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. 2020b. Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; and Zitnick, C. L. 2014. Microsoft coco: Common ´ objects in context. In ECCV.
Loshchilov, I.; and Hutter, F. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
Luo, Y.; Ji, J.; Sun, X.; Cao, L.; Wu, Y.; Huang, F.; Lin, C.-W.; and Ji, R. 2021. Dual-level collaborative transformer for image captioning. In AAAI.
Mukherjee, S.; and Awadallah, A. 2020. XtremeDistil: Multistage distillation for massive multilingual models. arXiv preprint arXiv:2004.05686.
Ordonez, V.; Kulkarni, G.; and Berg, T. 2011. Im2text: Describing images using 1 million captioned photographs. NeurIPS.
Pan, Y.; Yao, T.; Li, Y.; and Mei, T. 2020. X-linear attention networks for image captioning. In CVPR.
Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL.
Qin, Y.; Du, J.; Zhang, Y.; and Lu, H. 2019. Look back and predict forward in image captioning. In CVPR.
Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020.
Ren, S.; He, K.; Girshick, R.; and Sun, J. 2016. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE TPAMI, 39(6): 1137–1149.
Shao, S.; Li, Z.; Zhang, T.; Peng, C.; Yu, G.; Zhang, X.; Li, J.; and Sun, J. 2019. Objects365: A large-scale, high-quality dataset for object detection. In ICCV.
Sharma, P.; Ding, N.; Goodman, S.; and Soricut, R. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL.
Shen, S.; Li, L. H.; Tan, H.; Bansal, M.; Rohrbach, A.; Chang, K.- W.; Yao, Z.; and Keutzer, K. 2021. How much can CLIP benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383.
Song, Z.; Zhou, X.; Mao, Z.; and Tan, J. 2021. Image captioning with context-aware auxiliary guidance. In AAAI.
Tan, H.; and Bansal, M. 2019. Lxmert: Learning cross-modality encoder representations from transformers. In EMNLP.
Tan, M.; and Le, Q. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML.
Ultralytics. 2020. YOLOv5. https://github.com/ultralytics/yolov5.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In NeurIPS.
Vedantam, R.; Lawrence Zitnick, C.; and Parikh, D. 2015. Cider: Consensus-based image description evaluation. In CVPR.
Wang, J.; Hu, X.; Zhang, P.; Li, X.; Wang, L.; Zhang, L.; Gao, J.; and Liu, Z. 2020a. Minivlm: A smaller and faster vision-language model. arXiv preprint arXiv:2012.06946.
Wang, W.; Wei, F.; Dong, L.; Bao, H.; Yang, N.; and Zhou, M. 2020b. Minilm: Deep self-attention distillation for taskagnostic compression of pre-trained transformers. arXiv preprint arXiv:2002.10957.
Wang, Y.; Xu, J.; and Sun, Y. 2022. End-to-End Transformer Based Model for Image Captioning. In AAAI.
Wang, Z.; Yu, J.; Yu, A. W.; Dai, Z.; Tsvetkov, Y.; and Cao, Y. 2021. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904.
Xu, H.; Yan, M.; Li, C.; Bi, B.; Huang, S.; Xiao, W.; and Huang, F. 2021. E2E-VLP: End-to-end vision-language pre-training enhanced by visual learning. arXiv preprint arXiv:2106.01804.
Yang, X.; Liu, Y.; and Wang, X. 2022. Reformer: The relational transformer for image captioning. In ACM MM.
Zhang, P.; Li, X.; Hu, X.; Yang, J.; Zhang, L.; Wang, L.; Choi, Y.; and Gao, J. 2021a. Vinvl: Revisiting visual representations in vision-language models. In CVPR.
Zhang, X.; Sun, X.; Luo, Y.; Ji, J.; Zhou, Y.; Wu, Y.; Huang, F.; and Ji, R. 2021b. RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words. In CVPR.
Zhou, L.; Palangi, H.; Zhang, L.; Hu, H.; Corso, J.; and Gao, J. 2020. Unified vision-language pre-training for image captioning and vqa. In AAAI.
This paper is available on arxiv under CC BY 4.0 DEED license.