Automatic Caption Generation for Aerial Images: A Survey

Parag Jayant Mondhe, Manisha P. Satone, Gajanan K. Kharate

Abstract


Aerial images have attracted attention from researcher community since long time. Generating a caption for an aerial image describing its content in comprehensive way is less studied but important task as it has applications in agriculture, defence, disaster management and many more areas. Though different approaches were followed for natural image caption generation, generating a caption for aerial image remains a challenging task due to its special nature. Use of emerging techniques from Artificial Intelligence (AI) and Natural Language Processing (NLP) domains have resulted in generation of accepted quality captions for aerial images. However lot needs to be done to fully utilize potential of aerial image caption generation task. This paper presents detail survey of the various approaches followed by researchers for aerial image caption generation task. The datasets available for experimentation, criteria used for performance evaluation and future directions are also discussed.

Keywords


aerial images; caption generation; description generation; remote sensing images; satellite images

References


O. A. B. Penatti, K. Nogueira, and J. A. dos Santos, “Do deep features generalize from everyday objects to remote sensing and aerial scenes domains?” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2015, pp. 44–51.

J. Inglada, “Automatic recognition of man-made objects in high resolution optical remote sensing images by svm classification of geometric image features,” ISPRS J. Photogramm. Remote Sens., vol. 62, no. 3, pp. 236–248, 2007.

Z. Zou and Z. Shi, “Ship detection in spaceborne optical image with SVD networks,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 10, pp. 5832–5845, Oct. 2016.

L. Zhang, Z. Shi, and J. Wu, “A hierarchical oil tank detector with deep surrounding features for high-resolution optical satellite imagery,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 8, no. 10, pp. 4895–4909, Oct. 2015.

Z. An, Z. Shi, X. Teng, X. Yu, and W. Tang, “An automated airplane detection system for large panchromatic image with high spatial resolution,” Optik-Int. J. Light Electron Opt., vol. 125, no. 12, pp. 2768–2775, Jun. 2014.

J. Han, D. Zhang, G. Cheng, L. Guo, and J. Ren, “Object detection in optical remote sensing images based on weakly supervised learning and high-level feature learning,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 6, pp. 3325–3337, Jun. 2015.

Z. Shi, X. Yu, Z. Jiang, and B. Li, “Ship detection in high-resolution optical imagery based on anomaly detector and local shape feature,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 8, pp. 4511–4523, Aug. 2014.

V. Risojevi´c and Z. Babi´c, “Unsupervised quaternion feature learning for remote sensing image classification,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 9, no. 4, pp. 1521–1531, Apr. 2016.

B. Demir and L. Bruzzone, “Hashing-based scalable remote sensing image search and retrieval in large archives,” IEEE Trans. Geosci.Remote Sens., vol. 54, no. 2, pp. 892–904, Feb. 2016.

X. Lu, X. Zheng, and Y. Yuan, “Remote sensing scene classification by unsupervised representation learning,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 9, pp. 5148–5157, Sep. 2017.

G. Cheng, J. Han, L. Guo, Z. Liu, S. Bu, and J. Ren, “Effective and efficient midlevel visual elements-oriented land-use classification using VHR remote sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 8, pp. 4238–4249, Aug. 2015.

D. Tuia, F. Ratle, F. Pacifici, M. F. Kanevski, and W. J. Emery, “Active learning methods for remote sensing image classification,” IEEE Trans. Geosci. Remote Sens., vol. 47, no.7, pp. 2218–2232, Jul. 2009.

Y. Gu, Q. Wang, X. Jia, and J. A. Benediktsson, “A novel MKL model of integrating LiDAR data and MSI for urban area classification,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 10, pp. 5312–5326, Oct. 2015.

F. Melgani and L. Bruzzone, “Classification of hyperspectral remote sensing images with support vector machines,” IEEE Trans. Geosci. Remote Sens., vol. 42, no. 8, pp. 1778–1790, Aug. 2004.

S Chandeesh Kumar, M Hemalatha, S Badri Narayan, P Nandhini, “Region Driven Remote Sensing Image Captioning”, Procedia Computer Science, Elsevier, Volume 165, 2019, Pages 32-40, ISSN 1877-0509, https://doi.org/10.1016/j.procs.2020.01.067.

Z. Shi and Z. Zou, “Can a machine generate humanlike language descriptions for a remote sensing image?” IEEE Transactions Geoscience Remote Sensing, vol. 55, no. 6, pp. 3623–3634, Jun. 2017.

X. Zhang, Q. Wang, S. Chen and X. Li, "Multi-Scale Cropping Mechanism for Remote Sensing Image Captioning,"IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 2019, pp. 10039-10042, doi: 10.1109/IGARSS.2019.8900503.

B. Wang, X. Zheng, B. Qu and X. Lu, "Retrieval Topic Recurrent Memory Network for Remote Sensing Image Captioning," in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 13, pp. 256-270, 2020, doi: 10.1109/JSTARS.2019.2959208.

B. Z. Yao, X. Yang, L. Lin, M. W. Lee and S. Zhu, "I2T: Image Parsing to Text Description," in Proceedings of the IEEE, vol. 98, no. 8, pp. 1485-1508, Aug. 2010, doi: 10.1109/JPROC.2010.2050411.

V. Ordonez, G. Kulkarni, and T. L. Berg, “Im2text: Describing images using 1 million captioned photographs,” in Proc. Adv. Neural Inf. Process. Syst., 2011, pp. 1143–1151.

S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi, “Composing simple image descriptions using Web-scale n-grams,” in Proc. 15th Conf. Comput. Natural Lang. Learn., 2011, pp. 220–228.

Y. Yang, C. L. Teo, D. H. Iii, and Y. Aloimonos, “Corpus-guided sentence generation of natural images,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2011, pp. 444–454.

Y. Feng and M. Lapata, "Automatic Caption Generation for News Images," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 4, pp. 797-812, April 2013, doi: 10.1109/TPAMI.2012.118.

G. Kulkarni et al., "BabyTalk: Understanding and Generating Simple Image Descriptions," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2891-2903, Dec. 2013, doi: 10.1109/TPAMI.2012.162.

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. (2014). “Show and tell: A neural image caption generator.” [Online]. Available: https://arxiv.org/abs/1411.4555

A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 3128–3137.

K. Xu et al., “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 2048–2057.

J. Johnson, A. Karpathy, and L. Fei-Fei, “DenseCap: Fully convolutional localization networks for dense captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2016, pp. 4565–4574.

O. Vinyals, A. Toshev, S. Bengio and D. Erhan, "Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 652-663, 1 April 2017, doi: 10.1109/TPAMI.2016.2587640.

A. Karpathy and L. Fei-Fei, "Deep Visual-Semantic Alignments for Generating Image Descriptions," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 664-676, 1 April 2017, doi: 10.1109/TPAMI.2016.2598339.

A. Tariq and H. Foroosh, "A Context-Driven Extractive Framework for Generating Realistic Image Descriptions," in IEEE Transactions on Image Processing, vol. 26, no. 2, pp. 619-632, Feb. 2017, doi: 10.1109/TIP.2016.2628585.

K. Fu, J. Jin, R. Cui, F. Sha and C. Zhang, "Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2321-2334, 1 Dec. 2017, doi: 10.1109/TPAMI.2016.2642953.

C. C. Park, B. Kim and G. KIM, "Towards Personalized Image Captioning via Multimodal Memory Networks," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 4, pp. 999-1012, 1 April 2019, doi: 10.1109/TPAMI.2018.2824816.

M. Yang et al., "Multitask Learning for Cross-Domain Image Captioning," in IEEE Transactions on Multimedia, vol. 21, no. 4, pp. 1047-1061, April 2019, doi: 10.1109/TMM.2018.2869276.

N. Yu, X. Hu, B. Song, J. Yang and J. Zhang, "Topic-Oriented Image Captioning Based on Order-Embedding," in IEEE Transactions on Image Processing, vol. 28, no. 6, pp. 2743-2754, June 2019, doi: 10.1109/TIP.2018.2889922.

X. Li and S. Jiang, "Know More Say Less: Image Captioning Based on Scene Graphs," in IEEE Transactions on Multimedia, vol. 21, no. 8, pp. 2117-2130, Aug. 2019, doi: 10.1109/TMM.2019.2896516.

Z. Zha, D. Liu, H. Zhang, Y. Zhang and F. Wu, "Context-Aware Visual Policy Network for Fine-Grained Image Captioning," in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2019.2909864.

L. Gao, X. Li, J. Song and H. T. Shen, "Hierarchical LSTMs with Adaptive Attention for Visual Captioning," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 5, pp. 1112-1131, 1 May 2020, doi: 10.1109/TPAMI.2019.2894139.

N. Xu et al., "Multi-Level Policy and Reward-Based Deep Reinforcement Learning Framework for Image Captioning," in IEEE Transactions on Multimedia, vol. 22, no. 5, pp. 1372-1383, May 2020, doi: 10.1109/TMM.2019.2941820.

B. Wang, C. Wang, Q. Zhang, Y. Su, Y. Wang and Y. Xu, "Cross-Lingual Image Caption Generation Based on Visual Attention Model," in IEEE Access, vol. 8, pp. 104543-104554, 2020, doi: 10.1109/ACCESS.2020.2999568.

Y. Wang, N. Xu, A. -A. Liu, W. Li and Y. Zhang, "High-Order Interaction Learning for Image Captioning," in IEEE Transactions on Circuits and Systems for Video Technology, doi: 10.1109/TCSVT.2021.3121062.

D. -J. Kim, T. -H. Oh, J. Choi and I. S. Kweon, "Dense Relational Image Captioning via Multi-task Triple-Stream Networks," in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2021.3119754.

A. -A. Liu, Y. Zhai, N. Xu, W. Nie, W. Li and Y. Zhang, "Region-Aware Image Captioning via Interaction Learning," in IEEE Transactions on Circuits and Systems for Video Technology, doi: 10.1109/TCSVT.2021.3107035.

C. Yan et al., "Task-Adaptive Attention for Image Captioning," in IEEE Transactions on Circuits and Systems for Video Technology, doi: 10.1109/TCSVT.2021.3067449.

H. Ben et al., "Unpaired Image Captioning with Semantic-Constrained Self-Learning," in IEEE Transactions on Multimedia, doi: 10.1109/TMM.2021.3060948.

S. Zhang, Y. Zhang, Z. Chen and Z. Li, "VSAM-Based Visual Keyword Generation for Image Caption," in IEEE Access, vol. 9, pp. 27638-27649, 2021, doi: 10.1109/ACCESS.2021.3058425.

Z. Zhou et al., "An Image Captioning Model Based on Bidirectional Depth Residuals and its Application," in IEEE Access, vol. 9, pp. 25360-25370, 2021, doi: 10.1109/ACCESS.2021.3057091.

L. Huo, L. Bai and S. -M. Zhou, "Automatically Generating Natural Language Descriptions of Images by a Deep Hierarchical Framework," in IEEE Transactions on Cybernetics, doi: 10.1109/TCYB.2020.3041595.

X. Lu, B. Wang, X. Zheng and X. Li, "Exploring Models and Data for Remote Sensing Image Caption Generation," in IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 4, pp. 2183-2195, April 2018.

S. Wu, X. Zhang, X. Wang, C. Li and L. Jiao, "Scene Attention Mechanism for Remote Sensing Image Caption Generation," 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 2020, pp. 1-7, doi: 10.1109/IJCNN48605.2020.9207381.

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. (2014). “Return of the devil in the details: Delving deep into convolutional nets.” [Online]. Available: https://arxiv.org/abs/1405.3531

B. Wang, X. Lu, X. Zheng and X. Li, "Semantic Descriptions of High-Resolution Remote Sensing Images," in IEEE Geoscience and Remote Sensing Letters, vol. 16, no. 8, pp. 1274-1278, Aug. 2019, doi: 10.1109/LGRS.2019.2893772.

B. Qu, X. Li, D. Tao, and X. Lu, “Deep semantic understanding of high resolution remote sensing image,” in Proc. Int. Conf. Comput., Inf. Telecommun. Syst., Jul. 2016, pp. 124–128.

X. Zhang, X. Li, J. An, L. Gao, B. Hou and C. Li, "Natural language description of remote sensing images based on deep learning," 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2017, pp. 4798-4801, doi: 10.1109/IGARSS.2017.8128075.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.

Y. Jia. Caffe: An Open Source Convolutional Architecture for Fast Feature Embedding. http://caffe.berkeleyvision.org/,2013.

K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,”

IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904–1916, 2015.

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.

S. Wang, J. Chen, and G. Wang, “Intensive positioning network for remote sensing image captioning,” in Proc. Int. Conf. Intell. Sci. Big Data Eng., 2018, pp. 567–576.

X. Zhang, X. Wang, X. Tang, H. Zhou, and C. Li, “Description Generation for Remote Sensing Images Using Attribute Attention Mechanism,” Remote Sensing, vol. 11, no. 6, p. 612, Mar. 2019 [Online]. Available: http://dx.doi.org/10.3390/rs11060612

X. Lu, B. Wang and X. Zheng, "Sound Active Attention Framework for Remote Sensing Image Captioning," in IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 3, pp. 1985-2000, March 2020, doi: 10.1109/TGRS.2019.2951636.

X. Shen, B. Liu, Y. Zhou, J. Zhao, and M. Liu, “Remote sensing image captioning via variational autoencoder and reinforcement learning,” Knowl. Based Syst., vol. 203, 2020, Art. no. 105920.

D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004.

J. Sivic and A. Zisserman, “Video Google: A text retrieval approach to object matching in videos,” in Proc. 9th IEEE Int. Conf. Comput. Vis., vol. 2. Oct. 2003, pp. 1470–1477.

F. Perronnin, Y. Liu, J. Sánchez, and H. Poirier, “Large-scale image retrieval with compressed fisher vectors,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 3384–3391.

H. Jégou, F. Perronnin, M. Douze, J. Sánchez, P. Pérez, and C. Schmid, “Aggregating local image descriptors into compact codes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 9, pp. 1704–1716, Sep. 2012.

Tomas Mikolov and Kai Chen and Greg S. Corrado and Jeffrey Dean (2013) “Efficient Estimation of Word Representations in Vector Space”, arXiv:1301.3781

G. Cheng, J. Han, X. Lu, Remote sensing image scene classification: Benchmark and state of the art, Proc. IEEE 105 (10) (2017) 1865–1883, http://dx.doi.org/10.1109/jproc.2017.2675998.

G. Hoxha, F. Melgani and J. Slaghenauffi, "A New CNN-RNN Framework For Remote Sensing Image Captioning," 2020 Mediterranean and Middle-East Geoscience and Remote Sensing Symposium (M2GARSS), Tunis, Tunisia, 2020, pp. 1-4, doi: 10.1109/M2GARSS47143.2020.9105191.

J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick, “Exploring Nearest Neighbor Approaches for Image Captioning”, ArXiv 150504467 Cs, May 2015.

M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics,” J. Artif. Intell. Res., vol. 47, no. 8, pp. 853–899, 2013.

A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, and T. Mikolov, “Devise: A deep visual-semantic embedding model,” in Proc. Adv. Neural Inf. Process. Syst., 2013, pp. 2121–2129.

Yi Yang and Shawn Newsam, "Bag-Of-Visual-Words and Spatial Extensions for Land-Use Classification," ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS), 2010.

F. Zhang, B. Du, and L. Zhang, “Saliency-guided unsupervised feature learning for scene classification,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 53, no. 4, pp. 2175–2184, 2015.

P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Trans. Assoc. Comput. Linguistics, vol. 2, pp. 67–78, 2014.

X. Chen et al. (2015). “Microsoft COCO captions: Data collection and evaluation server.” [Online]. Available: https://arxiv.org/abs/1504.00325

G. S. Xia et al., “AID: A benchmark data set for performance evaluation of aerial scene classification,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 7, pp. 3965–3981, Jul. 2016.

Dhaksha Team (2018) “Drone Manufacture in INDIA - Team Dhaksha”, https://www.teamdhaksha.com/.

K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” in Proc. 40th Annu. Meet. Assoc. Comput. Linguistics, 2002, pp. 311–318.

C. Flick, “Rouge: A package for automatic evaluation of summaries,” in Proc. Workshop Text Summarization Branches Out, 2004, p. 10.

M. Denkowski and A. Lavie, “Meteor universal: Language specific translation evaluation for any target language,” in Proc. Workshop Stat. Mach. Transl., 2014, pp. 376–380.

R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-based image description evaluation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 4566–4575.

P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propositional image caption evaluation,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 382–398.


Full Text: PDF

Refbacks

  • There are currently no refbacks.


 

Indonesian Journal of Electrical Engineering and Informatics (IJEEI)
ISSN 2089-3272

Creative Commons Licence

This work is licensed under a Creative Commons Attribution 4.0 International License.

web analytics
View IJEEI Stats