基于表示学习的跨模态检索模型与特征抽取研究综述

doi:10.3772/j.issn.1000-0135.2018.04.008

情报学报

2018, Vol. 37

Issue (4): 422-435 DOI: 10.3772/j.issn.1000-0135.2018.04.008

Current Issue | Archive | Adv Search

A Review of the Cross-Modal Retrieval Model and Feature Extraction Based on Representation Learning

Li Zhiyi, Huang Zifeng, Xu Xiaomian

Economic & Management College of South China Normal University, Guangzhou 510006

Abstract
Figure/Table
References
Related Citation (15)

Download: PDF (841 KB) HTML (1 KB)
Export: BibTeX | EndNote (RIS)

Abstract Representation learning, particularly deep learning, has received wide attention and seen application in speech recognition, image analysis, and natural language processing fields. It not only promotes the research and development of artificial intelligence, but urges enterprises to consider new business and profit models. This paper aims to examine these studies in the form of reviews, and ultimately form a complete overview of the topic. Through the investigation and organization of relevant literature locally and internationally, this paper summarizes the research results of cross-modal retrieval and feature extraction based on representation learning from the two dimensions of information extraction and representation, and cross-modal system modeling. The main research includes summarizing five traditional representation learning algorithms, which are the autoencoder, sparse encoding, the restricted Boltzmann machine, deep belief networks, and convolutional neural networks. From the shared layer relationship between each mode, the representation space, and the correlation between each mode’s in-depth learning-based cross-modal modeling algorithm, the present state of research on modeling systems based on cross- modal modeling is summed up. Finally, the evaluation index of cross-modal retrieval is summarized. The study finds that the existing retrieval research is rich in single-modal information retrieval and that the content of queries and candidate sets belong to the same modality, whereas cross-modal retrieval is limited to two modal alignment languages of images and texts. Future research needs to see an increase of modal retrieval of audio, video, images, text, and other multimodal data, and using deeper constructing multimodal retrieval models and feature extraction algorithms to achieve three-or- greater cross-modal retrieval. In addition, an evaluation index of multimodal retrieval systems must be established.

Key words： representation learning cross modal retrieval feature extraction model review

Received: 03 December 2017

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Li Zhiyi
	Huang Zifeng
	Xu Xiaomian

Cite this article:

Li Zhiyi,Huang Zifeng,Xu Xiaomian. A Review of the Cross-Modal Retrieval Model and Feature Extraction Based on Representation Learning[J]. 情报学报, 2018, 37(4): 422-435.

URL:

https://qbxb.istic.ac.cn/EN/10.3772/j.issn.1000-0135.2018.04.008 OR https://qbxb.istic.ac.cn/EN/Y2018/V37/I4/422

[1] 王剑. 基于深度学习的跨模态图像检索方法研究[D]. 北京: 中国科学院大学研究生院, 2016.
[2] 何泳澔. 跨模态关联学习及其在图像检索中的应用研究[D]. 北京:中国科学院大学自动化研究所, 2016.
[3] 张昭旭. CNN深度学习模型用于表情特征提取方法探究[J]. 现代计算机, 2016(3): 41-44.
[4] 孙志军, 薛磊, 许阳明. 基于深度学习的边际Fisher分析特征提取算法[J]. 电子与信息学报, 2013, 35(4): 805-811.
[5] Amir A, Basu S, Iyengar G, et al.A multi-modal system for the retrieval of semantic video events[J]. Computer Vision & Image Understanding, 2004, 96(2): 216-236.
[6] Rasiwasia N, Costa Pereira J, Coviello E, et al.A new approach to cross-modal multimedia retrieval[C]// Proceedings of the International Conference on Multimedia. New York: ACM Press, 2010: 251-260.
[7] Ngiam J, Khosla A, Kim M, et al.Multimodal deep learning[C]// Proceedings of the International Conference on Machine Learning. Washington, USA, 2011: 689-696.
[8] 刘春丽, 李晓戈, 刘睿, 等. 基于表示学习的中文分词[J]. 计算机应用, 2016, 36(10): 2794-2798.
[9] Mikolov T, Sutskever I, Chen K, et al.Distributed representations of words and phrases and their compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26: 3111-3119.
[10] Zhao Y, Liu Z Y, Sun M S. Phrase type sensitive tensor indexing model for semantic composition[OL]. [2017-07-25]. http://www. thunlp.org/~lzy/publications/aaai2015_tim.pdf.
[11] Hu B T, Lu Z D, Li H, et al. Convolutional neural network architectures for matching natural language sentences[OL]. [2017-07- 25]. http://www.hangli-hl.com/uploads/3/1/6/8/3168008/hu-etal- nips2014.pdf.
[12] Le Q V, Mikolov T. Distributed representations of sentences and documents[OL]. [2017-07-25]. http://proceedings.mlr.press/v32/ le14.pdf.
[13] Kalchbrenner N, Grefenstette E, Blunsom P. A convolutional neural network for modelling sentences[OL]. [2017-07-25]. http://www.cs.wayne.edu/~mdong/Kalchbrenner_DCNN_ACL14.pdf.
[14] Perozzi B, Al-Rfou R, Skiena S. DeepWalk: online learning of social representations[OL]. [2017-07-25]. http://www.perozzi.net/ publications/14_kdd_deepwalk-slides.pdf.
[15] Tang J, Qu M, Wang M Z, et al.LINE: Large-scale information network embedding[OL].[2018-04-10].https://www.microsoft. com/en-us/research/wp-content/uploads/2016/02/frp0228-Tang.pdf.
[16] Grubinger M, Clough P, Müller H, et al. The IAPR TC12 Benchmark: A new evaluation resource for visual information systems[C/OL]// Proceedings of the International Workshop OntoImage 2006 Language Resources for Content-Based Image Retrieval. [2017-07-25]. http://www-i6.informatik.rwth-aachen.de/ publications/download/34/Grubinger-LREC-2006.pdf.
[17] Plummer B A, Wang L, Cervantes C M, et al.Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models[C]// Proceedings of the International Conference on Computer Vision. Las Vegas: IEEE, 2016: 2
[18] Krizhevsky A, Sutskever I, Hinton G E.ImageNet classification with deep convolutional neural networks[C]// Proceedings of the International Conference on Neural Information Processing Systems. Curran Associates, 2012: 1097-1105.
[19] David R.Signature analysis for multiple-output circuits[J]. IEEE Transactions on Computers, 1986, 35(9): 830-837.
[20] Cortes C, Vapnik V.Support-vector networks[J]. Machine Learning, 1995, 20(3): 273-297.
[21] Greene W H.Marginal effects in the bivariate probit model[J]. Social Science Electronic Publishing[OL]. [2017-07-25]. http:// archive.nyu.edu/bitstream/2451/26254/2/EC-96-11.pdf.
[22] Bengio Y.Learning deep architectures for AI[J]. Foundations & Trends® in Machine Learning, 2009, 2(1): 1-127.
[23] 韩力群. 人工神经网络理论、设计及应用[M]. 北京: 化学工业出版社, 2002: 191-193.
[24] Hinton G E, Osindero S, Teh Y W.A fast learning algorithm for deep belief nets[J]. Neural Computation, 2006, 18(7): 1527.
[25] Deng L, Li J Y, Huang J T, et al.Recent advances in deep learning for speech research at Microsoft[C]// Proceedings of the 38th IEEE International Conference on Acoustics, Speech, and Signal. Vancouver: IEEE, 2013: 8604-8608.
[26] Glorot X, Bengio Y.Understanding the difficulty of training deep feedforward neural networks[C]// Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. Society for Artificial Intelligence and Statistics, 2010: 249-256.
[27] Bengio Y, Lamblin P, Popovici D, et al.Greedy layer-wise training of deep networks[C]// Proceedings of the 19th International Conference on Neural Information Processing Systems. Vancouver: MIT Press, 2006: 153-160.
[28] 吴海燕. 基于自动编码器的半监督表示学习与分类学习研究[D]. 重庆: 重庆大学, 2015.
[29] Andreas J, Rohrbach M, Darrell T, et al. Learning to compose neural networks for question answering[OL]. [2017-07-31]. http://www.stanfordlibraries.info/class/cs224n/lectures/cs224n-2017-lecture17-highlight.pdf.
[30] 朱陶, 任海军, 洪卫军. 一种基于前向无监督卷积神经网络的人脸表示学习方法[J]. 计算机科学, 2016, 43(6): 303-307.
[31] 李志宇, 梁循, 徐志明, 等. DNPS: 基于阻尼采样的大规模动态社会网络结构特征表示学习[J]. 计算机学报, 2017, 40(4): 805-823.
[32] 李志义, 王冕, 赵鹏武. 基于条件随机场模型的“评价特征-评价词”对抽取研究[J]. 情报学报, 2017, 36(4): 411-421.
[33] Rumelhart D E, Hinton G E, Williams R J.Learning representations by back-propagating errors[J]. Nature, 1986, 323: 533-536.
[34] Vincent P, Larochelle H, Bengio Y, et al.Extracting and composing robust features with denoising autoencoders[C]// Proceedings of the International Conference on Machine Learning. New York: ACM Press, 2008: 1096-1103.
[35] Rifai S, Vincent P, Muller X, et al. Contractive auto-encoders: Explicit invariance during feature extraction[OL]. [2017-07-31]. http://www.iro.umontreal.ca/~lisa/bib/pub_subject/language/pointeurs/ICML2011_explicit_invariance.pdf.
[36] Masci J, Meier U.Stacked convolutional auto-encoders for hierarchical feature extraction[C]// Proceedings of the International Conference on Artificial Neural Networks. Springer-Verlag, 2011: 52-59.
[37] Vincent P, Larochelle H, Lajoie I, et al.Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion[J]. Journal of Machine Learning Research, 2010, 11(12): 3371-3408.
[38] Mitchell B, Sheppard J.Deep structure learning: Beyond connectionist approaches[C]// Proceedings of the International Conference on Machine Learning and Applications. IEEE, 2013: 162-167.
[39] Erhan D, Bengio Y, Courville A, et al.Why does unsupervised pre-training help deep learning?[J]. Journal of Machine Learning Research, 2010, 11(3): 625-660.
[40] Deng L, Seltzer M L, Yu D, et al.Binary coding of speech spectrograms using a deep auto-encoder[C]// Proceedings of the Conference of the International Speech Communication Association, Makuhari, Chiba, Japan. DBLP, 2010: 1692-1695.
[41] Lee H, Ekanadham C, Ng A Y.Sparse deep belief net model for visual area V2[C]// Proceedings of the International Conference on Neural Information Processing Systems. Curran Associates, 2007: 873-880.
[42] 李海峰, 李纯果. 深度学习结构和算法比较分析[J]. 河北大学学报(自然科学版), 2012, 32(5): 538-544.
[43] 刘菲, 刘学亮. 基于稀疏编码的多模态信息交叉检索[J]. 中国图象图形学报, 2015, 20(9): 1170-1176.
[44] 赵仲秋, 季海峰, 高隽, 等. 基于稀疏编码多尺度空间潜在语义分析的图像分类[J]. 计算机学报, 2014, 37(6): 1251-1260.
[45] 万源, 史莹, 陈晓丽. 非负局部Laplacian稀疏编码和上下文信息的图像分类[J]. 中国图象图形学报, 2017, 22(6): 731-740.
[46] Smolensky P.Information processing in dynamical systems: Foundations of harmony theory[C]// MIT Press, 1986: 194-281.
[47] Mikolov T, Sutskever I, Chen K, et al.Distributed representations of words and phrases and their compositionality[C]// Proceedings of the International Conference on Neural Information Processing Systems. Curran Associates, 2013: 3111-3119.
[48] Freund Y, Haussler D.Unsupervised learning of distributions on binary vectors using two layer networks[J]. Advances in Neural Information Processing Systems, 1999(4): 912-919.
[49] Le Roux N, Bengio Y.Representational power of restricted boltzmann machines and deep belief networks[J]. Neural Computation, 2008, 20(6): 1631-1649.
[50] Hinton G E.Training products of experts by minimizing contrastive divergence[J]. Neural Computation, 2002, 14(8): 1771-1800.
[51] Ashwin T S, Saran S, Reddy G R M. Video affective content analysis based on multimodal features using a novel hybrid SVM-RBM classifier[C]// IEEE Uttar Pradesh Section International Conference on Electrical, Computer and Electronics Engineering. IEEE, 2017: 416-421.
[52] 王曙. 深度学习算法研究及其在图像分类上的应用[D]. 南京: 南京邮电大学, 2016.
[53] 张阳, 刘伟铭, 吴义虎. 基于深信度网络分类算法的行人检测方法[J]. 计算机应用研究, 2016, 33(2): 594-597.
[54] Morère O, Lin J, Veillard A, et al.Nested invariance pooling and RBM hashing for image instance retrieval[C]// Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval. New York: ACM Press, 2017: 260-268.
[55] 刘兴旺, 王江晴, 徐科. 一种融合AutoEncoder与CNN的混合算法用于图像特征提取[J]. 计算机应用研究, 2017, 34(12): 3839-3843.
[56] 黎亚雄, 张坚强, 潘登, 等. 基于RNN-RBM语言模型的语音识别研究[J]. 计算机研究与发展, 2014, 51(9): 1936-1944.
[57] 鲁铮. 基于T-RBM算法的DBN分类网络的研究[D]. 长春: 吉林大学, 2014.
[58] 潘广源, 柴伟, 乔俊飞. DBN网络的深度确定方法[J]. 控制与决策, 2015, 30(2): 256-260.
[59] 何俊, 蔡建峰, 房灵芝, 等. 基于LBP/VAR与DBN模型的人脸表情识别[J]. 计算机应用研究, 2016, 33(8): 2509-2513.
[60] 吕启, 窦勇, 牛新, 等. 基于DBN模型的遥感图像分类[J]. 计算机研究与发展, 2014, 51(9): 1911-1918.
[61] LeCun Y, Bottou L, Bengio Y, et al. Gradient based learning applied to document recognition[C]// Proceedings of IEEE, 1998, 86(11): 2278-2324.
[62] Rasmusbergpalm/DeepLearnToolbox[OL]. [2017-07-12].https:// github.com/rasmusbergpalm/DeepLearnToolbox.
[63] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[EB/OL]. [2017-11-16]. https:// arxiv.org/pdf/1409.1556.pdf.
[64] Szegedy C, Liu W, Jia Y Q, et al.Going deeper with convolutions[C]// Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE, 2015: 1-9.
[65] 李彦冬, 郝宗波, 雷航. 卷积神经网络研究综述[J]. 计算机应用, 2016, 36(9): 2508-2515.
[66] Zheng C X, Long A, Volkov Y, et al.A cross-modal system for cell migration image annotation and retrieval[C]// Proceedings of the International Joint Conference on Neural Networks. IEEE, 2007: 1738-1743.
[67] Jia Y Q, Salzmann M, Darrell T.Learning cross-modality similarity for multinomial data[C]// Proceedings of the International Conference on Computer Vision.Barcelona. IEEE Computer Society, 2011: 2407-2414.
[68] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[EB/OL]. [2017-07-12]. http://arxiv.org/pdf/1301.3781.pdf.
[69] Le T A.An exploration of the Word2vec algorithm: Creating a vector representation of a language vocabulary that encodes meaning and usage patterns in the vector space structure[D]. University of North Texas, 2016.
[70] 张川. 面向图像分类的深度残差网络优化结构研究[D]. 北京: 中国科学院大学计算机技术研究所, 2016.
[71] Vía J, Santamaría I, Pérez J. A robust RLS algorithm for adaptive canonical correlation analysis[OL]. [2017-07-31]. http:// pdfs.semanticscholar.org/59ef/40e0c8fd82c95b12f3aee38b57a653ab1ea1.pdf.
[72] 邓正恒. 跨模态信息检索方法的研究与实现[D]. 上海: 复旦大学, 2013.
[73] Feng F X, Wang X J, Li R F.Cross-modal retrieval with correspondence autoencoder[C]// Proceedings of the 22nd ACM International Conference on Multimedia. New York: ACM Press, 2014: 7-16.
[74] Chandrika P, Jawahar C V.Multi modal semantic indexing for image retrieval[C]// Proceedings of the ACM International Conference on Image and Video Retrieval. New York: ACM Press, 2010: 342-349.
[75] Lin W X, Lu T, Su F.A novel multi-modal integration and propagation model for cross-media information retrieval[C]// Proceedings of the International Conference on Advances in Multimedia Modeling. Springer-Verlag, 2012: 740-749.
[76] Wang K Y, Wang W, He R, et al.Multi-modal subspace learning with joint graph regularization for cross-modal retrieval[C]// Proceedings of the 2013 Second IAPR Asian Conference on Pattern Recognition. IEEE Computer Society, 2013: 236-240.
[77] Xie L, Pan P, Lu Y S.Analyzing semantic correlation for cross-modal retrieval[J]. Multimedia Systems, 2015, 21(6): 525-539.
[78] Wang S X, Pan P, Lu Y S, et al.Improving cross-modal and multi-modal retrieval combining content and semantics similarities with probabilistic model[J]. Multimedia Tools and Applications, 2015, 74(6): 2009-2032.
[79] Xu X, Yang Y, Shimada A, et al.Semi-supervised Coupled Dictionary Learning for Cross-modal Retrieval in Internet Images and Texts[C]// Proceedings of the ACM International Conference on Multimedia. New York: ACM Press, 2015: 847-850
[80] 彭岩, 张道强. 半监督典型相关分析算法[J]. 软件学报, 2008, 19(11): 2822-2832.
[81] Akaho S.A kernel method for canonical correlation analysis[C]// Proceedings of the International Meeting of the Psychometric Society. Springer, 2001: 263-269.
[82] Yin J S, Hu D W, Zhou Z T.Noisy manifold learning using neighborhood smoothing embedding[J]. Pattern Recognition Letters, 2008, 29(11): 1613-1620.
[83] Feng F X, Wang X J, Li R F.Cross-modal retrieval with correspondence autoencoder[C]// Proceedings of the 22nd ACM International Conference on Multimedia. New York: ACM Press, 2014: 7-16.
[84] Kim J S, Sim J Y, Kim C S.Multiscale saliency detection using random walk with restart[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2014, 24(2): 198-210.
[85] Verma Y, Jawahar C V.A support vector approach for cross-modal search of images and texts[J]. Computer Vision and Image Understanding, 2016, 154: 48-63.
[86] Wang W, Yang X Y, Ooi B C, et al.Effective deep learning-based multi-modal retrieval[J]. The VLDB Journal, 2016, 25(1): 79-101.
[87] Cao Y, Long M S, Wang J M, et al.Deep visual-semantic hashing for cross-modal retrieval[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2016: 1445-1454.
[88] 董永亮, 柴旭清. 基于潜在语义的双层图像-文本多模态检索语义网络[J]. 计算机工程, 2016, 42(7): 299-303.
[89] 丁恒, 陆伟. 基于相关性的跨模态信息检索研究[J]. 现代图书情报技术, 2016, 32(1): 17-23.
[90] 刘传才, 杨静宇. 一种新的图像纹理表示方法[J]. 计算机学报, 2001, 24(11): 1202-1209.
[91] 李瑞光, 姜锋霞. 基于内容图像检索的特征性能评价研究[J]. 电脑知识与技术, 2014(5): 922-923.
[92] Saracevic T.Evaluation of evaluation in information retrieval[C]// Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 1995: 138-146.
[93] 江秋鑫. 基于SIFT特征的图像相似性度量及其应用研究[D]. 大连: 大连理工大学, 2012.
[94] 余锦秀. 基于用户行为分析的搜索引擎自动评价技术研究[D]. 北京: 北京邮电大学, 2013.
[95] Li K H, Huang Z, Cheng Y C, et al.A maximal figure-of-merit learning approach to maximizing mean average precision with deep neural network based classifiers[C]// Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2014: 4503-4507.
[96] 信息检索的评价指标(Precision, Recall, F-score, MAP)[EB/OL]. [2017-08-20]. http://blog.csdn.net/Lu597203933/article/details/ 41802155.

Editorial Office: JCSSTI Editorial Office, No.15 fuxing road, haidian, Beijing 100038
Tel: +86(010)68598273; Fax: +86(010)68598285; E-mail: qbxb@istic.ac.cn
Copyright © 2015 by the Journal of The China Society for Scientific and Technical Information
ISSN: 1000-0135 CN: 11-2257 / G3