|
|
Deep Neural Networks Language Model Based on CNN and LSTM Hybrid Architecture |
Wang Yi, Xie Juan, Cheng Ying |
School of Information Management, Nanjing University, Nanjing 210023 |
|
|
Abstract The language model is one of the most important domains in natural language processing. It is a bridge for the computer to identify and comprehend human language, and it is also a sign of Artificial Intelligence development. The language model is popular in Speech Recognition, Machine Translation, Information Retrieval, and Knowledge Mapping. With the rapid expansion of technology and hardware, the language model has experienced a transformation from statistical model to neural network model and then to the deep neural network model. The wide application of depth learning makes language modeling more extensive, complex, and expensive. This paper combines the personalized input, convolutional neural network (CNN) coding, and the technique of union gate, cooperating with long short-term memory (LSTM) mechanism to improve the language model. The dynamic integration of LSTM and CNN is called Gated CLSTM. In the experiment, we used the deep learning framework Tensorflow to achieve a Gated GLSTM architecture. Besides, some classical optimization techniques, such as noise contrastive estimation and recurrent projection layer, were adopted in the experiment. We tested the performance of the Gated CLSTM under an open and big scale corpus set and trained a signal-layer model and a three-layer model to observe how network depth influences the performance. The single-layer model has 4 days of training experience and reduced the perplexity to 42.1 in four GPU console environment. The three-layer model reduced the perplexity to 33.1 in 6 days. Compared with some classical benchmark models, significant improvements have been made by Gated CLSTM considering both hardware and time complexity and perplexity.
|
Received: 25 June 2017
|
|
|
|
[1] 文娟. 统计语言模型的研究与应用[D]. 北京: 北京邮电大学, 2010. [2] Bengio Y, Ducharme R, Vincent P, et al.A neural probabilistic language model[J]. Journal of Machine Learning Research, 2003, 3: 1137-1155. [3] Baroni M, Dinu G, Kruszewski G.Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors[C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2014: 238-247. [4] Bengio S, Bengio Y.Taking on the curse of dimensionality in joint distributions using neural networks[J]. IEEE Transactions on Neural Networks, 2000, 11(3): 550-557. [5] Mikolov T, Chen K, Corrado G, et al.Efficient estimation of word representations in vector space[EB/OL]. [2013-09-07].https://arxiv.org/abs/1301.3781. [6] Hinton G E.Learning distributed representations of concepts[C]// Proceedings of the Eighth Annual Conference of the Cognitive Science Society, Amherst, 1986, 1: 12. [7] Le Q, Mikolov T.Distributed representations of sentences and documents[C]// Proceedings of the 31st International Conference on Machine Learning. 2014, 14: 1188-1196. [8] Yu M, Dredze M.Improving lexical embeddings with semantic knowledge[C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics,. 2014: 545-550. [9] Ma W C, Suel T.Structural sentence similarity estimation for short texts[C]// Proceedings of the Twenty-Ninth International Florida Artificial Intelligence Research Society Conference. Association for the Advancement of Artificial Intelligence, 2016: 232-237. [10] Pennington J, Socher R, Manning C D.GloVe: global vectors for word representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2014, 14: 1532-1543. [11] Cohen J D, Servan-Schreiber D, McClelland J L. A parallel distributed processing approach to automaticity[J]. The American Journal of Psychology, 1992, `105(2): 239-269. [12] Elman J L.Finding structure in time[J]. Cognitive Science, 1990, 14(2): 179-211. [13] Graves A.Supervised sequence labelling with recurrent neural networks[M]. Berlin: Springer, 2012: 15-35. [14] Mikolov T, Karafiát M, Burget L, et al.Recurrent neural network based language model[C]// Proceedings of the 11th Annual Conference of the International Speech Communication Association, Makuhar, 2010, 2: 3. [15] Mikolov T, Deoras A, Kombrink S, et al.Empirical evaluation and combination of advanced language modeling techniques[C]// Proceedings of the Twelfth Annual Conference of the International Speech Communication Association, 2011. [16] Bengio Y, Simard P, Frasconi P.Learning long-term dependencies with gradient descent is difficult[J]. IEEE Transactions on Neural Networks, 1994, 5(2): 157-166. [17] Hochreiter S, Bengio Y, Frasconi P, et al.Gradient flow in recurrent nets: the difficulty of learning long-term dependencies[EB/OL]. [2014-11-19].https://www.researchgate.net/publication/2839938_Gradient_Flow_in_Recurrent_Nets_the_Difficulty_of_Learning_Long-Term_Dependencies. [18] Lipton Z C, Berkowitz J, Elkan C.A critical review of recurrent neural networks for sequence learning[EB/OL]. [2015-06-05].https://arxiv.org/pdf/1506.00019. [19] Hochreiter S, Schmidhuber J.Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780. [20] Kang M, Ng T, Nguyen L.Mandarin word-character hybrid-input Neural Network Language Model[C]// Proceedings of the Conference on the 12th Annual Conference of the International Speech Communication Association, Florence, Italy, 2011: 625-628. [21] Kombrink S, Mikolov T, Karafiát M, et al.Recurrent neural network based language modeling in meeting recognition[C]// Proceedings of the 12th Annual Conference of the International Speech Communication Association, Florence, Italy, 2011: 2877-2880. [22] Mikolov T, Kombrink S, Burget L, et al.Extensions of recurrent neural network language model[C]// Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Prague, Czech Republic, 2011: 5528-5531. [23] Shi Y Z, Zhang W Q, Liu J, et al.RNN language model with word clustering and class-based output layer[J]. EURASIP Journal on Audio, Speech, and Music Processing, 2013: 22. [24] Karpathy A, Johnson J, Li F F.Visualizing and understanding recurrent networks[EB/OL]. [2015-11-17].https://arxiv.org/ pdf/1506.02078. [25] Ballesteros M, Dyer C, Smith N A.Improved transition-based parsing by modeling characters instead of words with LSTMs [EB/OL]. [2015-08-11].https://arxiv.org/abs/1508.00657. [26] Cho K, Van Merriënboer B, Gulcehre C, et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[EB/OL]. [2014-09-03].https://arxiv.org/ abs/1406.1078. [27] Sutskever I, Vinyals O, Le Q V.Sequence to sequence learning with neural networks[C]// Proceedings of the Conference on Advances in Neural Information Processing Systems. 2014: 3104-3112. [28] Kalchbrenner N, Grefenstette E, Blunsom P.A convolutional neural network for modelling sentences[EB/OL]. [2014-04-08].https://arxiv.org/abs/1404.2188. [29] Kim Y.Convolutional neural networks for sentence classification[EB/OL]. [2014-09-03].https://arxiv.org/abs/1408.5882. [30] Tang D Y, Qin B, Liu T.Document modeling with gated recurrent neural network for sentiment classification[C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2015: 1422-1432. [31] Wang L, Luís T, Marujo L, et al.Finding function in form: Compositional character models for open vocabulary word representation[EB/OL]. [2016-05-23].https://arxiv.org/abs/1508. 02096. [32] Kang M, Ng T, Nguyen L.Mandarin word-character hybrid-input Neural Network Language Model[C]// Proceedings of the 12th Annual Conference of the International Speech Communication Association, Florence, Italy, 2011: 625-628. [33] Dos Santos C N, Zadrozny B. Learning character-level representations for part-of-speech tagging[C]// Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 2014, 32: 1818-1826. [34] Bojanowski P, Joulin A, Mikolov T.Alternative structures for character-level RNNs[EB/OL]. [2015-11-24].https://arxiv.org/ abs/1511.06303. [35] Luong M T, Manning C D.Achieving open vocabulary neural machine translation with hybrid word-character models[EB/ OL]. [2016-06-23].https://arxiv.org/pdf/1604.00788.pdf. [36] 朱德熙. 语法讲义[M]. 北京: 商务印书馆, 1982. [37] Abadi M, Agarwal A, Barham P, et al.TensorFlow: Large- scale machine learning on heterogeneous distributed systems[EB/OL]. [2016-03-16].https://arxiv.org/abs/1603.04467. [38] Looks M, Herreshoff M, Hutchins D L, et al.Deep learning with dynamic computation graphs[EB/OL]. [2017-02-22].https://arxiv.org/abs/1702.02181. [39] Sak H, Senior A, Beaufays F.Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition[EB/OL]. [2014-02-05].https://arxiv.org/abs/ 1402.1128. [40] Bengio Y, Senécal J S.Adaptive importance sampling to accelerate training of a neural probabilistic language model[J]. IEEE Transactions on Neural Networks, 2008, 19(4): 713-722. [41] Gutmann M, Hyvärinen A.Noise-contrastive estimation: A new estimation principle for unnormalized statistical models[C]// Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010: 297-304. [42] Mnih A, Teh Y W.A fast and simple algorithm for training neural probabilistic language models[EB/OL]. [2012-06-27].https://arxiv.org/abs/1206.6426. [43] Dyer C.Notes on noise contrastive estimation and negative sampling[EB/OL]. [2014-10-30].https://arxiv.org/abs/1410.8251. [44] Chelba C, Mikolov T, Schuster M, et al.One billion word benchmark for measuring progress in statistical language modeling[EB/OL]. [2014-03-04].https://arxiv.org/abs/1312.3005. [45] Ji S H, Vishwanathan S V N, Satish N, et al. BlackOut: Speeding up recurrent neural network language models with very large vocabularies[EB/OL].[2016-03-31]. https://arxiv. org/abs/1511.06909. [46] Williams W, Prasad N, Mrva D, et al.Scaling recurrent neural network language models[C]// Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Brisbane, QLD, Australia, 2015: 5391-5395. [47] Jozefowicz R, Vinyals O, Schuster M, et al.Exploring the limits of language modeling[EB/OL]. [2016-02-11].https:// arxiv.org/abs/1602.02410. [48] Li X, Qin T, Yang J, et al.LightRNN: Memory and computation-efficient recurrent neural networks[C]// Proceedings of the 30th Conference on Neural Information Processing Systems, Barcelona, Spain, 2016: 4385-4393. [49] Dauphin Y N, Fan A, Auli M, et al.Language modeling with gated convolutional networks[EB/OL]. [2016-11-23].https://arxiv.org/abs/1612.08083. |
|
|
|