|
|
Diachronic Semantic Mining and Visualization of Chinese Words: A Knowledge Discovery Perspective |
Pan Jun1, Wu Zongda2,3 |
1.Department of Big Data Science, School of Science, Zhejiang University of Science and Technology, Hangzhou 310023 2.Department of Computer Science and Engineering, Shaoxing University, Shaoxing 312000 3.School of Information Management, Nanjing University, Nanjing 210093 |
|
|
Abstract Mining knowledge from diachronic word semantic shifts has become an increasingly important problem in word temporal analysis. To this end, this paper aims to design a scalable framework for knowledge mining in the diachronic corpus, which is based on a loosely-coupled and service-oriented configurable architecture. The bottom layer of the framework provides data level services such as data cleansing, data normalization, and diachronic word vectors learning, among others. The middle layer defines customized data extraction strategy and user interface generation through the configuration files in xml format. The top layer uses various services to fulfill specific requirements of knowledge discovery and visualization. This study also implements a framework focusing on word semantic shifts of People’s Daily and identifies possible approaches in the application of diachronic word vector to digital humanities and social computing research. The proposed framework and its implementation are highly scalable, which can be used as a basis for researchers to further develop applications for diachronic word semantic knowledge mining and can also be extended to other diachronic corpora.
|
Received: 28 September 2020
|
|
|
|
1 王瑞琴, 杨小明, 楼俊钢. 词汇语义相关性度量研究[J]. 情报学报, 2016, 35(4): 389-404. 2 Kutuzov A, ?vrelid L, Szymanski T, et al. Diachronic word embeddings and semantic shifts: a survey[C]// Proceedings of the 27th International Conference on Computational Linguistics. New York: ACM Press, 2018: 1384-1397. 3 刘知远, 刘扬, 涂存超, 等. 词汇语义变化与社会变迁定量观测与分析[J]. 语言战略研究, 2016, 1(6): 47-54. 4 Michel J B, Shen Y K, Aiden A P, et al. Quantitative analysis of culture using millions of digitized books[J]. Science, 2011, 331(6014): 176-182. 5 Pettit M. Historical time in the age of big data: cultural psychology, historical change, and the Google Books Ngram Viewer[J]. History of Psychology, 2016, 19(2): 141-153. 6 El-Ebshihy A, El-Makky N, Nagi K. Using Google Books Ngram in detecting linguistic shifts over time[C]// Proceedings of the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management. Setúbal: Science and Technology Publications, 2018: 332-339. 7 Mihalcea R, Nastase V. Word epoch disambiguation: finding how words change over time[C]// Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2012: 259-263. 8 Leskovec J, Backstrom L, Kleinberg J. Meme-tracking and the dynamics of the news cycle[C]// Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2009: 497-506. 9 荀恩东, 饶高琦, 谢佳莉, 等. 现代汉语词汇历时检索系统的建设与应用[J]. 中文信息学报, 2015, 29(3): 169-176. 10 饶高琦, 李宇明. 基于70年报刊语料的现代汉语历时稳态词抽取与考察[J]. 中文信息学报, 2016, 30(6): 49-58. 11 潘俊, 吴宗大. 词汇表示学习研究进展[J]. 情报学报, 2019, 38(11): 1222-1240. 12 潘俊, 吴宗大. 词汇分布语义的语言学基础探微[J]. 浙江社会科学, 2019(12): 99-104, 158-159. 13 Kim Y, Chiu Y I, Hanaki K, et al. Temporal analysis of language through neural language models[C]// Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science. Stroudsburg: Association for Computational Linguistics, 2014: 61-65. 14 Peng H, Li J, Song Y, et al. Incrementally learning the hierarchical softmax function for neural language models[C]// Proceedings of the 31st AAAI Conference on Arti?cial Intelligence. Palo Alto: AAAI Press, 2017: 3267-3273. 15 Kaji N, Kobayashi H. Incremental skip-gram model with negative sampling[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2017: 363-371. 16 Hamilton W L, Leskovec J, Jurafsky D. Diachronic word embeddings reveal statistical laws of semantic change[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2016, 1: 1489-1501. 17 Kulkarni V, Al-Rfou R, Perozzi B, et al. Statistically significant detection of linguistic change[C]// Proceedings of the 24th International Conference on World Wide Web. Geneva: International World Wide Web Conferences Steering Committee, 2015: 625-635. 18 Bamler R, Mandt S. Dynamic word embeddings[C]// Proceedings of the 34th International Conference on Machine Learning. JMLR.org, 2017, 70: 380-389. 19 Yao Z J, Sun Y F, Ding W C, et al. Dynamic word embeddings for evolving semantic discovery[C]// Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. New York: ACM Press, 2018: 673-681. 20 Eger S, Mehler A. On the linearity of semantic change: investigating meaning variation via dynamic graph models[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2016: 52-58. 21 Azarbonyad H, Dehghani M, Beelen K, et al. Words are malleable: computing semantic shifts in political and media discourse[C]// Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. New York: ACM Press, 2017: 1509-1518. 22 Zhang Y T, Jatowt A, Bhowmick S S, et al. The past is not a foreign country: detecting semantically similar terms across time[J]. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(10): 2793-2807. 23 Szymanski T. Temporal word analogies: identifying lexical replacement with diachronic word embeddings[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2017: 448-453. 24 Rosin G D, Adar E, Radinsky K. Learning word relatedness over time[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2017: 1179-1189. 25 Kutuzov A, Velldal E, ?vrelid L. Tracing armed con?icts with diachronic word embedding models[C]// Proceedings of the Events and Stories in the News Workshop. Stroudsburg: Association for Computational Linguistics, 2017: 31-36. 26 Mueller H, Rauh C. Reading between the lines: prediction of political violence using newspaper text[J]. American Political Science Review, 2018, 112(2): 358-375. 27 欧阳剑. 面向数字人文研究的大规模古籍文本可视化分析与挖掘[J]. 中国图书馆学报, 2016, 42(2): 66-80. 28 金观涛, 刘青峰. 观念史研究: 中国现代重要政治术语的形成[M]. 北京: 法律出版社, 2009. 29 Turney P D, Pantel P. From frequency to meaning: vector space models of semantics[J]. Journal of Arti?cial Intelligence Research, 2010, 37: 141-188. 30 Gulordava K, Baroni M. A distributional similarity approach to the detection of semantic change in the Google Books Ngram corpus[C]// Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics. Stroudsburg: Association for Computational Linguistics, 2011: 67-71. 31 Zou X J, Sun N, Zhang H, et al. Diachronic corpus based word semantic variation and change mining[C]// Proceedings of the 20th International Conference on Intelligent Information Systems Symposium: Language Processing and Intelligent Information Systems. Heidelberg: Springer, 2013: 145-150. 32 Blei D M, Lafferty J D. Dynamic topic models[C]// Proceedings of the 23rd International Conference on Machine learning. New York: ACM Press, 2006: 113-120. 33 Heyer G, Kantner C, Niekler A, et al. Modeling the dynamics of domain specific terminology in diachronic corpora[C]// Proceedings of the 12th International conference on Terminology and Knowledge Engineering, 2016. 34 Frermann L, Lapata M. A Bayesian model of diachronic meaning change[J]. Transactions of the Association for Computational Linguistics, 2016, 4:31-45. 35 Bengio Y, Ducharme R, Vincent P, et al. A neural probabilistic language model[J]. Journal of Machine Learning Research, 2003, 3(6): 1137-1155. 36 Mikolov T, Yih W T, Zweig G. Linguistic regularities in continuous space word representations[C]// Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2013: 746-751. 37 Garg N, Schiebinger L, Jurafsky D, et al. Word embeddings quantify 100 years of gender and ethnic stereotypes[J]. Proceedings of the National Academy of Sciences of the United States of America, 2018, 115(16): E3635-E3644. 38 Hellrich J, Buechel S, Hahn U. JeSemE: interleaving semantics and emotions in a web service for the exploration of language change phenomena[C]// Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations. Stroudsburg: Association for Computational Linguistics, 2018: 10-14. 39 NLPIR汉语分词系统[EB/OL]. [2020-11-10]. http://ictclas.nlpir.org/. 40 Levy O, Goldberg Y, Dagan I. Improving distributional similarity with lessons learned from word embeddings[J]. Transactions of the Association for Computational Linguistics, 2015, 3: 211-225. 41 McFee B, Lanckriet G. Large-scale music similarity search with spatial trees[C]// Proceedings of the 12th International Society for Music Information Retrieval Conference, 2011: 55-60. 42 Harris Z S. Distributional structure[J]. WORD, 1954, 10(2/3): 146-162. 43 齐鹏飞. 中华人民共和国历史分期问题研究述评[J]. 中国人民大学学报, 2009, 23(5): 149-156. 44 Del Tredici M, Nissim M, Zaninello A. Tracing metaphors in time through self-distance in vector spaces[C]// Proceedings of the Third Italian Conference on Computational Linguistics CLiC-it 2016. Torino: Accademia University Press, 2016: 117-122. |
|
|
|