|
|
Automatic Indexing of Large Scale Subject Words |
Han Hongqi1,2, Gui Jie1, Zhang Yunliang1,2, Weng Mengjuan1,2, Xue Shan1,2, Yue Lindong1,2 |
1.Institute of Scientific and Technical Information of China, Beijing 100038 2.Key Laboratory of Rich-media Knowledge Organization and Service of Digital Publishing Content, National Press and Publication Administration, Beijing 100038 |
|
|
Abstract Existing subject indexing methods can only extract words that appear in the text but cannot select the words that have strong semantic correlation and do not appear from tens of thousands or hundreds of thousands of subject words. The multi-label text classification algorithm based on machine learning needs training data under each label, limiting its application in subject indexing. Aiming at the indexing requirements of large-scale subject words in massive documents, this study proposes an automatic indexing method based on the distributed word vector technique, which uses the word vector trained by a large-scale corpus to generate representation vectors for subject words and text documents of the same dimension and realizes the calculation of semantic similarities between them. The mapping table between subject and common words is constructed based on a large-scale corpus, so that the text vector is only compared with a small number of semantically strongly related subject word vectors, which significantly reduces the amount of calculation and improves the indexing efficiency. The developed automatic indexing tool has been applied to subject indexing on nearly 100 million documents and has achieved satisfactory speed. Compared with the Jieba keywords, the proposed method has a lower coincidence degree between the subject words and author keywords and achieves better indexing accuracy than the Jieba keywords after removing the non-subject words in the Jieba keywords.
|
Received: 07 February 2021
|
|
|
|
1 肖雯, 李鑫. 大数据时代数字资源的主题标引研究[J]. 图书馆理论与实践, 2016(11): 67-70. 2 张静. 自动标引技术的回顾与展望[J]. 现代情报, 2009, 29(4): 221-225. 3 章成志. 自动标引研究的回顾与展望[J]. 现代图书情报技术, 2007(11): 33-39. 4 中国科学技术信息研究所. 汉语主题词表(工程技术卷)[M]. 北京: 科学技术文献出版社, 2014. 5 衣芳. 《中分表(2版)》主题标引若干问题分析[J]. 神州, 2019(12): 235. 6 曹树金, 陈桂鸿, 陈忆金. 网络舆情主题标引算法与实现[J]. 图书情报知识, 2012(1): 52-59, 73. 7 Liu J Z, Chang W C, Wu Y X, et al. Deep learning for extreme multi-label text classification[C]// Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2017: 115-124. 8 李素建, 王厚峰, 俞士汶, 等. 关键词自动标引的最大熵模型应用研究[J]. 计算机学报, 2004, 27(9): 1192-1197. 9 柯平, 赵益民. 从关键词与高频词的相关度看自动标引的可行性[J]. 情报科学, 2009, 27(3): 326-328, 333. 10 丁芹. 基于格式语义格的自动标引和词相似度计算[J]. 情报理论与实践, 2004, 27(4): 363-366. 11 赵丹. 基于句法分析的主题标引规则分析[D]. 太原: 山西大学, 2017. 12 章成志. 基于集成学习的自动标引方法研究[J]. 情报学报, 2010, 29(1): 3-8. 13 王新. 基于神经网络的文献主题国别标引方法研究[J]. 数字图书馆论坛, 2019(7): 39-47. 14 陈博, 陈建龙. 基于文本挖掘和可视化技术的主题自动标引方法——以《英雄格萨尔》为例[J]. 现代情报, 2019, 39(8): 45-51, 102. 15 Mork J G, Jimeno Yepes A J, Aronson A. The NLM medical text indexer system for indexing biomedical literature[R/OL]. BioASQ Workshop, 2013. http://bioasq.org/sites/default/files/Mork.pdf. 16 Mork J, Aronson A, Demner-Fushman D. 12 years on - Is the NLM medical text indexer still useful and relevant?[J]. Journal of Biomedical Semantics, 2017, 8(1): 8. 17 Liu K, Peng S W, Wu J Q, et al. MeSHLabeler: improving the accuracy of large-scale MeSH indexing by integrating diverse evidence[J]. Bioinformatics, 2015, 31(12): i339-i347. 18 Peng S W, You R H, Wang H N, et al. DeepMeSH: deep semantic representation for improving large-scale MeSH indexing[J]. Bioinformatics, 2016, 32(12): i70-i79. 19 Dai S Y, You R H, Lu Z Y, et al. FullMeSH: improving large-scale MeSH indexing with full text[J]. Bioinformatics, 2020, 36(5): 1533-1541. 20 Xun G X, Jha K, Yuan Y, et al. MeSHProbeNet: a self-attentive probe net for MeSH indexing[J]. Bioinformatics, 2019, 35(19): 3794-3802. 21 Rae A R, Mork J G, Demner-Fushman D. Convolutional neural network for automatic MeSH indexing[C]// Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Cham: Springer, 2020: 581-594. 22 You R H, Liu Y X, Mamitsuka H, et al. BERTMeSH: deep contextual representation learning for large-scale high-performance MeSH indexing with full text[J]. Bioinformatics, 2021, 37(5): 684-692. 23 Xun G X, Jha K, Zhang A D. MeSHProbeNet-P: improving large-scale MeSH indexing with personalizable MeSH probes[J] ACM Transactions on Knowledge Discovery from Data, 2021, 15(1): Article No.11. 24 李纲, 戴强斌. 基于词汇链的关键词自动标引方法[J]. 图书情报知识, 2011(3): 67-71. 25 Gil-Leiva I. SISA—automatic indexing system for scientific articles: experiments with location heuristics rules versus TF-IDF rules[J]. Knowledge Organization, 2017, 44(3): 139-162. 26 陈白雪, 宋培彦. 基于用户自然标注的TF-IDF辅助标引算法及实证研究[J]. 图书情报工作, 2018, 62(1): 132-139. 27 唐晓波, 翟夏普. 基于本体和Word2Vec的文本知识片段语义标引[J]. 情报科学, 2019, 37(4): 97-102. 28 李枫林, 柯佳. 词向量语义表示研究进展[J]. 情报科学, 2019, 37(5): 155-165. 29 潘俊, 吴宗大. 词汇表示学习研究进展[J]. 情报学报, 2019, 38(11): 1222-1240. 30 郁可人, 傅云斌, 董启文. 基于神经网络语言模型的分布式词向量研究进展[J]. 华东师范大学学报(自然科学版), 2017(5): 52-65, 79. 31 Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[OL]. (2013-09-07). https://arxiv.org/pdf/1301.3781.pdf. 32 Pennington J, Socher R, Manning C. GloVe: global vectors for word representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2014: 1532-1543. 33 Peters M, Neumann M, Iyyer M, et al. Deep contextualized word representations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2018, 1: 1-15. 34 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019, 1: 4171-4186. 35 Le Q, Mikolov T. Distributed representations of sentences and documents[C]// Proceedings of the 31st International Conference on International Conference on Machine Learning. JMLR.org, 2014, 32: II-1188-II-1196. 36 张海超, 赵良伟. 利用Doc2Vec判断中文专利相似性[J]. 情报工程, 2018, 4(2): 64-72. 37 中国科学技术信息研究所. 文本主题标引方法、装置、电子设备及计算机存储介质: CN201910970014.9[P]. 2020-01-24. 38 Yang Z L, Dai Z H, Yang Y M, et al. XLNet: generalized autoregressive pretraining for language understanding[C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2019: 5753-5763. |
|
|
|