|
|
Automatic Noise Reduction of Scientific Domain Document Sets Using Positive-Unlabeled Learning |
Chen Guo, Yang Zeyu, Chen Jing, Shao Yu |
Department of Information Management, School of Economics and Management, Nanjing University of Science and Technology, Nanjing 210094 |
|
|
Abstract In the domain analysis of science and technology, a considerable proportion of unrelated literature (impurities) exists in the datasets constructed by mainstream methods, which weakens the reliability of the final analysis results. Therefore, noise reduction is essential for removing these impurities. Performing automatic noise reduction on a dataset of domain documents without manual annotation is a prerequisite condition for whether the noise-reduction scheme can be universally applied in practice at a low cost. This study aims to transform the noise reduction task into a classification instead of a clustering problem on the premise of making full use of the characteristics of the original document dataset. We introduce positive-unlabeled (PU) learning, which can be conducted using a group of “absolutely positive samples” available in the domain dataset, to obtain reliable negative samples for the final classifiers to fit. Experiments were conducted on a dataset of journals in the MAG online library in the fields of artificial intelligence, economics, and immunology to not only compare the performance of different schemes but also construct two benchmarks and introduce Normalized Discounted Cumulative Gain as an evaluation metric, which proved the effectiveness of our method from the aspects of noise reduction revenue, usability of the result, and effectiveness of document denoising in the context of scientific and technological information analysis.
|
Received: 18 December 2023
|
|
|
|
1 钟丽萍, 冷伏海, 罗世猛. 情报研究有效性的影响因素分析[J]. 情报理论与实践, 2013, 36(7): 6-9. 2 Borgman C L. Big data, little data, no data: scholarship in the networked world[M]. Cambridge: The MIT Press, 2015. 3 化柏林, 武夷山. 大数据更需要先清洗[J]. 情报学报, 2013, 32(6): 561. 4 刘自强, 岳丽欣, 王效岳, 等. 主题演化视角下的国际情报学研究热点与前沿分析[J]. 图书馆, 2017(3): 14-22. 5 侯剑华, 杨秀财, 周莉娟. 国际图书情报领域研究的前沿主题及其演化趋势分析[J]. 图书情报工作, 2016, 60(13): 82-90. 6 Shu F, Julien C A, Zhang L, et al. Comparing journal and paper level classifications of science[J]. Journal of Informetrics, 2019, 13(1): 202-225. 7 Shu F, Dinneen J D, Asadi B, et al. Mapping science using library of congress subject headings[J]. Journal of Informetrics, 2017, 11(4): 1080-1094. 8 Najmi A, Rashidi T H, Abbasi A, et al. Reviewing the transport domain: an evolutionary bibliometrics and network analysis[J]. Scientometrics, 2017, 110(2): 843-865. 9 Lu C, Bu Y, Dong X L, et al. Analyzing linguistic complexity and scientific impact[J]. Journal of Informetrics, 2019, 13(3): 817-829. 10 陈果, 邵雨, 王曰芬. 科技领域情报分析中文献集构造方式比较研究: 一致性与可靠性问题[J]. 情报学报, 2020, 39(10): 1034-1045. 11 沈艳红, 张娣. 文献计量分析中的数据准备工作研究[J]. 图书馆建设, 2012(5): 90-92. 12 冯璐, 冷伏海. 基于领域分析需求和目标的领域分析数据集界域研究[J]. 图书情报工作, 2009, 53(24): 51-54. 13 Sebastiani F. Machine learning in automated text categorization[J]. ACM Computing Surveys, 2002, 34(1): 1-47. 14 王曰芬, 章成志, 张蓓蓓, 等. 数据清洗研究综述[J]. 现代图书情报技术, 2007(12): 50-56. 15 蒋勋, 刘喜文. 大数据环境下面向知识服务的数据清洗研究[J]. 图书与情报, 2013(5): 16-21. 16 Chu X, Ilyas I F. Qualitative data cleaning[J]. Proceedings of the VLDB Endowment, 2016, 9(13): 1605-1608. 17 Aggarwal C C. Outlier analysis[M]. Cham: Springer, 2013. 18 Aggarwal C C, Zhai C X. Mining text data[M]. New York: Springer, 2012. 19 Pooja K M, Mondal S, Chandra J. A graph combination with edge pruning-based approach for author name disambiguation[J]. Journal of the Association for Information Science and Technology, 2020, 71(1): 69-83. 20 叶焕倬, 吴迪. 相似重复记录清理方法研究综述[J]. 现代图书情报技术, 2010, 26(9): 56-66. 21 Cheng J K, Mai X D, Wang S N. Research on abnormal data mining algorithm based on ICA[J]. Cluster Computing, 2019, 22(Suppl 2): 3613-3619. 22 Hittawe M M, Afzal S, Jamil T, et al. Abnormal events detection using deep neural networks: application to extreme sea surface temperature detection in the Red Sea[J]. Journal of Electronic Imaging, 2019, 28(2): 021012. 23 冯立伟, 张成, 李元. 基于统计模量和局部近邻标准化的局部离群因子故障检测方法[J]. 计算机应用, 2018, 38(4): 965-970. 24 Kieu T, Yang B, Guo C J, et al. Outlier detection for time series with recurrent autoencoder ensembles[C]// Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2019: 2725-2732. 25 何劲, 王曰芬, 傅柱. 专题情报研究中的数据集构造比较研究[J]. 情报理论与实践, 2023, 46(8): 175-181. 26 Zhou Z H. A brief introduction to weakly supervised learning[J]. National Science Review, 2018, 5(1): 44-53. 27 van Engelen J E, Hoos H H. A survey on semi-supervised learning[J]. Machine Learning, 2020, 109(2): 373-440. 28 Fazakis N, Karlos S, Kotsiantis S, et al. A multi-scheme semi-supervised regression approach[J]. Pattern Recognition Letters, 2019, 125: 758-765. 29 Qin Y, Ding S F, Wang L J, et al. Research progress on semi-supervised clustering[J]. Cognitive Computation, 2019, 11(5): 599-612. 30 Zhang D Q, Zhou Z H, Chen S C. Semi-supervised dimensionality reduction[C]// Proceedings of the 2007 SIAM International Conference on Data Mining. Philadelphia: Society for Industrial and Applied Mathematics, 2007: 629-634. 31 Liu B, Dai Y, Li X L, et al. Building text classifiers using positive and unlabeled examples[C]// Proceedings of the Third IEEE International Conference on Data Mining. Piscataway: IEEE, 2003: 179-186. 32 任亚峰, 姬东鸿, 张红斌, 等. 基于PU学习算法的虚假评论识别研究[J]. 计算机研究与发展, 2015, 52(3): 639-648. 33 Sch?lkopf B, Platt J C, Shawe-Taylor J, et al. Estimating the support of a high-dimensional distribution[J]. Neural Computation, 2001, 13(7): 1443-1471. 34 Jaskie K, Spanias A. Positive unlabeled learning[M]. Cham: Springe, 2022. 35 Patil R, Boit S, Gudivada V, et al. A survey of text representation and embedding techniques in NLP[J]. IEEE Access, 2023, 11: 36120-36146. 36 Joulin A, Grave E, Bojanowski P, et al. fastText.zip: compressing text classification models[OL]. (2016-12-12). https://arxiv.org/pdf/1612.03651. 37 Le Q, Mikolov T. Distributed representations of sentences and documents[C]// Proceedings of the 31st International Conference on Machine Learning. JMLR.org, 2014: Ⅱ-1188 - Ⅱ-1196. 38 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186. 39 Sinha A, Shen Z H, Song Y, et al. An overview of microsoft academic service (MAS) and applications[C]// Proceedings of the 24th International Conference on World Wide Web. New York: ACM Press, 2015: 243-246. 40 Chen G, Chen J, Shao Y, et al. Automatic noise reduction of domain-specific bibliographic datasets using positive-unlabeled learning[J]. Scientometrics, 2023, 128(2): 1187-1204. 41 陈果, 王盼停, 王曰芬. 文献集规模对科技领域情报分析的影响: 多种任务场景下的实证分析[J]. 情报学报, 2021, 40(8): 869-878. 42 J?rvelin K, Kek?l?inen J. Cumulated gain-based evaluation of IR techniques[J]. ACM Transactions on Information Systems, 2002, 20(4): 422-446. 43 陈晶. 领域科技文献集自动降噪研究[D]. 南京: 南京理工大学, 2022. 44 陈果, 许天祥. 基于主动学习的科技论文句子功能识别研究[J]. 数据分析与知识发现, 2019, 3(8): 53-61. |
|
|
|