1 钟丽萍, 冷伏海, 罗世猛. 情报研究有效性的影响因素分析[J]. 情报理论与实践, 2013, 36(7): 6-9. 2 Borgman C L. Big data, little data, no data: scholarship in the networked world[M]. Cambridge: The MIT Press, 2015. 3 化柏林, 武夷山. 大数据更需要先清洗[J]. 情报学报, 2013, 32(6): 561. 4 刘自强, 岳丽欣, 王效岳, 等. 主题演化视角下的国际情报学研究热点与前沿分析[J]. 图书馆, 2017(3): 14-22. 5 侯剑华, 杨秀财, 周莉娟. 国际图书情报领域研究的前沿主题及其演化趋势分析[J]. 图书情报工作, 2016, 60(13): 82-90. 6 Shu F, Julien C A, Zhang L, et al. Comparing journal and paper level classifications of science[J]. Journal of Informetrics, 2019, 13(1): 202-225. 7 Shu F, Dinneen J D, Asadi B, et al. Mapping science using library of congress subject headings[J]. Journal of Informetrics, 2017, 11(4): 1080-1094. 8 Najmi A, Rashidi T H, Abbasi A, et al. Reviewing the transport domain: an evolutionary bibliometrics and network analysis[J]. Scientometrics, 2017, 110(2): 843-865. 9 Lu C, Bu Y, Dong X L, et al. Analyzing linguistic complexity and scientific impact[J]. Journal of Informetrics, 2019, 13(3): 817-829. 10 陈果, 邵雨, 王曰芬. 科技领域情报分析中文献集构造方式比较研究: 一致性与可靠性问题[J]. 情报学报, 2020, 39(10): 1034-1045. 11 沈艳红, 张娣. 文献计量分析中的数据准备工作研究[J]. 图书馆建设, 2012(5): 90-92. 12 冯璐, 冷伏海. 基于领域分析需求和目标的领域分析数据集界域研究[J]. 图书情报工作, 2009, 53(24): 51-54. 13 Sebastiani F. Machine learning in automated text categorization[J]. ACM Computing Surveys, 2002, 34(1): 1-47. 14 王曰芬, 章成志, 张蓓蓓, 等. 数据清洗研究综述[J]. 现代图书情报技术, 2007(12): 50-56. 15 蒋勋, 刘喜文. 大数据环境下面向知识服务的数据清洗研究[J]. 图书与情报, 2013(5): 16-21. 16 Chu X, Ilyas I F. Qualitative data cleaning[J]. Proceedings of the VLDB Endowment, 2016, 9(13): 1605-1608. 17 Aggarwal C C. Outlier analysis[M]. Cham: Springer, 2013. 18 Aggarwal C C, Zhai C X. Mining text data[M]. New York: Springer, 2012. 19 Pooja K M, Mondal S, Chandra J. A graph combination with edge pruning-based approach for author name disambiguation[J]. Journal of the Association for Information Science and Technology, 2020, 71(1): 69-83. 20 叶焕倬, 吴迪. 相似重复记录清理方法研究综述[J]. 现代图书情报技术, 2010, 26(9): 56-66. 21 Cheng J K, Mai X D, Wang S N. Research on abnormal data mining algorithm based on ICA[J]. Cluster Computing, 2019, 22(Suppl 2): 3613-3619. 22 Hittawe M M, Afzal S, Jamil T, et al. Abnormal events detection using deep neural networks: application to extreme sea surface temperature detection in the Red Sea[J]. Journal of Electronic Imaging, 2019, 28(2): 021012. 23 冯立伟, 张成, 李元. 基于统计模量和局部近邻标准化的局部离群因子故障检测方法[J]. 计算机应用, 2018, 38(4): 965-970. 24 Kieu T, Yang B, Guo C J, et al. Outlier detection for time series with recurrent autoencoder ensembles[C]// Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2019: 2725-2732. 25 何劲, 王曰芬, 傅柱. 专题情报研究中的数据集构造比较研究[J]. 情报理论与实践, 2023, 46(8): 175-181. 26 Zhou Z H. A brief introduction to weakly supervised learning[J]. National Science Review, 2018, 5(1): 44-53. 27 van Engelen J E, Hoos H H. A survey on semi-supervised learning[J]. Machine Learning, 2020, 109(2): 373-440. 28 Fazakis N, Karlos S, Kotsiantis S, et al. A multi-scheme semi-supervised regression approach[J]. Pattern Recognition Letters, 2019, 125: 758-765. 29 Qin Y, Ding S F, Wang L J, et al. Research progress on semi-supervised clustering[J]. Cognitive Computation, 2019, 11(5): 599-612. 30 Zhang D Q, Zhou Z H, Chen S C. Semi-supervised dimensionality reduction[C]// Proceedings of the 2007 SIAM International Conference on Data Mining. Philadelphia: Society for Industrial and Applied Mathematics, 2007: 629-634. 31 Liu B, Dai Y, Li X L, et al. Building text classifiers using positive and unlabeled examples[C]// Proceedings of the Third IEEE International Conference on Data Mining. Piscataway: IEEE, 2003: 179-186. 32 任亚峰, 姬东鸿, 张红斌, 等. 基于PU学习算法的虚假评论识别研究[J]. 计算机研究与发展, 2015, 52(3): 639-648. 33 Sch?lkopf B, Platt J C, Shawe-Taylor J, et al. Estimating the support of a high-dimensional distribution[J]. Neural Computation, 2001, 13(7): 1443-1471. 34 Jaskie K, Spanias A. Positive unlabeled learning[M]. Cham: Springe, 2022. 35 Patil R, Boit S, Gudivada V, et al. A survey of text representation and embedding techniques in NLP[J]. IEEE Access, 2023, 11: 36120-36146. 36 Joulin A, Grave E, Bojanowski P, et al. fastText.zip: compressing text classification models[OL]. (2016-12-12). https://arxiv.org/pdf/1612.03651. 37 Le Q, Mikolov T. Distributed representations of sentences and documents[C]// Proceedings of the 31st International Conference on Machine Learning. JMLR.org, 2014: Ⅱ-1188 - Ⅱ-1196. 38 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186. 39 Sinha A, Shen Z H, Song Y, et al. An overview of microsoft academic service (MAS) and applications[C]// Proceedings of the 24th International Conference on World Wide Web. New York: ACM Press, 2015: 243-246. 40 Chen G, Chen J, Shao Y, et al. Automatic noise reduction of domain-specific bibliographic datasets using positive-unlabeled learning[J]. Scientometrics, 2023, 128(2): 1187-1204. 41 陈果, 王盼停, 王曰芬. 文献集规模对科技领域情报分析的影响: 多种任务场景下的实证分析[J]. 情报学报, 2021, 40(8): 869-878. 42 J?rvelin K, Kek?l?inen J. Cumulated gain-based evaluation of IR techniques[J]. ACM Transactions on Information Systems, 2002, 20(4): 422-446. 43 陈晶. 领域科技文献集自动降噪研究[D]. 南京: 南京理工大学, 2022. 44 陈果, 许天祥. 基于主动学习的科技论文句子功能识别研究[J]. 数据分析与知识发现, 2019, 3(8): 53-61.