Two-Stage Author Name Disambiguation Study Combining Rule-Based and Supervised Models
Chen Yifan1,2, Xie Ruixia3, Yang Ning1,2, Hu Wei1, Zhang Zhiqiang1,2
1.National Science Library (Chengdu), Chinese Academy of Sciences, Chengdu 610299 2.Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190 3.School of Economics and Management, Tongji University, Shanghai 200092
摘要作者姓名消歧(author name disambiguation,AND)是信息检索、信息整合、文献计量等领域开展学术研究的一项基础性与保障性工作。在单一消歧模型难以满足实际消歧需求的背景下,本文创新性地提出一种结合规则模型与监督模型的两阶段自动化作者姓名消歧框架TSD-RS(two-stage disambiguation framework for integrating rule-based and supervisory model)。一阶段采用动态阈值法对规则模型进行改良,以提升初步消歧性能,在此基础上,设计并比较12种规则使用顺序对AND的影响;二阶段以初步消歧形成的论文簇为节点、以监督模型预测结果为连边权重构建簇间网络,通过InfoMap算法对网络进行社团划分以实现二次迭代消歧,在此过程中,分别比较4种训练集(正负样本对)自动化构造方法及4种监督模型(包括大语言模型)用于AND的性能差异。在3个不同规模金标准数据集上的实验结果显示,当TSD-RS一阶段规则顺序选择Order5、二阶段训练集正样本提取方法选择1/2-shell、监督模型选择随机森林时,消歧效果最好且bF1值95%置信区间为0.85±0.04,较基线模型有明显提升。
1 Kim J, Diesner J. Distortive effects of initial-based name disambiguation on measurements of large-scale coauthorship networks[J]. Journal of the Association for Information Science and Technology, 2016, 67(6): 1446-1461. 2 Santini C, Gesese G A, Peroni S, et al. A knowledge graph embeddings based approach for author name disambiguation using literals[J]. Scientometrics, 2022, 127(8): 4887-4912. 3 Enserink M. Are you ready to become a number?[J]. Science, 2009, 323(5922): 1662-1664. 4 Top names over the last 100 years[EB/OL]. [2025-05-14]. https://www.ssa.gov/oact/babynames/decades/century.html. 5 Web of Science. Search results from all databases for Wang Wei[EB/OL]. [2025-05-14]. https://www.webofscience.com/wos/alldb/summary/be0a4a57-d39e-4a78-87ff-5d4716e8209c-0109e39b1f/relevance/1. 6 Correia A, Guimar?es D, Paulino D, et al. AuthCrowd: author name disambiguation and entity matching using crowdsourcing[C]// Proceedings of the 2021 IEEE 24th International Conference on Computer Supported Cooperative Work in Design. Piscataway: IEEE, 2021: 150-155. 7 Zhang L Z, Ban Z J. Author name disambiguation based on rule and graph model[C]// Proceedings of the 9th Natural Language Processing and Chinese Computing. Cham: Springer, 2020: 617-628. 8 Rehs A. A supervised machine learning approach to author disambiguation in the Web of Science[J]. Journal of Informetrics, 2021, 15(3): 101166. 9 Waqas H, Qadir M A. Multilayer heuristics based clustering framework (MHCF) for author name disambiguation[J]. Scientometrics, 2021, 126(9): 7637-7678. 10 Liu Y, Li W J, Huang Z, et al. A fast method based on multiple clustering for name disambiguation in bibliographic citations[J]. Journal of the Association for Information Science and Technology, 2015, 66(3): 634-644. 11 Backes T. Effective unsupervised author disambiguation with relative frequencies[C]// Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries. New York: ACM Press, 2018: 203-212. 12 Levin M, Krawczyk S, Bethard S, et al. Citation-based bootstrapping for large-scale author disambiguation[J]. Journal of the American Society for Information Science and Technology, 2012, 63(5): 1030-1047. 13 Zhang L, Lu W, Yang J Q. LAGOS-AND: a large gold standard dataset for scholarly author name disambiguation[J]. Journal of the Association for Information Science and Technology, 2023, 74(2): 168-185. 14 Kim J, Kim J, Owen-Smith J. Ethnicity-based name partitioning for author name disambiguation using supervised machine learning[J]. Journal of the Association for Information Science and Technology, 2021, 72(8): 979-994. 15 吴柯烨, 闵超, 孙建军, 等. 面向特定科研任务的著者姓名消歧方法[J]. 情报学报, 2021, 40(7): 734-744. 16 沈喆, 王毅, 鞠秀芳, 等. 基于先精确后召回策略的作者名消歧模型研究[J]. 情报学报, 2022, 41(4): 350-363. 17 周杰, 李弼程, 唐永旺. 基于关键证据与E2LSH的增量式人名聚类消歧方法[J]. 情报学报, 2016, 35(7): 714-722. 18 杨昭. 基于元路径的作者名称协同消歧研究[J]. 情报学报, 2023, 42(3): 327-340. 19 Liu J L, Lei K H, Liu J Y, et al. Ranking-based name matching for author disambiguation in bibliographic data[C]// Proceedings of the 2013 KDD Cup 2013 Workshop. New York: ACM Press, 2013: Article No.8. 20 Mozafari N. A genetic-based approach for author name disambiguation problem[J]. Iranian Journal of Information Processing and Management, 2021, 36(3): 791-816. 21 Tekles A, Bornmann L. Author name disambiguation of bibliometric data: a comparison of several unsupervised approaches[J]. Quantitative Science Studies, 2020, 1(4): 1510-1528. 22 Pooja K, Mondal S, Chandra J. Exploiting similarities across multiple dimensions for author name disambiguation[J]. Scientometrics, 2021, 126(9): 7525-7560. 23 De Bonis M, Falchi F, Manghi P. Graph-based methods for author name disambiguation: a survey[J]. PeerJ Computer Science, 2023, 9: e1536. 24 Mihaljevi? H, Santamaría L. Disambiguation of author entities in ADS using supervised learning and graph theory methods[J]. Scientometrics, 2021, 126(5): 3893-3917. 25 Xia L Q, Li C X, Zhang C B, et al. Leveraging error-assisted fine-tuning large language models for manufacturing excellence[J]. Robotics and Computer-Integrated Manufacturing, 2024, 88: 102728. 26 Kim J, Kim J. The impact of imbalanced training data on machine learning for author name disambiguation[J]. Scientometrics, 2018, 117(1): 511-526. 27 The USGenWeb Project. Research support[EB/OL]. [2025-05-14]. https://www.usgenweb.org/research/index.html. 28 Philips L. The double metaphone search algorithm[J]. C/C++ Users Journal, 2000, 18(6): 38-43. 29 Malvestio I, Cardillo A, Masuda N. Interplay between k-core and community structure in complex networks[J]. Scientific Reports, 2020, 10: Article No.14702. 30 Zhang Y F, Wang Z Y, He Z T, et al. BB-GeoGPT: a framework for learning a large language model for geographic information science[J]. Information Processing & Management, 2024, 61(5): 103808. 31 Rosvall M, Bergstrom C T. Maps of random walks on complex networks reveal community structure[J]. Proceedings of the National Academy of Sciences of the United States of America, 2008, 105(4): 1118-1123. 32 Wang X Z, Tang J, Cheng H, et al. ADANA: active name disambiguation[C]// Proceedings of the 11th IEEE International Conference on Data Mining. Piscataway: IEEE, 2011: 794-803. 33 Momeni F, Mayr P. An open testbed for author name disambiguation evaluation[DS/OL]. [2025-05-14]. https://doi.org/10.7802/1234. 34 DBLP. dblp-2015-05-01[EB/OL]. [2025-05-14]. https://dblp.org/xml/release/dblp-2015-05-01.xml.gz. 35 Jin R R, Du J C, Huang W W, et al. A comprehensive evaluation of quantization strategies for large language models[C]// Findings of the Association for Computational Linguistics ACL 2024. Stroudsburg: Association for Computational Linguistics, 2024: 12186-12215. 36 Parthasarathy V B, Zafar A, Khan A, et al. The ultimate guide to fine-tuning LLMs from basics to breakthroughs: an exhaustive review of technologies, research, best practices, applied research challenges and opportunities[PP/OL]. V3. arXiv (2024-10-30). https://arxiv.org/pdf/2408.13296. 37 Kim J. A fast and integrative algorithm for clustering performance evaluation in author name disambiguation[J]. Scientometrics, 2019, 120(2): 661-681. 38 Subramanian S, King D, Downey D, et al. S2AND: a benchmark and evaluation system for author name disambiguation[C]// Proceedings of the 2021 ACM/IEEE Joint Conference on Digital Libraries. Piscataway: IEEE, 2021: 170-179. 39 Momeni F, Mayr P. Evaluating co-authorship networks in author name disambiguation for common names[C]// Proceedings of the 20th International Conference on Theory and Practice of Digital Libraries. Cham: Springer, 2016: 386-391. 40 Lundberg S M, Lee S I. A unified approach to interpreting model predictions[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2017: 4768-4777. 41 Qian Y N, Zheng Q H, Sakai T, et al. Dynamic author name disambiguation for growing digital libraries[J]. Information Retrieval Journal, 2015, 18(5): 379-412. 42 Liu W L, Do?an R I, Kim S, et al. Author name disambiguation for PubMed[J]. Journal of the Association for Information Science and Technology, 2014, 65(4): 765-781. 43 Urrutia F, Araya R. Who’s the best detective? Large language models vs. traditional machine learning in detecting incoherent fourth grade math answers[J]. Journal of Educational Computing Research, 2024, 61(8): 1723-1754. 44 Liu S J, Fang Y F. Use large language models for named entity disambiguation in academic knowledge graphs[C]// Proceedings of the 2023 3rd International Conference on Education, Information Management and Service Science. Paris: Atlantis Press, 2023: 681-691. 45 Rojo-Echeburúa A. Top 15 small language models for 2025[EB/OL]. (2024-11-14) [2025-05-14]. https://www.datacamp.com/blog/ top-small-language-models. 46 ChatGLM[EB/OL]. [2025-05-14]. https://baike.baidu.com/item/ChatGLM/62811883. 47 “磐石·科学基础大模型”正式发布 赋能科研范式重塑[EB/OL]. (2025-07-26). https://ia.cas.cn/kxyj/kydt_1/202507/t20250726_7897083.html. 48 Yang A, Li A F, Yang B S, et al. Qwen3 technical report[PP/OL]. V1. arXiv (2025-05-14). https://arxiv.org/abs/2505.09388. 49 OpenAI. 隆重推出GPT-OSS[EB/OL]. (2025-08-05). https://openai.com/zh-Hans-CN/index/introducing-gpt-oss/.