|
|
|
| Two-Stage Author Name Disambiguation Study Combining Rule-Based and Supervised Models |
| Chen Yifan1,2, Xie Ruixia3, Yang Ning1,2, Hu Wei1, Zhang Zhiqiang1,2 |
1.National Science Library (Chengdu), Chinese Academy of Sciences, Chengdu 610299 2.Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190 3.School of Economics and Management, Tongji University, Shanghai 200092 |
|
|
|
|
Abstract Author name disambiguation (AND) is a foundational and critical task in academic research fields such as information retrieval, information integration, and bibliometrics. To address the limitations of single disambiguation models in satisfying practical requirements, this study proposes a two-stage automated author name disambiguation framework (TSD-RS) that combines rule-based and supervised models. In the first stage, the dynamic threshold method is employed to improve the rule-based model, thus enhancing preliminary disambiguation performance, while 12 rule-application orders are designed and compared for their effect on AND. In the second stage, a cluster network is constructed using paper clusters formed in the preliminary disambiguation as nodes and supervised model prediction results as edge weights. Subsequently, the InfoMap algorithm is applied for community detection to refine disambiguation iteratively. During this process, four automated training-set construction methods (for positive and negative sample pairs) and four supervised models (including large language models) are compared for their AND effectiveness. Experiments on three gold-standard datasets of varying scales show that the best disambiguation performance is achieved when selecting Order5 for rule sequence in TSD-RS’s first stage, the 1/2-shell method for positive sample extraction, and the random-forest model in the second stage. The resulting bF1 value attains a 95% confidence interval of 0.85±0.04, thus demonstrating improvement over baseline models.
|
|
Received: 15 May 2025
|
|
|
|
1 Kim J, Diesner J. Distortive effects of initial-based name disambiguation on measurements of large-scale coauthorship networks[J]. Journal of the Association for Information Science and Technology, 2016, 67(6): 1446-1461. 2 Santini C, Gesese G A, Peroni S, et al. A knowledge graph embeddings based approach for author name disambiguation using literals[J]. Scientometrics, 2022, 127(8): 4887-4912. 3 Enserink M. Are you ready to become a number?[J]. Science, 2009, 323(5922): 1662-1664. 4 Top names over the last 100 years[EB/OL]. [2025-05-14]. https://www.ssa.gov/oact/babynames/decades/century.html. 5 Web of Science. Search results from all databases for Wang Wei[EB/OL]. [2025-05-14]. https://www.webofscience.com/wos/alldb/summary/be0a4a57-d39e-4a78-87ff-5d4716e8209c-0109e39b1f/relevance/1. 6 Correia A, Guimar?es D, Paulino D, et al. AuthCrowd: author name disambiguation and entity matching using crowdsourcing[C]// Proceedings of the 2021 IEEE 24th International Conference on Computer Supported Cooperative Work in Design. Piscataway: IEEE, 2021: 150-155. 7 Zhang L Z, Ban Z J. Author name disambiguation based on rule and graph model[C]// Proceedings of the 9th Natural Language Processing and Chinese Computing. Cham: Springer, 2020: 617-628. 8 Rehs A. A supervised machine learning approach to author disambiguation in the Web of Science[J]. Journal of Informetrics, 2021, 15(3): 101166. 9 Waqas H, Qadir M A. Multilayer heuristics based clustering framework (MHCF) for author name disambiguation[J]. Scientometrics, 2021, 126(9): 7637-7678. 10 Liu Y, Li W J, Huang Z, et al. A fast method based on multiple clustering for name disambiguation in bibliographic citations[J]. Journal of the Association for Information Science and Technology, 2015, 66(3): 634-644. 11 Backes T. Effective unsupervised author disambiguation with relative frequencies[C]// Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries. New York: ACM Press, 2018: 203-212. 12 Levin M, Krawczyk S, Bethard S, et al. Citation-based bootstrapping for large-scale author disambiguation[J]. Journal of the American Society for Information Science and Technology, 2012, 63(5): 1030-1047. 13 Zhang L, Lu W, Yang J Q. LAGOS-AND: a large gold standard dataset for scholarly author name disambiguation[J]. Journal of the Association for Information Science and Technology, 2023, 74(2): 168-185. 14 Kim J, Kim J, Owen-Smith J. Ethnicity-based name partitioning for author name disambiguation using supervised machine learning[J]. Journal of the Association for Information Science and Technology, 2021, 72(8): 979-994. 15 吴柯烨, 闵超, 孙建军, 等. 面向特定科研任务的著者姓名消歧方法[J]. 情报学报, 2021, 40(7): 734-744. 16 沈喆, 王毅, 鞠秀芳, 等. 基于先精确后召回策略的作者名消歧模型研究[J]. 情报学报, 2022, 41(4): 350-363. 17 周杰, 李弼程, 唐永旺. 基于关键证据与E2LSH的增量式人名聚类消歧方法[J]. 情报学报, 2016, 35(7): 714-722. 18 杨昭. 基于元路径的作者名称协同消歧研究[J]. 情报学报, 2023, 42(3): 327-340. 19 Liu J L, Lei K H, Liu J Y, et al. Ranking-based name matching for author disambiguation in bibliographic data[C]// Proceedings of the 2013 KDD Cup 2013 Workshop. New York: ACM Press, 2013: Article No.8. 20 Mozafari N. A genetic-based approach for author name disambiguation problem[J]. Iranian Journal of Information Processing and Management, 2021, 36(3): 791-816. 21 Tekles A, Bornmann L. Author name disambiguation of bibliometric data: a comparison of several unsupervised approaches[J]. Quantitative Science Studies, 2020, 1(4): 1510-1528. 22 Pooja K, Mondal S, Chandra J. Exploiting similarities across multiple dimensions for author name disambiguation[J]. Scientometrics, 2021, 126(9): 7525-7560. 23 De Bonis M, Falchi F, Manghi P. Graph-based methods for author name disambiguation: a survey[J]. PeerJ Computer Science, 2023, 9: e1536. 24 Mihaljevi? H, Santamaría L. Disambiguation of author entities in ADS using supervised learning and graph theory methods[J]. Scientometrics, 2021, 126(5): 3893-3917. 25 Xia L Q, Li C X, Zhang C B, et al. Leveraging error-assisted fine-tuning large language models for manufacturing excellence[J]. Robotics and Computer-Integrated Manufacturing, 2024, 88: 102728. 26 Kim J, Kim J. The impact of imbalanced training data on machine learning for author name disambiguation[J]. Scientometrics, 2018, 117(1): 511-526. 27 The USGenWeb Project. Research support[EB/OL]. [2025-05-14]. https://www.usgenweb.org/research/index.html. 28 Philips L. The double metaphone search algorithm[J]. C/C++ Users Journal, 2000, 18(6): 38-43. 29 Malvestio I, Cardillo A, Masuda N. Interplay between k-core and community structure in complex networks[J]. Scientific Reports, 2020, 10: Article No.14702. 30 Zhang Y F, Wang Z Y, He Z T, et al. BB-GeoGPT: a framework for learning a large language model for geographic information science[J]. Information Processing & Management, 2024, 61(5): 103808. 31 Rosvall M, Bergstrom C T. Maps of random walks on complex networks reveal community structure[J]. Proceedings of the National Academy of Sciences of the United States of America, 2008, 105(4): 1118-1123. 32 Wang X Z, Tang J, Cheng H, et al. ADANA: active name disambiguation[C]// Proceedings of the 11th IEEE International Conference on Data Mining. Piscataway: IEEE, 2011: 794-803. 33 Momeni F, Mayr P. An open testbed for author name disambiguation evaluation[DS/OL]. [2025-05-14]. https://doi.org/10.7802/1234. 34 DBLP. dblp-2015-05-01[EB/OL]. [2025-05-14]. https://dblp.org/xml/release/dblp-2015-05-01.xml.gz. 35 Jin R R, Du J C, Huang W W, et al. A comprehensive evaluation of quantization strategies for large language models[C]// Findings of the Association for Computational Linguistics ACL 2024. Stroudsburg: Association for Computational Linguistics, 2024: 12186-12215. 36 Parthasarathy V B, Zafar A, Khan A, et al. The ultimate guide to fine-tuning LLMs from basics to breakthroughs: an exhaustive review of technologies, research, best practices, applied research challenges and opportunities[PP/OL]. V3. arXiv (2024-10-30). https://arxiv.org/pdf/2408.13296. 37 Kim J. A fast and integrative algorithm for clustering performance evaluation in author name disambiguation[J]. Scientometrics, 2019, 120(2): 661-681. 38 Subramanian S, King D, Downey D, et al. S2AND: a benchmark and evaluation system for author name disambiguation[C]// Proceedings of the 2021 ACM/IEEE Joint Conference on Digital Libraries. Piscataway: IEEE, 2021: 170-179. 39 Momeni F, Mayr P. Evaluating co-authorship networks in author name disambiguation for common names[C]// Proceedings of the 20th International Conference on Theory and Practice of Digital Libraries. Cham: Springer, 2016: 386-391. 40 Lundberg S M, Lee S I. A unified approach to interpreting model predictions[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2017: 4768-4777. 41 Qian Y N, Zheng Q H, Sakai T, et al. Dynamic author name disambiguation for growing digital libraries[J]. Information Retrieval Journal, 2015, 18(5): 379-412. 42 Liu W L, Do?an R I, Kim S, et al. Author name disambiguation for PubMed[J]. Journal of the Association for Information Science and Technology, 2014, 65(4): 765-781. 43 Urrutia F, Araya R. Who’s the best detective? Large language models vs. traditional machine learning in detecting incoherent fourth grade math answers[J]. Journal of Educational Computing Research, 2024, 61(8): 1723-1754. 44 Liu S J, Fang Y F. Use large language models for named entity disambiguation in academic knowledge graphs[C]// Proceedings of the 2023 3rd International Conference on Education, Information Management and Service Science. Paris: Atlantis Press, 2023: 681-691. 45 Rojo-Echeburúa A. Top 15 small language models for 2025[EB/OL]. (2024-11-14) [2025-05-14]. https://www.datacamp.com/blog/ top-small-language-models. 46 ChatGLM[EB/OL]. [2025-05-14]. https://baike.baidu.com/item/ChatGLM/62811883. 47 “磐石·科学基础大模型”正式发布 赋能科研范式重塑[EB/OL]. (2025-07-26). https://ia.cas.cn/kxyj/kydt_1/202507/t20250726_7897083.html. 48 Yang A, Li A F, Yang B S, et al. Qwen3 technical report[PP/OL]. V1. arXiv (2025-05-14). https://arxiv.org/abs/2505.09388. 49 OpenAI. 隆重推出GPT-OSS[EB/OL]. (2025-08-05). https://openai.com/zh-Hans-CN/index/introducing-gpt-oss/. |
|
|
|