|
|
A Model of Author Name Disambiguation Based on the Strategy of Targeting Precision before Recall |
Shen Zhe1, Wang Yi1, Ju Xiufang2, Cheng Ying1 |
1.School of Information Management, Nanjing University, Nanjing 210023 2.Institute for Chinese Social Sciences Research and Assessment, Nanjing University, Nanjing 210093 |
|
|
Abstract Collecting the complete and accurate academic output of each scholar provides the fundamental data needed for bibliometrics and scientific evaluation research. Since the existing author name disambiguation (AND) techniques have not met the demand of practical application, this paper proposes a two-step AND model based on rules for high-level scientific talents that takes full advantage of a rule-based model with high precision and adopts a strategy of targeting precision before recall. Since more features were used due to the feasibility of collecting external data of high-level researchers that contain resumes, representative work, and research interests, the proposed method showed excellent performance. The method was tested with data from the National Science Fund for Distinguished Young Scholars. The experimental results showed that the proposed method performed well both in precision and recall. The F1 score was 0.93 and 0.95 based on two feature sets that were obviously better than the baseline model.
|
Received: 08 March 2021
|
|
|
|
1 Strotmann A, Zhao D Z. Author name disambiguation: what difference does it make in author-based citation analysis?[J]. Journal of the American Society for Information Science and Technology, 2012, 63(9): 1820-1833. 2 Smalheiser N R, Torvik V I. Author name disambiguation[J]. Annual Review of Information Science and Technology, 2009, 43: 1-43. 3 Elliot S. Survey of author name disambiguation: 2004 to 2010[J/OL]. Library Philosophy and Practice. (2010-11-01) [2020-04-10]. https://digitalcommons.unl.edu/libphilprac/443/. 4 Hussain I, Asghar S. A survey of author name disambiguation techniques: 2010-2016[J]. The Knowledge Engineering Review, 2017, 32: e22. 5 Sanyal D K, Bhowmick P K, Das P P. A review of author name disambiguation techniques for the PubMed bibliographic database[J]. Journal of Information Science, 2021, 47(2): 227-254. 6 沈喆, 王毅, 姚毅凡, 等. 面向学术文献的作者名消歧方法研究综述[J]. 数据分析与知识发现, 2020, 4(8): 15-27. 7 Zhang Y T, Zhang F J, Yao P R, et al. Name disambiguation in AMiner: clustering, maintenance, and human in the loop[C]// Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York: ACM Press, 2018: 1002-1011. 8 Egghe L, Guns R, Rousseau R. Thoughts on uncitedness: Nobel laureates and Fields medalists as case studies[J]. Journal of the American Society for Information Science and Technology, 2011, 62(8): 1637-1644. 9 Jones B F, Weinberg B A. Age dynamics in scientific creativity[J]. Proceedings of the National Academy of Sciences of the United States of America, 2011, 108(47): 18910-18914. 10 Weinberg B A, Galenson D W. Creative careers: the life cycles of Nobel laureates in economics[J]. De Economist, 2019, 167(3): 221-239. 11 刘俊婉, 郑晓敏, 王菲菲, 等. 基于节点进退的中科院院士合作网络演化研究——以信息技术科学部为例[J]. 情报杂志, 2016, 35(12): 162-168. 12 Liu X J, Yu M X, Chen D Z, et al. Tracking research performance before and after receiving the Cheung Kong Scholars award: a case study of recipients in 2005[J]. Research Evaluation, 2018, 27(4): 367-379. 13 Yin Z F, Zhi Q. Dancing with the academic elite: a promotion or hindrance of research production?[J]. Scientometrics, 2017, 110(1): 17-41. 14 Yue M L, Li R N, Ou G Y, et al. An exploration on the flow of leading research talents in China: from the perspective of distinguished young scholars[J]. Scientometrics, 2020, 125(2): 1559-1574. 15 Ferreira A A, Gon?alves M A, Laender A H F. A brief survey of automatic methods for author name disambiguation[J]. ACM SIGMOD Record, 2012, 41(2): 15-26. 16 于夏薇, 袁军鹏. 融合语料库的论文作者姓名中英自动翻译研究[J]. 情报工程, 2018, 4(1): 42-51. 17 Huang H H, Kuo Y H. Cross-lingual document representation and semantic similarity measure: a fuzzy set and rough set based approach[J]. IEEE Transactions on Fuzzy Systems, 2010, 18(6): 1098-1111. 18 Hussain I, Asghar S. Author name disambiguation by exploiting graph structural clustering and hybrid similarity[J]. Arabian Journal for Science and Engineering, 2018, 43(12): 7421-7437. 19 Abdulhayoglu M A, Thijs B. Use of ResearchGate and Google CSE for author name disambiguation[J]. Scientometrics, 2017, 111(3): 1965-1985. 20 Ding X, Zhang H, Guo X Y. An unsupervised framework for author-paper linking in bibliographic retrieval system[C]// Proceedings of the 2018 14th International Conference on Semantics, Knowledge and Grids. IEEE, 2018: 152-159. 21 Han H Q, Yao C Q, Fu Y, et al. Semantic fingerprints-based author name disambiguation in Chinese documents[J]. Scientometrics, 2017, 111(3): 1879-1896. 22 Zhang B C, Dundar M, Dave V, et al. Dirichlet process Gaussian mixture for active online name disambiguation by particle filter[C]// Proceedings of the 2019 ACM/IEEE Joint Conference on Digital Libraries. IEEE, 2019: 269-278. 23 Km P, Mondal S, Chandra J. A graph combination with edge pruning-based approach for author name disambiguation[J]. Journal of the Association for Information Science and Technology, 2020, 71(1): 69-83. 24 Du H L, Jiang Z Y, Gao J L. Who is who: name disambiguation in large-scale scientific literature[C]// Proceedings of the 2019 International Conference on Data Mining Workshops. IEEE, 2019: 1037-1044. 25 Ma X, Wang R R, Zhang Y. Author name disambiguation in heterogeneous academic networks[C]// Proceedings of the International Conference on Web Information Systems and Applications. Cham: Springer, 2019: 126-137. 26 Qiao Z Y, Du Y, Fu Y J, et al. Unsupervised author disambiguation using heterogeneous graph convolutional network embedding[C]// Proceedings of the 2019 IEEE International Conference on Big Data. IEEE, 2019: 910-919. 27 翟晓瑞, 韩红旗, 张运良, 等. 基于稀疏分布式表征的英文著者姓名消歧研究[J]. 计算机应用研究, 2019, 36(12): 3534-3538. 28 Zhao Z Q, Rollins J, Bai L G, et al. Incremental author name disambiguation for scientific citation data[C]// Proceedings of the 2017 IEEE International Conference on Data Science and Advanced Analytics. IEEE, 2017: 175-183. 29 Backes T. Effective unsupervised author disambiguation with relative frequencies[C]// Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries. New York: ACM Press, 2018: 203-212. 30 Silva J M B, Silva F. Feature extraction for the author name disambiguation problem in a bibliographic database[C]// Proceedings of the Symposium on Applied Computing. New York: ACM Press, 2017: 783-789. 31 Santana A F, Gon?alves M A, Laender A H F, et al. Incremental author name disambiguation by exploiting domain-specific heuristics[J]. Journal of the Association for Information Science and Technology, 2017, 68(4): 931-945. 32 Katsurai M, Ohmukai I, Takeda H. Topic representation of researchers’ interests in a large-scale academic database and its application to author disambiguation[J]. IEICE Transactions on Information and Systems, 2016, E99.D(4): 1010-1018. 33 Amplayo R K, Hwang S W, Song M. AutoSense model for word sense induction[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 6212-6219. 34 Pooja K M, Mondal S, Chandra J. An unsupervised heuristic based approach for author name disambiguation[C]// Proceedings of the 2018 10th International Conference on Communication Systems & Networks. IEEE, 2018: 540-542. 35 尚玉玲, 曹建军, 李红梅, 等. 基于合作作者与隶属机构信息的同名排歧方法[J]. 计算机科学, 2018, 45(11): 220-225, 260. 36 Zhang B C, Hasan M A. Name disambiguation in anonymized graphs using network embedding[C]// Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. New York: ACM Press, 2017: 1239-1248. 37 Yu Z Z, Yang B. Researcher name disambiguation: feature learning and affinity propagation clustering[C]// Proceedings of the International Symposium on Methodologies for Intelligent Systems. Cham: Springer, 2018: 225-235. 38 Kim K, Rohatgi S, Giles C L. Hybrid deep pairwise classification for author name disambiguation[C]// Proceedings of the 28th ACM International Conference on Information and Knowledge Management. New York: ACM Press, 2019: 2369-2372. 39 Xu J, Shen S Q, Li D S, et al. A network-embedding based method for author disambiguation[C]// Proceedings of the 27th ACM International Conference on Information and Knowledge Management. New York: ACM Press, 2018: 1735-1738. 40 Zhang W J, Yan Z M, Zheng Y Q. Author name disambiguation using graph node embedding method[C]// Proceedings of the 2019 IEEE 23rd International Conference on Computer Supported Cooperative Work in Design. IEEE, 2019: 410-415. 41 余传明, 钟韵辞, 林奥琛, 等. 基于网络表示学习的作者重名消歧研究[J]. 数据分析与知识发现, 2020, 4(2/3): 48-59. 42 Wang H W, Wan R J, Wen C, et al. Author name disambiguation on heterogeneous information network with adversarial representation learning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(1): 238-245. 43 Yan H, Peng H, Li C, et al. Bibliographic name disambiguation with graph convolutional network[C]// Proceedings of the 20th International Conference on Web Information Systems Engineering. Cham: Springer, 2019: 538-551. 44 Kim J, Kim J. The impact of imbalanced training data on machine learning for author name disambiguation[J]. Scientometrics, 2018, 117(1): 511-526. 45 Peng L W, Shen S Q, Xu J, et al. Diting: an author disambiguation method based on network representation learning[J]. IEEE Access, 2019, 7: 135539-135555. 46 Protasiewicz J, Dadas S. A hybrid knowledge-based framework for author name disambiguation[C]// Proceedings of the 2016 IEEE International Conference on Systems, Man, and Cybernetics. IEEE, 2016: 594-600. 47 Xu X L, Li Y P, Liptrott M, et al. NDFMF: an author name disambiguation algorithm based on the fusion of multiple features[C]// Proceedings of the 2018 IEEE 42nd Annual Computer Software and Applications Conference. IEEE, 2018: 187-190. 48 Zhang B C, Dundar M, Hasan M A. Bayesian non-exhaustive classification a case study: online name disambiguation using temporal record streams[C]// Proceedings of the 25th ACM International Conference on Information and Knowledge Management. New York: ACM Press, 2016: 1341-1350. 49 Deng C H, Deng H F, Li C R. A scholar disambiguation method based on heterogeneous relation-fusion and attribute enhancement[J]. IEEE Access, 2020, 8: 28375-28384. 50 Momeni F, Mayr P. Using co-authorship networks for author name disambiguation[C]// Proceedings of the 2016 IEEE/ACM Joint Conference on Digital Libraries. IEEE, 2016: 261-262. 51 Müller M C. Semantic author name disambiguation with word embeddings[C]// Proceedings of the International Conference on Theory and Practice of Digital Libraries. Cham: Springer, 2017: 300-311. 52 刘林. 面向科技人才情报的多策略组合模型同名消歧方法[J]. 通信技术, 2018, 51(8): 1836-1843. 53 Hazra R, Saha A, Deb S B, et al. An efficient technique for author name disambiguation[C]// Proceedings of the 2016 IEEE International Conference on Current Trends in Advanced Computing. IEEE, 2016: 1-6. 54 Sun S M, Zhang H, Li N, et al. Name disambiguation for Chinese scientific authors with multi-level clustering[C]// Proceedings of the 2017 IEEE International Conference on Computational Science and Engineering and IEEE International Conference on Embedded and Ubiquitous Computing. IEEE, 2017: 176-182. 55 Cota R G, Gon?alves M A, Laender A H F. A heuristic-based hierarchical clustering method for author name disambiguation in digital libraries[C]// XXII Simpósio Brasileiro de Banco de Dados, 15-19 de Outubro, Jo?o Pessoa, Paraíba, Brasil, Anais, 2007: 20-34. 56 Schulz C, Mazloumian A, Petersen A M, et al. Exploiting citation networks for large-scale author name disambiguation[J]. EPJ Data Science, 2014, 3: Article No.11. 57 Caron E, van Eck N J P. Large scale author name disambiguation using rule-based scoring and clustering[C]// Proceedings of the Science and Technology Indicators Conference, 2014: 79-86. 58 何雪英, 张丽. Web of Science数据库2006年新增功能介绍[J]. 情报探索, 2008(2): 69-71. 59 百度百科[EB/OL]. [2020-03-06]. https://baike.baidu.com/. 60 都平平, 李雨珂, 孟勇, 等. 类百度百科模式专家学者知识链数据库建设研究[J]. 图书馆杂志, 2015, 34(11): 46-51. 61 Zhang S Y, Xinhua E, Pan T. A multi-level author name disambiguation algorithm[J]. IEEE Access, 2019, 7: 104250-104257. 62 Tekles A, Bornmann L. Author name disambiguation of bibliometric data: a comparison of several unsupervised approaches[J]. Quantitative Science Studies, 2020, 1(4): 1510-1528. 63 范午攸. 一种针对已知作者的姓名消歧方法[J]. 图书馆杂志, 2018, 37(12): 56-63. 64 刘玮辰, 史冬波, 李江. 基于职业经历和引文网络的华人姓名消歧算法[J]. 信息资源管理学报, 2020, 10(6): 82-89, 100. |
|
|
|