Essential Reference Measurements from the Perspective of Full-Text: Concept Definition, Index System, and Identification Model
Lin Gege1, Hou Haiyan1, Pan Yuxin1, Liang Guoqiang2, Hu Zhigang3
1.School of Public Administration and Policy, Dalian University of Technology, Dalian 116024 2.College of Economics and Management, Beijing University of Technology, Beijing 100124 3.Institute for Science, Technology and Society, South China Normal University, Guangzhou 510006
摘要识别施引文献中的核心引文是深入开展科技成果评价的重要基础。为此,本文探讨了全文本视角下的核心引文测度,包括概念界定、指标体系构建及识别模型的优化,从而提供一个更为精准的科学评价工具。首先,明确核心引文的定义,构建包含题录信息和引用信息2个维度、8个子维度及33个引文特征指标的核心引文识别指标体系。其次,通过多种机器学习模型(如随机森林、支持向量机、逻辑回归)对引文特征指标进行遴选与优化,分析其相关性及信息增益,保留21个重要的引文特征指标,并验证识别模型的有效性。研究结果表明,基于引用信息的引文特征指标在识别核心引文时具有更高的重要性和贡献度。机器学习模型在核心引文识别中的表现优异,特别是随机森林、支持向量机、逻辑回归等模型,其ROC(receiver operating characteristic)曲线的AUC(area under curve)值均大于0.85,证明了模型的高效性和鲁棒性。核心引文测度方法及识别模型不仅为科学评价体系提供了更精准的工具,也为深入研究引文分析奠定了坚实的基础。
林歌歌, 侯海燕, 潘宇馨, 梁国强, 胡志刚. 全文本视角下的核心引文测度:概念界定、指标体系与识别模型[J]. 情报学报, 2024, 43(10): 1199-1212.
Lin Gege, Hou Haiyan, Pan Yuxin, Liang Guoqiang, Hu Zhigang. Essential Reference Measurements from the Perspective of Full-Text: Concept Definition, Index System, and Identification Model. 情报学报, 2024, 43(10): 1199-1212.
1 国务院办公厅关于完善科技成果评价机制的指导意见[EB/OL]. (2021-07-16) [2024-09-21]. https://www.gov.cn/gongbao/content/2021/content_5631817.htm. 2 中共中央 国务院印发《深化新时代教育评价改革总体方案》[EB/OL]. (2020-10-13) [2024-09-21]. http://www.moe.gov.cn/jyb_xxgk/moe_1777/moe_1778/202010/t20201013_494381.html. 3 教育部印发《关于破除高校哲学社会科学研究评价中“唯论文”不良导向的若干意见》的通知[EB/OL]. (2020-12-15) [2024-09-21]. http://www.moe.gov.cn/srcsite/A13/moe_2557/s3103/202012/t20201215_505588.html. 4 教育部 科技部印发《关于规范高等学校SCI论文相关指标使用 树立正确评价导向的若干意见》的通知[EB/OL]. (2020-02-23) [2024-09-21]. http://www.moe.gov.cn/srcsite/A16/moe_784/202002/t20200223_423334.html. 5 Lyu D Q, Ruan X M, Xie J, et al. The classification of citing motivations: a meta-synthesis[J]. Scientometrics, 2021, 126(4): 3243-3264. 6 Mandard M. On the shoulders of giants? Motives to cite in management research[J]. European Management Review, 2022, 19(1): 10-21. 7 Pak C M, Wang W B, Yu G. An analysis of in-text citations based on fractional counting[J]. Journal of Informetrics, 2020, 14(4): 101070. 8 Lin G G, Hou H Y, Hu Z G. Understanding multiple references citation[C]// Proceedings of the 17th International Conference on Scientometrics and Informetrics. Leuven: ISSI Society, 2019: 2347-2357. 9 Hu Z G, Lin G G, Sun T A, et al. Understanding multiply mentioned references[J]. Journal of Informetrics, 2017, 11(4): 948-958. 10 Zhao D Z, Cappello A, Johnston L. Functions of uni- and multi-citations: implications for weighted citation analysis[J]. Journal of Data and Information Science, 2017, 2(1): 51-69. 11 胡志刚. 全文引文分析: 理论、方法与应用[M]. 北京: 科学出版社, 2016. 12 胡志刚, 章成志. 悄然兴起的全文计量分析[J]. 图书馆论坛, 2021, 41(3): 1-11. 13 章成志, 胡志刚, 徐硕, 等. 全文本计量分析理论与技术的新进展与新探索——2019全文本文献计量分析学术沙龙综述[J]. 信息资源管理学报, 2020, 10(1): 111-117. 14 Ding Y, Zhang G, Chambers T, et al. Content-based citation analysis: the next generation of citation analysis[J]. Journal of the Association for Information Science and Technology, 2014, 65(9): 1820-1833. 15 赵蓉英, 曾宪琴, 陈必坤. 全文本引文分析——引文分析的新发展[J]. 图书情报工作, 2014, 58(9): 129-135. 16 Valenzuela M, Ha V, Etzioni O. Identifying meaningful citations[C]// Proceedings of the 29th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2015: 21-26. 17 Nazir S, Asif M, Ahmad S, et al. Important citation identification by exploiting content and section-wise in-text citation count[J]. PLoS One, 2020, 15(3): e0228885. 18 Cano V. Citation behavior: classification, utility, and location[J]. Journal of the American Society for Information Science, 1989, 40(4): 284-290. 19 McCain K W, Turner K. Citation context analysis and aging patterns of journal articles in molecular genetics[J]. Scientometrics, 1989, 17(1): 127-163. 20 Cui X. Identification of essential references based on the full text of scientific papers and its application in scientometrics[D]. Leiden: Leiden University, 2014. 21 朱大明. 研究型论文中“关键引文”概念初探[J]. 中国科技期刊研究, 2015, 26(11): 1161-1165. 22 Zhu X D, Turney P, Lemire D, et al. Measuring academic influence: not all citations are equal[J]. Journal of the Association for Information Science and Technology, 2015, 66(2): 408-427. 23 Qayyum F, Afzal M T. Identification of important citations by exploiting research articles’ metadata and cue-terms from content[J]. Scientometrics, 2019, 118(1): 21-43. 24 Aljuaid H, Iftikhar R, Ahmad S, et al. Important citation identification using sentiment analysis of in-text citations[J]. Telematics and Informatics, 2021, 56: 101492. 25 夏红玉, 胡潜, 王忠义. 基于引文重要性的知识流动主路径分析[J]. 情报学报, 2022, 41(5): 451-462. 26 Szomszor M, Pendlebury D A, Adams J. How much is too much? The difference between research influence and self-citation excess[J]. Scientometrics, 2020, 123(2): 1119-1147. 27 Mishra S, Fegley B D, Diesner J, et al. Self-citation is the hallmark of productive authors, of any gender[J]. PLoS One, 2018, 13(9): e0195773. 28 Jones T H, Hanney S. Tracing the indirect societal impacts of biomedical research: development and piloting of a technique based on citations[J]. Scientometrics, 2016, 107(3): 975-1003. 29 Small H G. Cited documents as concept symbols[J]. Social Studies of Science, 1978, 8(3): 327-340. 30 Horbach S, Aagaard K, Schneider J W. Meta-research: how problematic citing practices distort science[OL]. (2021-02-22). https://doi.org/10.31222/osf.io/aqyhg. 31 Lin G G, van Eck N J, Hou H Y, et al. The changing role of cited papers over time: an analysis of highly cited papers based on a large full text dataset[C/OL]// Proceedings of the 26th International Conference on Science and Technology Indicator, (2022-09-07). https://doi.org/10.5281/zenodo.6948268. 32 Jarneving B. Bibliographic coupling and its application to research-front and other core documents[J]. Journal of Informetrics, 2007, 1(4): 287-307. 33 Ghosal T, Tiwary P, Patton R, et al. Towards establishing a research lineage via identification of significant citations[J]. Quantitative Science Studies, 2022, 2(4): 1511-1528. 34 章成志, 张颖怡. 基于学术论文全文的研究方法实体自动识别研究[J]. 情报学报, 2020, 39(6): 589-600. 35 秦成磊, 章成志. 基于层次注意力网络模型的学术文本结构功能识别[J]. 数据分析与知识发现, 2020, 4(11): 26-42. 36 Ding Y, Liu X Z, Guo C, et al. The distribution of references across texts: some implications for citation analysis[J]. Journal of Informetrics, 2013, 7(3): 583-592. 37 Mari?i? S, Spaventi J, Pavi?i? L, et al. Citation context versus the frequency counts of citation histories[J]. Journal of the American Society for Information Science, 1998, 49(6): 530-540. 38 Tang R, Safer M A. Author-rated importance of cited references in biology and psychology publications[J]. Journal of Documentation, 2008, 64(2): 246-272. 39 Zhao D Z, Strotmann A. Deep and narrow impact: introducing location filtered citation counting[J]. Scientometrics, 2020, 122(1): 503-517. 40 Small H. Characterizing highly cited method and non-method papers using citation contexts: the role of uncertainty[J]. Journal of Informetrics, 2018, 12(2): 461-480. 41 Herlach G. Can retrieval of information from citation indexes be simplified? Multiple mention of a reference as a characteristic of the link between cited and citing article[J]. Journal of the American Society for Information Science, 1978, 29(6): 308-310. 42 林歌歌. 科技论文中多项引用的分布与特征研究[D]. 大连: 大连理工大学, 2019. 43 Huang S Z, Qian J J, Huang Y, et al. Disclosing the relationship between citation structure and future impact of a publication[J]. Journal of the Association for Information Science and Technology, 2022, 73(7): 1025-1042. 44 Hassan S U, Akram A, Haddawy P. Identifying important citations using contextual information from full text[C]// Proceedings of the 17th ACM/IEEE Joint Conference on Digital Libraries. Piscataway: IEEE, 2017: 41-48. 45 Hutto C, Gilbert E. VADER: a parsimonious rule-based model for sentiment analysis of social media text[C]// Proceedings of the 8th International AAAI Conference on Weblogs and Social Media. Palo Alto: AAAI Press, 2014: 216-225. 46 Beltagy I, Lo K, Cohan A. SciBERT: a pretrained language model for scientific text[C]// Proceedings of the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 3613-3618. 47 Pedregosa F, Varoquaux G E L, Gramfort A, et al. Scikit-learn: machine learning in python[J]. Journal of Machine Learning Research, 2011, 12: 2825-2830. 48 Géron A. Hands-on machine learning with scikit-learn, keras, and TensorFlow[M]. Sevastopol: O’Reilly Media, 2022. 49 Kraskov A, St?gbauer H, Grassberger P. Estimating mutual information[J]. Physical Review E, 2004, 69(6): 066138. 50 Azhagusundari B, Thanamani A S. Feature selection based on information gain[J]. International Journal of Innovative Technology and Exploring Engineering, 2013, 2(2): 18-21.