Towards an Appropriate Scale of Datasets for Domain Bibliometrics: Empirical Study under Multiple Tasks
Chen Guo1,2, Wang Panting1, Wang Yuefen1
1.Department of Information Management, School of Economics and Management, Nanjing University of Science and Technology, Nanjing 210094 2.Jiangsu Science and Technology Collaborative Innovation Center of Social Public Safety, Nanjing 210094
摘要面向特定领域开展科技情报分析时,由于文献的集中与离散分布规律,难以有效构造全量文献集。那么多大规模的领域文献集是可靠的?这一问题在不同的情报分析任务场景下,有不同答案。本文综合考虑待分析领域大小、待分析对象(学科分类、国家、机构、关键词、引文、作者,及其各自共现关系)、待分析对象的Top值截取(如高频词)数量、结果是否考虑排序等常见的多种任务场景,设计相应的实验方案。以“人工智能”领域WoS(Web of Science)数据为例,开展多种规模的数据抽样,并计算得出抽样子文献集对全量文献集的拟合指标值为4800个,以量化结果揭示科技情报分析中不同任务场景对文献集规模的要求。研究结果表明,涉及学科与国家分类的分析任务,以极小规模文献集便可得到较为可靠的结果;涉及作者的分析任务,对于文献集规模的要求极高,有必要采用全量数据;涉及机构、关键词、引文的分析任务,文献集达到一定规模可得到较为可靠的结果,但相应规模受不同因素的影响,尤其是共现分析、截取较多Top对象和结果要求排序这三种任务场景对文献集规模要求更高。
陈果, 王盼停, 王曰芬. 文献集规模对科技领域情报分析的影响:多种任务场景下的实证分析[J]. 情报学报, 2021, 40(8): 869-878.
Chen Guo, Wang Panting, Wang Yuefen. Towards an Appropriate Scale of Datasets for Domain Bibliometrics: Empirical Study under Multiple Tasks. 情报学报, 2021, 40(8): 869-878.
1 钟丽萍. 情报研究有效性评价的国内外研究现状及评述[J]. 情报杂志, 2012, 31(10): 32-35, 70. 2 钟丽萍, 冷伏海, 罗世猛. 情报研究有效性的影响因素分析[J]. 情报理论与实践, 2013, 36(7): 6-9. 3 中国人工智能开源软件发展联盟标准. 人工智能: 深度学习算法评估规范AIOSS-01-2018[S/OL]. (2018-07-01) [2019-07-10]. http:// www.cesi.cn/images/editor/20180703/20180703174359294.pdf. 4 Kennedy G. An introduction to Corpus Linguistics[M]. London: Routledge, 1998. 5 苏金智, 肖航. 语料库与社会语言学研究方法[J]. 浙江大学学报(人文社会科学版), 2012, 42(4): 87-95. 6 冯璐. 面向学科信息集成的领域分析数据集构建[M]. 北京: 北京邮电大学出版社, 2013. 7 Shu F, Julien C A, Zhang L, et al. Comparing journal and paper level classifications of science[J]. Journal of Informetrics, 2019, 13(1): 202-225. 8 Chen G, Xiao L. Selecting publication keywords for domain analysis in bibliometrics: a comparison of three methods[J]. Journal of Informetrics, 2016, 10(1): 212-223. 9 Omar M, Mehmood A, Choi G S, et al. Global mapping of artificial intelligence in Google and Google Scholar[J]. Scientometrics, 2017, 113(3): 1269-1305. 10 Shu F, Dinneen J D, Asadi B, et al. Mapping science using library of congress subject headings[J]. Journal of Informetrics, 2017, 11(4): 1080-1094. 11 Milojevi? S, Sugimoto C R, Yan E J, et al. The cognitive structure of Library and Information Science: analysis of article title words[J]. Journal of the American Society for Information Science and Technology, 2011, 62(10): 1933-1953. 12 Iqbal W, Qadir J, Tyson G, et al. A bibliometric analysis of publications in computer networking research[J]. Scientometrics, 2019, 119(2): 1121-1155. 13 Waltman L, van Eck N J, Noyons E C M. A unified approach to mapping and clustering of bibliometric networks[J]. Journal of Informetrics, 2010, 4(4): 629-635. 14 冯志刚, 李长玲, 刘小慧, 等. 基于引用与被引用文献信息的图书情报学跨学科性分析[J]. 情报科学, 2018, 36(3): 105-111. 15 Figuerola C G, García Marco F J, Pinto M. Mapping the evolution of library and information science (1978–2014) using topic modeling on LISA[J]. Scientometrics, 2017, 112(3): 1507-1535. 16 Blessinger K, Frasier M. Analysis of a decade in library literature: 1994–2004[J]. College & Research Libraries, 2007, 68(2): 155-169. 17 Julien H, Pecoskie J (J L), Reed K. Trends in information behavior research, 1999–2008: a content analysis[J]. Library & Information Science Research, 2011, 33(1): 19-24. 18 Chang Y W, Huang M H. A study of the evolution of interdisciplinarity in library and information science: using three bibliometric methods[J]. Journal of the American Society for Information Science and Technology, 2012, 63(1): 22-33. 19 Leydesdorff L, Nerghes A. Co-word maps and topic modeling: a comparison using small and medium‐sized corpora (N<1,000)[J]. Journal of the Association for Information Science and Technology, 2017, 68(4): 1024-1035. 20 邱均平. 信息计量学(七)第七讲: 文献信息分布的集中与离散规律——布-齐-洛分布系及理论[J]. 情报理论与实践, 2001, 24(1): 77-80. 21 Zhang J, Liu G N, Ren M. Finding a representative subset from large-scale documents[J]. Journal of Informetrics, 2016, 10(3): 762-775. 22 孙巍, 黄政, 张学福. 基于特征测度的领域分析文献数据集构建方法研究[J]. 数字图书馆论坛, 2015(12): 9-14. 23 冯璐, 冷伏海. 基于领域分析需求和目标的领域分析数据集界域研究[J]. 图书情报工作, 2009, 53(24): 51-54. 24 刘敏娟, 张学福, 颜蕴, 等. 基于期刊主题相似性的领域分析数据集构建: 方法与实证[J]. 图书情报工作, 2016, 60(10): 115-122. 25 Shu F, Julien C A, Larivière V. Does the web of science accurately represent Chinese scientific performance?[J]. Journal of the Association for Information Science and Technology, 2019, 70(10): 1138-1152. 26 Spearman C. The proof and measurement of association between two things[J]. The American Journal of Psychology, 1987, 100(3/4): 441-471.