面向学术资源的术语区分能力的测度方法研究

doi:10.3772/j.issn.1000-0135.2019.10.008

情报学报

2019, Vol. 38

Issue (10): 1078-1091 DOI: 10.3772/j.issn.1000-0135.2019.10.008

情报分析方法与技术

本期目录 | 过刊浏览 | 高级检索

面向学术资源的术语区分能力的测度方法研究

王昊^1,2, 唐慧慧^1,2, 张海潮^1,2, 张进³, 张紫玄^1,2

1.南京大学信息管理学院，南京 210023
2.江苏省数据工程与知识服务重点实验室，南京 210023
3.威斯康星大学密尔沃基分校信息研究学院，密尔沃基 53201

A Study on the Measurement Methods of Term Discriminative Capacity for Academic Resources

Wang Hao^1,2, Tang Huihui^1,2, Zhang Haichao^1,2, Zhang Jin³, Zhang Zixuan^1,2

1.School of Information Management, Nanjing University, Nanjing 210023
2.Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023
3.School of Information Studies, University of Wisconsin-Milwaukee, Milwaukee 53201

摘要
图/表
参考文献
相关文章 (1)

全文: PDF (3077 KB) HTML (136 KB)
输出: BibTeX | EndNote (RIS)

摘要改进索引术语质量的衡量方法可以有效提高IR系统的检索效率，但术语的固有属性易受文档长度影响，难以全面衡量术语质量。对此，本文从术语内在的区分性出发，借鉴词袋模型的基本思想，提出了术语区分能力（term discriminative capacity，TDC）这一理论及3种不同的计算方法。本文还采集了Web of Science的3个子数据库中包含4个著录项的900条记录作为实验数据，来实现TDC的大规模计算，并观察3种算法在实践中的差异。经过实验分析得出，计算术语区分能力的最佳方法为TDC-T，该算法在多个方面表现稳定，且不受DF值的影响，可以作为衡量术语质量的全新指标，记为TDC。但是本研究所选取的A&HCI数据库的记录较少，这或许会造成另两个领域计算结果的失衡。

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	王昊
	唐慧慧
	张海潮
	张进
	张紫玄

关键词 ：索引术语, 词袋模型, 术语区分能力, 术语空间密度, 术语质量评价

收稿日期: 2018-11-28

基金资助:国家自然科学基金青年科学基金项目“面向学术资源的TSD与TDC测度及分析研究”（71503121）；“江苏青年社科英才”人才培养项目；“南京大学仲英青年学者”人才培养项目。

作者简介: 王昊，男，1981年生，博士，博士生导师，主要研究方向为自然语言处理、数据挖掘应用、本体学习等

引用本文:

王昊, 唐慧慧, 张海潮, 张进, 张紫玄. 面向学术资源的术语区分能力的测度方法研究[J]. 情报学报, 2019, 38(10): 1078-1091.
Wang Hao, Tang Huihui, Zhang Haichao, Zhang Jin, Zhang Zixuan. A Study on the Measurement Methods of Term Discriminative Capacity for Academic Resources. 情报学报, 2019, 38(10): 1078-1091.

链接本文:

https://qbxb.istic.ac.cn/CN/10.3772/j.issn.1000-0135.2019.10.008 或 https://qbxb.istic.ac.cn/CN/Y2019/V38/I10/1078

1 CleverdonC W. Aslib Cranfield research project-report on the testing and analysis of an investigation into the comparative efficiency of indexing systems[R]. College of Aeronautics Cranfield, 1960.
2 SaltonG, YangC S. On the specification of term values in automatic indexing[J]. Journal of Documentation, 1973, 29(4): 351-372.
3 SaltonG, YangC S, YuC T. A theory of term importance in automatic text analysis[J]. Journal of the American Society for Information Science, 1975, 26(1): 33-44.
4 PushpalathaK P, RajuG. Analysis of algorithms used to compute term discrimination values[C]// Proceedings of the International Conference on Computational Intelligence and Computing Research. New York: IEEE, 2010.
5 AhmedS M Z, McKnightC, OppenheimC. A user-centred design and evaluation of IR interfaces[J]. Journal of Librarianship and Information Science, 2006, 38(3): 157-172.
6 GuptaY, SainiA, SaxenaA K. A new fuzzy logic based ranking function for efficient Information Retrieval system[J]. Expert Systems with Applications, 2015, 42(3): 1223-1234.
7 SadikovE, MadhavanJ, WangL, et al. Clustering query refinements by user intent[C]// Proceedings of the 19th International Conference on World Wide Web. New York: ACM Press, 2010: 841-850.
8 SaltonG. The SMART retrieval system—experiments in automatic document processing[M]. Upper Saddle River: Prentice-Hall, 1971.
9 LancasterF W, FayenE G. Information retrieval on-line[M]. Los Angeles: Melville Publishing Company, 1973.
10 SaltonG, WongA, YangC S. A vector space model for automatic indexing[J]. Communications of the ACM, 1975, 18(11): 613-620.
11 RobertsonS E, JonesK S. Relevance weighting of search terms[J]. Journal of the American Society for Information Science, 1976, 27(3): 129-146.
12 HodgeV J, AustinJ. An evaluation of standard retrieval algorithms and a binary neural approach[J]. Neural Networks, 2001, 14(3): 287-303.
13 QiuZ, PérezJ F. Evaluating replication for parallel jobs: An efficient approach[J]. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(8): 2288-2302.
14 CleverdonC, KeenM. Factors determining the performance of indexing systems, Volume 2. Test results[R]. Aslib Cranfield Research Project. Cranfield, 1966.
15 CleverdonC W. The Cranfield tests on index language devices[J]. Aslib Proceedings, 1967, 19(6): 173-194.
16 CleverdonC W. The effect of variations in relevance assessments in comparative experimental tests of index languages[R]. Cranfield Library Report No. 3. Cranfield: Cranfield Institute of Technology, 1970.
17 SaracevicT. Selected results from an inquiry into testing of information retrieval systems[J]. Journal of the American Society for Information Science, 1971, 22(2): 126-139.
18 El-HamdouchiA, WillettP. An improved algorithm for the calculation of exact term discrimination values[J]. Information Processing & Management, 1988, 24(1): 17-22.
19 CroftW B. Document representation in probabilistic models of information retrieval[J]. Journal of the American Society for Information Science, 1981, 32(6): 451-457.
20 ZhaoR, MaoK Z. Fuzzy bag-of-words model for document representation[J]. IEEE Transactions on Fuzzy Systems, 2018, 26(2): 794-804.
21 WahlF, MercadierC, HelbertC. A standardized distance-based index to assess the quality of space-filling designs[J]. Statistics and Computing, 2017, 27(2): 319-329.
22 BernauerL, HanE J, SohnS Y. Term discrimination for text search tasks derived from negative binomial distribution[J]. Information Processing & Management, 2018, 54(3): 370-379.
23 BurnettJ E, CooperD, LynchM F, et al. Document retrieval experiments using indexing vocabularies of varying size. I. Variety generation symbols assigned to the fronts of index terms[J]. Journal of Documentation, 1979, 35(3): 197-206.
24 LiuR T, GaoL C, AnD, et al. Automatic document metadata extraction based on deep networks[C]// Proceedings of the National CCF Conference on Natural Language Processing and Chinese Computing. Cham: Springer, 2018: 305-317.
25 BlairD C, MaronM E. An evaluation of retrieval effectiveness for a full-text document-retrieval system[J]. Communications of the ACM, 1985, 28(3): 289-299.
26 van der MeulenW A, JanssenP J F C. Automatic versus manual indexing[J]. Information Processing & Management, 1977, 13(1): 13-21.
27 HmeidiI, KanaanG, EvensM. Design and implementation of automatic indexing for information retrieval with Arabic documents[J]. Journal of the American Society for Information Science, 1997, 48(10): 867-881.
28 Abu El-KhairI. Effects of stop words elimination for Arabic information retrieval: A comparative study[J]. International Journal of Computing & Information Sciences, 2006, 4(3): 119-133.
29 KimW, AronsonA R, WilburW J. Automatic MeSH term assignment and quality assessment[J]. Proceedings of AMIA Symposium, 2001: 319-323.
30 WacholderN, KlavansJ L, EvansD K. Evaluation of automatically identified index terms for browsing electronic documents[C]// Proceedings of the Conference on Applied Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2000: 302-309.
31 AmitayE, CarmelD, LempelR, et al. Scaling IR-system evaluation using term relevance sets[C]// Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2004: 10-17.
32 Michael KeenE, DiggerJ A. Report of an information science index languages test[M]. Aberystwyth: College of Librarianship, 1972.
33 NieJ. An outline of a general model for information retrieval systems[C]// Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 1988: 495-506.
34 KowalskiG. Information retrieval architecture and algorithms[M]. Boston: Springer, 2011.
35 WolframD, ZhangJ. The impact of term-indexing characteristics on a document space[J]. Canadian Journal of Information & Library Science, 2001, 26(4): 33-35.
36 WolframD, ZhangJ. An investigation of the influence of indexing exhaustivity and term distributions on a document space[J]. Journal of the Association for Information Science and Technology, 2014, 53(11): 943-952.
37 Sparck JonesK. A statistical interpretation of term specificity and its application in retrieval[J]. Journal of Documentation, 1972, 28(1): 11-21.
38 RijsbergenV C J. Information retrieval[M]. Butterworth-Heinemann, 1979.
39 Wilfrid LancasterF. Information retrieval systems: Characteristics, testing, and evaluation[M]. New York: Wiley, 1979.
40 SaltonG, WongA. On the role of words and phrases in automatic text analysis[J]. Computers and the Humanities, 1976, 10(2): 69-87.
41 SaltonG. Automatic text processing: The transformation, Analysis, and Retrieval of Information by Computer[M]. Boston: Addison-Wesley Longman Publishing, 1989.
42 WolframD, ZhangJ. The influence of indexing practices and weighting algorithms on document spaces[J]. Journal of the American Society for Information Science and Technology, 2008, 59(1): 3-11.
43 YoshidaK, AdachiF, WashioT, et al. Density-based spam detector[C]// Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2004: 486-493.
44 CaiD, van RijsbergenC J. Learning semantic relatedness from term discrimination information[J]. Expert Systems with Applications, 2009, 36(2): 1860-1875.
45 DominichS, GóthJ, KiezerT, et al. An entropy-based interpretation of retrieval status value-based retrieval, and its application to the computation of term and query discrimination value[J]. Journal of the American Society for Information Science and Technology, 2004, 55(7): 613-627.
46 KorfhageR R. Information storage and retrieval[M]. New York: Wiley, 1997.
47 WillettP. An algorithm for the calculation of exact term discrimination values[J]. Information Processing & Management, 1985, 21(3): 225-232.
48 AjiferukeI, ChuC M. Quality of indexing in online databases: an alternative measure for a term discriminating index[J]. Information Processing & Management, 1988, 24(5): 599-601.
49 ZhangJ, WolframD. Visualization of term discrimination analysis[J]. Journal of the American Society for Information Science and Technology, 2001, 52(8): 615-627.
50 刘启元, 叶鹰. 文献题录信息挖掘技术方法及其软件SATI的实现——以中外图书情报学为例[J]. 信息资源管理学报, 2012, 2(1): 50-58.
51 ZhangJ, KorfhageR R. A distance and angle similarity measure method[J]. Journal of the American Society for Information Science, 1999, 50(9): 772-778.
52 FisherR A. Statistical methods for research workers[M]. Oliver and Boyd, 1958.