A Study on the Measurement Methods of Term Discriminative Capacity for Academic Resources
Wang Hao1,2, Tang Huihui1,2, Zhang Haichao1,2, Zhang Jin3, Zhang Zixuan1,2
1.School of Information Management, Nanjing University, Nanjing 210023 2.Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023 3.School of Information Studies, University of Wisconsin-Milwaukee, Milwaukee 53201
摘要改进索引术语质量的衡量方法可以有效提高IR系统的检索效率,但术语的固有属性易受文档长度影响,难以全面衡量术语质量。对此,本文从术语内在的区分性出发,借鉴词袋模型的基本思想,提出了术语区分能力(term discriminative capacity,TDC)这一理论及3种不同的计算方法。本文还采集了Web of Science的3个子数据库中包含4个著录项的900条记录作为实验数据,来实现TDC的大规模计算,并观察3种算法在实践中的差异。经过实验分析得出,计算术语区分能力的最佳方法为TDC-T,该算法在多个方面表现稳定,且不受DF值的影响,可以作为衡量术语质量的全新指标,记为TDC。但是本研究所选取的A&HCI数据库的记录较少,这或许会造成另两个领域计算结果的失衡。
王昊, 唐慧慧, 张海潮, 张进, 张紫玄. 面向学术资源的术语区分能力的测度方法研究[J]. 情报学报, 2019, 38(10): 1078-1091.
Wang Hao, Tang Huihui, Zhang Haichao, Zhang Jin, Zhang Zixuan. A Study on the Measurement Methods of Term Discriminative Capacity for Academic Resources. 情报学报, 2019, 38(10): 1078-1091.
1 CleverdonC W. Aslib Cranfield research project-report on the testing and analysis of an investigation into the comparative efficiency of indexing systems[R]. College of Aeronautics Cranfield, 1960. 2 SaltonG, YangC S. On the specification of term values in automatic indexing[J]. Journal of Documentation, 1973, 29(4): 351-372. 3 SaltonG, YangC S, YuC T. A theory of term importance in automatic text analysis[J]. Journal of the American Society for Information Science, 1975, 26(1): 33-44. 4 PushpalathaK P, RajuG. Analysis of algorithms used to compute term discrimination values[C]// Proceedings of the International Conference on Computational Intelligence and Computing Research. New York: IEEE, 2010. 5 AhmedS M Z, McKnightC, OppenheimC. A user-centred design and evaluation of IR interfaces[J]. Journal of Librarianship and Information Science, 2006, 38(3): 157-172. 6 GuptaY, SainiA, SaxenaA K. A new fuzzy logic based ranking function for efficient Information Retrieval system[J]. Expert Systems with Applications, 2015, 42(3): 1223-1234. 7 SadikovE, MadhavanJ, WangL, et al. Clustering query refinements by user intent[C]// Proceedings of the 19th International Conference on World Wide Web. New York: ACM Press, 2010: 841-850. 8 SaltonG. The SMART retrieval system—experiments in automatic document processing[M]. Upper Saddle River: Prentice-Hall, 1971. 9 LancasterF W, FayenE G. Information retrieval on-line[M]. Los Angeles: Melville Publishing Company, 1973. 10 SaltonG, WongA, YangC S. A vector space model for automatic indexing[J]. Communications of the ACM, 1975, 18(11): 613-620. 11 RobertsonS E, JonesK S. Relevance weighting of search terms[J]. Journal of the American Society for Information Science, 1976, 27(3): 129-146. 12 HodgeV J, AustinJ. An evaluation of standard retrieval algorithms and a binary neural approach[J]. Neural Networks, 2001, 14(3): 287-303. 13 QiuZ, PérezJ F. Evaluating replication for parallel jobs: An efficient approach[J]. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(8): 2288-2302. 14 CleverdonC, KeenM. Factors determining the performance of indexing systems, Volume 2. Test results[R]. Aslib Cranfield Research Project. Cranfield, 1966. 15 CleverdonC W. The Cranfield tests on index language devices[J]. Aslib Proceedings, 1967, 19(6): 173-194. 16 CleverdonC W. The effect of variations in relevance assessments in comparative experimental tests of index languages[R]. Cranfield Library Report No. 3. Cranfield: Cranfield Institute of Technology, 1970. 17 SaracevicT. Selected results from an inquiry into testing of information retrieval systems[J]. Journal of the American Society for Information Science, 1971, 22(2): 126-139. 18 El-HamdouchiA, WillettP. An improved algorithm for the calculation of exact term discrimination values[J]. Information Processing & Management, 1988, 24(1): 17-22. 19 CroftW B. Document representation in probabilistic models of information retrieval[J]. Journal of the American Society for Information Science, 1981, 32(6): 451-457. 20 ZhaoR, MaoK Z. Fuzzy bag-of-words model for document representation[J]. IEEE Transactions on Fuzzy Systems, 2018, 26(2): 794-804. 21 WahlF, MercadierC, HelbertC. A standardized distance-based index to assess the quality of space-filling designs[J]. Statistics and Computing, 2017, 27(2): 319-329. 22 BernauerL, HanE J, SohnS Y. Term discrimination for text search tasks derived from negative binomial distribution[J]. Information Processing & Management, 2018, 54(3): 370-379. 23 BurnettJ E, CooperD, LynchM F, et al. Document retrieval experiments using indexing vocabularies of varying size. I. Variety generation symbols assigned to the fronts of index terms[J]. Journal of Documentation, 1979, 35(3): 197-206. 24 LiuR T, GaoL C, AnD, et al. Automatic document metadata extraction based on deep networks[C]// Proceedings of the National CCF Conference on Natural Language Processing and Chinese Computing. Cham: Springer, 2018: 305-317. 25 BlairD C, MaronM E. An evaluation of retrieval effectiveness for a full-text document-retrieval system[J]. Communications of the ACM, 1985, 28(3): 289-299. 26 van der MeulenW A, JanssenP J F C. Automatic versus manual indexing[J]. Information Processing & Management, 1977, 13(1): 13-21. 27 HmeidiI, KanaanG, EvensM. Design and implementation of automatic indexing for information retrieval with Arabic documents[J]. Journal of the American Society for Information Science, 1997, 48(10): 867-881. 28 Abu El-KhairI. Effects of stop words elimination for Arabic information retrieval: A comparative study[J]. International Journal of Computing & Information Sciences, 2006, 4(3): 119-133. 29 KimW, AronsonA R, WilburW J. Automatic MeSH term assignment and quality assessment[J]. Proceedings of AMIA Symposium, 2001: 319-323. 30 WacholderN, KlavansJ L, EvansD K. Evaluation of automatically identified index terms for browsing electronic documents[C]// Proceedings of the Conference on Applied Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2000: 302-309. 31 AmitayE, CarmelD, LempelR, et al. Scaling IR-system evaluation using term relevance sets[C]// Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2004: 10-17. 32 Michael KeenE, DiggerJ A. Report of an information science index languages test[M]. Aberystwyth: College of Librarianship, 1972. 33 NieJ. An outline of a general model for information retrieval systems[C]// Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 1988: 495-506. 34 KowalskiG. Information retrieval architecture and algorithms[M]. Boston: Springer, 2011. 35 WolframD, ZhangJ. The impact of term-indexing characteristics on a document space[J]. Canadian Journal of Information & Library Science, 2001, 26(4): 33-35. 36 WolframD, ZhangJ. An investigation of the influence of indexing exhaustivity and term distributions on a document space[J]. Journal of the Association for Information Science and Technology, 2014, 53(11): 943-952. 37 Sparck JonesK. A statistical interpretation of term specificity and its application in retrieval[J]. Journal of Documentation, 1972, 28(1): 11-21. 38 RijsbergenV C J. Information retrieval[M]. Butterworth-Heinemann, 1979. 39 Wilfrid LancasterF. Information retrieval systems: Characteristics, testing, and evaluation[M]. New York: Wiley, 1979. 40 SaltonG, WongA. On the role of words and phrases in automatic text analysis[J]. Computers and the Humanities, 1976, 10(2): 69-87. 41 SaltonG. Automatic text processing: The transformation, Analysis, and Retrieval of Information by Computer[M]. Boston: Addison-Wesley Longman Publishing, 1989. 42 WolframD, ZhangJ. The influence of indexing practices and weighting algorithms on document spaces[J]. Journal of the American Society for Information Science and Technology, 2008, 59(1): 3-11. 43 YoshidaK, AdachiF, WashioT, et al. Density-based spam detector[C]// Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2004: 486-493. 44 CaiD, van RijsbergenC J. Learning semantic relatedness from term discrimination information[J]. Expert Systems with Applications, 2009, 36(2): 1860-1875. 45 DominichS, GóthJ, KiezerT, et al. An entropy-based interpretation of retrieval status value-based retrieval, and its application to the computation of term and query discrimination value[J]. Journal of the American Society for Information Science and Technology, 2004, 55(7): 613-627. 46 KorfhageR R. Information storage and retrieval[M]. New York: Wiley, 1997. 47 WillettP. An algorithm for the calculation of exact term discrimination values[J]. Information Processing & Management, 1985, 21(3): 225-232. 48 AjiferukeI, ChuC M. Quality of indexing in online databases: an alternative measure for a term discriminating index[J]. Information Processing & Management, 1988, 24(5): 599-601. 49 ZhangJ, WolframD. Visualization of term discrimination analysis[J]. Journal of the American Society for Information Science and Technology, 2001, 52(8): 615-627. 50 刘启元, 叶鹰. 文献题录信息挖掘技术方法及其软件SATI的实现——以中外图书情报学为例[J]. 信息资源管理学报, 2012, 2(1): 50-58. 51 ZhangJ, KorfhageR R. A distance and angle similarity measure method[J]. Journal of the American Society for Information Science, 1999, 50(9): 772-778. 52 FisherR A. Statistical methods for research workers[M]. Oliver and Boyd, 1958.