|
|
A Study on the Measurement Methods of Term Discriminative Capacity for Academic Resources |
Wang Hao1,2, Tang Huihui1,2, Zhang Haichao1,2, Zhang Jin3, Zhang Zixuan1,2 |
1.School of Information Management, Nanjing University, Nanjing 210023 2.Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023 3.School of Information Studies, University of Wisconsin-Milwaukee, Milwaukee 53201 |
|
|
Abstract Improving the quality of indexing terms can effectively improve the retrieval efficiency of the IR system, but the inherent properties of the term are susceptible to the length of the document, making it difficult to fully measure the quality of the term. In this regard, this paper starts from the intrinsic property of the term’s discrimination and proposes the theory of term discriminative capacity (TDC) and three different calculation methods based on the idea of the bag-of-words model. In this paper, 900 records containing 4 entries from three sub-databases of Web of Science were collected as experimental data to realize large-scale calculation of TDC and observe the differences between the three algorithms in practice. Through experimental analysis, the best method for calculating the term discriminative capacity is determined to be TDC-T. Its algorithm is stable in many respects and is not affected by the DF value. Therefore, as a new indicator to measure the quality of the term, it is recorded as TDC. However, the A&HCI database selected in this study has fewer records, which may cause an imbalance in the calculation results of the other two fields.
|
Received: 28 November 2018
|
|
|
|
1 CleverdonC W. Aslib Cranfield research project-report on the testing and analysis of an investigation into the comparative efficiency of indexing systems[R]. College of Aeronautics Cranfield, 1960. 2 SaltonG, YangC S. On the specification of term values in automatic indexing[J]. Journal of Documentation, 1973, 29(4): 351-372. 3 SaltonG, YangC S, YuC T. A theory of term importance in automatic text analysis[J]. Journal of the American Society for Information Science, 1975, 26(1): 33-44. 4 PushpalathaK P, RajuG. Analysis of algorithms used to compute term discrimination values[C]// Proceedings of the International Conference on Computational Intelligence and Computing Research. New York: IEEE, 2010. 5 AhmedS M Z, McKnightC, OppenheimC. A user-centred design and evaluation of IR interfaces[J]. Journal of Librarianship and Information Science, 2006, 38(3): 157-172. 6 GuptaY, SainiA, SaxenaA K. A new fuzzy logic based ranking function for efficient Information Retrieval system[J]. Expert Systems with Applications, 2015, 42(3): 1223-1234. 7 SadikovE, MadhavanJ, WangL, et al. Clustering query refinements by user intent[C]// Proceedings of the 19th International Conference on World Wide Web. New York: ACM Press, 2010: 841-850. 8 SaltonG. The SMART retrieval system—experiments in automatic document processing[M]. Upper Saddle River: Prentice-Hall, 1971. 9 LancasterF W, FayenE G. Information retrieval on-line[M]. Los Angeles: Melville Publishing Company, 1973. 10 SaltonG, WongA, YangC S. A vector space model for automatic indexing[J]. Communications of the ACM, 1975, 18(11): 613-620. 11 RobertsonS E, JonesK S. Relevance weighting of search terms[J]. Journal of the American Society for Information Science, 1976, 27(3): 129-146. 12 HodgeV J, AustinJ. An evaluation of standard retrieval algorithms and a binary neural approach[J]. Neural Networks, 2001, 14(3): 287-303. 13 QiuZ, PérezJ F. Evaluating replication for parallel jobs: An efficient approach[J]. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(8): 2288-2302. 14 CleverdonC, KeenM. Factors determining the performance of indexing systems, Volume 2. Test results[R]. Aslib Cranfield Research Project. Cranfield, 1966. 15 CleverdonC W. The Cranfield tests on index language devices[J]. Aslib Proceedings, 1967, 19(6): 173-194. 16 CleverdonC W. The effect of variations in relevance assessments in comparative experimental tests of index languages[R]. Cranfield Library Report No. 3. Cranfield: Cranfield Institute of Technology, 1970. 17 SaracevicT. Selected results from an inquiry into testing of information retrieval systems[J]. Journal of the American Society for Information Science, 1971, 22(2): 126-139. 18 El-HamdouchiA, WillettP. An improved algorithm for the calculation of exact term discrimination values[J]. Information Processing & Management, 1988, 24(1): 17-22. 19 CroftW B. Document representation in probabilistic models of information retrieval[J]. Journal of the American Society for Information Science, 1981, 32(6): 451-457. 20 ZhaoR, MaoK Z. Fuzzy bag-of-words model for document representation[J]. IEEE Transactions on Fuzzy Systems, 2018, 26(2): 794-804. 21 WahlF, MercadierC, HelbertC. A standardized distance-based index to assess the quality of space-filling designs[J]. Statistics and Computing, 2017, 27(2): 319-329. 22 BernauerL, HanE J, SohnS Y. Term discrimination for text search tasks derived from negative binomial distribution[J]. Information Processing & Management, 2018, 54(3): 370-379. 23 BurnettJ E, CooperD, LynchM F, et al. Document retrieval experiments using indexing vocabularies of varying size. I. Variety generation symbols assigned to the fronts of index terms[J]. Journal of Documentation, 1979, 35(3): 197-206. 24 LiuR T, GaoL C, AnD, et al. Automatic document metadata extraction based on deep networks[C]// Proceedings of the National CCF Conference on Natural Language Processing and Chinese Computing. Cham: Springer, 2018: 305-317. 25 BlairD C, MaronM E. An evaluation of retrieval effectiveness for a full-text document-retrieval system[J]. Communications of the ACM, 1985, 28(3): 289-299. 26 van der MeulenW A, JanssenP J F C. Automatic versus manual indexing[J]. Information Processing & Management, 1977, 13(1): 13-21. 27 HmeidiI, KanaanG, EvensM. Design and implementation of automatic indexing for information retrieval with Arabic documents[J]. Journal of the American Society for Information Science, 1997, 48(10): 867-881. 28 Abu El-KhairI. Effects of stop words elimination for Arabic information retrieval: A comparative study[J]. International Journal of Computing & Information Sciences, 2006, 4(3): 119-133. 29 KimW, AronsonA R, WilburW J. Automatic MeSH term assignment and quality assessment[J]. Proceedings of AMIA Symposium, 2001: 319-323. 30 WacholderN, KlavansJ L, EvansD K. Evaluation of automatically identified index terms for browsing electronic documents[C]// Proceedings of the Conference on Applied Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2000: 302-309. 31 AmitayE, CarmelD, LempelR, et al. Scaling IR-system evaluation using term relevance sets[C]// Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2004: 10-17. 32 Michael KeenE, DiggerJ A. Report of an information science index languages test[M]. Aberystwyth: College of Librarianship, 1972. 33 NieJ. An outline of a general model for information retrieval systems[C]// Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 1988: 495-506. 34 KowalskiG. Information retrieval architecture and algorithms[M]. Boston: Springer, 2011. 35 WolframD, ZhangJ. The impact of term-indexing characteristics on a document space[J]. Canadian Journal of Information & Library Science, 2001, 26(4): 33-35. 36 WolframD, ZhangJ. An investigation of the influence of indexing exhaustivity and term distributions on a document space[J]. Journal of the Association for Information Science and Technology, 2014, 53(11): 943-952. 37 Sparck JonesK. A statistical interpretation of term specificity and its application in retrieval[J]. Journal of Documentation, 1972, 28(1): 11-21. 38 RijsbergenV C J. Information retrieval[M]. Butterworth-Heinemann, 1979. 39 Wilfrid LancasterF. Information retrieval systems: Characteristics, testing, and evaluation[M]. New York: Wiley, 1979. 40 SaltonG, WongA. On the role of words and phrases in automatic text analysis[J]. Computers and the Humanities, 1976, 10(2): 69-87. 41 SaltonG. Automatic text processing: The transformation, Analysis, and Retrieval of Information by Computer[M]. Boston: Addison-Wesley Longman Publishing, 1989. 42 WolframD, ZhangJ. The influence of indexing practices and weighting algorithms on document spaces[J]. Journal of the American Society for Information Science and Technology, 2008, 59(1): 3-11. 43 YoshidaK, AdachiF, WashioT, et al. Density-based spam detector[C]// Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2004: 486-493. 44 CaiD, van RijsbergenC J. Learning semantic relatedness from term discrimination information[J]. Expert Systems with Applications, 2009, 36(2): 1860-1875. 45 DominichS, GóthJ, KiezerT, et al. An entropy-based interpretation of retrieval status value-based retrieval, and its application to the computation of term and query discrimination value[J]. Journal of the American Society for Information Science and Technology, 2004, 55(7): 613-627. 46 KorfhageR R. Information storage and retrieval[M]. New York: Wiley, 1997. 47 WillettP. An algorithm for the calculation of exact term discrimination values[J]. Information Processing & Management, 1985, 21(3): 225-232. 48 AjiferukeI, ChuC M. Quality of indexing in online databases: an alternative measure for a term discriminating index[J]. Information Processing & Management, 1988, 24(5): 599-601. 49 ZhangJ, WolframD. Visualization of term discrimination analysis[J]. Journal of the American Society for Information Science and Technology, 2001, 52(8): 615-627. 50 刘启元, 叶鹰. 文献题录信息挖掘技术方法及其软件SATI的实现——以中外图书情报学为例[J]. 信息资源管理学报, 2012, 2(1): 50-58. 51 ZhangJ, KorfhageR R. A distance and angle similarity measure method[J]. Journal of the American Society for Information Science, 1999, 50(9): 772-778. 52 FisherR A. Statistical methods for research workers[M]. Oliver and Boyd, 1958. |
|
|
|