|
|
Review on Identifying the Semantics of Scientific Literature Content |
Huang Hong, Chen Chong, Zhang Jingying |
School of Government, Beijing Normal University, Beijing 100875 |
|
|
Abstract Identifying the semantics of the textual content of scientific literature can shed light on the research elements of scientific literature. This task is a kind of fine-grained text mining, and is essential for knowledge acquisition and utilization. This article reviews recent research studies on identification of semantics of scientific literature content; it is expected that such a review would provide comprehensive reference for subsequent studies. This study begins by summarizing the existing semantic annotation models of literature content, and then it discusses the research track of semantic identification of literature content based on different granularities (i.e. chapters, sentences and terms), illustrates the typical applications, highlights the existing problems, and suggests future research directions. The study seeks answers to five questions: (1) Which semantic types of scientific literature content are under focus? (2) What granularity of text units should be selected for semantic identification? (3) What kind of identification approaches are available? (4) How to evaluate the identification results? (5) What are the typical applications of semantic identification? Future improvement on this line of research includes proposing uniform standards on semantic types, increasing the available training data sets and focusing on multiple semantic types and their relations, and improving existing methods. It is important to continue making many efforts to find more solutions through future studies.
|
Received: 08 October 2021
|
|
|
|
1 Vitali F, Peroni S. The argument model ontology (AMO)[EB/OL]. (2011-04-05) [2021-06-04]. https://sparontologies.github.io/amo/current/amo.html. 2 Ciccarese P, Groza T. Ontology of rhetorical blocks (ORB)[EB/OL]. (2011-06-05) [2021-06-04]. https://www.w3.org/2001/sw/hcls/notes/orb/. 3 Nasar Z, Jaffry S W, Malik M K. Information extraction from scientific articles: a survey[J]. Scientometrics, 2018, 117(3): 1931-1990. 4 Grishman R. Information extraction: techniques and challenges[C]// Proceedings of the International Summer School on Information Extraction: a Multidisciplinary Approach to an Emerging Information Technology. Heidelberg: Springer, 1997: 10-27. 5 Kiryakov A, Popov B, Terziev I, et al. Semantic annotation, indexing, and retrieval[J]. Journal of Web Semantics, 2004, 2(1): 49-79. 6 de Ribaupierre H, Falquet G. Extracting discourse elements and annotating scientific documents using the SciAnnotDoc model: a use case in gender documents[J]. International Journal on Digital Libraries, 2018, 19(2): 271-286. 7 李旭晖, 秦书倩, 吴燕秋, 等. 从计算角度看大规模数据中的知识组织[J]. 图书情报知识, 2018(6): 94-102. 8 Renear A H, Palmer C L. Strategic reading, ontologies, and the future of scientific publishing[J]. Science, 2009, 325(5942): 828-832. 9 Shotton D M. Semantic publishing: the coming revolution in scientific journal publishing[J]. Learned Publishing, 2009, 22(2): 85-94. 10 Ding Y, Song M, Han J, et al. Entitymetrics: measuring the impact of entities[J]. PLoS One, 2013, 8(8): e71416. 11 Wang Y Z, Zhang C Z. Using the full-text content of academic articles to identify and evaluate algorithm entities in the domain of natural language processing[J]. Journal of Informetrics, 2020, 14(4): 101091. 12 Toulmin S E. The uses of argument[M]. Cambridge: Cambridge University Press, 2003. 13 亚理斯多德. 修辞学[M]. 罗念生, 译. 北京: 生活?读书?新知三联书店, 1991. 14 谭笑, 刘兵. 科学文本研究中的修辞分析[J]. 科学学研究, 2009, 27(8): 1144-1148. 15 Mann W C, Thompson S A. Rhetorical structure theory: toward a functional theory of text organization[J]. Text - Interdisciplinary Journal for the Study of Discourse, 1988, 8(3): 243-281. 16 Soldatova L N, King R D. An ontology of scientific experiments[J]. Journal of the Royal Society, Interface, 2006, 3(11): 795-803. 17 The semantic publishing and referencing ontologies[EB/OL]. [2021-06-04]. http://www.sparontologies.net/. 18 吴思竹, 李峰, 张智雄. 知识资源的语义表示和出版模式研究——以Nanopublication为例[J]. 中国图书馆学报, 2013, 39(4): 102-109. 19 宋宁远, 裴雷, 王春迎. 科学论文语义增强的研究进展与趋势研判[J]. 图书情报工作, 2021, 65(1): 82-90. 20 于改红, 张智雄, 马娜. 科技文献语篇元素自动标注模型研究综述[J]. 图书情报工作, 2018, 62(15): 132-144. 21 Day R A. The origins of the scientific paper: the IMRAD format[J]. Journal of the American Medical Writers Association, 1989, 4(2): 16-18. 22 Hou S L, Zhang S H, Fei C Q. Rhetorical structure theory: a comprehensive review of theory, parsing methods and applications[J]. Expert Systems with Applications, 2020, 157: 113421. 23 Sollaci L B, Pereira M G. The introduction, methods, results, and discussion (IMRAD) structure: a fifty-year survey[J]. Journal of the Medical Library Association, 2004, 92(3): 364-367. 24 Shahid A, Afzal M T. Section-wise indexing and retrieval of research articles[J]. Cluster Computing, 2018, 21(1): 481-492. 25 Bertin M, Atanassova I. A study of lexical distribution in citation contexts through the IMRaD standard[C]// Proceedings of the First Workshop on Bibliometric-enhanced Information Retrieval Co-located with 36th European Conference on Information Retrieval. CEUR-WS.org, 2014: 5-12. 26 Hu Z G, Chen C M, Liu Z Y. Where are citations located in the body of scientific articles? A study of the distributions of citation locations[J]. Journal of Informetrics, 2013, 7(4): 887-896. 27 Ding Y, Liu X Z, Guo C, et al. The distribution of references across texts: some implications for citation analysis[J]. Journal of Informetrics, 2013, 7(3): 583-592. 28 Bertin M, Atanassova I, Larivière V, et al. The distribution of references in scientific papers: an analysis of the IMRaD structure[C]// Proceedings of the 14th International Conference of the International Society for Scientometrics and Informetrics, Conference, 2013: 591-603. 29 Tuarob S, Mitra P, Giles C L. A hybrid approach to discover semantic hierarchical sections in scholarly documents[C]// Proceedings of the 2015 13th International Conference on Document Analysis and Recognition. IEEE, 2015: 1081-1085. 30 陆伟, 黄永, 程齐凯. 学术文本的结构功能识别——功能框架及基于章节标题的识别[J]. 情报学报, 2014, 33(9): 979-985. 31 Luong M T, Nguyen T D, Kan M Y. Logical structure recovery in scholarly articles with rich document features[J]. International Journal of Digital Library Systems, 2010, 1(4): 1-23. 32 黄永, 陆伟, 程齐凯. 学术文本的结构功能识别——基于章节内容的识别[J]. 情报学报, 2016, 35(3): 293-300. 33 王东波, 高瑞卿, 叶文豪, 等. 不同特征下的学术文本结构功能自动识别研究[J]. 情报学报, 2018, 37(10): 997-1008. 34 Habib R, Afzal M T. Sections-based bibliographic coupling for research paper recommendation[J]. Scientometrics, 2019, 119(2): 643-656. 35 黄永, 陆伟, 程齐凯, 等. 学术文本的结构功能识别——基于段落的识别[J]. 情报学报, 2016, 35(5): 530-538. 36 Lu W, Huang Y, Bu Y, et al. Functional structure identification of scientific documents in computer science[J]. Scientometrics, 2018, 115(1): 463-486. 37 Ahmed I, Afzal M T. A systematic approach to map the research articles’ sections to IMRAD[J]. IEEE Access, 2020, 8: 129359-129371. 38 Li S B, Wang Q. A hybrid approach to recognize generic sections in scholarly documents[J]. International Journal on Document Analysis and Recognition (IJDAR), 2021, 24(4): 339-348. 39 Ma B W, Zhang C Z, Wang Y Z, et al. Enhancing identification of structure function of academic articles using contextual information[J]. Scientometrics, 2021, 127: 885-925. 40 王佳敏, 陆伟, 刘家伟, 等. 多层次融合的学术文本结构功能识别研究[J]. 图书情报工作, 2019, 63(13): 95-104. 41 Ibekwe-Sanjuan F. Repe?rage et annotation d’indices de nouveaute?s dans les e?crits scientifiques[C]// Actes du Colloques “Indice, Index, Indexation”. ADBS Editions, 2005: 261-275. 42 Swales J M. Research genres: explorations and applications[M]. Cambridge: Cambridge University Press, 2004. 43 McKnight L, Srinivasan P. Categorization of sentence types in medical abstracts[J]. AMIA Annual Symposium Proceedings, 2003, 2003: 440-444. 44 Ribeiro S, Yao J T, Rezende D A. Discovering IMRaD structure with different classifiers[C]// Proceedings of the 2018 IEEE International Conference on Big Knowledge. IEEE, 2018: 200-204. 45 沈思, 胡昊天, 叶文豪, 等. 基于全字语义的摘要结构功能自动识别研究[J]. 情报学报, 2019, 38(1): 79-88. 46 Yu G H, Zhang Z X, Liu H, et al. Masked sentence model based on BERT for move recognition in medical scientific abstracts[J]. Journal of Data and Information Science, 2019, 4(4): 42-55. 47 Zhang Z X, Liu H, Ding L P, et al. Moves recognition in abstract of research paper based on deep learning[C]// Proceedings of the 2019 ACM/IEEE Joint Conference on Digital Libraries. IEEE, 2019: 390-391. 48 张智雄, 刘欢, 丁良萍, 等. 不同深度学习模型的科技论文摘要语步识别效果对比研究[J]. 数据分析与知识发现, 2019, 3(12): 1-9. 49 Agarwal S, Yu H. Automatically classifying sentences in full-text biomedical articles into Introduction, Methods, Results and Discussion[J]. Bioinformatics, 2009, 25(23): 3174-3180. 50 Heffernan K, Teufel S. Identifying problem statements in scientific text[C]// Proceedings of the 6th International Conference on Computational Models of Argument - Workshop on Foundations of the Language of Argumentation. Potsdam: University of Potsdam, 2016, doi: 10.17863/CAM.13243. 51 张颖怡, 章成志. 基于学术论文全文的研究方法句自动抽取研究[J]. 情报学报, 2020, 39(6): 640-650. 52 章成志, 李铮. 基于学术论文全文的创新研究评价句抽取研究[J]. 数据分析与知识发现, 2019, 3(10): 12-19. 53 D’Souza J, Auer S, Pedersen T. SemEval-2021 Task 11: NLPContributionGraph - structuring scholarly NLP contributions for a research knowledge graph[C]// Proceedings of the 15th International Workshop on Semantic Evaluation. Stroudsburg: Association for Computational Linguistics, 2021: 364-376. 54 王末, 崔运鹏, 陈丽, 等. 基于深度学习的学术论文语步结构分类方法研究[J]. 数据分析与知识发现, 2020, 4(6): 60-68. 55 Hartley J. Improving the clarity of journal abstracts in psychology: the case for structure[J]. Science Communication, 2003, 24(3): 366-379. 56 Augenstein I, Das M, Riedel S, et al. SemEval 2017 Task 10: ScienceIE - extracting keyphrases and relations from scientific publications[C]// Proceedings of the 11th International Workshop on Semantic Evaluation. Stroudsburg: Association for Computational Linguistics, 2017: 546-555. 57 Taniguchi Y, Nanba H. Identification of bibliographic information written in both Japanese and English[C]// Proceedings of the International Conference on Theory and Practice of Digital Libraries. Heidelberg: Springer, 2008: 431-433. 58 Kondo T, Nanba H, Takezawa T, et al. Technical trend analysis by analyzing research papers’ titles[C]// Proceedings of the Language and Technology Conference: Challenges for Computer Science and Linguistics. Heidelberg: Springer, 2009: 512-521. 59 Nanba H, Kondo T, Takezawa T. Automatic creation of a technical trend map from research papers and patents[C]// Proceedings of the 3rd International Workshop on Patent Information Retrieval. New York: ACM Press, 2010: 11-16. 60 Gupta S, Manning C D. Analyzing the dynamics of research by extracting key aspects of scientific papers[C]// Proceedings of 5th International Joint Conference on Natural Language Processing. Asian Federation of Natural Language Processing, 2011: 1-9. 61 Tsai C T, Kundu G, Roth D. Concept-based analysis of scientific literature[C]// Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. New York: ACM Press, 2013: 1733-1738. 62 Ammar W, Peters M E, Bhagavatula C, et al. The AI2 system at SemEval-2017 Task 10 (ScienceIE): semi-supervised end-to-end entity and relation extraction[C]// Proceedings of the 11th International Workshop on Semantic Evaluation. Stroudsburg: Association for Computational Linguistics, 2017: 592-596. 63 Luan Y, Ostendorf M, Hajishirzi H. Scientific information extraction with semi-supervised neural tagging[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2017: 2641-2651. 64 Augenstein I, S?gaard A. Multi-task learning of keyphrase boundary classification[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2017: 341-346. 65 Singh M, Dan S, Agarwal S, et al. AppTechMiner: mining applications and techniques from scientific articles[C]// Proceedings of the 6th International Workshop on Mining Scientific Publications. New York: ACM Press, 2017: 1-8. 66 Heffernan K, Teufel S. Identifying problems and solutions in scientific text[J]. Scientometrics, 2018, 116(2): 1367-1382. 67 章成志, 张颖怡. 基于学术论文全文的研究方法实体自动识别研究[J]. 情报学报, 2020, 39(6): 589-600. 68 Hou L L, Zhang J, Wu O, et al. Method and dataset entity mining in scientific literature: a CNN + Bi-LSTM model with self-attention[J]. Knowledge-Based Systems, 2022, 235: 107621. 69 Luan Y, He L H, Ostendorf M, et al. Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 3219-3232. 70 Jain S, van Zuylen M, Hajishirzi H, et al. SciREX: a challenge dataset for document-level information extraction[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 7506-7516. 71 秦成磊, 章成志. 基于层次注意力网络模型的学术文本结构功能识别[J]. 数据分析与知识发现, 2020, 4(11): 26-42. 72 程齐凯, 李鹏程, 张国标, 等. 学术文本词汇功能识别——基于标题生成策略和注意力机制的问题方法抽取[J]. 情报学报, 2021, 40(1): 43-52. 73 冯鸾鸾, 李军辉, 李培峰, 等. 面向国防科技领域的技术和术语识别方法研究[J]. 计算机科学, 2019, 46(12): 231-236. 74 Yang A, Li S J. SciDTB: discourse dependency TreeBank for scientific abstracts[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2018: 444-449. 75 黄永, 陆伟, 程齐凯, 等. 学术文本的结构功能识别——在学术搜索中的应用[J]. 情报学报, 2016, 35(4): 425-431. 76 Kafkas ?, Pi X J, Marinos N, et al. Section level search functionality in Europe PMC[J]. Journal of Biomedical Semantics, 2015, 6: 7. 77 郑彦宁, 化柏林. 句子级知识抽取在情报学中的应用分析[J]. 情报理论与实践, 2011, 34(12): 1-4. 78 方龙, 李信, 黄永, 等. 学术文本的结构功能识别——在关键词自动抽取中的应用[J]. 情报学报, 2017, 36(6): 599-605. 79 姜艺, 黄永, 夏义堃, 等. 学术文本词汇功能识别——在关键词自动抽取中的应用[J]. 情报学报, 2021, 40(2): 152-162. 80 Treeratpituk P, Teregowda P, Huang J, et al. SEERLAB: a system for extracting keyphrases from scholarly documents[C]// Proceedings of the 5th International Workshop on Semantic Evaluation. Stroudsburg: Association for Computational Linguistics, 2010: 182-185. 81 程齐凯, 李信. 面向语义出版的学术文本词汇语义功能自动识别[J]. 数字图书馆论坛, 2017(8): 24-31. 82 卢超, 章成志, 王玉琢, 等. 语义特征分析的深化——学术文献的全文计量分析研究综述[J]. 中国图书馆学报, 2021, 47(2): 110-131. 83 Chowdhury G. TREC: experiment and evaluation in information retrieval[J]. Online Information Review, 2007, 31(5): 717-718. 84 Deng J, Dong W, Socher R, et al. ImageNet: a large-scale hierarchical image database[C]// Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009: 248-255. |
|
|
|