|
|
Generative and Hierarchical Classification of Literature Based on Fine-tuned Large Language Models |
Hu Zhongyi1,2, Shui Diancheng1, Wu Jiang1,2 |
1.School of Information Management, Wuhan University, Wuhan 430072 2.Center for E-commerce Research and Development, Wuhan University, Wuhan 430072 |
|
|
Abstract The automatic classification and indexing of literature facilitate efficient organization, storage, arrangement, and retrieval. Previous studies have primarily used discriminative models to automatically identify shallow categories of literature but have struggled with deep category classification. Hence, this study transforms the hierarchical classification problem of literature into a task of generating hierarchical category labels for literature and proposes a generative hierarchical classification indexing framework based on a large language model (LLM). The framework first uses natural language to label and interpret the hierarchical classification index of literature, then applies efficient fine-tuning techniques to perform supervised fine-tuning on the LLM. The fine-tuned LLM is then used to directly generate hierarchical classification labels for literature, and the Chinese Library Classification indices of literature are obtained via label mapping. The data from three disciplines, namely economics, medicine and health, and industrial technology, are used to evaluate the proposed model. Experimental results show that supervised fine-tuning can effectively improve the understanding and reasoning abilities of general LLMs for the classification and indexing of literature. Moreover, LLMs can achieve better classification performance than traditional discriminative models. By integrating the abstracts, titles, and keywords of literature, the classification performance of fine-tuned LLMs can be effectively improved. A comparison of Baichuan2 and Qwen1.5 models with different parameter sizes showed that the fine-tuned Qwen1.5-14B-Chat model performed the best, achieving 98% classification performance in the first level category and 80% accuracy in the most challenging fifth level category. A typical example analysis demonstrates that the fine-tuned Qwen1.5-14B-Chat has error correction capabilities.
|
Received: 20 April 2024
|
|
|
|
1 邓三鸿, 傅余洋子, 王昊. 基于LSTM模型的中文图书多标签分类研究[J]. 数据分析与知识发现, 2017, 1(7): 52-60. 2 沈立力, 姜鹏, 王静. 基于BERT模型的中文期刊文献自动分类实践研究[J]. 图书馆杂志, 2022, 41(5): 109-118, 135. 3 郭利敏. 基于卷积神经网络的文献自动分类研究[J]. 图书与情报, 2017(6): 96-103. 4 王昊, 严明, 苏新宁. 基于机器学习的中文书目自动分类研究[J]. 中国图书馆学报, 2010, 36(6): 28-39. 5 张智雄, 赵旸, 刘欢. 构建面向实际应用的科技文献自动分类引擎[J]. 中国图书馆学报, 2022, 48(4): 104-115. 6 杨敏, 谷俊. 基于SVM的中文书目自动分类及应用研究[J]. 图书情报工作, 2012, 56(9): 114-119. 7 蒋彦廷, 胡韧奋. 基于BERT模型的图书表示学习与多标签分类研究[J]. 新世纪图书馆, 2020(9): 38-44. 8 赵旸, 张智雄, 刘欢, 等. 基于BERT模型的中文医学文献分类研究[J]. 数据分析与知识发现, 2020, 4(8): 41-49. 9 戎璐. 基于图双注意力网络的图书类目分类方法研究[J]. 图书馆学研究, 2023(5): 47-60. 10 罗鹏程, 王一博, 王继民. 基于深度预训练语言模型的文献学科自动分类研究[J]. 情报学报, 2020, 39(10): 1046-1059. 11 何琳, 侯汉清, 白振田, 等. 基于标引经验和机器学习相结合的多层自动分类[J]. 情报学报, 2006, 25(6): 725-729. 12 白国应. 关于发展中国文献分类学的若干思考[J]. 图书情报工作, 2009, 53(17): 5-6, 22. 13 Wang J. An extensive study on automated Dewey Decimal Classification[J]. Journal of the American Society for Information Science and Technology, 2009, 60(11): 2269-2286. 14 Kuo J J. An automatic library data classification system using layer structure and voting strategy[C]// Proceedings of the International Conference on Asian Digital Libraries: the Emergence of Digital Libraries—Research and Practices. Cham: Springer, 2014: 279-287. 15 Golub K, Suominen O, Mohammed A T, et al. Automated Dewey Decimal Classification of Swedish library metadata using Annif software[J]. Journal of Documentation, 2024, 80(5): 1057-1079. 16 Borovi? M, Ojster?ek M, Strnad D. A hybrid approach to recommending universal decimal classification codes for cataloguing in Slovenian digital libraries[J]. IEEE Access, 2022, 10: 85595-85605. 17 张雨卉. 基于《中国图书馆分类法》的文献自动化深层分类的研究和实现[J]. 图书馆杂志, 2024, 43(3): 61-74. 18 陈帅朴, 钱宇星, 钱志强, 等. 多重特征关联和图注意力网络融合的文献分类方法研究——以中文医学文献为例[J]. 情报学报, 2024, 43(4): 470-490. 19 Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. Journal of Machine Learning Research, 2020, 21: 1-67. 20 车万翔, 窦志成, 冯岩松, 等. 大模型时代的自然语言处理: 挑战、机遇与发展[J]. 中国科学: 信息科学, 2023, 53(9): 1645-1687. 21 鲍彤, 章成志. ChatGPT中文信息抽取能力测评——以三种典型的抽取任务为例[J]. 数据分析与知识发现, 2023, 7(9): 1-11. 22 张颖怡, 章成志, 周毅, 等. 基于ChatGPT的多视角学术论文实体识别: 性能测评与可用性研究[J]. 数据分析与知识发现, 2023, 7(9): 12-24. 23 赵志枭, 胡蝶, 刘畅, 等. 人文社科领域中文通用大模型性能评测[J]. 图书情报工作, 2024, 68(13): 132-143. 24 白如江, 陈启明, 张玉洁, 等. 基于ChatGPT+Prompt的专利技术功效实体自动生成研究[J]. 数据分析与知识发现, 2024, 8(4): 14-25. 25 陆伟, 刘寅鹏, 石湘, 等. 大模型驱动的学术文本挖掘——推理端指令策略构建及能力评测[J]. 情报学报, 2024, 43(8): 946-959. 26 Zhang S Y, Dong L F, Li X Y, et al. Instruction tuning for large language models: a survey[OL]. (2024-12-01). https://arxiv.org/pdf/2308.10792. 27 Hu E, Shen Y L, Wallis P, et al. LoRA: low-rank adaptation of large language models[OL]. (2021-06-17). https://arxiv.org/pdf/2106.09685. 28 Yang A Y, Xiao B, Wang B N, et al. Baichuan 2: open large-scale language models[OL]. (2023-09-20). https://arxiv.org/pdf/2309.10305. 29 Bai J Z, Bai S, Chu Y F, et al. Qwen technical report[OL]. (2023-09-28). https://arxiv.org/pdf/2309.16609. 30 Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners[OL]. (2019-01-08). https://insightcivic.s3.us-east-1.amazonaws.com/language-models.pdf. 31 Liu P F, Yuan W Z, Fu J L, et al. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing[J]. ACM Computing Surveys, 2023, 55(9): Article No.195. |
|
|
|