|
|
|
| TreeGPT: An LLM-Based Automated Method for Constructing Technical Topic Architectures |
| Wang Haofeng, Gao Yingfan, Wang Lijun, Yao Changqing |
| Institute of Scientific and Technical Information of China, Beijing 100038 |
|
|
|
|
Abstract The continuous growth of scientific and technological literature has posed significant challenges in the field of technical intelligence analysis, particularly in the efficient extraction of technical information from large volumes of unstructured text and the construction of a hierarchical semantic framework. This study proposes TreeGPT, an automated framework for constructing technical topic architectures using large language models (LLMs). By leveraging the intrinsic knowledge and semantic vectorization capabilities of LLMs, TreeGPT identifies technical topics and mines their relationships from scientific literature to generate a structured technical topic system for target domains. To validate the effectiveness of the proposed method, this study conducted a comparative empirical analysis against BERTopic and HLDA using the integrated circuit domain as a case study. Experimental results demonstrate that TreeGPT significantly outperforms traditional methods in terms of semantic accuracy and hierarchical clarity, while achieving an effective balance? between the performance and cost of LLMs. The proposed method also provides valuable support? for domain knowledge, semantic modeling, and technical intelligence analysis.
|
|
Received: 18 June 2025
|
|
|
|
1 Dwivedi Y K, Sharma A, Rana N P, et al. Evolution of artificial intelligence research in technological forecasting and social change: research topics, trends, and future directions[J]. Technological Forecasting and Social Change, 2023, 192: 122579. 2 许海云, 武华维, 罗瑞, 等. 基于多元关系融合的科技文本主题识别方法研究[J]. 中国图书馆学报, 2019, 45(1): 82-94. 3 关鹏, 王曰芬. 科技情报分析中LDA主题模型最优主题数确定方法研究[J]. 现代图书情报技术, 2016(9): 42-50. 4 黄佳佳, 李鹏伟, 彭敏, 等. 基于深度学习的主题模型研究[J]. 计算机学报, 2020, 43(5): 827-855. 5 Blei D M, Jordan M I, Griffiths T L, et al. Hierarchical topic models and the nested Chinese restaurant process[C]// Proceedings of the 17th International Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2003: 17-24. 6 Grootendorst M. BERTopic: neural topic modeling with a class-based TF-IDF procedure[PP/OL]. V1. arXiv (2022-03-11) [2025-01-08]. http://arxiv.org/pdf/2203.05794. 7 Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3(4-5): 993-1022. 8 张东鑫, 张敏. 图情领域LDA主题模型应用研究进展述评[J]. 图书情报知识, 2022, 39(6): 143-157. 9 王晨, 廖启明. 基于改进的LDA模型的文献主题挖掘与演化趋势研究——以个人隐私信息保护领域为例[J]. 情报科学, 2023, 41(10): 112-120. 10 Das R, Zaheer M, Dyer C. Gaussian LDA for topic models with word embeddings[C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2015: 795-804. 11 Yao L, Zhang Y, Wei B G, et al. Incorporating knowledge graph embeddings into topic modeling[C]// Proceedings of the 31th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2017: 3119-3126. 12 Moody C E. Mixing Dirichlet topic models and word embeddings to make LDA2Vec[PP/OL]. V1. arXiv (2016-05-06) [2025-02-12]. https://arxiv.org/pdf/1605.02019. 13 Zhang C, Tao F B, Chen X S, et al. TaxoGen: unsupervised topic taxonomy construction by adaptive term embedding and clustering[C]// Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York: ACM Press, 2018: 2701-2709. 14 刘桂锋, 陈亦侯, 包翔, 等. 基于BERTopic主题模型融合RoBERTa算法的短文本分类方法研究[J]. 情报工程, 2024, 10(5): 85-98. 15 斯彬洲, 孙海春, 吴越. 基于大语言模型和事件融合的电信诈骗事件风险分析[J]. 数据分析与知识发现, 2025, 9(7): 38-51. 16 王志强, 李宜展, 李云龙, 等. 基于BERTopic的大科学装置科学研究联合基金资助主题挖掘[J]. 图书情报工作, 2024, 68(24): 104-113. 17 吴应强, 李白杨, 费巍, 等. 我国政府数据开放研究与国家战略所需的匹配度分析——基于BERTopic模型与扎根理论[J]. 情报科学, 2025, 43(1): 117-126. 18 Rachel J L J, Bhuvaneswari A, Kumudha M. Topic modeling based clustering of disaster tweets using BERTopic[C]// Proceedings of the 2024 MIT Art, Design and Technology School of Computing International Conference. Piscataway: IEEE, 2024: 1-6. 19 Srivastava A, Sutton C. Autoencoding variational inference for topic models[C]// Proceedings of the International Conference on Learning Representations. Appleton: ICLR, 2017: 1-12. 20 Peng M, Xie Q Q, Zhang Y C, et al. Neural sparse topical coding[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2018: 2332-2340. 21 Hearst M A. Automatic acquisition of hyponyms from large text corpora[C]// Proceedings of the 14th Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 1992: 539-545. 22 Panchenko A, Faralli S, Ruppert E, et al. TAXI at SemEval-2016 task 13: a taxonomy induction method based on lexico-syntactic patterns, substrings and focused crawling[C]// Proceedings of the 10th International Workshop on Semantic Evaluation. Stroudsburg: Association for Computational Linguistics, 2016: 1320-1327. 23 Wang C, Danilevsky M, Desai N, et al. A phrase mining framework for recursive construction of a topical hierarchy[C]// Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2013: 437-445. 24 McInnes L, Healy J, Astels S. HDBSCAN: hierarchical density based clustering[J]. Journal of Open Source Software, 2017, 2(11): 205. 25 Fu R J, Guo J, Qin B, et al. Learning semantic hierarchies via word embeddings[C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2014: 1199-1209. 26 Liu X Q, Song Y Q, Liu S X, et al. Automatic taxonomy construction from keywords[C]// Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2012: 1433-1441. 27 Wu W T, Li H S, Wang H X, et al. Probase: a probabilistic taxonomy for text understanding[C]// Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. New York: ACM Press, 2012: 481-492. 28 Lu Y Y, Chen H G, Mao P B, et al. Self-supervised topic taxonomy discovery in the box embedding space[J]. Transactions of the Association for Computational Linguistics, 2024, 12: 1401-1416. 29 张凯, 杨敏纳, 隗玲. 融合Finetuned-BERTopic和大模型的技术主题识别方法研究[J]. 情报理论与实践, 2025, 48(3): 189-198. 30 范旭辉, 穆智蕊. 融合BERTopic和大语言模型的研究前沿识别——以美国NSF人工智能领域资助为例[J]. 情报工程, 2025, 11(1): 18-28. 31 Lee M, Kim Z M, Khetan V, et al. Human-AI collaborative taxonomy construction: a case study in profession-specific writing assistants[C]// Proceedings of the Third Workshop on Intelligent and Interactive Writing Assistants. New York: ACM Press, 2024: 51-57. 32 Doi T, Isonuma M, Yanaka H. Topic modeling for short texts with large language models[C]// Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2024: 21-33. 33 Chang S Y, Wang R, Ren P, et al. Enhanced short text modeling: leveraging large language models for topic refinement[PP/OL]. V2. arXiv (2025-02-16) [2025-11-15]. https://arxiv.org/pdf/2403.17706. 34 Pham C M, Hoyle A, Sun S M, et al. TopicGPT: a prompt-based topic modeling framework[C]// Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2024: 2956-2984. 35 Chen J L, Xiao S T, Zhang P T, et al. M3-embedding: multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation[PP/OL]. V4. arXiv (2024-06-28) [2024-12-02]. https://arxiv.org/pdf/2402.03216v4. 36 韩瑞莲, 安璐, 周炜. 大语言模型的应急情报生成能力测评基准[J]. 情报理论与实践, 2025, 48(4): 54-63, 43. 37 OpenAI. GPT-OSS-120B & GPT-OSS-20B model card[PP/OL]. V1. arXiv (2025-08-05) [2025-11-17]. https://arxiv.org/pdf/2508.10925. 38 DeepSeek-AI. DeepSeek-V3 technical report[PP/OL]. V2. arXiv (2025-02-18) [2025-11-17]. https://arxiv.org/pdf/2412.19437. 39 Tencent Hunyuan Team. Hunyuan-TurboS: advancing large language models through Mamba-transformer synergy and adaptive chain-of-thought[PP/OL]. V3. arXiv (2025-07-04) [2025-11-17]. https://arxiv.org/pdf/2505.15431. |
|
|
|