一种基于专家反馈微调的专利文本相似性计算方法

doi:10.3772/j.issn.1000-0135.2025.07.005

情报学报

2025, Vol. 44

Issue (7): 846-858 DOI: 10.3772/j.issn.1000-0135.2025.07.005

情报理论与方法

本期目录 | 过刊浏览 | 高级检索

一种基于专家反馈微调的专利文本相似性计算方法

王淑君^1,2, 高影繁^1,2, 姚长青¹, 袁鸣^1,2

1.中国科学技术信息研究所，北京 100038
2.富媒体数字出版内容组织与知识服务重点实验室，北京 100038

A Patent Text Similarity Calculation Method Based on Expert Feedback Fine-Tuning

Wang Shujun^1,2, Gao Yingfan^1,2, Yao Changqing¹, Yuan Ming^1,2

1.Institute of Scientific and Technical Information of China, Beijing 100038
2.Key Laboratory of Rich-media Knowledge Organization and Service of Digital Publishing Content, Beijing 100038

摘要
图/表
参考文献
相关文章 (3)

全文: PDF (2798 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要专利作为创新技术的重要知识载体，文本相似性计算是自然语言处理应用广泛的重要一环，专利文本相似性计算有助于挖掘潜在价值专利和支撑专利检索。本文提出了一种基于专家反馈微调的专利文本相似性计算方法，在专家评价小数据集上，利用大模型重新生成摘要文本进而实现负例文本增强，随后利用专家评价数据集对预训练模型进行微调，并在大规模数据集上重新计算得到相似专利。本文在新材料和电子信息两个新兴领域中分别继续训练BART（bidirectional and auto-regressive transformers）和BGE（Beijing Academy of Artificial Intelligence general embedding）模型，并在专家评价数据集上微调两个模型。实验结果表明，该方法的Spearman相关系数相较于初始模型分别提升了6.4%和16.9%。实证部分识别了电子信息领域企业技术竞争对手这一场景，验证了该方法在技术竞争对手识别中的优势。

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	王淑君
	高影繁
	姚长青
	袁鸣

关键词 ：相似专利计算, 预训练模型, 专家反馈微调, 文本表示

收稿日期: 2024-11-22

基金资助:新一代人工智能国家科技重大专项项目“面向复杂信息流的科技文献大模型增量构建”（2023ZD0121501）；中央级公益性科研院所基本科研业务项目“面向战略决策的智能情报技术引擎研究及应用”（ZD2025-08）；国家自然科学基金面上项目“新兴产业创新生态系统的演化、预测和评价：基于动态异质网络分析视角”（72274013）。

作者简介: 王淑君，女，1998年生，硕士，研究实习员，主要研究领域为自然语言处理；高影繁，通信作者，女，1974年生，博士，研究员，主要研究领域为智能信息处理，E-mail：gaoyingf@istic.ac.cn；姚长青，男，1974年生，博士，研究员，主要研究领域为情报理论与方法；袁鸣，男，1995年生，硕士，助理研究员，主要研究领域为自然语言处理；

引用本文:

王淑君, 高影繁, 姚长青, 袁鸣. 一种基于专家反馈微调的专利文本相似性计算方法[J]. 情报学报, 2025, 44(7): 846-858.
Wang Shujun, Gao Yingfan, Yao Changqing, Yuan Ming. A Patent Text Similarity Calculation Method Based on Expert Feedback Fine-Tuning. 情报学报, 2025, 44(7): 846-858.

链接本文:

https://qbxb.istic.ac.cn/CN/10.3772/j.issn.1000-0135.2025.07.005 或 https://qbxb.istic.ac.cn/CN/Y2025/V44/I7/846

1 Kelly B, Papanikolaou D, Seru A, et al. Measuring technological innovation over the long run[J]. American Economic Review: Insights, 2021, 3(3): 303-320.
2 俞琰, 陈磊, 姜金德, 等. 结合词向量和统计特征的专利相似度测量方法[J]. 数据分析与知识发现, 2019, 3(9): 53-59.
3 关鹏, 王曰芬. 国内外专利网络研究进展[J]. 数据分析与知识发现, 2020, 4(1): 26-39.
4 王淑君. 预训练模型表示优化在专利突破性测度中的应用研究[D]. 北京: 中国科学技术信息研究所, 2023.
5 Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[OL]. (2013-09-07). https://arxiv.org/pdf/1301.3781.
6 Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3(1): 993-1022.
7 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186.
8 Nakamura A, Harada T. Revisiting fine-tuning for few-shot learning[OL]. (2019-10-03). https://arxiv.org/pdf/1910.00216.
9 Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners[J]. Advances in Neural Information Processing Systems, 2020, 33: 1877-1901.
10 Howard J, Ruder S. Universal language model fine-tuning for text classification[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2018: 328-339.
11 Gao J, He D, Tan X, et al. Representation degeneration problem in training natural language generation models[OL]. (2019-07-28). https://arxiv.org/pdf/1907.12009.
12 Gao T Y, Yao X C, Chen D Q. SimCSE: simple contrastive learning of sentence embeddings[C]// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 6894-6910.
13 祝婷. 融合网络表示学习与文本信息的学术文献推荐方法[J]. 情报工程, 2022, 8(3): 81-92.
14 齐亚双, 张白洋, 李淼, 等. 专利技术融合驱动的技术机会识别研究——以人工智能图像处理技术为例[J]. 现代情报, 2023, 43(8): 150-160.
15 刘小玲, 谭宗颖. 基于专利多属性融合的技术主题划分方法研究[J]. 数据分析与知识发现, 2022, 6(2/3): 45-54.
16 刘晋霞, 张志宇. 基于SBERT的专利前沿主题识别方法研究——以我国制氢技术为例[J]. 情报工程, 2022, 8(6): 28-45.
17 赵凯琳, 靳小龙, 王元卓. 小样本学习研究综述[J]. 软件学报, 2021, 32(2): 349-369.
18 席笑文, 郭颖, 宋欣娜, 等. 基于Word2Vec与LDA主题模型的技术相似性可视化研究[J]. 情报学报, 2021, 40(9): 974-983.
19 Yoo Y, Jeong C, Gim S, et al. A novel patent similarity measurement methodology: semantic distance and technological distance[OL]. (2023-03-23). https://arxiv.org/pdf/2303.16767.
20 Yu L Q, Liu B, Lin Q W, et al. Semantic similarity matching for patent documents using ensemble BERT-related model and novel text processing method[OL]. (2024-01-06). https://arxiv.org/pdf/2401.06782.
21 Beck N, Killamsetty K, Kothawade S, et al. Beyond active learning: leveraging the full potential of human interaction via auto-labeling, human correction, and human verification[C]// Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway: IEEE, 2024: 2869-2877.
22 Karamanolakis G, Hsu D, Gravano L. Interactive machine teaching by labeling rules and instances[J]. Transactions of the Association for Computational Linguistics, 2024, 12: 1441-1459.
23 姚家琪, 宋鹏宇, 沈萌, 等. 面向少样本故障诊断的知识自监督深度表征学习方法[J]. 控制与决策, 2024, 39(10): 3357-3365.
24 赵新建, 夏飞, 朱凤玲, 等. 面向网络流量分类的Mamba网络: 引入数据增强的优化方法[J]. 软件导刊, 2025, 24(3): 99-108.
25 付艳艳, 黄瑞章, 薛菁菁, 等. 基于主动学习的深度半监督聚类模型[J]. 计算机应用研究, 2024, 41(10): 2955-2961.
26 Lewis M, Liu Y H, Goyal N, et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 7871-7880.
27 Xiao S T, Liu Z, Zhang P T, et al. C-pack: packed resources for general Chinese embeddings[C]// Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2024: 641-649.
28 刘建国, 周涛, 郭强, 等. 个性化推荐系统评价方法综述[J]. 复杂系统与复杂性科学, 2009, 6(3): 1-10.
29 Seonwoo Y, Wang G Y, Seo C, et al. Ranking-enhanced unsupervised sentence representation learning[C]// Proceedings of the 61st Annual meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2023: 15783-15798.
30 Wang J Z, Huang P P, Zhao H, et al. Billion-scale commodity embedding for e-commerce recommendation in alibaba[C]// Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York: ACM Press, 2018: 839-848.
31 吴菲菲, 杨梓, 黄鲁成. 基于专利信息的企业潜在竞争对手识别——以OLED技术为例[J]. 情报学报, 2017, 36(9): 954-963.
32 李钊, 黄晓斌, 陈劲松. 日本专利局支撑产业竞争的专利分析公共服务经验及其启示[J]. 情报工程, 2021, 7(1): 30-39.