A Patent Text Similarity Calculation Method Based on Expert Feedback Fine-Tuning
Wang Shujun1,2, Gao Yingfan1,2, Yao Changqing1, Yuan Ming1,2
1.Institute of Scientific and Technical Information of China, Beijing 100038 2.Key Laboratory of Rich-media Knowledge Organization and Service of Digital Publishing Content, Beijing 100038
摘要专利作为创新技术的重要知识载体,文本相似性计算是自然语言处理应用广泛的重要一环,专利文本相似性计算有助于挖掘潜在价值专利和支撑专利检索。本文提出了一种基于专家反馈微调的专利文本相似性计算方法,在专家评价小数据集上,利用大模型重新生成摘要文本进而实现负例文本增强,随后利用专家评价数据集对预训练模型进行微调,并在大规模数据集上重新计算得到相似专利。本文在新材料和电子信息两个新兴领域中分别继续训练BART(bidirectional and auto-regressive transformers)和BGE(Beijing Academy of Artificial Intelligence general embedding)模型,并在专家评价数据集上微调两个模型。实验结果表明,该方法的Spearman相关系数相较于初始模型分别提升了6.4%和16.9%。实证部分识别了电子信息领域企业技术竞争对手这一场景,验证了该方法在技术竞争对手识别中的优势。
1 Kelly B, Papanikolaou D, Seru A, et al. Measuring technological innovation over the long run[J]. American Economic Review: Insights, 2021, 3(3): 303-320. 2 俞琰, 陈磊, 姜金德, 等. 结合词向量和统计特征的专利相似度测量方法[J]. 数据分析与知识发现, 2019, 3(9): 53-59. 3 关鹏, 王曰芬. 国内外专利网络研究进展[J]. 数据分析与知识发现, 2020, 4(1): 26-39. 4 王淑君. 预训练模型表示优化在专利突破性测度中的应用研究[D]. 北京: 中国科学技术信息研究所, 2023. 5 Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[OL]. (2013-09-07). https://arxiv.org/pdf/1301.3781. 6 Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3(1): 993-1022. 7 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186. 8 Nakamura A, Harada T. Revisiting fine-tuning for few-shot learning[OL]. (2019-10-03). https://arxiv.org/pdf/1910.00216. 9 Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners[J]. Advances in Neural Information Processing Systems, 2020, 33: 1877-1901. 10 Howard J, Ruder S. Universal language model fine-tuning for text classification[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2018: 328-339. 11 Gao J, He D, Tan X, et al. Representation degeneration problem in training natural language generation models[OL]. (2019-07-28). https://arxiv.org/pdf/1907.12009. 12 Gao T Y, Yao X C, Chen D Q. SimCSE: simple contrastive learning of sentence embeddings[C]// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 6894-6910. 13 祝婷. 融合网络表示学习与文本信息的学术文献推荐方法[J]. 情报工程, 2022, 8(3): 81-92. 14 齐亚双, 张白洋, 李淼, 等. 专利技术融合驱动的技术机会识别研究——以人工智能图像处理技术为例[J]. 现代情报, 2023, 43(8): 150-160. 15 刘小玲, 谭宗颖. 基于专利多属性融合的技术主题划分方法研究[J]. 数据分析与知识发现, 2022, 6(2/3): 45-54. 16 刘晋霞, 张志宇. 基于SBERT的专利前沿主题识别方法研究——以我国制氢技术为例[J]. 情报工程, 2022, 8(6): 28-45. 17 赵凯琳, 靳小龙, 王元卓. 小样本学习研究综述[J]. 软件学报, 2021, 32(2): 349-369. 18 席笑文, 郭颖, 宋欣娜, 等. 基于Word2Vec与LDA主题模型的技术相似性可视化研究[J]. 情报学报, 2021, 40(9): 974-983. 19 Yoo Y, Jeong C, Gim S, et al. A novel patent similarity measurement methodology: semantic distance and technological distance[OL]. (2023-03-23). https://arxiv.org/pdf/2303.16767. 20 Yu L Q, Liu B, Lin Q W, et al. Semantic similarity matching for patent documents using ensemble BERT-related model and novel text processing method[OL]. (2024-01-06). https://arxiv.org/pdf/2401.06782. 21 Beck N, Killamsetty K, Kothawade S, et al. Beyond active learning: leveraging the full potential of human interaction via auto-labeling, human correction, and human verification[C]// Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway: IEEE, 2024: 2869-2877. 22 Karamanolakis G, Hsu D, Gravano L. Interactive machine teaching by labeling rules and instances[J]. Transactions of the Association for Computational Linguistics, 2024, 12: 1441-1459. 23 姚家琪, 宋鹏宇, 沈萌, 等. 面向少样本故障诊断的知识自监督深度表征学习方法[J]. 控制与决策, 2024, 39(10): 3357-3365. 24 赵新建, 夏飞, 朱凤玲, 等. 面向网络流量分类的Mamba网络: 引入数据增强的优化方法[J]. 软件导刊, 2025, 24(3): 99-108. 25 付艳艳, 黄瑞章, 薛菁菁, 等. 基于主动学习的深度半监督聚类模型[J]. 计算机应用研究, 2024, 41(10): 2955-2961. 26 Lewis M, Liu Y H, Goyal N, et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 7871-7880. 27 Xiao S T, Liu Z, Zhang P T, et al. C-pack: packed resources for general Chinese embeddings[C]// Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2024: 641-649. 28 刘建国, 周涛, 郭强, 等. 个性化推荐系统评价方法综述[J]. 复杂系统与复杂性科学, 2009, 6(3): 1-10. 29 Seonwoo Y, Wang G Y, Seo C, et al. Ranking-enhanced unsupervised sentence representation learning[C]// Proceedings of the 61st Annual meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2023: 15783-15798. 30 Wang J Z, Huang P P, Zhao H, et al. Billion-scale commodity embedding for e-commerce recommendation in alibaba[C]// Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York: ACM Press, 2018: 839-848. 31 吴菲菲, 杨梓, 黄鲁成. 基于专利信息的企业潜在竞争对手识别——以OLED技术为例[J]. 情报学报, 2017, 36(9): 954-963. 32 李钊, 黄晓斌, 陈劲松. 日本专利局支撑产业竞争的专利分析公共服务经验及其启示[J]. 情报工程, 2021, 7(1): 30-39.