|
|
A Patent Text Similarity Calculation Method Based on Expert Feedback Fine-Tuning |
Wang Shujun1,2, Gao Yingfan1,2, Yao Changqing1, Yuan Ming1,2 |
1.Institute of Scientific and Technical Information of China, Beijing 100038 2.Key Laboratory of Rich-media Knowledge Organization and Service of Digital Publishing Content, Beijing 100038 |
|
|
Abstract As important carriers of innovative technology, patents highlight the significance of text similarity calculation in natural language processing, with wide applications. Patent text similarity calculation helps identify potentially valuable patents and supports patent retrieval. This study introduces a patent text similarity calculation method that leverages expert feedback for fine-tuning. Using a small expert evaluation dataset, the method employs a large model to regenerate abstract texts and achieve negative example text enhancement. The pretrained model was fine-tuned using the expert evaluation dataset, and the similarity of patents was recalculated using a large-scale dataset. This study continues to train the bidirectional and auto-regressive transformers (BART) and Beijing Academy of Artificial Intelligence general embedding (BGE) models in the emerging fields of new materials and electronic information, respectively, and fine-tunes the two models using the expert evaluation dataset. The experimental results show that the Spearman correlation coefficient of this method increases by 6.4% and 16.9% compared to the initial models. The empirical section selects enterprises in the electronic information field to identify technological competitors and verifies the advantages of the method.
|
Received: 22 November 2024
|
|
|
|
1 Kelly B, Papanikolaou D, Seru A, et al. Measuring technological innovation over the long run[J]. American Economic Review: Insights, 2021, 3(3): 303-320. 2 俞琰, 陈磊, 姜金德, 等. 结合词向量和统计特征的专利相似度测量方法[J]. 数据分析与知识发现, 2019, 3(9): 53-59. 3 关鹏, 王曰芬. 国内外专利网络研究进展[J]. 数据分析与知识发现, 2020, 4(1): 26-39. 4 王淑君. 预训练模型表示优化在专利突破性测度中的应用研究[D]. 北京: 中国科学技术信息研究所, 2023. 5 Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[OL]. (2013-09-07). https://arxiv.org/pdf/1301.3781. 6 Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3(1): 993-1022. 7 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186. 8 Nakamura A, Harada T. Revisiting fine-tuning for few-shot learning[OL]. (2019-10-03). https://arxiv.org/pdf/1910.00216. 9 Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners[J]. Advances in Neural Information Processing Systems, 2020, 33: 1877-1901. 10 Howard J, Ruder S. Universal language model fine-tuning for text classification[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2018: 328-339. 11 Gao J, He D, Tan X, et al. Representation degeneration problem in training natural language generation models[OL]. (2019-07-28). https://arxiv.org/pdf/1907.12009. 12 Gao T Y, Yao X C, Chen D Q. SimCSE: simple contrastive learning of sentence embeddings[C]// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 6894-6910. 13 祝婷. 融合网络表示学习与文本信息的学术文献推荐方法[J]. 情报工程, 2022, 8(3): 81-92. 14 齐亚双, 张白洋, 李淼, 等. 专利技术融合驱动的技术机会识别研究——以人工智能图像处理技术为例[J]. 现代情报, 2023, 43(8): 150-160. 15 刘小玲, 谭宗颖. 基于专利多属性融合的技术主题划分方法研究[J]. 数据分析与知识发现, 2022, 6(2/3): 45-54. 16 刘晋霞, 张志宇. 基于SBERT的专利前沿主题识别方法研究——以我国制氢技术为例[J]. 情报工程, 2022, 8(6): 28-45. 17 赵凯琳, 靳小龙, 王元卓. 小样本学习研究综述[J]. 软件学报, 2021, 32(2): 349-369. 18 席笑文, 郭颖, 宋欣娜, 等. 基于Word2Vec与LDA主题模型的技术相似性可视化研究[J]. 情报学报, 2021, 40(9): 974-983. 19 Yoo Y, Jeong C, Gim S, et al. A novel patent similarity measurement methodology: semantic distance and technological distance[OL]. (2023-03-23). https://arxiv.org/pdf/2303.16767. 20 Yu L Q, Liu B, Lin Q W, et al. Semantic similarity matching for patent documents using ensemble BERT-related model and novel text processing method[OL]. (2024-01-06). https://arxiv.org/pdf/2401.06782. 21 Beck N, Killamsetty K, Kothawade S, et al. Beyond active learning: leveraging the full potential of human interaction via auto-labeling, human correction, and human verification[C]// Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway: IEEE, 2024: 2869-2877. 22 Karamanolakis G, Hsu D, Gravano L. Interactive machine teaching by labeling rules and instances[J]. Transactions of the Association for Computational Linguistics, 2024, 12: 1441-1459. 23 姚家琪, 宋鹏宇, 沈萌, 等. 面向少样本故障诊断的知识自监督深度表征学习方法[J]. 控制与决策, 2024, 39(10): 3357-3365. 24 赵新建, 夏飞, 朱凤玲, 等. 面向网络流量分类的Mamba网络: 引入数据增强的优化方法[J]. 软件导刊, 2025, 24(3): 99-108. 25 付艳艳, 黄瑞章, 薛菁菁, 等. 基于主动学习的深度半监督聚类模型[J]. 计算机应用研究, 2024, 41(10): 2955-2961. 26 Lewis M, Liu Y H, Goyal N, et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 7871-7880. 27 Xiao S T, Liu Z, Zhang P T, et al. C-pack: packed resources for general Chinese embeddings[C]// Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2024: 641-649. 28 刘建国, 周涛, 郭强, 等. 个性化推荐系统评价方法综述[J]. 复杂系统与复杂性科学, 2009, 6(3): 1-10. 29 Seonwoo Y, Wang G Y, Seo C, et al. Ranking-enhanced unsupervised sentence representation learning[C]// Proceedings of the 61st Annual meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2023: 15783-15798. 30 Wang J Z, Huang P P, Zhao H, et al. Billion-scale commodity embedding for e-commerce recommendation in alibaba[C]// Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York: ACM Press, 2018: 839-848. 31 吴菲菲, 杨梓, 黄鲁成. 基于专利信息的企业潜在竞争对手识别——以OLED技术为例[J]. 情报学报, 2017, 36(9): 954-963. 32 李钊, 黄晓斌, 陈劲松. 日本专利局支撑产业竞争的专利分析公共服务经验及其启示[J]. 情报工程, 2021, 7(1): 30-39. |
|
|
|