一种基于<bold>SimCSE</bold>有监督微调的跨语言专利文本表示优化方法

doi:10.3772/j.issn.1000-0135.2025.07.003

情报学报

2025, Vol. 44

Issue (7): 818-829 DOI: 10.3772/j.issn.1000-0135.2025.07.003

情报理论与方法

本期目录 | 过刊浏览 | 高级检索

一种基于SimCSE有监督微调的跨语言专利文本表示优化方法

王莉军^1,2, 李浩天^1,3, 高影繁^1,2, 王淑君^1,2

1.中国科学技术信息研究所，北京 100038
2.富媒体数字出版内容组织与知识服务重点实验室，北京 100038
3.北京市丰台区档案馆，北京 100076

Cross-Language Patent Text Representation Optimization Based on Supervised Fine-Tuning SimCSE Approach

Wang Lijun^1,2, Li Haotian^1,3, Gao Yingfan^1,2, Wang Shujun^1,2

1.Institute of Scientific and Technical Information of China, Beijing 100038
2.Key Laboratory of Rich-media Knowledge Organization and Service of Digital Publishing Content, Beijing 100038
3.Fengtai District Archives of Beijing, Beijing 100076

摘要
图/表
参考文献
相关文章 (0)

全文: PDF (1048 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要本文提出了一种跨语言专利文本表示优化方法，旨在提升中英专利文本的语义表示能力。该方法结合了SimCSE（simple contrastive sentence embeddings）对比学习算法与有监督微调策略，通过充分利用中英专利文本的平行语料数据，实现了跨语言的有效文本表示。在无监督SimCSE微调的基础上，本文引入了有监督的SimCSE微调算法，以增强模型在跨语言语义理解上的表现。具体而言，本文提出了一种正负样本挖掘策略，通过分析专利文本间的引用关系构建高质量正样本集，使模型能够捕捉到更准确的跨语言语义相似性。同时，引入RetroMAE（retrieval-oriented masked auto-encoder）二次预训练模型，针对难负例的挖掘进行优化，以进一步提高模型的区分能力和泛化性能。与传统跨语言文本表示方法相比，本文方法在处理跨语言专利文本时表现出显著优势，突破了已有方法在语义对齐和区分上的局限性，为多领域跨语言专利分析提供了更加精准有效的工具。

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	王莉军
	李浩天
	高影繁
	王淑君

关键词 ：跨语言专利, SimCSE, 正负例挖掘

收稿日期: 2024-11-22

基金资助:新一代人工智能国家科技重大专项项目“面向复杂信息流的科技文献大模型增量构建”（2023ZD0121501）；中央级公益性科研院所基本科研业务项目“面向战略决策的智能情报技术引擎研究及应用”（ZD2025-08）；国家自然科学基金面上项目“新兴产业创新生态系统的演化、预测和评价：基于动态异质网络分析视角”（72274013）。

作者简介: 王莉军，女，1978年生，博士，副研究员，主要研究领域为数据挖掘；李浩天，男，1999年生，硕士，主要研究领域为自然语言处理；高影繁，通信作者，女，1974年生，博士，研究员，主要研究领域为智能信息处理，E-mail：gaoyingf@istic.ac.cn；王淑君，女，1998年生，硕士，研究实习员，主要研究领域为自然语言处理；

引用本文:

王莉军, 李浩天, 高影繁, 王淑君. 一种基于SimCSE有监督微调的跨语言专利文本表示优化方法[J]. 情报学报, 2025, 44(7): 818-829.
Wang Lijun, Li Haotian, Gao Yingfan, Wang Shujun. Cross-Language Patent Text Representation Optimization Based on Supervised Fine-Tuning SimCSE Approach. 情报学报, 2025, 44(7): 818-829.

链接本文:

https://qbxb.istic.ac.cn/CN/10.3772/j.issn.1000-0135.2025.07.003 或 https://qbxb.istic.ac.cn/CN/Y2025/V44/I7/818

1 苟扬, 李睿, 李娟. 中日两国地震应急技术专利数据可视化分析与对比研究[J]. 情报工程, 2022, 8(4): 71-84.
2 阿布都克力木·阿布力孜, 张雨宁, 阿力木江·亚森, 等. 预训练语言模型的扩展模型研究综述[J]. 计算机科学, 2022, 49(11A): 210800125.
3 Pires T, Schlinger E, Garrette D. How multilingual is multilingual BERT?[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2019: 4996-5001.
4 Conneau A, Lample G. Cross-lingual language model pretraining[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2019: 7059-7069.
5 Conneau A, Khandelwal K, Goyal N, et al. Unsupervised cross-lingual representation learning at scale[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 8440-8451.
6 Ni M H, Huang H Y, Su L, et al. M3P: learning universal representations via multitask multilingual multimodal pre-training[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 3976-3985.
7 Song K T, Tan X, Qin T, et al. MASS: masked sequence to sequence pre-training for language generation[J]. Proceedings of Machine Learning Research, 2019, 97: 5926-5936.
8 Liu Y H, Gu J T, Goyal N, et al. Multilingual denoising pre-training for neural machine translation[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 726-742.
9 Artetxe M, Schwenk H. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 597-610.
10 Ouyang X, Wang S H, Pang C, et al. ERNIE-M: enhanced multilingual representation by aligning cross-lingual semantics with monolingual corpora[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 27-38.
11 Feng F X, Yang Y F, Cer D, et al. Language-agnostic BERT sentence embedding[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2022: 878-891.
12 Chen J L, Xiao S T, Zhang P T, et al. M3-Embedding: multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation[C]//Findings of the Association for Computational Linguistics: ACL 2024. Stroudsburg: Association for Computational Linguistics, 2024: 2318-2335.
13 梁雨昕. 基于预训练模型的文本表示优化方法研究[D]. 西安: 西北大学, 2022: 35-36.
14 祝婷. 融合网络表示学习与文本信息的学术文献推荐方法[J]. 情报工程, 2022, 8(3): 81-92.
15 Mu J Q, Bhat S, Viswanath P. All-but-the-Top: simple and effective postprocessing for word representations[OL]. (2018-03-19) [2024-05-29]. https://arxiv.org/pdf/1702.01417.
16 Ethayarajh K. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 55-65.
17 Gong C Y, He D, Tan X, et al. FRAGE: frequency-agnostic word representation[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2018: 1341-1352.
18 Gao J, He D, Tan X, et al. Representation degeneration problem in training natural language generation models[OL]. (2019-07-28) [2024-05-30]. https://arxiv.org/pdf/1907.12009.
19 Reimers N, Gurevych I. Sentence-BERT: sentence embeddings using Siamese BERT-networks[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 3982-3992.
20 刘晋霞, 张志宇. 基于SBERT的专利前沿主题识别方法研究——以我国制氢技术为例[J]. 情报工程, 2022, 8(6): 28-45.
21 Li B H, Zhou H, He J X, et al. On the sentence embeddings from pre-trained language models[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2020: 9119-9130.
22 Su J L, Cao J R, Liu W J, et al. Whitening sentence representations for better semantics and faster retrieval[OL]. (2021-03-29) [2024-05-30]. https://arxiv.org/pdf/2103.15316.
23 Yan Y M, Li R M, Wang S R, et al. ConSERT: a contrastive framework for self-supervised sentence representation transfer[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 5065-5075.
24 Gao T Y, Yao X C, Chen D Q. SimCSE: simple contrastive learning of sentence embeddings[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 6894-6910.
25 Wu X, Gao C C, Zang L J, et al. Esimcse: enhanced sample building method for contrastive learning of unsupervised sentence embedding[C]//Proceedings of the 29th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2022: 3898-3907.
26 Xu L L, Lian J X, Zhao W X, et al. Negative sampling for contrastive representation learning: a review[OL]. (2022-06-01) [2025-01-29]. https://arxiv.org/pdf/2206.00212.