|
|
Cross-Language Patent Text Representation Optimization Based on Supervised Fine-Tuning SimCSE Approach |
Wang Lijun1,2, Li Haotian1,3, Gao Yingfan1,2, Wang Shujun1,2 |
1.Institute of Scientific and Technical Information of China, Beijing 100038 2.Key Laboratory of Rich-media Knowledge Organization and Service of Digital Publishing Content, Beijing 100038 3.Fengtai District Archives of Beijing, Beijing 100076 |
|
|
Abstract This paper proposes a method for optimizing cross-language patent text representations to enhance the semantic representation of Chinese and English patent texts. This method integrates the SimCSE contrastive learning algorithm with a supervised fine-tuning strategy, effectively leveraging parallel corpora of Chinese and English patent texts to achieve effective cross-language text representation. First, based on unsupervised SimCSE fine-tuning, this paper introduces a supervised SimCSE fine-tuning algorithm to improve the performance of the model in cross-language semantic understanding. Specifically, we propose a positive and negative sample-mining strategy in which a high-quality positive sample set is constructed by analyzing the citation relationships between patent texts, thereby enabling the model to capture cross-linguistic semantic similarities more accurately. Simultaneously, we introduce the RetroMAE secondary pretraining model to optimize the mining of hard negative samples, further enhancing the discriminative ability and generalization performance of the model. Compared with traditional cross-language text representation methods, the method proposed in this paper demonstrates significant advantages in handling cross-language patent texts, overcoming the limitations of previous methods in semantic alignment and differentiation, thus providing a more precise and effective tool for cross-language patent analysis across multiple domains.
|
Received: 22 November 2024
|
|
|
|
1 苟扬, 李睿, 李娟. 中日两国地震应急技术专利数据可视化分析与对比研究[J]. 情报工程, 2022, 8(4): 71-84. 2 阿布都克力木·阿布力孜, 张雨宁, 阿力木江·亚森, 等. 预训练语言模型的扩展模型研究综述[J]. 计算机科学, 2022, 49(11A): 210800125. 3 Pires T, Schlinger E, Garrette D. How multilingual is multilingual BERT?[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2019: 4996-5001. 4 Conneau A, Lample G. Cross-lingual language model pretraining[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2019: 7059-7069. 5 Conneau A, Khandelwal K, Goyal N, et al. Unsupervised cross-lingual representation learning at scale[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 8440-8451. 6 Ni M H, Huang H Y, Su L, et al. M3P: learning universal representations via multitask multilingual multimodal pre-training[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 3976-3985. 7 Song K T, Tan X, Qin T, et al. MASS: masked sequence to sequence pre-training for language generation[J]. Proceedings of Machine Learning Research, 2019, 97: 5926-5936. 8 Liu Y H, Gu J T, Goyal N, et al. Multilingual denoising pre-training for neural machine translation[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 726-742. 9 Artetxe M, Schwenk H. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 597-610. 10 Ouyang X, Wang S H, Pang C, et al. ERNIE-M: enhanced multilingual representation by aligning cross-lingual semantics with monolingual corpora[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 27-38. 11 Feng F X, Yang Y F, Cer D, et al. Language-agnostic BERT sentence embedding[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2022: 878-891. 12 Chen J L, Xiao S T, Zhang P T, et al. M3-Embedding: multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation[C]//Findings of the Association for Computational Linguistics: ACL 2024. Stroudsburg: Association for Computational Linguistics, 2024: 2318-2335. 13 梁雨昕. 基于预训练模型的文本表示优化方法研究[D]. 西安: 西北大学, 2022: 35-36. 14 祝婷. 融合网络表示学习与文本信息的学术文献推荐方法[J]. 情报工程, 2022, 8(3): 81-92. 15 Mu J Q, Bhat S, Viswanath P. All-but-the-Top: simple and effective postprocessing for word representations[OL]. (2018-03-19) [2024-05-29]. https://arxiv.org/pdf/1702.01417. 16 Ethayarajh K. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 55-65. 17 Gong C Y, He D, Tan X, et al. FRAGE: frequency-agnostic word representation[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2018: 1341-1352. 18 Gao J, He D, Tan X, et al. Representation degeneration problem in training natural language generation models[OL]. (2019-07-28) [2024-05-30]. https://arxiv.org/pdf/1907.12009. 19 Reimers N, Gurevych I. Sentence-BERT: sentence embeddings using Siamese BERT-networks[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 3982-3992. 20 刘晋霞, 张志宇. 基于SBERT的专利前沿主题识别方法研究——以我国制氢技术为例[J]. 情报工程, 2022, 8(6): 28-45. 21 Li B H, Zhou H, He J X, et al. On the sentence embeddings from pre-trained language models[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2020: 9119-9130. 22 Su J L, Cao J R, Liu W J, et al. Whitening sentence representations for better semantics and faster retrieval[OL]. (2021-03-29) [2024-05-30]. https://arxiv.org/pdf/2103.15316. 23 Yan Y M, Li R M, Wang S R, et al. ConSERT: a contrastive framework for self-supervised sentence representation transfer[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 5065-5075. 24 Gao T Y, Yao X C, Chen D Q. SimCSE: simple contrastive learning of sentence embeddings[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 6894-6910. 25 Wu X, Gao C C, Zang L J, et al. Esimcse: enhanced sample building method for contrastive learning of unsupervised sentence embedding[C]//Proceedings of the 29th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2022: 3898-3907. 26 Xu L L, Lian J X, Zhao W X, et al. Negative sampling for contrastive representation learning: a review[OL]. (2022-06-01) [2025-01-29]. https://arxiv.org/pdf/2206.00212. |
|
|
|