Cross-Language Patent Text Representation Optimization Based on Supervised Fine-Tuning SimCSE Approach
Wang Lijun1,2, Li Haotian1,3, Gao Yingfan1,2, Wang Shujun1,2
1.Institute of Scientific and Technical Information of China, Beijing 100038 2.Key Laboratory of Rich-media Knowledge Organization and Service of Digital Publishing Content, Beijing 100038 3.Fengtai District Archives of Beijing, Beijing 100076
1 苟扬, 李睿, 李娟. 中日两国地震应急技术专利数据可视化分析与对比研究[J]. 情报工程, 2022, 8(4): 71-84. 2 阿布都克力木·阿布力孜, 张雨宁, 阿力木江·亚森, 等. 预训练语言模型的扩展模型研究综述[J]. 计算机科学, 2022, 49(11A): 210800125. 3 Pires T, Schlinger E, Garrette D. How multilingual is multilingual BERT?[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2019: 4996-5001. 4 Conneau A, Lample G. Cross-lingual language model pretraining[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2019: 7059-7069. 5 Conneau A, Khandelwal K, Goyal N, et al. Unsupervised cross-lingual representation learning at scale[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 8440-8451. 6 Ni M H, Huang H Y, Su L, et al. M3P: learning universal representations via multitask multilingual multimodal pre-training[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 3976-3985. 7 Song K T, Tan X, Qin T, et al. MASS: masked sequence to sequence pre-training for language generation[J]. Proceedings of Machine Learning Research, 2019, 97: 5926-5936. 8 Liu Y H, Gu J T, Goyal N, et al. Multilingual denoising pre-training for neural machine translation[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 726-742. 9 Artetxe M, Schwenk H. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 597-610. 10 Ouyang X, Wang S H, Pang C, et al. ERNIE-M: enhanced multilingual representation by aligning cross-lingual semantics with monolingual corpora[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 27-38. 11 Feng F X, Yang Y F, Cer D, et al. Language-agnostic BERT sentence embedding[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2022: 878-891. 12 Chen J L, Xiao S T, Zhang P T, et al. M3-Embedding: multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation[C]//Findings of the Association for Computational Linguistics: ACL 2024. Stroudsburg: Association for Computational Linguistics, 2024: 2318-2335. 13 梁雨昕. 基于预训练模型的文本表示优化方法研究[D]. 西安: 西北大学, 2022: 35-36. 14 祝婷. 融合网络表示学习与文本信息的学术文献推荐方法[J]. 情报工程, 2022, 8(3): 81-92. 15 Mu J Q, Bhat S, Viswanath P. All-but-the-Top: simple and effective postprocessing for word representations[OL]. (2018-03-19) [2024-05-29]. https://arxiv.org/pdf/1702.01417. 16 Ethayarajh K. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 55-65. 17 Gong C Y, He D, Tan X, et al. FRAGE: frequency-agnostic word representation[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2018: 1341-1352. 18 Gao J, He D, Tan X, et al. Representation degeneration problem in training natural language generation models[OL]. (2019-07-28) [2024-05-30]. https://arxiv.org/pdf/1907.12009. 19 Reimers N, Gurevych I. Sentence-BERT: sentence embeddings using Siamese BERT-networks[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 3982-3992. 20 刘晋霞, 张志宇. 基于SBERT的专利前沿主题识别方法研究——以我国制氢技术为例[J]. 情报工程, 2022, 8(6): 28-45. 21 Li B H, Zhou H, He J X, et al. On the sentence embeddings from pre-trained language models[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2020: 9119-9130. 22 Su J L, Cao J R, Liu W J, et al. Whitening sentence representations for better semantics and faster retrieval[OL]. (2021-03-29) [2024-05-30]. https://arxiv.org/pdf/2103.15316. 23 Yan Y M, Li R M, Wang S R, et al. ConSERT: a contrastive framework for self-supervised sentence representation transfer[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 5065-5075. 24 Gao T Y, Yao X C, Chen D Q. SimCSE: simple contrastive learning of sentence embeddings[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 6894-6910. 25 Wu X, Gao C C, Zang L J, et al. Esimcse: enhanced sample building method for contrastive learning of unsupervised sentence embedding[C]//Proceedings of the 29th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2022: 3898-3907. 26 Xu L L, Lian J X, Zhao W X, et al. Negative sampling for contrastive representation learning: a review[OL]. (2022-06-01) [2025-01-29]. https://arxiv.org/pdf/2206.00212.