|
|
Artificial Intelligence Technology: Novel Strategy for Patent Dataset Creation Based on Machine Learning |
Chen Yue1, Song Kai1, Liu Anrong2, Cao Xiaoyang2 |
1.Institution of Science of Science and S&T Management & WISE Lab, Dalian University of Technology, Dalian 116024 2.Chinese Academy of Engineering Innovation Strategy, Beijing 100089 |
|
|
Abstract Disruptive technology is a technology group with a complex internal structure that spans multiple disciplines and fields, and from a spatial perspective, it includes leading, auxiliary, and supporting technologies. The use of scientometrics to evaluate disruptive technologies and explore the evolution of science and technology is facing challenges that manifest in data retrieval. This paper explores a novel strategy for patent dataset construction for complex technology based on machine learning, with a focus on the patent retrieval tasks (binary classification tasks) of machine learning. This is similar to query classification, which is based on active learning in information retrieval. Additionally, we propose an improved text classification method that combines feature maximization with the CNN model. In this paper, the technical domain of artificial intelligence (AI) is used as an example. The results show an accuracy of 98.01%, a recall rate of 97.04%, and an F1 value of 97.89%; this demonstrates that the proposed strategy accurately identifies AI patents, improves the accuracy and recall rate of patent searches, and facilitates the creation of accurate and comprehensive patent datasets for the technical domain of AI.
|
Received: 07 July 2020
|
|
|
|
1 米黑尔·罗科, 威廉·班布里奇. 聚合四大科技, 提高人类能力: 纳米技术、生物技术、信息技术和认知科学[M]. 蔡曙山, 王志栋, 周允程, 等译. 北京: 清华大学出版社, 2010. 2 NRC. Convergence: facilitating transdisciplinary integration of life sciences, physical sciences, engineering, and beyond[M]. Washington, DC: National Academies Press, 2014. 3 Kim J, Jun S, Jang D, et al. Sustainable technology analysis of artificial intelligence using Bayesian and social network models[J]. Sustainability, 2018, 10(2): 115. 4 Huang L, Miao W, Zhang Y, et al. Patent network analysis for identifying technological evolution: a case study of China’s artificial intelligence technologies[C]// Proceedings of the 2017 Portland International Conference on Management of Engineering and Technology (PICMET). IEEE, 2017: 1-9. 5 李悦, 苏成, 贾佳, 等. 基于科学计量的世界人工智能领域发展状况分析[J]. 计算机科学, 2017, 44(12): 183-187. 6 Kim H W, Noh K R, Ahn S. Technology convergence map creation and country profile analysis in the field of artificial intelligence[J]. The Journal of the Korea Institute of Electronic Communication Sciences, 2017, 12(1): 139-146. 7 黄名选, 严小卫, 张师超. 查询扩展技术进展与展望[J]. 计算机应用与软件, 2007, 24(11): 1-4, 8. 8 Jones K S, Barber E O. What makes an automatic keyword classification effective?[J]. Journal of the American Society for Information Science, 1971, 22(3): 166-175. 9 Voorhees E M. The effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval[R]. UMI Order No. GAX86-07224. New York: Cornell University, 1985. 10 贾君枝, 叶壮壮. 基于潜在语义索引的Wikidata机构实体聚类研究[J]. 数据分析与知识发现, 2019, 3(10): 56-65. 11 Jing Y F, Croft W B. An association thesaurus for information retrieval[C]// Proceedings of RIAO 1994 Conference. New York: CiteSeer, 1994, 94: 146-160. 12 Salton G, Buckley C. Improving retrieval performance by relevance feedback[J]. Journal of the American Society for Information Science, 1990, 41(4): 288-297. 13 Xu J X, Croft W B. Query expansion using local and global document analysis[C]// Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 1996: 4-11. 14 Broder A Z, Fontoura M, Gabrilovich E, et al. Robust classification of rare queries using web knowledge[C]// Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2007: 231-238. 15 Furnas G W, Deerwester S, Dumais S T, et al. Information retrieval using a singular value decomposition model of latent semantic structure[C]// Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 1988: 465-480. 16 宋峻峰, 张维明, 肖卫东, 等. 基于本体的信息检索模型研究[J]. 南京大学学报(自然科学版), 2005, 41(2): 189-197. 17 Müller H M, Kenny E E, Sternberg P W. Textpresso: an ontology-based information retrieval and extraction system for biological literature[J]. PLoS Biology, 2004, 2(11): e309. 18 Wei J, Bressan S, Ooi B C. Mining term association rules for automatic global query expansion: methodology and preliminary results[C]// Proceedings of the First International Conference on Web Information Systems Engineering. IEEE, 2000, 1: 366-373. 19 Martín-Bautista M J, Sánchez D, Chamorro-Martínez J, et al. Mining web documents to find additional query terms using fuzzy association rules[J]. Fuzzy Sets and Systems, 2004, 148(1): 85-104. 20 Song M, Song I Y, Hu X H, et al. Integration of association rules and ontologies for semantic query expansion[J]. Data & Knowledge Engineering, 2007, 63(1): 63-75. 21 Cui H, Wen J R, Nie J Y, et al. Query expansion by mining user logs[J]. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(4): 829-839. 22 Fonseca B M, Golgher P B, De Moura E S, et al. Discovering search engine related queries using association rules[J]. Journal of Web Engineering, 2003, 2(4): 215-227. 23 许侃, 林原, 曲忱, 等. 专利查询扩展的词向量方法研究[J]. 计算机科学与探索, 2018, 12(6): 972-980. 24 Joachims T. Text categorization with support vector machines: learning with many relevant features[C]// Proceedings of the European Conference on Machine Learning. Heidelberg: Springer, 1998: 137-142. 25 Yang Y, Pedersen J P. A comparative study on feature selection in text categorization[C]// Proceedings of the Fourteenth International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers, 1997, 97: 412-420. 26 Peng H C, Long F H, Ding C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(8): 1226-1238. 27 Mikolov T, Yih W, Zweig G. Linguistic regularities in continuous space word representations[C]// Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2013: 746-751. 28 Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]// Proceedings of the 2013 Advances in Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2013: 3111-3119. 29 陈悦, LamirelJean-Charles, 刘则渊. 中国科学学40年研究主题变迁——基于特征最大化F指标的文本内容分析[J]. 科学学与科学技术管理, 2018, 39(12): 28-45. 30 Lamirel J C, Cuxac P, Chivukula A S, et al. Optimizing text classification through efficient feature selection based on quality metric[J]. Journal of Intelligent Information Systems, 2015, 45(3): 379-396. 31 Zhao J H, Wu H, Deng F Y, et al. Maximum value matters: finding hot topics in scholarly fields[OL]. (2017-10-18). https://arxiv.org/pdf/1710.06637.pdf. 32 Abdou M, Gloncák V, Bojar O. Variable mini-batch sizing and pre-trained embeddings[C]// Proceedings of the Second Conference on Machine Translation. Stroudsburg: Association for Computational Linguistics, 2017: 680-686. 33 Schmid H. Probabilistic part-of-speech tagging using decision trees[C]// Processing of the International Conference on New Methods in Language Processing. London and New York: Routledge, 2013: 154. 34 Salton G, Buckley C. Term-weighting approaches in automatic text retrieval[J]. Information Processing & Management, 1988, 24(5): 513-523. |
|
|
|