|
|
Review of the Research Progress on the Open Scientific Datasets Unified Discovery Platform |
Luo Pengcheng1,2, Wang Jimin1, Nie Lei3 |
1.Department of Information Management, Peking University, Beijing 100871 2.Peking University Library, Beijing 100871 3.Academy of Regional and Global Governance, Beijing Foreign Studies University, Beijing 100089 |
|
|
Abstract In the open scientific environment, the reuse of scientific data is valued. To help researchers find data, many unified discovery platforms for scientific datasets have been launched. Accordingly, dataset retrieval methods have received great attention from researchers. This study conducts an extensive review of the research and applications related to the unified discovery platform of open scientific datasets at home and abroad. It surveys the research progress through dataset collection, dataset organization, dataset retrieval, and retrieval results ranking and further analyzes the future research directions. Specifically, we provide a detailed introduction and in-depth analysis of dataset collection methods, multi-source metadata unified methods, metadata quality analysis, metadata information enrichment methods, query expansion, and ranking methods as well as relevance criteria and comprehensive ranking methods. This review is expected to act as a reference to further research and applications.
|
Received: 17 May 2021
|
|
|
|
1 国务院办公厅关于印发科学数据管理办法的通知[EB/OL]. (2018-04-02) [2021-10-27]. http://www.gov.cn/zhengce/content/2018-04/02/content_5279272.htm. 2 陕西省人民政府办公厅关于印发科学数据管理实施细则的通知[EB/OL]. (2018-08-14) [2021-10-27]. http://www.shaanxi.gov.cn/zfxxgk/fdzdgknr/zcwj/szfbgtwj/szbf/201808/t20180814_1666831. html. 3 省人民政府办公厅关于印发湖北省科学数据管理实施细则的通知[EB/OL]. (2018-11-01) [2021-10-27]. http://www.hubei.gov.cn/zfwj/ezbf/201811/t20181121_1713568.shtml. 4 省政府办公厅关于印发江苏省科学数据管理实施细则的通知[EB/OL]. (2019-02-26) [2021-10-27]. http://www.jiangsu.gov.cn/art/2019/2/26/art_64797_8239962.html. 5 National Science Foundation. Dissemination and sharing of research results - NSF data management plan requirements[EB/OL]. [2021-10-27]. https://www.nsf.gov/bfa/dias/policy/dmp.jsp. 6 Monash University. Australian national data service[EB/OL]. [2021-10-27]. https://www.ands.org.au/. 7 European Union. European open science cloud[EB/OL]. [2021-10-27]. https://www.eosc-portal.eu/. 8 Benjelloun O, Chen S Y, Noy N. Google dataset search by the numbers[C]// Proceedings of the 19th International Semantic Web Conference. Cham: Springer, 2020: 667-682. 9 Milham M P, Craddock R C, Son J J, et al. Assessment of the impact of shared brain imaging data on the scientific literature[J]. Nature Communications, 2018, 9: 2818. 10 Tenopir C, Rice N M, Allard S, et al. Data sharing, management, use, and reuse: practices and perceptions of scientists worldwide[J]. PLoS One, 2020, 15(3): e0229003. 11 Tenopir C, Dalton E D, Allard S, et al. Changes in data sharing and data reuse practices and perceptions among scientists worldwide[J]. PLoS One, 2015, 10(8): e0134826. 12 Gregory K, Groth P, Cousijn H, et al. Searching data: a review of observational data retrieval practices in selected disciplines[J]. Journal of the Association for Information Science and Technology, 2019, 70(5): 419-432. 13 Gregory K M, Cousijn H, Groth P, et al. Understanding data search as a socio-technical practice[J]. Journal of Information Science, 2020, 46(4): 459-475. 14 Liu J P, Wang J, Zhou G M, et al. User’s scientific data retrieval behavior study based on the model of TPB[C]// Proceedings of the 3rd International Conference on Computer Science and Application Engineering. New York: ACM Press, 2019: Article No.71. 15 Koesten L, Mayr P, Groth P, et al. Report on the DATA: SEARCH’18 workshop - searching data on the web[J]. ACM SIGIR Forum, 2019, 52(2): 117-124. 16 Koesten L, Demidova E, Savenkov V, et al. PROFILES & DATA: SEARCH international workshop on profiling and searching data on the web chairs’ welcome & organization[C]// Companion Proceedings of the Web Conference 2018. Republic and Canton of Geneva: International World Wide Web Conferences Steering Committee, 2018: 1479-1480. 17 Noy N. When the Web is your data lake: creating a search engine for datasets on the Web[C]// Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. New York: ACM Press, 2020: 801. 18 Chen X L, Gururaj A E, Ozyurt B, et al. DataMed - an open source discovery index for finding biomedical datasets[J]. Journal of the American Medical Informatics Association, 2018, 25(3): 300-308. 19 Kr?mer T, Klas C P, Hausstein B. A data discovery index for the social sciences[J]. Scientific Data, 2018, 5: 180064. 20 Wang H J, Webster K. Editorial: artificial intelligence for data discovery and reuse demands healthy data ecosystem and community efforts[C]// Proceedings of the Conference on Artificial Intelligence for Data Discovery and Reuse. New York: ACM Press, 2019: 1-4. 21 Edinburgh university data library research data management handbook[EB/OL]. (2011-08-01) [2021-10-27]. http://www.docs.is.ed.ac.uk/docs/data-library/EUDL_RDM_Handbook.pdf. 22 Chapman A, Simperl E, Koesten L, et al. Dataset search: a survey[J]. The VLDB Journal, 2020, 29(1): 251-272. 23 杨波, 赵扬, 焦红. 国际主要科学数据集检索平台对比研究[J]. 情报工程, 2020, 6(1): 22-33. 24 Brickley D, Burgess M, Noy N. Google Dataset Search: building a search engine for datasets in an open Web ecosystem[C]// Proceedings of the World Wide Web Conference. New York: ACM Press, 2019: 1365-1375. 25 DataCite Schema. DataCite metadata schema[EB/OL]. [2021-10-27]. https://schema.datacite.org/. 26 Schema.org releases[EB/OL]. [2021-10-27]. https://schema.org/docs/releases.html. 27 W3C. Data catalog vocabulary (DCAT)[EB/OL]. [2021-10-27]. https://www.w3.org/TR/2014/REC-vocab-dcat-20140116/. 28 Clarivate. The repository selection process[EB/OL]. [2021-10-27]. https://clarivate.com/webofsciencegroup/essays/the-repository-selection-process/. 29 Ohno-Machado L, Sansone S A, Alter G, et al. Finding useful data across multiple biomedical data repositories using DataMed[J]. Nature Genetics, 2017, 49(6): 816-819. 30 Sansone S, McQuilton P, Cousijn H. Data repository selection: which criteria matter?[EB/OL]. (2019-11-29) [2021-10-27]. https://blog.datacite.org/data-repository-selection-which-criteria-matter/. 31 Sansone S, McQuilton P, Cousijn H, et al. Data repository selection: criteria that matter[EB/OL]. [2021-10-27]. https://osf.io/m2bce/. 32 Clarivate. Recommended practices to promote scholarly data citation and tracking[EB/OL]. [2021-10-27]. https://clarivate.com/webofsciencegroup/wp-content/uploads/sites/2/2019/08/Crv_WOS_ Whitepaper_DCI_web.pdf. 33 Garnett A, Leahey A, Savard D, et al. Open metadata for research data discovery in Canada[J]. Journal of Library Metadata, 2017, 17(3/4): 201-217. 34 Devarakonda R, Palanisamy G, Green J M, et al. Data sharing and retrieval using OAI-PMH[J]. Earth Science Informatics, 2011, 4(1): 1-5. 35 ANDS. Providing metadata records[EB/OL]. [2021-10-27]. https://www.ands.org.au/online-services/research-data-australia/rda-registry/providing-metadata-records. 36 Khalsa S, Cotroneo P, Wu M F. A survey of current practices in data search services[EB/OL]. (2018-03-21) [2021-10-27]. https://www.rd-alliance.org/system/files/SearchSystemSurveyReport.pdf. 37 Research Data Alliance. Metadata directory[EB/OL]. [2021-10-27]. http://rd-alliance.github.io/metadata-directory/standards/. 38 Brown C. Developing a core metadata profile for the UK research data discovery service[EB/OL]. (2016-03-11) [2021-10-27]. https://rdds.jiscinvolve.org/wp/2016/03/11/core_metadata_ profile/. 39 Jisc. UK research data discovery service core metadata profile v1.1[EB/OL]. [2021-10-27]. https://drive.google.com/file/d/0B3v6Fm7XStdBWUpvc3FWQjhoMTA/view. 40 Registry interchange format-collections and services[EB/OL]. [2021-10-27]. https://services.ands.org.au/documentation/rifcs/guidelines/rif-cs.html. 41 L?ffler F, Wesp V, K?nig-Ries B, et al. Dataset search in biodiversity research: do metadata in data repositories reflect scholarly information needs?[J]. PLoS One, 2021, 16(3): e0246099. 42 Leahey A, Barsky E, Brosz J, et al. Metadata for discovery: disciplinary standards and crosswalk progress report[R/OL]. (2017-09-06) [2021-10-27]. https://open.library.ubc.ca/cIRcle/collections/facultyresearchandpublications/52383/items/1.0355406. 43 Sansone S A, Gonzalez-Beltran A, Rocca-Serra P, et al. DATS, the data tag suite to enable discoverability of datasets[J]. Scientific Data, 2017, 4: 170059. 44 Rousidis D, Garoufallou E, Balatsoukas P, et al. Metadata for big data: a preliminary investigation of metadata quality issues in research data repositories[J]. Information Services & Use, 2014, 34(3/4): 279-286. 45 Rousidis D, Garoufallou E, Balatsoukas P, et al. Evaluation of metadata in research data repositories: the case of the DC.Subject element[C]// Proceedings of the Research Conference on Metadata and Semantics Research. Cham: Springer, 2015: 203-213. 46 Gon?alves R S, Musen M A. The variable quality of metadata about biological samples used in biomedical experiments[J]. Scientific Data, 2019, 6: 190021. 47 Hu W, Zaveri A, Qiu H L, et al. Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata[J]. BMC Bioinformatics, 2017, 18(1): 415. 48 Gordon S, Habermann T. The influence of community recommendations on metadata completeness[J]. Ecological Informatics, 2018, 43: 38-51. 49 Habermann T. MetaDIG recommendations for FAIR DataCite metadata[EB/OL]. (2019-09-27) [2021-10-27]. https://blog.datacite.org/metadig-recommendations-for-fair-datacite-metadata/. 50 Lafia S, Kuhn W. Spatial discovery of linked research datasets and documents at a spatially enabled research library[J]. Journal of Map & Geography Libraries, 2018, 14(1): 21-39. 51 Burgess M, Noy N. Building Google dataset search and fostering an open data ecosystem[EB/OL]. [2021-10-27]. https://ai.googleblog.com/2018/09/building-google-dataset-search-and.html. 52 Rueda L, Fenner M, Cruse P. DataCite: lessons learned on persistent identifiers for research data[J]. International Journal of Digital Curation, 2016, 11(2): 39-47. 53 Ghavimi B, Mayr P, Vahdati S, et al. Identifying and improving dataset references in social sciences full texts[C]// Proceedings of the 20th International Conference on Electronic Publishing. G?ttingen: ELPUB, 2016: 105-114. 54 Lu M Y, Bangalore S, Cormode G, et al. A dataset search engine for the research document corpus[C]// Proceedings of 2012 IEEE 28th International Conference on Data Engineering. IEEE, 2012: 1237-1240. 55 Karisani P, Qin Z S, Agichtein E. Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval[J]. Database, 2018, 2018: bax104. 56 Wei W. Information retrieval in biomedical research: from articles to datasets[D]. San Diego: University of California San Diego, 2017. 57 Wei W, Ji Z L, He Y P, et al. Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE retrieval challenge[J]. Database, 2018, 2018: bay017. 58 Singhal A, Kasturi R, Srivastava J. DataGopher: Context-based search for research datasets[C]// Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration. IEEE, 2014: 749-756. 59 Singhal A, Srivastava J. Research dataset discovery from research publications using web context[J]. Web Intelligence, 2017, 15(2): 81-99. 60 中国科学院计算机网络信息中心. 中国科学院数据云标准规范体系[EB/OL]. [2021-10-27]. http://www.csdb.cn/datacenter. 61 Chen Z Y, Jia H Y, Heflin J, et al. Generating schema labels through dataset content analysis[C]// Companion Proceedings of the Web Conference 2018. Republic and Canton of Geneva: International World Wide Web Conferences Steering Committee, 2018: 1515-1522. 62 Chen Z Y, Jia H Y, Heflin J, et al. Leveraging schema labels to enhance dataset search[C]// Proceedings of the European Conference on Information Retrieval. Cham: Springer, 2020: 267-280. 63 Carevic Z, Roy D, Mayr P. Characteristics of dataset retrieval sessions: experiences from a real-life digital library[C]// Proceedings of the International Conference on Theory and Practice of Digital Libraries. Cham: Springer, 2020: 185-193. 64 Kacprzak E, Koesten L M, Ibá?ez L D, et al. A query log analysis of dataset search[C]// Proceedings of the International Conference on Web Engineering. Cham: Springer, 2017: 429-436. 65 Kacprzak E, Koesten L, Ibá?ez L D, et al. Characterising dataset search—an analysis of search logs and data requests[J]. Journal of Web Semantics, 2019, 55: 37-55. 66 李丽亚, 宋扬, 薛中玉, 等. 基于Ontology的科学数据共享检索体系解析[J]. 情报理论与实践, 2009, 32(5): 81-85. 67 张乃静. 基于林业科学数据的语义检索研究[D]. 北京: 中国林业科学研究院, 2013. 68 Wright T B, Ball D, Hersh W. Query expansion using MeSH terms for dataset retrieval: OHSU at the bioCADDIE 2016 dataset retrieval challenge[J]. Database, 2017, 2017: bax065. 69 Bouadjenek M R, Verspoor K. Multi-field query expansion is effective for biomedical dataset retrieval[J]. Database, 2017, 2017: bax062. 70 Scerri A, Kuriakose J, Deshmane A A, et al. Elsevier’s approach to the bioCADDIE 2016 dataset retrieval challenge[J]. Database, 2017, 2017: bax056. 71 Dulisch N, Kempf A O, Schaer P. Query expansion for survey question retrieval in the social sciences[C]// Proceedings of the International Conference on Theory and Practice of Digital Libraries. Cham: Springer, 2015: 28-39. 72 Vanderbilt K, Porter J H, Lu S S, et al. A prototype system for multilingual data discovery of International Long-Term Ecological Research (ILTER) Network data[J]. Ecological Informatics, 2017, 40: 93-101. 73 Porter J H. Evaluating a thesaurus for discovery of ecological data[J]. Ecological Informatics, 2019, 51: 151-156. 74 Cieslewicz A, Dutkiewicz J, Jedrzejek C. Baseline and extensions approach to information retrieval of complex medical data: Poznan’s approach to the bioCADDIE 2016[J]. Database, 2018, 2018: bax103. 75 Kacprzak E, Koesten L, Tennison J, et al. Characterising dataset search queries[C]// Companion Proceedings of the Web Conference 2018. Republic and Canton of Geneva: International World Wide Web Conferences Steering Committee, 2018: 1485-1488. 76 Takeuchi S, Akahoshi Y, Ong B T, et al. Spatio-temporal pseudo relevance feedback for large-scale and heterogeneous scientific repositories[C]// Proceedings of the 2014 IEEE International Congress on Big Data. IEEE, 2014: 669-676. 77 Takeuchi S, Sugiura K, Akahoshi Y, et al. Spatio-temporal pseudo relevance feedback for scientific data retrieval[J]. IEEJ Transactions on Electrical and Electronic Engineering, 2017, 12(1): 124-131. 78 Teodoro D, Mottin L, Gobeill J, et al. Improving average ranking precision in user searches for biomedical research datasets[J]. Database, 2017, 2017: bax083. 79 Wang Y S, Rastegar-Mojarad M, Komandur-Elayavilli R, et al. Leveraging word embeddings and medical entity extraction for biomedical dataset retrieval using unstructured texts[J]. Database, 2017, 2017: bax091. 80 刘春蔚, 邹自明, 佟继周. 基于LSI的日地空间领域科学数据语义检索模型[J]. 中国科学院大学学报, 2016, 33(5): 711-719. 81 DataCite. DataCite annual review 2019[R/OL]. [2021-10-27]. https://datacite.org/assets/AnnualReview-DataCite2019.pdf. 82 Hull C. London Lucene/Solr Meetup - relevance tuning for Elsevier’s DataSearch & harvesting data from PDFs[EB/OL]. [2021-10-27]. https://www.flax.co.uk/index.html@p=3812.html. 83 Devarakonda R, Palanisamy G, Wilson B E, et al. Mercury: reusable metadata management, data discovery and access system[J]. Earth Science Informatics, 2010, 3(1/2): 87-94. 84 Bugaje M, Chowdhury G. Is data retrieval different from text retrieval? An exploratory study[C]// Proceedings of the International Conference on Asian Digital Libraries. Cham: Springer, 2017: 97-103. 85 Bugaje M, Chowdhury G. Data retrieval = text retrieval?[C]// Proceedings of the International Conference on Information. Cham: Springer, 2018: 253-262. 86 Kern D, Mathiak B. Are there any differences in data set retrieval compared to well-known literature retrieval?[C]// Proceedings of the International Conference on Theory and Practice of Digital Libraries. Cham: Springer, 2015: 197-208. 87 Chen J C, Wang X X, Cheng G, et al. Towards more usable dataset search: from query characterization to snippet generation[C]// Proceedings of the 28th ACM International Conference on Information and Knowledge Management. New York: ACM Press, 2019: 2445-2448. 88 Megler V M, Maier D. Are data sets like documents? Evaluating similarity-based ranked search over scientific data[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(1): 32-45. 89 Maier D, Megler V M, Tufte K. Challenges for dataset search[C]// Proceedings of the International Conference on Database Systems for Advanced Applications. Cham: Springer, 2014: 1-15. 90 Megler V M, Maier D. Demonstrating “Data Near Here”: scientific data search[C]// Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. New York: ACM Press, 2015: 1075-1080. 91 Zhang W, Byna S, Niu C X, et al. Exploring metadata search essentials for scientific data management[C]// Proceedings of the 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics. IEEE, 2019: 83-92. 92 Roberts K, Gururaj A E, Chen X L, et al. Information retrieval for biomedical datasets: the 2016 bioCADDIE dataset retrieval challenge[J]. Database, 2017, 2017: bax068. 93 Cohen T, Roberts K, Gururaj A E, et al. A publicly available benchmark for biomedical dataset retrieval: the reference standard for the 2016 bioCADDIE dataset retrieval challenge[J]. Database, 2017, 2017: bax061. 94 赵华, 王健, 张贵兰, 等. 基于元数据的科学数据用户相关性判断研究[J]. 情报杂志, 2016, 35(10): 131-136. 95 高飞, 石蕾, 王健, 等. 农业科学数据用户相关性线索与标准之间关系探索[J]. 图书情报工作, 2017, 61(15): 72-80. 96 高飞. 科学数据用户相关性线索、标准及相互关系研究[D]. 北京: 中国农业科学院, 2017. 97 张贵兰, 王健, 周国民, 等. 科学数据用户相关性标准研究[J]. 图书情报工作, 2019, 63(4): 112-121. 98 张贵兰. 科学数据用户相关性标准研究[D]. 北京: 中国农业科学院, 2019. 99 刘建平, 王健, 周国民, 等. 基于科学数据的用户相关性判断实证研究[J]. 数字图书馆论坛, 2017(4): 22-31. 100 刘建平. 科学数据用户相关性判断模型研究[D]. 北京: 中国农业科学院, 2020. 101 Gregory K, Groth P, Scharnhorst A, et al. Lost or found? Discovering data needed for research[J]. Harvard Data Science Review, 2020, 2(2). DOI: 10.1162/99608f92.e38165eb. 102 Kr?mer T, Papenmeier A, Carevic Z, et al. Data-seeking behaviour in the social sciences[J]. International Journal on Digital Libraries, 2021, 22(2): 175-195. 103 藤常延, 沈志宏, 丁翠萍. 基于HITS的科学数据检索结果排序的研究[C]// 中科院科学数据库办公室. 第十一届科学数据库与信息技术学术研讨会论文集, 三亚, 2012: 269-274. 104 黎建辉, 兰金松, 沈志宏, 等. 面向科学数据的PageRank排序算法[J]. 计算机科学与探索, 2013, 7(6): 494-504. 105 滕常延. 科学数据检索结果排序方法的研究与实现[D]. 北京: 中国科学院大学, 2012. 106 李龙飞, 余厚强, 尹梓涵, 等. 替代计量学视角下科学数据集价值的定量测度研究[J]. 情报理论与实践, 2020, 43(9): 47-52, 71. 107 Kratz J E, Strasser C. Making data count[J]. Scientific Data, 2015, 2: 150039. |
|
|
|