The rapid proliferation of digital data in the research landscape has underscored the critical need for sustainable data curation strategies, especially regarding the long-term preservation of valuable datasets. Research data repositories, as key infrastructures for data stewardship, face mounting challenges in determining which datasets should be preserved for future reuse, validation, and scientific advancement. Given the constraints of storage, funding, and technical resources, not all data generated by research activities can or should be preserved indefinitely. Consequently, defining rigorous, transparent, and contextually appropriate evaluation and selection criteria has emerged as a vital concern within the broader scope of research data management (RDM) and digital curation.
This study aims to identify and categorize the key criteria used to evaluate and select research data for long-term preservation in repositories. By conducting a systematic review of existing literature and practices, it seeks to offer a conceptual framework that supports repository managers, librarians, archivists, and data stewards in making informed and consistent decisions about what data to retain. The research further addresses the implications of these criteria on policy development, data sharing, and FAIR data principles (Findable, Accessible, Interoperable, and Reusable). Ultimately, the study contributes to improving data lifecycle management strategies and ensuring that preserved data retains its scientific, legal, ethical, and cultural value.
Methods and Material
This research adopted a qualitative content analysis approach based on a systematic literature review. The primary goal was to identify, classify, and synthesize the evaluation and selection criteria applied by data repositories in preserving research data. The review focused on peer-reviewed journal articles, white papers, policy documents, and institutional guidelines published between 2000 and 2024. Major databases such as Scopus, Web of Science, ScienceDirect, and Google Scholar were searched using combinations of keywords including "research data preservation," "data selection criteria," "data curation," and "digital repositories."
Inclusion criteria for the literature involved the presence of explicit or implicit discussion on the assessment or selection of research data for long-term storage, including frameworks, models, or institutional case studies. A total of 67 relevant documents were identified and analyzed. Through iterative coding and constant comparison, the evaluation criteria were distilled into several thematic clusters, such as scientific value, legal and ethical considerations, technical characteristics, economic feasibility, data usability, and policy alignment.
Resultss and Discussion
The findings of this study, based on a systematic review of the literature and a meta-synthesis of previous studies, identify a comprehensive set of criteria and components for evaluating and selecting research data for retention in data repositories. These criteria are categorized into eight main components: data preparation, data quality, physical conditions and technical features, metadata management and features, ethical principles of data, document-related criteria, compliance with FAIR principles, and repository policies and issues.
In the “Data Preparation” component, indicators such as data cleaning, data scale, presence of missing data, and evaluation of survey biases are highlighted. This component emphasizes the necessity of eliminating errors and inconsistencies, assessing the scale of data, and addressing missing values. It also stresses the importance of identifying and evaluating biases in survey data, such as sampling errors, non-response, and other confounding factors.
The “Data Quality” component includes indicators such as accuracy, reliability, completeness, validity, documentation of limitations, and timeliness of data. Accuracy and correctness of information must be carefully assessed, and data reliability should be evaluated based on how the data was produced and analyzed. Completeness refers to the presence of all necessary elements in the dataset, and validity relates to the soundness of data collection tools and the extent to which findings reflect reality. Acknowledging study limitations helps clarify weaknesses, and up-to-date data are valued for their relevance in terms of collection time.
The remaining components and their indicators are as follows:
- Physical Conditions and Technical Features of Data: Includes data formats, future readability, required software for access, and compatibility with technical standards.
- Metadata Management and Features: Covers the presence of sufficient metadata, use of standardized structures for data description, supplementary documentation, and necessary information for data reuse.
- Ethical Principles of Data: Encompasses protection of participants' privacy, anonymization or encryption of sensitive information, obtaining informed consent, and respect for intellectual property rights.
- Document-Related Criteria: Includes the association of data with specific research projects, traceability of data to published scholarly articles, and documentation of data collection methods.
- Compliance with FAIR Principles: Covers Findability, Accessibility, Interoperability, and Reusability of the data.
- Repository Policies and Issues: Involves adherence to legal requirements and repository policies, access licenses, data sharing conditions, and security considerations for data storage.
These eight components and their corresponding indicators provide a comprehensive and evidence-based framework for evaluating and selecting suitable research data for long-term retention in data repositories.
Discussion and Conclusion
The discussion and conclusion of this paper emphasize the importance of various components in the evaluation and selection of research data for storage in data repositories. In the data preparation phase, accuracy in data cleaning and screening, particularly in quantitative research, is crucial. Challenges such as missing data and potential biases, such as sampling errors, can complicate analyses and reduce the quality of the data. Therefore, adherence to precise standards in cleaning and verifying data is essential.
In evaluating data quality, accuracy and precision of information, reliability, and completeness of the data are key criteria. Data that is properly collected and analyzed can facilitate more effective research and reuse of data. Especially in both qualitative and quantitative data, the use of standardized formats and compatibility with various systems are significant technical issues that impact storage quality.
Metadata documentation also plays a critical role in data evaluation. Metadata provides essential information about the data, enhancing transparency, collaboration, and trust. Furthermore, adhering to ethical principles, such as obtaining informed consent from participants and ensuring their privacy during the use of data, is crucial. These actions help maintain public trust and prevent misuse of data.
The paper also emphasizes the importance of alignment with the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) in data evaluation. Adherence to these principles ensures that data will be more effective and accessible for future use. Additionally, policies related to data repositories must consider user needs and technical limitations, preserving high-value research data for future use.
The conclusion reveals that the evaluation and selection of research data for storage should be conducted with care and adherence to standardized criteria to improve the quality and effectiveness of data utilization in future research. Furthermore, practical recommendations such as developing data evaluation guidelines, training data specialists, and implementing technological tools to enhance the data evaluation and storage processes are proposed.