Terminology
Research Data - any physical and/or digital materials that are collected, observed, or created in research activity for purposes of analysis to produce original research results or creative works. Research data are recorded factual material commonly retained by and accepted in the scientific community as necessary to validate research findings.
Research Data Management - organization of data, from its entry to the research cycle through to the dissemination and archiving of valuable results. It covers the planning, creating, storing, organizing, accessing, sharing, describing, publishing and curating of data.
Research Data Management
You need to think about data management as early as possible and throughout the research lifecycle. Data management is not a single task to be ticked off at any particular part of the research process, and is integral to the process of conducting research.
Reasons to manage data:
- Risks of data loss;
- Non-repeatability of research;
- Institutional reputational risk;
- Need to repeat work if you can’t make sense of it, or if it is not documented effectively;
- ‘Big data’ – large or complex that traditional data processing applications are inadequate to deal with them;
- Give access to data and/or results to other researchers;
- Just as part of good practice – to share, cite, re-use;
- Funder requirements – gain new and continuation funding;
- Institutional reputational and funding risk if there is no infrastructure and/or poor practice;
- So it is not hard to find data and combine with other’s data;
- Identify versions of data;
- To enable sharing;
- Citation impact if made available – get credit for your work;
- Demonstrate value for funding and likelihood of further funding;
- Enable collaboration.
For more information, visit Digital Curation Centre (DCC) website, CLICK HERE.
Managing your data will help you to:
- find easily the data when needed,
- avoid unnecessary duplication,
- validate your results if required,
- ensure your research is visible and has impact,
- get credit when others cite your work,
- comply with Funder mandates.
Data Management Plan
A data management plan (DMP) is a written document that describes the data you expect to acquire or generate during the course of a research project, how you will manage, describe, analyze, and store those data, and what mechanisms you will use at the end of your project to share and preserve your data.
A data management plan or DMP is a formal document that outlines how you will handle your data both during your research, and after the project is completed.
The goal of a data management plan is to consider the many aspects of data management, metadata generation, data preservation, and analysis before the project begins; this ensures that data are well-managed in the present, and prepared for preservation in the future.
Elements of Data Management Plan
- Data description: A description of the information to be gathered; the nature and scale of the data that will be generated or collected.
- Existing data: A survey of existing data relevant to the project and a discussion of whether and how these data will be integrated.
- Format: Formats in which the data will be generated, maintained, and made available, including a justification for the procedural and archival appropriateness of those formats.
- Metadata: A description of the metadata to be provided along with the generated data, and a discussion of the metadata standards used.
- Storage and backup: Storage methods and backup procedures for the data, including the physical and cyber resources and facilities that will be used for the effective preservation and storage of the research data.
- Security: A description of technical and procedural protections for information, including confidential information, and how permissions, restrictions, and embargoes will be enforced.
- Responsibility: Names of the individuals responsible for data management in the research project.
- Intellectual property rights: Entities or persons who will hold the intellectual property rights to the data, and how IP will be protected if necessary. Any copyright constraints (e.g., copyrighted data collection instruments) should be noted.
- Access and sharing: A description of how data will be shared, including access procedures, embargo periods, technical mechanisms for dissemination and whether access will be open or granted only to specific user groups. A timeframe for data sharing and publishing should also be provided.
- Audience: The potential secondary users of the data.
- Selection and retention periods: A description of how data will be selected for archiving, how long the data will be held, and plans for eventual transition or termination of the data collection in the future.
- Archiving and preservation: The procedures in place or envisioned for long-term archiving and preservation of the data, including succession plans for the data should the expected archiving entity go out of existence.
- Ethics and privacy: A discussion of how informed consent will be handled and how privacy will be protected, including any exceptional arrangements that might be needed to protect participant confidentiality, and other ethical issues that may arise.
- Budget: The costs of preparing data and documentation for archiving and how these costs will be paid. Requests for funding may be included.
- Data organization: How the data will be managed during the project, with information about version control, naming conventions, etc.
- Quality Assurance: Procedures for ensuring data quality during the project.
- Legal requirements: A listing of all relevant federal or funder requirements for data management and data sharing.
Importance of Data Management Plan
Preparing a data management plan before data are collected ensures that data are in the correct format, organized well, and better annotated.
This saves time in the long term because there is no need to re-organize, re-format, or try to remember details about data. It also increases research efficiency since both the data collector and other researchers will be able to understand and use well-annotated data in the future.
Many funding agencies are requiring that grant applications contain data management plans for projects involving data collection.
Data Management Planning Tool
The DMPTool is a collaboration of multiple institutions, including DataONE, and is a service of the UC Curation Center. The DMPTool will help you:
- Create ready-to-use data management plans for specific funding agencies;
- Meet funder requirements for data management plans;
- Get step-by-step instructions and guidance for your data management plan as you build it;
- Learn about resources and services available at your institution to help fulfill the data management requirements of your grant.
Data Management Best Practices
To keep your data well-organized, you should consider following best practices:
- Use descriptive and informative file names
- Choose file formats that will ensure long-term access
- Track different versions of your documents
- Create metadata for every experiment or analysis you run
- Find helpful tools for analyzing your data
- Handle sensitive data in an appropriate manner
- Plan for a long-term
- Document your plan
Once you have started to implement best practices for yourself and your research group, make an effort to document these plans. Include your and your group's procedures for the following:
- Naming files
- Saving and backing up files
- Describing data files
- Tracking versions
You might consider using a wiki or a Google doc that everyone in your group can access when needed. Be sure to define who is responsible for each task and for setting the overall policies.
Metadata
In its most basic sense, metadata is information about data, and describes basic characteristics of the data, such as:
- Who created the data
- What the data file contains
- When the data were generated
- Where the data were generated
- Why the data were generated
- How the data were generated
Metadata makes it easier for you and others to identify and reuse data correctly at a later date.
Well-structured metadata not only supports the long-term discovery and preservation of your research data, but allows for the aggregation and simultaneous searching of research data from tens or hundreds or thousands of researchers.
This is why domain-specific repositories typically require highly structured metadata with your data submissions: it enables highly granular searches on their aggregated content. This in turn makes your data easier to find.
A number of free tools are available for metadata creation. Some of them help you select controlled vocabularies to include in your documentation, while others combine that functionality with a fully-supported metadata schema. Here are some:
See more tools here.
Naming Files
How you organize and name your files will have a big impact on your ability to find those files later and to understand what they contain. You should be consistent and descriptive in naming and organizing files so that it is obvious where to find specific data and what the files contain.
It's a good idea to set up a clear directory structure that includes information like the project title, a date, and some type of unique identifier. Individual directories may be set up by date, researcher, experimental run, or whatever makes sense for you and your research.
File names should allow you to identify a precise experiment from the name. Choose a format for naming your files and use it consistently.
You might consider including some of the following information in your file names, but you can include any information that will allow you to distinguish your files from one another.
- Project or experiment name or acronym
- Location/spatial coordinates
- Researcher name/initialsDate or date range of experiment
- Type of dataConditions
- Version number of file
- Three-letter file extension for application-specific files
Another good idea is to include in the directory a readme.txt file that explains your naming format along with any abbreviations or codes you have used.
Guidelines for Choosing Formats
When selecting file formats for archiving, the formats should ideally be:
- Non-proprietary;
- Unencrypted;
- Uncompressed;
- In common usage by the research community;
- Adherent to an open, documented standard;
- Interoperable among diverse platforms and applications;
- Fully published and available royalty-free;
- Fully and independently implementable by multiple software providers on multiple platforms without any intellectual property restrictions for necessary technology;
- Developed and maintained by an open standards organization with a well-defined inclusive process for evolution of the standard.
The file formats you use have a direct impact on your ability to open those files at a later date and on the ability of other people to access those data. There are genrally two types of file format:
- Proprietary Format
- Open Format
A proprietary format is a file format of a company, organization, or individual that contains data that is ordered and stored according to a particular encoding-scheme, designed by the company or organization to be secret, such that the decoding and interpretation of this stored data is only easily accomplished with particular software or hardware that the company itself has developed.
An open format is a file format for storing digital data, defined by a published specification usually maintained by a standards organization, and which can be used and implemented by anyone.
You should save data in a non-proprietary (open) file format when possible. If conversion to an open data format will result in some data loss from your files, you might consider saving the data in both the proprietary format and an open format. Having at least some of the information available to you later will be better than having none of it available!
When it is necessary to save files in a proprietary format, consider including a readme.txt file in your directory that documents the name and version of the software used to generate the file, as well as the company who made the software. This could help you down the road if you need to figure out how to open these files again!
- Containers: TAR, GZIP, ZIP
- Databases: XML, CSV
- Geospatial: SHP, DBF, GeoTIFF, NetCDF
- Moving images: MOV, MPEG, AVI, MXF
- Sounds: WAVE, AIFF, MP3, MXF
- Statistics: ASCII, DTA, POR, SAS, SAV
- Still images: TIFF, JPEG 2000, PDF, PNG, GIF, BMP
- Tabular data: CSV Text: XML, PDF/A, HTML, ASCII, UTF-8
- Web archive: WARC
To see the Library of Congress' Sustainability of Digital Formats web site for more complete listings and discussions of formats, including guidance for the preservation of data sets, geospatial data, and web archives, CLICK HERE. Or visit the LOC's page on Recommended Format Specifications for preservation, CLICK HERE.
Data Storage Guidelines
State how often the data will be backed up and to which locations. How many copies are being made?
Storing data on laptops, computer hard drives or external storage devices alone is very risky. The use of robust, managed storage provided by university IT teams is preferable. Similarly, it is normally better to use automatic backup services provided by IT Services than rely on manual processes.
When deciding on what type of storage solution you will use, you will need to think about several things, such as how much storage you need, what your budget is for storage, what platform you are using, and whether you have data security issues.
Questions to consider:
- Do you have sufficient storage or will you need to include charges for additional services?
- How will the data be backed up?
- Who will be responsible for backup and recovery?
- How will the data be recovered in the event of an incident?
Useful Links
Data Management Planning Tool
DMPTool for Data Management Plans
Checklist for a Data Management Plan