Data are observations or measurements (unprocessed or processed) represented as text, numbers, or multimedia. A dataset (also spelled 'data set') is a structured and stable collection of data generally associated with a unique body of work (for example a research study). In order for a dataset to be reusable for research purposes, the dataset needs to be FAIR (findable, accessible, interoperable, reusable). This means that it needs to e.g. have a unique identifier such as a DOI or URN, sufficient metadata including provenance and creator information, and a license enabling reuse. Datasets also need to fulfill discipline specific requirements and standards. More about the difference between data and dataset in Data types.
Datasets are the corner stone of data driven computing and data analysis. Datasets allow to focus on the origin, life cycle and ethical use of data resources, instead of the technicalities of single data files or computing methods. CSC provides services for dataset oriented research and develops future services to better support datasets and other higher level aspects of data.
The ownership, copyrights and license of data is often best defined for the whole dataset, though, in some cases finer-grained definitions might be needed. In scientific writing dataset is usually cited as a single entity.
Discover research data
When utilizing and re-using data collected or produced by others the origin, content, location, license, restrictions of use, and other necessary information are needed. Search services include descriptive information (metadata) on research datasets. The better the description of the dataset is, the easier it is to find and use it. Existing research datasets may be available for reuse.
Specific datasets hosted in CSC computing environment
CSC also hosts or provides access to several datasets on different platforms.
- CSD - Cambridge Crystallographic Database – organic and metallo-organic crystal structures and tools
- Molport 6M molecule database preprocessed for fast GPU screening with Schrödinger Shape
Language research and other digital humanities and social sciences
- The latest versions of CLARIN PUB or ACA licensed corpora are available unpacked in /appl/data/kielipankki/
Processing and analysing data
Read more in CSC's Data analysis guide