Dataset sources
Overview
Data are observations or measurements (unprocessed or processed) represented as text, numbers, or multimedia. A dataset (also spelled data set) is a structured and stable collection of data generally associated with a unique body of work (for example a research study). In order for a dataset to be reusable for research purposes, the dataset needs to be FAIR (findable, accessible, interoperable, reusable). This means that it needs to, e.g., have a unique identifier such as a DOI or URN, sufficient metadata including provenance and creator information, and a license enabling reuse. Datasets also need to fulfill discipline-specific requirements and standards. More about the difference between data and dataset in Data types.
Datasets are the cornerstone of data-driven computing and data analysis. Datasets allow focusing on the origin, life cycle and ethical use of data resources instead of the technicalities of single data files or computing methods. CSC provides services for dataset-oriented research and develops future services to better support datasets and other higher-level aspects of data.
Note
The ownership, copyrights and license of data is often best defined for the whole dataset, though, in some cases finer-grained definitions might be needed. In scientific writing a dataset is usually cited as a single entity.
Gathering datasets
The first phase of dataset-driven work is where datasets are gathered. It is possible to locate and take into use existing, well-defined datasets, or to create new datasets by sourcing data and organizing it into datasets. The gathering phase lays the foundation on which data-driven computing and analysis is built on. During this phase, the focus should be on making sure licenses and terms of use for data are known and match the intended use, asserting that the origin for data is known for good provenance and that data is organized and documented well.
What if the dataset terms of use are made by the producers themselves?
The dataset creator reserves the right to specify the terms of use for their
dataset, even without ready-made licenses. In such cases, the terms of use
specified must be observed, but you can also negotiate the terms of use by
contacting the owner of the dataset.
What if no terms of use have been specified for data?
If there are no terms of use or the terms of use given are unclear, you should
always contact the owner of the dataset in question.
Discover research data
When utilizing and re-using data collected or produced by others, the origin, content, location, license, restrictions of use, and other necessary information are needed. Search services include descriptive information (metadata) on research datasets. The better the description of the dataset is, the easier it is to find and use it. Existing research datasets may be available for reuse.
See CSC's services for discovering datasets
Specific datasets hosted in CSC computing environment
CSC also hosts or provides access to several datasets on different platforms.
Biosciences
- Chipster_genomes Tool to download aligner indexes used by the Chipster software to Puhti
- AlphaFold databases are available on Puhti
Chemistry
- CSD - Cambridge Crystallographic Database – organic and metallo-organic crystal structures and tools
- Molport 6M molecule database preprocessed for fast GPU screening with Schrödinger Shape
Geosciences
Language research and other digital humanities and social sciences
- The latest versions of
CLARIN PUB or ACA licensed corpora are
available unpacked on Puhti in
/appl/data/kielipankki/
Processing and analyzing data
Read more in CSC's Data analysis guide