Scientific Data Reduction Benchmarks

This site has been established as part of the ECP CODAR project.

This site provides reference scientific datasets, data reduction techniques, error metrics, error controls and error assessment tools for users and developers of scientific data reduction techniques.  

Important: when publishing results from one or more datasets presented in this webpage, make sure to:

 

Data sets:

 

Name

Type

Format

Size (data)

Link

CESM-ATM

Source:

Mark Taylor (SNL)

Climate simulation

Dataset1 : 79 fields: 2D, 1800 x 3600 ;

Dataset2 : 1 field :

3D, 26x1800x3600.

Both are single precision, binary

1.47GB + 17GB

Dataset1

Dataset2

Metadata

EXAALT

Source:

EXAALT team

This dataset has been approved for unlimited release by Los Alamos National Laboratory and has been assigned LA-UR-18-25670.

Molecular dynamics simulation 

6 fields: x,y,z,vx,vy,vz,

Each field stored separately,

Single precision,

Binary, Little-endian

60 MB

Dataset

Metadata

                 

Hurricane ISABEL

Source:

http://vis.computer.org/vis2004contest/data.html

                          

Climate simulation

13 fields: 3D, 100x500x500, single-precision, binary (cleared dataset by replacing background by 0)

                          

1.25GB

                          

Dataset

Metadata

EXAFEL

Source:

LCLS

Images from the LCLS instrument

2D,

Single precision

HDF5 and binary

51 MB

Dataset

Metadata

HACC

Source:

HACC team

(ECP EXASKY)

Cosmology:

particle simulation

1 snapshot: 6 fields x,y,z,vx,vy,vz)

Each field stored separately,

Single precision,

Binary, Little-endian

19 GB

5 GB

Dataset1

Metadata

Dataset2

Metadata

NYX

Source:

Lukic et al. “methods: numerical, intergalactic medium, quasars: absorption lines, large-scale structure of universe”, journal of Monthly Notices of Royal Astronomical Society

Cosmology:

Adaptive mesh hydrodynamics + N-body cosmological simulation

6 fields, 3D,

512 x 512 x 512

Single precision,

Binary, Little-endian

2.7 GB

Dataset

Metadata

NWChem

Source:

Example molecular 2-electron integral values generated by libint, a library developed by the Valeev research group at Virginia Tech.  See https://github.com/evaleev/libint   Libint is an integral evaluation option in NWChemEx.

 Two-electron repulsion integrals computed over Gaussian-type orbital basis sets

3 fields, 1D,

Double precision,

Binary, Little-endian    

16GB

Dataset

Metadata

SCALE-LETKF

Source:

simulation data are generated by The Local Ensemble Transform Kalman Filter (LETKF) data assimilation package for the SCALE-RM weather model (provided by RIKEN)

Contact: Guo-yuan Lien (guoyuan.lien@gmail.com)

Climate simulation

13 fields, 3D, Single precision, binary, little-endian

4.9GB

Dataset

Metadata

QMCPACK

Source:

QMCPACK performance test

(contact: Ye Luo: yeluo@anl.gov)

Many-body ab initio Quantum Monte Carlo (electronic structure of atoms, molecules, and solids)

288 orbitals, 3D,

69 x 69 x 115,

Single precision and

Binary, Little endian

1 GB

Dataset

Metadata

S3D

Source:

Kolla, Hemanth NMN (hnkolla@sandia.gov)

Combustion simulation

11 fields, 3D, 500x500x500,

Double precision

Binary, Little-endian

 44 GB

Dataset

Metadata

XGC

Source:

Princeton Plasma Physics Laboratory (PPPL)

https://jychoi-hpc.github.io/adios-python-docs/XGC-mesh-and-field-data.html

Fusion Simulation

 9 timesteps, 3D,

unstructured mesh

(the mesh data is in the archive),

 20694x512,

Double precision

Binary, Little-endian

1.2 GB

Dataset

Metadata

NSTX GPI

Source:

Michael Churchill, Princeton Plasma Physics Laboratory (PPPL)

- Copyright: Free to use but need to check with Michael Churchil (rchurchi@pppl.gov) before publishing results.

Fusion Gas Puff Image (GPI) data

369,357 steps,

2D time-series data (movie), 80x64 image with

Double precision

Binary, Little-endian

4.1 GB

Dataset

Metadata

Brown Samples

Source:

Brown University

Synthetic, generated to specified regularity

1D,

Double precision

Binary, Little-endian

(3 datasets with 3 different regularity)

256 MB

256 MB

256 MB

Dataset

Dataset

Dataset

Metadata

 

Note: This table will be augmented with metrics that matter for users of these datasets as well as recommended settings for error control (lossy compression).

                          

In general, the extension of the data file is named in the following convention :

Others: please submit your proposal of datasets to codar-reduction (at) cels.anl.gov. Requirements: datasets will be open to public access. Dataset should be linked to a simulation application or a scientific instrument. Metadata should explain the source origin of the dataset and how it has been produced (what simulation, what instrument, what settings). Upon review by the SDRBenchmarks committee, the dataset will (or will not) be added to the SDRBenchmarks repository.

Lossy compressors:

Lossless compressors:

Commonly used metrics for reduction technique assessment:

Error controls:

Assessment tools, metrics and error control:

Contributors/maintainers:

Sponsors: