Scientific Data Reduction Benchmarks

This site has been established as part of the ECP CODAR project.

 

This site provides reference scientific datasets, data reduction techniques, error metrics, error controls and error assessment tools for users and developers of scientific data reduction techniques. 

 

Important: when publishing results from one or more datasets presented in this webpage, please:

        Cite: SDRBench: https://sdrbench.github.io

        Please also cite: K. Zhao, S. Di, X. Liang, S. Li, D. Tao, J. Bessac, Z. Chen, and F. Cappello, “SDRBench: Scientific Data Reduction Benchmark for Lossy Compressors”, International Workshop on Big Data Reduction (IWBDR2020), in conjunction with IEEE Bigdata20.

        Acknowledge: the source of the dataset you used, the DOE NNSA ECP project, and the ECP CODAR project.

        Check: the condition of publications (some dataset sources request prior check)

        Contact: the compressor authors to get the correct compressor configuration according to each dataset and each comparison metrics.

        Dimension: the order of the dimensions shown in the 'Format' column of the table is in row-major order (aka. C order), which is consistent with well-known I/O libraries such as HDF5. For example, for the CESM-ATM dataset (1800 x 3600), 1800 is higher dimension (changing slower) and 3600 is lower dimension (changing faster). For most compressors (such as SZ, ZFP and FPZIP), the dimensions should be given in the reverse order (such as -2 3600 1800) for their executables. If you are not sure about the order of dimension, one simple method is trying different dimension orders and selecting the results with highest compression ratios.

 

Data sets:

 

Name

Type

Format

Size (data)

Command Examples

Link

CESM-ATM

Source:

Mark Taylor (SNL)

Climate simulation

 

Dataset1 : 79 fields: 2D, 1800 x 3600 ;

Dataset2 : 1 field :

3D, 26x1800x3600.

Both are single precision, binary

 

 

8 bit Entropy

32 bit Entropy

avg

6.5

19.8

min

1.5

2.3

max

7.8

22.6

Dataset1 (raw): 1.47GB

Dataset1 (cleared): 1.47GB

Dataset2: 17GB

 

(cleared data zeroed all background data)

SZ(Compress): sz -z -f -i CLDHGH_1_1800_3600.f32 -M REL -R 1E-2 -2 3600 1800 

SZ(Decompress): sz -x -f -i CLDHGH_1_1800_3600.f32 -2 3600 1800 -s CLDHGH_1_1800_3600.f32.sz -a

ZFP: zfp -f -i CLDHGH_1_1800_3600.f32 -z CLDHGH_1_1800_3600.f32.zfp -o CLDHGH_1_1800_3600.f32.zfp.out -2 3600 1800 -a 1E-2 -s

LibPressio: pressio -b compressor=$COMP -i CLDHGH_1_1800_3600.f32 -d 3600 -d 1800 -t float -o rel=1e-2  -m time -m size -M all

where $COMP can be sz, zfp, sz3, mgard, etc

 

Z-checker-installer:  ./runZCCase.sh -f REL CESM-ATM raw-data-dir f32 3600 1800

Dataset1 (raw)

Dataset1 (cleared)

Dataset2 (raw)

Metadata

 

Dataset1’s property

Dataset2’s property

EXAALT

Source:

EXAALT team

This dataset has been approved for unlimited release by Los Alamos National Laboratory and has been assigned LA-UR-18-25670.

Molecular dynamics simulation

6 fields: x,y,z,vx,vy,vz,

Each field stored separately,

Single precision,

Binary, Little-endian

 

Data1

8 bit Entropy

32 bit Entropy

avg

7.3

19.9

min

7

18.4

max

7.4

20.4

Data2

8 bit Entropy

32 bit Entropy

avg

6.7

15.4

min

6

9.40

max

7.1

20.1

Data3

8 bit Entropy

32 bit Entropy

avg

6.9

16.3

min

6.8

15.6

max

7.2

17.6

Dataset1:

 60 MB

Dataset2:

973 MB

Dataset3:

2.4 GB

SZ(Compress): sz -z -f -i xx.f32 -M REL -R 1E-2 -1 2869440

SZ(Decompress): sz -x -f -i xx.f32 -1 2869440 -s xx.f32.sz -a

ZFP: zfp -f -i xx.f32 -z xx.f32.zfp -o xx.f32.zfp.out -1 2869440 -a 1E-2 -s

LibPressio: pressio -b compressor=$COMP -i xx.f32 -d 2869440 -t float -o rel=1e-2  -m time -m size -M all

where $COMP can be sz, zfp, sz3, mgard, etc


Z-checker-installer: ./runZCCase.sh -f REL EXAALT raw-data-dir f32 2869440

Dataset1

Metadata1

Property1

Dataset2

Metadata2

Property2

Dataset 3

Metadata3

Property3

                       

Hurricane ISABEL

Source:

http://vis.computer.org/ vis2004contest/data.html

                                    

Weather simulation

 

13 fields: 3D, 100x500x500, single-precision, binary (cleared dataset by replacing background by 0)

 

 

8 bit Entropy

32 bit Entropy

avg

5.9

18.9

min

0.7

1.3

max

7.5

24.2

                                     

1.25GB

 

SZ(Compress): sz -z -f -i Pf48.bin.f32 -M REL -R 1E-2 -3 500 500 100

SZ(Decompress): sz -x -f -i Pf48.bin.f32 -3 500 500 100 -s Pf48.bin.f32.sz -a

ZFP: zfp -f -i Pf48.bin.f32 -3 500 500 100 -z Pf48.bin.f32.zfp -o Pf48.bin.f32.zfp.out -a 1E-2 -s

LibPressio: pressio -b compressor=$COMP -i Pf48.bin.f32 -d 500 -d 500 -d 100  -t float -o rel=1e-2  -m time -m size -M all

where $COMP can be sz, zfp, sz3, mgard, etc

Z-checker-installer: ./runZCCase.sh -f REL Hurricane raw-data-dir f32 500 500 100

                                    

Dataset

Metadata

Property

 

EXAFEL

Source:

LCLS

Images from the LCLS instrument

1 field: 2D,

Single precision

HDF5 and binary

 

8 bit Entropy

32 bit Entropy

7.4

25.7

51 MB, 

8.5 GB,

1GB

SZ(Compress): sz -z -f -i EXAFEL-LCLS-986x32x185x388.f32 -M REL -R 1E-2 -2 388 5837120

SZ(Decompress): sz -x -f -i EXAFEL-LCLS-986x32x185x388.f32 -s EXAFEL-LCLS-986x32x185x388.f32.sz -2 388 5837120 -a

ZFP: zfp -f -a 1E-2 -i EXAFEL-LCLS-986x32x185x388.f32 -z EXAFEL-LCLS-986x32x185x388.f32.zfp -o EXAFEL-LCLS-986x32x185x388.f32.zfp.out -2 388 5837120 -s

LibPressio: pressio -b compressor=$COMP -i EXAFEL-LCLS-986x32x185x388.f32 -d 5837120 -d 388  -t float -o rel=1e-2  -m time -m size -M all

where $COMP can be sz, zfp, sz3, mgard, etc


Z-checker-installer: ./runZCCase.sh -f REL EXAFEL raw-data-dir f32 388 5837120

Dataset1

Dataset2

Metadata

Property

 

Dataset3

Metadata

 

HACC

Source:

HACC team

(ECP EXASKY)

Cosmology:

particle simulation

 

1 snapshot: 6 fields x,y,z,vx,vy,vz)

Each field stored separately,

Single precision,

Binary, Little-endian

 

 

8 bit Entropy

32 bit Entropy

avg

7.2

25.3

min

7.1

24.7

max

7.4

25.9

19 GB

 

 

 

5 GB

SZ(Compress): sz -z -f -i xx.f32 -M ABS -A 0.003 -1 1073726487

SZ(Decompress): sz -x -f -i xx.f32 -s xx.f32.sz -1 1073726487 -a

ZFP: zfp -f -i xx.f32 -z xx.f32.zfp -o xx.f32.zfp.out -a 0.003 -1 1073726487

LibPressio: pressio -b compressor=$COMP -i xx.f32 -d 1073726487 -t float -o abs=3e-3  -m time -m size -M all

where $COMP can be sz, zfp, sz3, mgard, etc


Z-checker-installer: ./runZCCase.sh -f ABS HACC raw-data-dir f32 1073726487

Dataset1

Metadata1

Property1

Dataset2

Metadata2

Property2

NYX

Source:

Lukic et al. “methods: numerical, intergalactic medium, quasars: absorption lines, large-scale structure of universe”, journal of Monthly Notices of Royal Astronomical Society

Cosmology:

Adaptive mesh hydrodynamics + N-body cosmological simulation

6 fields, 3D,

512 x 512 x 512

Single precision,

Binary, Little-endian

 

 

8 bit Entropy

32 bit Entropy

avg

7.2

25.1

min

7.0

24

max

7.4

25.9

2.7 GB

SZ(Compress): sz -z -f -i temperature.f32 -M REL -R 1E-2 -3 512 512 512

SZ(Decompress): sz -x -f -i temperature.f32 -s temperature.f32.sz -3 512 512 512 -a 

ZFP: zfp -f -i temperature.f32 -z temperature.f32.zfp -o temperature.f32.zfp.out -3 512 512 512 -a 1E-2 -s

LibPressio: pressio -b compressor=$COMP -i temperature.f32 -d 512 -d 512 -d 512 -t float -o rel=1e-2  -m time -m size -M all

where $COMP can be sz, zfp, sz3, mgard, etc


Z-checker-installer: ./runZCCase.sh -f REL NYX raw-data-dir f32 512 512 512

Dataset

Metadata

Property

NWChem

Source:

Example molecular 2-electron integral values generated by libint, a library developed by the Valeev research group at Virginia Tech.  See https://github.com/evaleev/libint   Libint is an integral evaluation option in NWChemEx.

 Two-electron repulsion integrals computed over Gaussian-type orbital basis sets

3 fields, 1D,

Double precision,

Binary, Little-endian   

16GB

SZ(Compress): sz -z -d -i 631-tst.bin.d64 -M REL -R 1E-2 -1 102953248

SZ(Decompress): sz -x -d -i 631-tst.bin.d64 -s 631-tst.bin.d64.sz -1 102953248 -a

ZFP: zfp -d -i 631-tst.bin.d64 -z 631-tst.bin.d64.zfp -o 631-tst.bin.d64.zfp.out -a 1E-2 -1 102953248 -s

LibPressio: pressio -b compressor=$COMP -i 631-tst.bin.d64 -d 102953248  -t double -o rel=1e-2  -m time -m size -M all

where $COMP can be sz, zfp, sz3, mgard, etc


Z-checker-installer: ./runZCCase.sh -d REL NWCHEM raw-data-dir d64 102953248

Dataset

Metadata

Property

 

 

SCALE-LETKF

Source:

simulation data are generated by The Local Ensemble Transform Kalman Filter (LETKF) data assimilation package for the SCALE-RM weather model (provided by RIKEN)

Contact: Guo-yuan Lien (guoyuan.lien@gmail.com)

Climate simulation

13 fields, 3D, Single precision, binary, little-endian

 

 

8 bit Entropy

32 bit Entropy

avg

6.7

21.7

min

4.6

10.2

max

7.4

26.1

4.9GB

SZ(Compress): sz -f -z -i T-98x1200x1200.f32 -M REL -R 1E-2 -3 1200 1200 98

SZ(Decompress): sz -f -x -i T-98x1200x1200.f32 -s T-98x1200x1200.f32.sz -3 1200 1200 98 -a

ZFP: zfp -f -i T-98x1200x1200.f32 -z T-98x1200x1200.f32.zfp -o T-98x1200x1200.f32.zfp.out -a 1E-2 -3 1200 1200 98 -s

LibPressio: pressio -b compressor=$COMP -i T-98x1200x1200.f32 -d 1200 -d 1200 -d 98  -t float -o rel=1e-2  -m time -m size -M all

where $COMP can be sz, zfp, sz3, mgard, etc


Z-checker-installer: ./runZCCase.sh -f REL SCALE-LETKF raw-data-dir f32 1200 1200 98

Dataset

Metadata

Property

QMCPACK

Source:

QMCPACK performance test

(contact: Ye Luo: yeluo@anl.gov)

Many-body ab initio Quantum Monte Carlo (electronic structure of atoms, molecules, and solids)

1 field, 288 orbitals, 3D,

69 x 69 x 115,

Single precision and

Binary, Little endian

 

8 bit Entropy

32 bit Entropy

7.6

26.1

1 GB

SZ(Compress): sz -z -f -i einspline_288_115_69_69.pre.f32 -M REL -R 1E-2 -3 69 69 33120

SZ(Decompress): sz -x -f -i einspline_288_115_69_69.pre.f32 -s einspline_288_115_69_69.pre.f32.sz -3 69 69 33120 -a

ZFP: zfp -f -i einspline_288_115_69_69.pre.f32 -z einspline_288_115_69_69.pre.f32.zfp -o einspline_288_115_69_69.pre.f32.zfp.out -a 1E-2 -3 69 69 33120

LibPressio: pressio -b compressor=$COMP -i einspline_288_115_69_69.pre.f32 -d 69 -d 69 -d 33120 -t float -o rel=1e-2  -m time -m size -M all

where $COMP can be sz, zfp, sz3, mgard, etc

Z-checker-installer: ./runZCCase.sh -f REL QMCPack raw-data-dir f32 69 69 33120

Dataset

Metadatat

Property

Miranda

Source: hydrodynamics code for large turbulence simulations

(contact: Peter Lindstrom lindstrom2@llnl.gov)

hydrodynamics code for large turbulence simulations

3D, totally 7 fields,

each field is

256x384x384,

Double precision and Binary, Little endian

 

 

8 bit Entropy

32 bit Entropy

avg

7.1

22.5

min

3.62

7.2

max

7.7

25.1

Small

1.87GB

Big

106GB

SZ(Compress): sz -z -d -i density.d64 -M REL -R 1E-2 -3 384 384 256

SZ(Decompress): sz -x -d -i density.d64 -s density.d64.sz -3 384 384 256 -a

ZFP: zfp -d -3 384 384 256 -i density.d64 -z density.d64.zfp -o density.d64.zfp.out -a 1E-2 -s

LibPressio: pressio -b compressor=$COMP -i density.d64 -d 384 -d 384 -d 256 -t double -o rel=1e-2  -m time -m size -M all

where $COMP can be sz, zfp, sz3, mgard, etc


Z-checker-installer: ./runZCCase.sh -d REL Miranda raw-data-dir d64 384 384 256

Small (256x384x384)

Dataset

Metadata

Property

 

Big (3072x3072x3072)

Dataset

Metadata

Property

 

S3D

Source:

Kolla, Hemanth (hnkolla@sandia.gov)

Combustion simulation

 

11 fields, 3D, 500x500x500,

Double precision

Binary, Little-endian

 

 

8 bit Entropy

32 bit Entropy

avg

7.6

21.9

min

7.6

21.8

max

7.6

22

 44 GB

SZ(Compress): sz -d -z -i stat_planar.1.1000E-03.field.d64 -M REL -R 1E-2 -3 500 500 500

SZ(Decompress): sz -d -x -i stat_planar.1.1000E-03.field.d64 -s stat_planar.1.1000E-03.field.d64 -3 500 500 500 -a

ZFP: zfp -d -3 500 500 500 -i stat_planar.1.1000E-03.field.d64 -z stat_planar.1.1000E-03.field.d64.zfp -o stat_planar.1.1000E-03.field.d64.zfp.out -a 1E-2 -s

LibPressio: pressio -b compressor=$COMP -i stat_planar.1.1000E-03.field.d64 -d 500 -d 500 -d 500  -t double -o rel=1e-2  -m time -m size -M all

where $COMP can be sz, zfp, sz3, mgard, etc


Z-checker-installer: ./runZCCase.sh -f REL S3D raw-data-dir f32 500 500 500

Dataset

Metadata

Property

XGC

Source:

Princeton Plasma Physics Laboratory (PPPL)

https://jychoi-hpc.github.io/adios-python-docs/XGC-mesh-and-field-data.html

Fusion Simulation

 9 timesteps, 3D,

unstructured mesh

(the mesh data is in the archive),

 20694x512,

Double precision

Binary, Little-endian

1.2 GB

SZ(Compress): NA

SZ(Decompress): NA

ZFP: NA

LibPressio: N/A

Z-checker-installer: NA

 

(the files are stored in .bp format, you need to install ADIOS to read them)

 

Dataset

Metadata

NSTX GPI

Source:

Michael Churchill, Princeton Plasma Physics Laboratory (PPPL)

- Copyright: Free to use but need to check with Michael Churchil (rchurchi@pppl.gov) before publishing results.

Fusion Gas Puff Image (GPI) data

369,357 steps,

2D time-series data (movie), 80x64 image with

Double precision

Binary, Little-endian

4.1 GB

SZ(Compress): NA

SZ(Decompress): NA

ZFP: NA

LibPressio: N/A

Z-checker-installer: NA

 

(the files are stored in .bp format, you need to install ADIOS to read them)

Dataset

Metadata

Brown Samples

Source:

Brown University

Synthetic, generated to specified regularity

1D,

Double precision

Binary, Little-endian

(3 datasets with 3 different regularity)

 

 

8 bit Entropy

32 bit Entropy

avg

7.5

23

min

7.5

23

max

7.6

23

 

 

256 MB

256 MB

256 MB

SZ(Compress): sz -z -d -i sample_r_B_0.5_26.d64 -M REL -R 1E-2 -1 33554433

SZ(Decompress): sz -x -d -i sample_r_B_0.5_26.d64 -s sample_r_B_0.5_26.d64.sz -1 33554433 -a 

ZFP: zfp -d -1 33554433 -i sample_r_B_0.5_26.d64 -z sample_r_B_0.5_26.d64.zfp -o sample_r_B_0.5_26.d64.zfp.out -a 1E-2 -s

LibPressio: pressio -b compressor=$COMP -i sample_r_B_0.5_26.d64 -d 33554433 -t double -o rel=1e-2  -m time -m size -M all

where $COMP can be sz, zfp, sz3, mgard, etc

Z-checker-installer: ./runZCCase.sh -d REL Brown-samples raw-data-dir d64 33554433

Dataset

Metadata

Property

 

Note: This table will be augmented with metrics that matter for users of these datasets as well as recommended settings for error control (lossy compression).

 

                                   

In general, the extension of the data file is named in the following convention:

      .f32 means single-precision floating point data in little-endian type   

      .F32 means single-precision floating point data in big-endian type   

      .f64 means double-precision floating point data in little-endian type 

      .F64 means double-precision floating point data in big-endian type

 

Others: please submit your proposal of datasets to codar-reduction (at) cels.anl.gov. Requirements: datasets will be open to public access. Dataset should be linked to a simulation application or a scientific instrument. Metadata should explain the source origin of the dataset and how it has been produced (what simulation, what instrument, what settings). Upon review by the SDRBenchmarks committee, the dataset will (or will not) be added to the SDRBenchmarks repository.

 

Lossy compressors:

      SZ: https://github.com/szcompressor/SZ

      ZFP: https://github.com/LLNL/zfp

      TTHRESH: https://github.com/rballester/tthresh

      MGARD: https://github.com/CODARcode/MGARD

      ISABELA: http://freescience.org/cs/ISABELA/ISABELA.html

      DCTZ: https://github.com/swson/DCTZ

      Digit rounding: https://github.com/CNES/Digit_Rounding

(standalone version: https://github.com/disheng222/digitroundingZ )

      Bit grooming: https://github.com/nco/nco

(standalone version: https://github.com/disheng222/BitGroomingZ.git )

      Others: please submit your suggestion of lossy compressor, with github link to codar-reduction (at) cels.anl.gov

 

Unifying generic Interface for Compression:

      LibPressio: https://robertu94.github.io/libpressio/

 

Lossless compressors:

      FPZIP: https://computation.llnl.gov/projects/floating-point-compression

      ZFP: https://github.com/LLNL/zfp

      FPC: http://cs.txstate.edu/~burtscher/research/FPC/.

      pFPC (parallel) http://cs.txstate.edu/~burtscher/research/pFPC/.

      SPDP http://cs.txstate.edu/~burtscher/research/SPDP/.

      GFC (GPUs): http://cs.txstate.edu/~burtscher/research/GFC/ 

      MPC (GPUs): http://cs.txstate.edu/~burtscher/research/MPC

      BLOSC: https://github.com/Blosc/c-blosc

      Others: please submit your suggestion of lossless compressor, with github link to codar-reduction (at) cels.anl.gov

 

Unifying generic Interface for Compression (lossy and lossless):

 

Commonly used metrics for reduction technique assessment:

      Reduction/reconstruction rate in [G|M|K]B/s

      Reduction ratio (Initial size/compressed size)

      [Lossy] Rate distortion (PSNR at different bit rates)

      [Lossy] PSNR in dB, MSE, RMSE and NRMSE

      [Lossy] SSIM (structural similarity index)

      [Lossy] Pearson correlation of the initial and reconstructed dataset

      [Lossy] Autocorrelation of the compression error (1D, … nD)

      [Lossy] Spectral alteration (difference of power spectrum)

      [Lossy] Preservation of the n order derivatives

      Others: please submit your suggestion of assessment metric, to codar-reduction (at) cels.anl.gov

 

Error controls:

      [Common] Point wise absolute error bound

      [Common] Point wise relative error bound

      [Variation] Value range point wise relative error bound

      [Other] Fixed PSNR

 

Assessment tools, metrics and error control:

      Z-checker: https://github.com/CODARcode/Z-checker

      Foresight: https://github.com/lanl/VizAly-Foresight

      SFS https://github.com/JulianKunkel/statistical-file-scanner

      Others: please submit your suggestion of assessment tools, with github link to codar-reduction (at) cels.anl.gov

 

Contributors/maintainers:

      F. Cappello (ANL, lead),

      M. Ainsworth (Brown University),

      J. Bessac (ANL),

      Martin Burtscher (Texas State University),

      Jong Youl Choi (ORNL)

      E. Constantinescu (ANL),

      S. Di (ANL),

      H. Guo (ANL),

      Peter Lindstrom (LLNL),

      Ozan Tugluk (Brown University).

 

We sincerely acknowledge that the following contributors agreed on the release of their scientific datasets:

Salman Habib (ANL), Katrin Heitmann (ANL), Zarija Lubik (LBNL), Rob Jacob (ANL), Danny Perez (LANL), Chun Hong Yoon (SLAC), Kristopher W. Keipert (ANL), Guo-Yuan Lien (Central Weather Bureau Taiwan), Ye Luo (ANL), Peter Lindstrom (LLNL), Hemanth Kolla (SNL), Jong Choi (ORNL), Michael Churchill (PPPL), Ozan Tugluk (Brown University)

 

 

Sponsors: