Using Blosc2 NDim As A Fast Explorer Of The Milky Way (Or Any Other NDim Dataset)

Project Blosc; Francesc Alted; Marta Iborra; Oscar Guiñón; David Ibáñez; Sergio Barrachina

doi:10.25080/gerudo-f2bc6f59-000

Using Blosc2 NDim As A Fast Explorer Of The Milky Way (Or Any Other NDim Dataset)

Project Blosc
Project Blosc

Francesc Alted
Project Blosc

Marta Iborra
Project Blosc

Oscar Guiñón
Project Blosc

David Ibáñez
Project Blosc

Sergio Barrachina
Universitat Jaume I

Abstract

Large multidimensional datasets are widely used in various engineering and scientific applications. Prompt access to the subsets of these datasets is crucial for an efficient exploration experience. To facilitate this, we have added support for large dimensional datasets to Blosc2, a compression and format library. The extension enables effective support for large multidimensional datasets, with a special encoding of zeros that allows for efficient handling of sparse datasets. Additionally, the new two-level data partition used in Blosc2 reduces the need for decompressing unnecessary data, further accelerating slicing speed.

The Blosc2 NDim layer enables the creation and reading of n-dimensional datasets in an extremely efficient manner. This is due to a completely general n-dim 2-level partitioning, which allows for slicing and dicing of arbitrary large (and compressed) data in a more fine-grained way. Having a second partition provides a better flexibility to fit the different partitions at the different CPU cache levels, making compression even more efficient.

Additionally, Blosc2 can make use of Btune, a library that automatically finds the optimal combination of compression parameters to suit user needs. Btune employs various techniques, such as a genetic algorithm and a neural network model, to discover the best parameters for a given dataset much more quickly. This approach is a significant improvement over the traditional trial-and-error method, which can take hours or even days to find the best parameters.

As an example, we will demonstrate how Blosc2 NDim enables fast exploration of the Milky Way using the Gaia DR3 dataset.

Keywords

explore datasets, n-dimensional datasets, Gaia DR3, Milky Way, Blosc2, compression

DOI

10.25080/gerudo-f2bc6f59-000

Bibtex entry

Full text PDF

Proceedings

Using Blosc2 NDim As A Fast Explorer Of The Milky Way (Or Any Other NDim Dataset)