itsh5py#

While there are many ways to store different data types, many of them have their drawbacks. hdf is a common way to store large arrays. Sometimes it can be practical to store arrays with additional (pythonic) data in a single file. While hdf attributes can support some types, many exception exists especially with python types.

This is a small implementation of recursive dict support for python to write and read hdf-files with many different pythonic data types. Almost all types implemented in default python and numpy should be supported, even in nested structures. The resulting files work in hdfview and panoply with some small drawbacks.

A major convenience is the ability to store iterables like lists and tuples, even in nested form. Mixed types are also supported.

Conversion, obscuration or changes to the saved types are kept at the bare minimum. So if, for any reasons, the files have to be used without itsh5py, all the data will be accessible with just a little added inconvenience.

Installation#

itsh5py is available on PyPI and can be readily installed via

pip install itsh5py

Run pip uninstall itsh5py in order to remove the package from your system.

To work, this requires some additional packages. Obviously, h5py is used for data storage. numpy is used for array handling. DataFrames are also supported using pandas. Finally, for serialization of difficult data types, yaml is used via pyyaml.

All the source packages above are available on PyPI for all common OS.

Limitations and warning#

Some limitation still exist:

  • While most of the core data types should be implemented, there is arbitrary complexity especially with nested iterables. Most likely there are still some cases and types which are not supported and may fail with different levels of grace. Since this package will most likely be used for data storage please always consider checking if your type is saved and loaded correctly. If in doubt, always open the file with h5py.File() and check. Feel free to report missing or buggy data types and they will be implemented if possible.

  • numpy object arrays are not supported.

  • Keys of the dictionary which will be saved should only be strings to avoid any ambiguity. Any other types are not tested and most likely will fail.

  • Lazy slicing of arrays is not supported (yet).

  • Long tuples and mixed type lists will be saved element-wise and thus be slow. This is recognizable starting at approx. 100 elements.

  • Path object are supported as single datasets or as list or tuple iterables - however only non nested type.

  • Closing a LazyHdfDict will close the file reference - even if another LazyHdfDict accesses the same file (which should not happen too often).

Tutorial#

Let’s start with some sample data: Some coordinate arrays and a derived data field:

import numpy as np
import itsh5py

# Taken from mayavi examples!
x, y = np.mgrid[-5.:5.:200j, -5.:5:200j]
z = np.sin(x + y) + np.sin(2 * x - y) + np.cos(3 * x + 4 * y)

This is a nice contour plot Saving this to hdf is obviously easy and possible with h5py. But it would mean creating a file, datasets and filling this manually. Using itsh5py, this is as easy as

itsh5py.save('demo', {'x': x, 'y': y, 'data': z})

Still, this is a default hdf file which can be opened and inspected in tools like hdfview or Panoply:

However, sometimes you just need to store some metadata with your file and attributes just won’t do it. Most of the types used in python are supported, thus

itsh5py.save('demo2', {'x': x, 'y': y, 'data': z, 'meta': ['type1', 2.]})

This can be inspected too:

As you can see, a mixed list is split into its elements since this type is not supported by hdf. For other types, other conversions exist. They will be visible when opening the files using h5py or similar but when loading them using itsh5py, they are converted back.

Loading can be done in two ways: Lazy, which keeps everything possible with weak references (default) or just loading all data. If lazy is active, the result is a LazyHdfDict:

lazy_demo = itsh5py.load('demo')
lazy_demo_2 = itsh5py.load('demo2')
itsh5py.config.use_lazy = False
basic_demo = itsh5py.load('demo')
basic_demo_2 = itsh5py.load('demo2')

Inspecting the results shows the following:

Unfold to see
basic_demo
{
    'data': array([[ 0.59925318,  0.47366702,  0.38353246, ..., -0.93771155,
        -0.65135851, -0.36662565],
       [ 0.52598534,  0.43316361,  0.37683558, ..., -0.80275349,
        -0.52186765, -0.24853893],
       [ 0.46326134,  0.40428732,  0.38196713, ..., -0.66053162,
        -0.39007068, -0.13290833],
       ...,
       [ 1.24416509,  1.14695653,  1.03256892, ..., -2.31421298,
        -2.40068459, -2.44342076],
       [ 1.09745781,  0.99214212,  0.87544695, ..., -2.36468556,
        -2.42493877, -2.44148253],
       [ 0.93395003,  0.82435405,  0.70941222, ..., -2.3818948 ,
        -2.41563926, -2.40663759]]),
    'x': array([[-5.        , -5.        , -5.        , ..., -5.        ,
        -5.        , -5.        ],
       [-4.94974874, -4.94974874, -4.94974874, ..., -4.94974874,
        -4.94974874, -4.94974874],
       [-4.89949749, -4.89949749, -4.89949749, ..., -4.89949749,
        -4.89949749, -4.89949749],
       ...,
       [ 4.89949749,  4.89949749,  4.89949749, ...,  4.89949749,
         4.89949749,  4.89949749],
       [ 4.94974874,  4.94974874,  4.94974874, ...,  4.94974874,
         4.94974874,  4.94974874],
       [ 5.        ,  5.        ,  5.        , ...,  5.        ,
         5.        ,  5.        ]]),
    'y': array([[-5.        , -4.94974874, -4.89949749, ...,  4.89949749,
         4.94974874,  5.        ],
       [-5.        , -4.94974874, -4.89949749, ...,  4.89949749,
         4.94974874,  5.        ],
       [-5.        , -4.94974874, -4.89949749, ...,  4.89949749,
         4.94974874,  5.        ],
       ...,
       [-5.        , -4.94974874, -4.89949749, ...,  4.89949749,
         4.94974874,  5.        ],
       [-5.        , -4.94974874, -4.89949749, ...,  4.89949749,
         4.94974874,  5.        ],
       [-5.        , -4.94974874, -4.89949749, ...,  4.89949749,
         4.94974874,  5.        ]])
}

basic_demo_2
{
    'data': array([[ 0.59925318,  0.47366702,  0.38353246, ..., -0.93771155,
        -0.65135851, -0.36662565],
       [ 0.52598534,  0.43316361,  0.37683558, ..., -0.80275349,
        -0.52186765, -0.24853893],
       [ 0.46326134,  0.40428732,  0.38196713, ..., -0.66053162,
        -0.39007068, -0.13290833],
       ...,
       [ 1.24416509,  1.14695653,  1.03256892, ..., -2.31421298,
        -2.40068459, -2.44342076],
       [ 1.09745781,  0.99214212,  0.87544695, ..., -2.36468556,
        -2.42493877, -2.44148253],
       [ 0.93395003,  0.82435405,  0.70941222, ..., -2.3818948 ,
        -2.41563926, -2.40663759]]),
    'meta': ['type1', 2.0],
    'x': array([[-5.        , -5.        , -5.        , ..., -5.        ,
        -5.        , -5.        ],
       [-4.94974874, -4.94974874, -4.94974874, ..., -4.94974874,
        -4.94974874, -4.94974874],
       [-4.89949749, -4.89949749, -4.89949749, ..., -4.89949749,
        -4.89949749, -4.89949749],
       ...,
       [ 4.89949749,  4.89949749,  4.89949749, ...,  4.89949749,
         4.89949749,  4.89949749],
       [ 4.94974874,  4.94974874,  4.94974874, ...,  4.94974874,
         4.94974874,  4.94974874],
       [ 5.        ,  5.        ,  5.        , ...,  5.        ,
         5.        ,  5.        ]]),
    'y': array([[-5.        , -4.94974874, -4.89949749, ...,  4.89949749,
         4.94974874,  5.        ],
       [-5.        , -4.94974874, -4.89949749, ...,  4.89949749,
         4.94974874,  5.        ],
       [-5.        , -4.94974874, -4.89949749, ...,  4.89949749,
         4.94974874,  5.        ],
       ...,
       [-5.        , -4.94974874, -4.89949749, ...,  4.89949749,
         4.94974874,  5.        ],
       [-5.        , -4.94974874, -4.89949749, ...,  4.89949749,
         4.94974874,  5.        ],
       [-5.        , -4.94974874, -4.89949749, ...,  4.89949749,
         4.94974874,  5.        ]])
}

Which is, while not pretty, what was expected since it’s the same as the input.

Taking a look at the LazyHdfDict, this is structured better:

demo.hdf
├─ /data::(200, 200)
├─ /x::(200, 200)
└─ /y::(200, 200)

demo2.hdf
├─ /data::(200, 200)
├─ Group /meta
│  ├─ /meta/i_0::b'type1'
│  └─ /meta/i_1::2.0
├─ /x::(200, 200)
└─ /y::(200, 200)

You can always unlazy a LazyHdfDict by either calling dict() or using the .unlazy() method. The latter is a wrapper that takes care of closing the then unused reference.

Attributes#

Attributes can be used to add (scalar) quantities to hdf types (Files, Groups, Datasets). They can be loaded using the unpack_attrs option to itsh5py.load() which will place them in a dict called attrs. This is off by default. Otherwise, they can be accessed via the h5py backend, see below.

To quickly store some attributes with your data, you can use the same attrs key:

file = itsh5py.save('demo_att',
                    {'x': x, 'y': y, 'data': z,
                     'attrs': {'additional_str': 'meta_string',
                               'addition_float': 100.,
                               },
                     })
reloaded = itsh5py.load(file)
reloaded:

demo_att.hdf
├─ /data::(200, 200)
├─ /x::(200, 200)
└─ /y::(200, 200)

While the attribute were added to the file, they are not loaded by default. Access them via either of the two methods:

[f'{k}: {v} (Type {type(v)})' for k, v in reloaded.h5file.attrs.items()]
["addition_float: 100.0 (Type <class 'numpy.float64'>)",
 "additional_str: meta_string (Type <class 'str'>)"]

reloaded = itsh5py.load(file, unpack_attrs=True)
reloaded['attrs']
{'addition_float': 100.0, 'additional_str': 'meta_string'}

h5py Backend#

After loading lazy (by default), the underlying hdf can be accessed via the LazyHdfDict.h5file property. This allows the creation, extraction, slicing and so on with all basic h5py methods on the file.

Queue System#

Open files (at least lazy ones) are stored in a queue. The handling Functions are mostly hidden and do not need to be accessed. However there are two things to not here: The amount of files open at once can be controlled via the itsh5py.max_open_files attribute. Currently open files can be shown using

itsh5py.open_filenames()
['demo2.hdf', 'demo.hdf']

There might be situations where large amounts of open files can be present, e.g. in list comprehensions. This can be handled in two ways:

  1. Setting itsh5py.max_open_files to a large number. Be aware that this, combined with unlazy files, can be difficult for RAM and slow down the process considerably.

  2. Using itsh5.config.allow_fallback_open = True (defaults to False). Since closing a LazyHdfDict does not remove the python instance, this allows to reopen a file on the fly to access unwrapped data from a previously open file. This will only open the file to get the data and subsequently close it again, preventing memory issues but also slowing down the process.

Usage#

Overview documentation of the public API functions.

save

Adds keys of given dict as groups and values as datasets to the given hdf-file (by string or object) or group object.

load

Returns a dictionary containing the groups as keys and the datasets as values from given hdf file.

LazyHdfDict

Helps loading data only if values from the dict are requested.

queue_handler

Base module to handle the queue of open (in memory) files.

config

Package-wide config options

Releases#

0.7.2#

This is just a maintenance release after some time with just a small QoL feature when working with large files.

  • Published on 2023-01-28

  • Added support for the new config max_tree_children to reduce too large tree views with a default of 30 elements per group.

  • Fixed some minor issues in the documentation

  • Set minimum versions for dependencies

  • Minimum Python version is set at 3.7 still - this might change when h5py updates their requirement.

  • Increased some backend versions to reflect some changes in the past

0.7.1#

  • Published on 2022-2-2

  • Changed affiliations and contact for main author

  • Some minor code style changes

0.7.0#

  • Published on 2021-08-09

  • Initial Release on PyPI