itsh5py#
While there are many ways to store different data types, many of them have their drawbacks. hdf is a common way to store large arrays. Sometimes it can be practical to store arrays with additional (pythonic) data in a single file. While hdf attributes can support some types, many exception exists especially with python types.
This is a small implementation of recursive dict support for python to write and read hdf-files with many different pythonic data types. Almost all types implemented in default python and numpy should be supported, even in nested structures. The resulting files work in hdfview and panoply with some small drawbacks.
A major convenience is the ability to store iterables like lists and tuples, even in nested form. Mixed types are also supported.
Conversion, obscuration or changes to the saved types are kept at the bare minimum. So if, for any reasons, the files have to be used without itsh5py, all the data will be accessible with just a little added inconvenience.
Installation#
itsh5py is available on PyPI and can be readily installed via
pip install itsh5py
Run pip uninstall itsh5py
in order to remove the package from your system.
To work, this requires some additional packages. Obviously, h5py
is used
for data storage. numpy
is used for array handling. DataFrames are
also supported using pandas
. Finally, for serialization of difficult
data types, yaml is used via pyyaml
.
All the source packages above are available on PyPI for all common OS.
Limitations and warning#
Some limitation still exist:
While most of the core data types should be implemented, there is arbitrary complexity especially with nested iterables. Most likely there are still some cases and types which are not supported and may fail with different levels of grace. Since this package will most likely be used for data storage please always consider checking if your type is saved and loaded correctly. If in doubt, always open the file with
h5py.File()
and check. Feel free to report missing or buggy data types and they will be implemented if possible.numpy object arrays are not supported.
Keys of the dictionary which will be saved should only be strings to avoid any ambiguity. Any other types are not tested and most likely will fail.
Lazy slicing of arrays is not supported (yet).
Long tuples and mixed type lists will be saved element-wise and thus be slow. This is recognizable starting at approx. 100 elements.
Path object are supported as single datasets or as list or tuple iterables - however only non nested type.
Closing a LazyHdfDict will close the file reference - even if another LazyHdfDict accesses the same file (which should not happen too often).
Tutorial#
Let’s start with some sample data: Some coordinate arrays and a derived data field:
import numpy as np
import itsh5py
# Taken from mayavi examples!
x, y = np.mgrid[-5.:5.:200j, -5.:5:200j]
z = np.sin(x + y) + np.sin(2 * x - y) + np.cos(3 * x + 4 * y)
Saving this to hdf is obviously easy and possible with h5py. But it would mean
creating a file, datasets and filling this manually. Using
itsh5py
, this is
as easy as
itsh5py.save('demo', {'x': x, 'y': y, 'data': z})
Still, this is a default hdf file which can be opened and inspected in tools like hdfview or Panoply:
However, sometimes you just need to store some metadata with your file and attributes just won’t do it. Most of the types used in python are supported, thus
itsh5py.save('demo2', {'x': x, 'y': y, 'data': z, 'meta': ['type1', 2.]})
This can be inspected too:
As you can see, a mixed list is split into its elements since this type is
not supported by hdf. For other types, other conversions exist. They will be
visible when opening the files using h5py
or similar but when loading them
using itsh5py
, they are converted back.
Loading can be done in two ways: Lazy, which keeps everything possible with weak references (default) or just loading all data. If lazy is active, the result is a LazyHdfDict:
lazy_demo = itsh5py.load('demo')
lazy_demo_2 = itsh5py.load('demo2')
itsh5py.config.use_lazy = False
basic_demo = itsh5py.load('demo')
basic_demo_2 = itsh5py.load('demo2')
Inspecting the results shows the following:
Unfold to see
basic_demo
{
'data': array([[ 0.59925318, 0.47366702, 0.38353246, ..., -0.93771155,
-0.65135851, -0.36662565],
[ 0.52598534, 0.43316361, 0.37683558, ..., -0.80275349,
-0.52186765, -0.24853893],
[ 0.46326134, 0.40428732, 0.38196713, ..., -0.66053162,
-0.39007068, -0.13290833],
...,
[ 1.24416509, 1.14695653, 1.03256892, ..., -2.31421298,
-2.40068459, -2.44342076],
[ 1.09745781, 0.99214212, 0.87544695, ..., -2.36468556,
-2.42493877, -2.44148253],
[ 0.93395003, 0.82435405, 0.70941222, ..., -2.3818948 ,
-2.41563926, -2.40663759]]),
'x': array([[-5. , -5. , -5. , ..., -5. ,
-5. , -5. ],
[-4.94974874, -4.94974874, -4.94974874, ..., -4.94974874,
-4.94974874, -4.94974874],
[-4.89949749, -4.89949749, -4.89949749, ..., -4.89949749,
-4.89949749, -4.89949749],
...,
[ 4.89949749, 4.89949749, 4.89949749, ..., 4.89949749,
4.89949749, 4.89949749],
[ 4.94974874, 4.94974874, 4.94974874, ..., 4.94974874,
4.94974874, 4.94974874],
[ 5. , 5. , 5. , ..., 5. ,
5. , 5. ]]),
'y': array([[-5. , -4.94974874, -4.89949749, ..., 4.89949749,
4.94974874, 5. ],
[-5. , -4.94974874, -4.89949749, ..., 4.89949749,
4.94974874, 5. ],
[-5. , -4.94974874, -4.89949749, ..., 4.89949749,
4.94974874, 5. ],
...,
[-5. , -4.94974874, -4.89949749, ..., 4.89949749,
4.94974874, 5. ],
[-5. , -4.94974874, -4.89949749, ..., 4.89949749,
4.94974874, 5. ],
[-5. , -4.94974874, -4.89949749, ..., 4.89949749,
4.94974874, 5. ]])
}
basic_demo_2
{
'data': array([[ 0.59925318, 0.47366702, 0.38353246, ..., -0.93771155,
-0.65135851, -0.36662565],
[ 0.52598534, 0.43316361, 0.37683558, ..., -0.80275349,
-0.52186765, -0.24853893],
[ 0.46326134, 0.40428732, 0.38196713, ..., -0.66053162,
-0.39007068, -0.13290833],
...,
[ 1.24416509, 1.14695653, 1.03256892, ..., -2.31421298,
-2.40068459, -2.44342076],
[ 1.09745781, 0.99214212, 0.87544695, ..., -2.36468556,
-2.42493877, -2.44148253],
[ 0.93395003, 0.82435405, 0.70941222, ..., -2.3818948 ,
-2.41563926, -2.40663759]]),
'meta': ['type1', 2.0],
'x': array([[-5. , -5. , -5. , ..., -5. ,
-5. , -5. ],
[-4.94974874, -4.94974874, -4.94974874, ..., -4.94974874,
-4.94974874, -4.94974874],
[-4.89949749, -4.89949749, -4.89949749, ..., -4.89949749,
-4.89949749, -4.89949749],
...,
[ 4.89949749, 4.89949749, 4.89949749, ..., 4.89949749,
4.89949749, 4.89949749],
[ 4.94974874, 4.94974874, 4.94974874, ..., 4.94974874,
4.94974874, 4.94974874],
[ 5. , 5. , 5. , ..., 5. ,
5. , 5. ]]),
'y': array([[-5. , -4.94974874, -4.89949749, ..., 4.89949749,
4.94974874, 5. ],
[-5. , -4.94974874, -4.89949749, ..., 4.89949749,
4.94974874, 5. ],
[-5. , -4.94974874, -4.89949749, ..., 4.89949749,
4.94974874, 5. ],
...,
[-5. , -4.94974874, -4.89949749, ..., 4.89949749,
4.94974874, 5. ],
[-5. , -4.94974874, -4.89949749, ..., 4.89949749,
4.94974874, 5. ],
[-5. , -4.94974874, -4.89949749, ..., 4.89949749,
4.94974874, 5. ]])
}
Which is, while not pretty, what was expected since it’s the same as the input.
Taking a look at the LazyHdfDict, this is structured better:
demo.hdf
├─ /data::(200, 200)
├─ /x::(200, 200)
└─ /y::(200, 200)
demo2.hdf
├─ /data::(200, 200)
├─ Group /meta
│ ├─ /meta/i_0::b'type1'
│ └─ /meta/i_1::2.0
├─ /x::(200, 200)
└─ /y::(200, 200)
You can always unlazy a LazyHdfDict by either calling dict()
or using the
.unlazy()
method. The latter is a wrapper that takes care of closing the
then unused reference.
Attributes#
Attributes can be used to add (scalar) quantities to hdf types (Files,
Groups, Datasets). They can be loaded using the unpack_attrs
option to
itsh5py.load()
which will place them in a dict
called attrs
. This is
off by default. Otherwise, they can be accessed via the h5py
backend, see
below.
To quickly store some attributes with your data, you can use the same attrs
key:
file = itsh5py.save('demo_att',
{'x': x, 'y': y, 'data': z,
'attrs': {'additional_str': 'meta_string',
'addition_float': 100.,
},
})
reloaded = itsh5py.load(file)
reloaded:
demo_att.hdf
├─ /data::(200, 200)
├─ /x::(200, 200)
└─ /y::(200, 200)
While the attribute were added to the file, they are not loaded by default. Access them via either of the two methods:
[f'{k}: {v} (Type {type(v)})' for k, v in reloaded.h5file.attrs.items()]
["addition_float: 100.0 (Type <class 'numpy.float64'>)",
"additional_str: meta_string (Type <class 'str'>)"]
reloaded = itsh5py.load(file, unpack_attrs=True)
reloaded['attrs']
{'addition_float': 100.0, 'additional_str': 'meta_string'}
h5py Backend#
After loading lazy (by default), the underlying hdf can be accessed
via the LazyHdfDict.h5file
property. This allows the creation, extraction,
slicing and so on with all basic h5py
methods on the file.
Queue System#
Open files (at least lazy ones) are stored in a queue. The handling Functions
are mostly hidden and do not need to be accessed. However there are two
things to not here: The amount of files open at once can be controlled via the
itsh5py.max_open_files
attribute. Currently open files can be shown using
itsh5py.open_filenames()
['demo2.hdf', 'demo.hdf']
There might be situations where large amounts of open files can be present, e.g. in list comprehensions. This can be handled in two ways:
Setting
itsh5py.max_open_files
to a large number. Be aware that this, combined with unlazy files, can be difficult for RAM and slow down the process considerably.Using
itsh5.config.allow_fallback_open = True
(defaults toFalse
). Since closing a LazyHdfDict does not remove the python instance, this allows to reopen a file on the fly to access unwrapped data from a previously open file. This will only open the file to get the data and subsequently close it again, preventing memory issues but also slowing down the process.
Usage#
Overview documentation of the public API functions.
Adds keys of given dict as groups and values as datasets to the given hdf-file (by string or object) or group object. |
|
Returns a dictionary containing the groups as keys and the datasets as values from given hdf file. |
|
Helps loading data only if values from the dict are requested. |
|
Base module to handle the queue of open (in memory) files. |
|
Package-wide config options |
Releases#
0.7.2#
This is just a maintenance release after some time with just a small QoL feature when working with large files.
Published on 2023-01-28
Added support for the new config
max_tree_children
to reduce too large tree views with a default of 30 elements per group.Fixed some minor issues in the documentation
Set minimum versions for dependencies
Minimum Python version is set at 3.7 still - this might change when
h5py
updates their requirement.Increased some backend versions to reflect some changes in the past
0.7.1#
Published on 2022-2-2
Changed affiliations and contact for main author
Some minor code style changes
0.7.0#
Published on 2021-08-09
Initial Release on PyPI