Hierarchical Data Format 5

filename extension:
 .h5, .hdf5

The file structure of HDF5 includes two major types of objects:

  • Datasets
    Multidimensional arrays of a homogeneous type.
  • Groups
    Container structures which can hold datasets or other groups.

In that way we end up with a data format that somewhat resembles a filesystem.

Python support for HDF5 is due to the h5py package, which can be installed via

pip install h5py

and be used after importing it via

import h5py

File objects

To open an HDF5 file the recommended way is to use

with h5py.File('my_file.h5', 'r') as f:
    # interact with the file object

where the first argument is the path to the HDF5 file, and the second one is the mode in which to open the file. Available modes are

r:readonly, file must exist
r+:read/write, file must exist
w:create file, truncate if exists
w- or x:create file, fail if exists
a:read/write if exists, create otherwise

with the default filemode being a.

Danger

Similarly to open() there is another way of opening an HDF5 file:

f = h5py.File('my_file.h5', 'r')
# interact with the file object
f.close()

This has the disadvantage that you have to take care of closing the file yourself. As the binary nature of HDF5 files essentially means that one corrupted bit may lead to loss of all the data in the file. We therefore strongly recommend the previously mentioned way with the context manager.

Every file object is also an HDF5 group object representing the root group of the file.

Groups

The hierarchical organization of the HDF5 file format is achieved by groups. Similarly to directories on your regular filesystem they help you with the organization of your data within an HDF5 file. As mentioned before the file object returned by the File initialization. Creating a new group is done by calling the create_group() method of an HDF5 Group object:

with h5py.File('my_file.h5', 'w') as f:
    print('Name of group:', f.name)
    nested_group = f.create_group('nested group')
    print('Name of nested group:', nested_group.name)
    recursively_created_group = f.create_group('/this/is/deep')
    print(
        'Name of recursively created group:',
        recursively_created_group.name)

Which results in the following output:

Name of group: /
Name of nested group: /nested group
Name of recursively created group: /this/is/deep

Working with files may also be compared to working with dict objects, as they offer the indexing syntax to traverse groups, support iteration and the keys(), values() and items() syntax:

with h5py.File('my_file.h5', 'w') as f:
    f.create_group('/favorite/group/one')
    f.create_group('/favorite/group/two')
    f.create_group('/favorite/group/three')
    group = f['/favorite/group']
    for subgroup_name in sorted(group):
        print(subgroup_name)
    subgroup_one = group['one']
    print(subgroup_one.name)
    for subgroup_name, subgroup in sorted(group.items()):
        print(subgroup_name, subgroup.name)
one
three
two
/favorite/group/one
one /favorite/group/one
three /favorite/group/three
two /favorite/group/two

Datasets

Creating a dataset is done via calling the create_dataset method on an HDF5 group. Although we can create empty datasets in HDF5 and fill them with data later on, we much more commonly just want to store data that has been worked on in a NumPy array. This can be done like this:

my_array = np.array(5*5*3*3*3).reshape(5, 5, 3, 3, 3)

with h5py.File('my_file.h5', 'w') as f:
    group = f.create_group('my_group')
    dset = group.create_dataset('my_dataset', data=my_array)

with h5py.File('my_file.h5', 'r') as f:
    retrieved_array = np.array(f['my_group/my_dataset'])

print(np.allclose(my_array, retrieved_array))

Which results in the ouptut True. Handling the datatypes is done automatically both for dumping and loading the data—both for regular and structured arrays.

As the data we store tends to get quite large we can leverage the compression options HDF5 offers. H5Py makes keeps this simple and enables compression when a compression keyword is supplied, followed by a number in the range of 1 to 9 indicating the cpu-time/compression trade-off, from “least compression” to “densest compression.”

my_array = np.array(5*5*3*3*3).reshape(5, 5, 3, 3, 3)

with h5py.File('my_file.h5', 'w') as f:
    group = f.create_group('my_group')
    dset = group.create_dataset('my_dataset', data=my_array, compression=9)

Attributes

Each group and dataset may have attributes. Attributes are metadata that can be assigned in a dictionary-like fashion using the attributes attribute of the group or dataset object.

md_dtype = [
    ('atom_id', np.int32),
    ('type', np.string_, 2),
    ('position', np.float64, 3),
    ('velocity', np.float64, 3)
]
md_data = np.array(
    [
        (0, 'He', (5.7222e-07, 4.8811e-09, 2.0415e-07), (-29.245, 100.45, 128.28)),
        (1, 'He', (9.7710e-07, 3.6371e-07, 4.7311e-07), (-199.26, 232.75, -534.38)),
        (2, 'Ar', (6.4989e-07, 6.7873e-07, 9.5000e-07), (-1.5592, -378.76, 84.091)),
        (3, 'Ar', (5.9024e-08, 3.7138e-07, 7.3455e-08), (342.82, 156.82, -38.991)),
        (4, 'He', (7.6746e-07, 8.3017e-08, 4.8520e-07), (-30.45, -379.75, -336.32)),
        (5, 'Ar', (1.7226e-07, 4.6023e-07, 4.7356e-08), (-311.51, -429.39, -694.74)),
        (6, 'Ar', (9.6394e-07, 7.2845e-07, 8.8623e-07), (-82.636, 45.098, -10.626)),
        (7, 'He', (5.4450e-07, 4.6373e-07, 6.2270e-07), (158.89, 258.58, -151.5)),
        (8, 'He', (7.9322e-07, 9.4700e-07, 3.5194e-08), (-197.03, 156.74, -185.2)),
        (9, 'Ar', (2.7797e-07, 1.6487e-07, 8.2403e-07), (-38.65, -696.32, 216.42)),
        (10, 'He', (1.1842e-07, 6.3244e-07, 5.0958e-07), (-149.63, 422.88, -76.309)),
        (11, 'Ar', (2.0359e-07, 8.3369e-07, 9.6348e-07), (484.57, -267.41, -352.54)),
        (12, 'He', (5.1019e-07, 2.2470e-07, 2.3846e-08), (-231.92, -99.51, 32.77)),
        (13, 'Ar', (3.5383e-07, 8.4581e-07, 7.2340e-07), (-303.95, 47.316, 222.53)),
        (14, 'He', (3.8515e-07, 2.8940e-07, 5.6028e-07), (233.08, 254.18, 429.83)),
        (15, 'He', (1.5842e-07, 9.8225e-07, 5.7859e-07), (199.63, 203.11, -425.6)),
        (16, 'He', (3.6831e-07, 7.6520e-07, 2.9884e-07), (66.341, 222.32, -97.653)),
        (17, 'He', (2.8696e-07, 1.5129e-07, 6.4060e-07), (90.358, -67.459, -64.782)),
        (18, 'He', (1.0325e-07, 9.9012e-07, 3.4381e-07), (71.108, 11.06, 15.912)),
        (19, 'Ar', (4.3929e-07, 7.5363e-07, 9.9974e-07), (239.19, 173.83, 335.29))
    ],
    dtype=md_dtype)
with h5py.File('md_results.h5') as f:
    # Generic information regarding the file/the simulation
    f.attrs['units'] = 'All quantities are in SI units.'
    f.attrs['atom-types'] = json.dumps(
        {
            'He': 'Helium',
            'Ar': 'Argon'
        })
    f.attrs['He potential'] = json.dumps(
        {
            'type': 'Lennard-Jones',
            'parameters': {
                'epsilon/k_B': 10.22,
                'sigma': 256e-12
            }
        })
    f.attrs['Ar potential'] = json.dumps(
        {
            'type': 'Lennard-Jones',
            'parameters': {
                'epsilon/k_B': 120,
                'sigma': 341e-12
            }
        })
    f.attrs['system size'] = np.array([100e-6, 200e-6, 300e-6])
    f.attrs['boundary conditions'] = json.dumps(
        ['periodic', 'periodic', 'periodic'])

    # Information specific to this dataset
    dset = f.create_dataset('0000', data=md_data, compression=9)
    dset.attrs['step'] = 0
    dset.attrs['time'] = 0

HDFView

The HDF Group offers a tool for brwosing and editing HDF5 files: HDFView. Often it can be installed via the package manager of our operating system, and subsequently be executed via