<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

There is a good explanation when you scroll down to <a href="https://support.hdfgroup.

HDF dataset strides storage (pot. enhancement?) about hdf.pinvoke HOT 10 OPEN

hdfgroup commented on August 10, 2024

HDF dataset strides storage (pot. enhancement?)

from hdf.pinvoke.

Comments (10)

gheber commented on August 10, 2024

We might want to split this into pieces... My first comment is that there is only one "storage scheme". If there is documentation suggesting otherwise, it's erroneous and we should correct it. All dataset elements are stored in the file in C order. Multi-dimensional array data is stored in C order; in other words, the “last” dimension changes fastest. This is true for contiguous layout, but also for the arrangement of dataset elements in a chunk.

from hdf.pinvoke.

gheber commented on August 10, 2024

In other words, there is no way to tell if a dataset was written by a C or Fortran program. It always ends up in the file in C-order.

from hdf.pinvoke.

gheber commented on August 10, 2024

How can one reduce the performance penalty for applications who expect FORTRAN-order? There's no general answer or magic bullet. Appropriate chunking might be one strategy:

If a file/dataset is only or mostly accessed by FORTRAN-order applications, you might consider chunking the dataset layout with "column-oriented" chunks.
If you need it both ways, try to find a chunk size which doesn't favor C or FORTRAN, but doesn't overly penalize either.

from hdf.pinvoke.

hokb commented on August 10, 2024

In other words, there is no way to tell if a dataset was written by a C or Fortran program. It always ends up in the file in C-order

This is what makes it hard for, let's say a FORTRAN developer to use HDF5 for distributing data and trust that it comes out in the same shape, no? The receipient would need to know that it was written in FORTRAN and handle dimension flipping accordingly. (no rant - am just trying to figure it out)

from hdf.pinvoke.

gheber commented on August 10, 2024

My apologies for creating a lot of confusion around this issue. I will try to
explain it once more, and maybe resolve the issue.

For simplicity, let's assume that we are writing datasets with a contiguous
storage layout. The dataset elements are always written to the file in the
order in which they are stored in memory. (Order is defined as increasing
virtual address order.) That means, for example, that a C application writes the
elements of a two-dimensional dataset into the file in row-major order, i.e.,
the elements of the first row, followed by the elements of the second row, etc.
It also means that a FORTRAN application would write a two-dimensional array
in column-major order, i.e., the elements of the first column, followed by the
elements of the second column, etc.

In which order does a FORTRAN application see the data written by a
C-application, and, vice versa, how does a C application see the data written
by a FORTRAN application? The correct answer is: "As it was written by the
other application." What does that mean?

Let me illustrate that with an example. Suppose a C-app writes a 2x3 matrix
into a 2D dataset in an HDF5 file. On disk, the elements appear in this order:
00 01 02 10 11 12 (0-based matrix indices). When a FORTRAN application
reads this dataset, it will read them in this very order. What is the shape
returned by SUBROUTINE h5dget_space_f(dataset_id, dataspace_id, hdferr) ?
The answer is: 3x2.

You might say, "A 2x3 matrix is not a 3x2 matrix.", and I would agree with you.
The hidden and problematic assumption is that HDF5 knows what a matrix is.
HDF5 doesn't. All it knows are sequences of dataset elements, and all it
guarantees is that it maintains that order, and it doesn't alter your data.
(Otherwise it's a bug, or you've applied a filter or dataset transformation.)

Does that make (more) sense?

from hdf.pinvoke.

hokb commented on August 10, 2024

The key is this statement:

The hidden and problematic assumption is that HDF5 knows what a matrix is. HDF5 doesn't.

Indeed this view is new to me and makes the current situation more clear.
My question is: "Does HDF5 wants to know?"
Reasoning: it appears that HDF5 datasets are often used to store matrices and arrays of higher dimensionality. Also, there are several interfaces to HDF5:

C, h5py, HDF5.PInvoke, Java, ... [row-major]
FORTRAN, Matlab, ILNumerics, Julia, Octave, ... [column major].

Currently, when exchanging a matrix between those interfaces / languages we must accept that the outcoming matrix potentially appears scrambled unless we flip around the dimensions we read from the dataset. ASAIK all interfaces for column major order do this flip today automatically. If we managed to flip the dimensions we still end up with a transposed matrix and will have to correct this or live with the transposed version. Whether we need to do all this is matter of some magic meta info which we need to acquire from the creator of the dataset. She must provide us with a hint of the API used to store the dataset, at least. If the storage scheme of the writing language differs from the storage scheme we are on then we must perform a permutation of the dimensions.

Usually, this metadata would be stored into an attribute of the dataset or the HDF file today. However, the name of such attribute is arbitrary and so we still need to exchange documentation about the attribute. This fact renders the attribute solution ... well, not completely useless, but a lot less useful ...
Obviously, currently the choice of the interface used to access HDF5 matters a lot. My request or suggestion is about a way to change this and make HDF5 more compatible between individual languages.

One way could be to introduce an 'official' attribute on datasets which creators/ API authors could use to store the striding of a dataset into. Another, less flexible and compatible and more error prone option could be to add an 'official' attribute used to store the type of the writing API. This way, authors of APIs could react accordingly, perform all permutations if necessary and release the end user from having to deal with this issue.

Maybe there is an official way established already and I have only missed it?

from hdf.pinvoke.

gheber commented on August 10, 2024

There's no official way as far as I know, but I'll check with the folks here.

from hdf.pinvoke.

gheber commented on August 10, 2024

There's very little appetite on this end for anything "official." A first step might be to create and circulate an RFC, and see how much or how little enthusiasm it creates.

from hdf.pinvoke.

BenTeoNeb commented on August 10, 2024

Are there news on this issue ?

Is there a way to find out whether data has been stored by a C or Fortran program, and then choose to transpose or not the read dataset ?

As stated in the documentation:

When a C application reads data stored from a Fortran program, the data will appear to be transposed due to the difference in the C and Fortran storage orders. For example, if Fortran writes a 4x6 two-dimensional dataset to the file, a C program will read it as a 6x4 two-dimensional dataset into memory. The HDF5 C utilities h5dump and h5ls will also display transposed data, if data is written from a Fortran program.

from hdf.pinvoke.

JanWosnitza commented on August 10, 2024

There is a good explanation when you scroll down to 7.3.2.5. C versus Fortran Dataspaces.

When a Fortran application describes a dataspace to store an array as A(20,100), it specifies the value of the first dimension to be 20 and the second to be 100. Since Fortran stores data by columns, the first-listed dimension with the value 20 is the fastest-changing dimension and the last-listed dimension with the value 100 is the slowest-changing. In order to adhere to the HDF5 storage convention, the HDF5 Fortran wrapper transposes dimensions, so the first dimension becomes the last. The dataspace dimensions stored in the file will be 100,20 instead of 20,100 in order to correctly describe the Fortran data that is stored in 100 columns, each containing 20 elements.

When a Fortran application reads the data back, the HDF5 Fortran wrapper transposes the dimensions once more, returning the first dimension to be 20 and the second to be 100, describing correctly the sizes of the array that should be used to read data in the Fortran array A(20,100).

To me that means you don't have to be concerned in which order the data is saved into the file. Just treat the data transposed when you switch to a language with a different memory layout.

Nevertheless I still see an issue when you try to label the axes of a dataset by adding an attribute to the dataset. The dimensions get reversed but the labels don't:

seen by Fortran: Dataset: [10, 20], Labels-Attribute: ["X-axis", "Y-axis"]
seen by C: Dataset: [20, 10], Labels-Attribute: ["X-axis", "Y-axis"]

from hdf.pinvoke.

HDF dataset strides storage (pot. enhancement?) about hdf.pinvoke HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent