hpc-io / vol-async Goto Github PK
View Code? Open in Web Editor NEWAsynchronous I/O for HDF5
Home Page: https://hdf5-vol-async.readthedocs.io
License: Other
Asynchronous I/O for HDF5
Home Page: https://hdf5-vol-async.readthedocs.io
License: Other
Hi,
I am running on an x86-64 Linux OpenMPI cluster, and I have built following the instructions in the README, but the tests do not complete successfully:
$ make check_serial
python3 ./pytest.py
Running serial tests
Test # 1 : async_test_serial.exe PASSED
Test # 2 : async_test_serial2.exe PASSED
ERROR: Test async_test_multifile.exe : returned non-zero exit status= -11 aborting test
run_cmd= ./async_test_multifile.exe
pytest was unsuccessful
The backtrace is:
$ cat async_vol_test.err
[gadi-login-07:3639707:0:3639707] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x110)
==== backtrace (tid:3639707) ====
0 0x0000000000012c20 .annobin_sigaction.c() sigaction.c:0
1 0x0000000000007f5c get_n_running_task_in_queue_obj() /home/120/bw0729/vol-async/src/h5_async_vol.c:2138
2 0x0000000000008f0c H5VL_async_request_wait() /home/120/bw0729/vol-async/src/h5_async_vol.c:24279
3 0x000000000045238a H5VL__request_wait() /home/120/bw0729/hdf5/src/H5VLcallback.c:6435
4 0x00000000004653f6 H5VL_request_wait() /home/120/bw0729/hdf5/src/H5VLcallback.c:6469
5 0x0000000000177597 H5ES__wait_cb() /home/120/bw0729/hdf5/src/H5ESint.c:669
6 0x0000000000178ce2 H5ES__list_iterate() /home/120/bw0729/hdf5/src/H5ESlist.c:171
7 0x00000000001786a4 H5ES__wait() /home/120/bw0729/hdf5/src/H5ESint.c:754
8 0x0000000000174130 H5ESwait() /home/120/bw0729/hdf5/src/H5ES.c:342
9 0x000000000040129a main() /home/120/bw0729/vol-async/test/async_test_multifile.c:61
10 0x0000000000023493 __libc_start_main() ???:0
11 0x000000000040106e _start() ???:0
=================================
Hello vol-async team,
I'm trying to get this HDF5 Asynchronous I/O VOL Connector installed on my system and I can get it to a point where it is passing the serial tests (in vol-async/test/pytest.py) but never the parallel ones; I think there may be some inconsistencies with the directory structures / paths as written so hopefully we can clear this up together. Let me walk you through how I got here:
export H5_DIR=/home1/sneuhoff/nbu11/scratch/hdf5_async/hdf5/
export VOL_DIR=/home1/sneuhoff/nbu11/scratch/hdf5_async/vol-async/
export ABT_DIR=/home1/sneuhoff/nbu11/scratch/hdf5_async/vol-async/argobots/
./configure --prefix=$H5_DIR/install --enable-parallel --enable-threadsafe --enable-unsupported CC=mpicc
using my systems HPE MPT installation for MPI.make install
with no issues, switched to $ABT_DIR, ran ./autogen.sh && CC=cc ./configure --prefix=$ABT_DIR/build && make install
with no issuesHDF5_DIR = /home1/sneuhoff/nbu11/scratch/hdf5_async/hdf5/install`
ABT_DIR = /home1/sneuhoff/nbu11/scratch/hdf5_async/vol-async/argobots/build
Notice these are not as written in repo's README: I had to add /install
on the end of HDF5_DIR for it to find the correct header files, if I did not do this, it would complain that hdf5dev.h
could not be found (as it should, that header file is not in $H5_DIR
as Makefile.summit
would have you believe)
6. After editing that Makefile, I run make
and it completes smoothly. Next, I run
export LD_LIBRARY_PATH=$VOL_DIR/src:$H5_DIR/lib:$LD_LIBRARY_PATH
export HDF5_PLUGIN_PATH="$VOL_DIR"
export HDF5_VOL_CONNECTOR="async under_vol=0;under_info={}
although, here again I find that $H5_DIR/lib
doesn't exist, perhaps it should be $H5_DIR/install/lib
7. I copy Makefile.summit
to Makefile
and again edit it so that:
ASYNC_DIR = /home1/sneuhoff/nbu11/scratch/hdf5_async/vol-async/src
HDF5_DIR = /home1/sneuhoff/nbu11/scratch/hdf5_async/hdf5/install
ABT_DIR = /home1/sneuhoff/nbu11/scratch/hdf5_async/vol-async/argobots/build
make
with no issuesmake check
(my Python is version 3.7.0), I get the following:./pytest.py -p
Running serial tests
Test # 1 : async_test_serial.exe PASSED
Test # 2 : async_test_serial2.exe PASSED
ERROR: Test async_test_multifile.exe : returned non-zero exit status= -6 aborting test
run_cmd= ./async_test_multifile.exe
pytest was unsuccessful
Running async_test_multifile.exe
alone gives me:
async_test_multifile.exe: H5CX.c:3610: H5CX__pop_common: Assertion `head && *head' failed.
Aborted (core dumped)
In my other attempts changing various things I was able to get it to pass all the way to here:
./pytest.py -p
Running serial tests
Test # 1 : async_test_serial.exe PASSED
Test # 2 : async_test_serial2.exe PASSED
Test # 3 : async_test_multifile.exe PASSED
Test # 4 : async_test_serial_event_set.exe PASSED
ERROR: Test async_test_serial_event_set_error_stack.exe : returned non-zero exit status= 255 aborting test
run_cmd= ./async_test_serial_event_set_error_stack.exe
pytest was unsuccessful
Running that test individually gives:
H5Fcreate start
H5Fcreate done
H5Gcreate start
H5Gcreate done
H5Gcreate 2 start (should fail when executed)
HDF5-DIAG: Error detected in HDF5 (1.13.0) thread 0:
#000: H5G.c line 268 in H5Gcreate_async(): unable to asynchronously create group
major: Symbol table
minor: Unable to create file
#001: H5G.c line 185 in H5G__create_api_common(): unable to create group
major: Symbol table
minor: Unable to initialize object
#002: H5VLcallback.c line 4920 in H5VL_group_create(): group create failed
major: Virtual Object Layer
minor: Unable to create file
#003: H5VLcallback.c line 4887 in H5VL__group_create(): group create failed
major: Virtual Object Layer
minor: Unable to create file
#004: H5VLnative_group.c line 103 in H5VL__native_group_create(): unable to create group
major: Symbol table
minor: Unable to initialize object
#005: H5Gint.c line 328 in H5G__create_named(): unable to create and link to group
major: Symbol table
minor: Unable to initialize object
#006: H5L.c line 2383 in H5L_link_object(): unable to create new link to object
major: Links
minor: Unable to initialize object
#007: H5L.c line 2625 in H5L__create_real(): can't insert link
major: Links
minor: Unable to insert object
#008: H5Gtraverse.c line 838 in H5G_traverse(): internal path traversal failed
major: Symbol table
minor: Object not found
#009: H5Gtraverse.c line 614 in H5G__traverse_real(): traversal operator failed
major: Symbol table
minor: Callback failed
#010: H5L.c line 2418 in H5L__link_cb(): name already exists
major: Links
minor: Object already exists
Error with group create
HDF5-DIAG: Error detected in HDF5 (1.13.0) thread 0:
#000: H5S.c line 496 in H5Sclose(): not a dataspace
major: Invalid arguments to routine
minor: Inappropriate type
Closing dataset's dataspace failed
HDF5-DIAG: Error detected in HDF5 (1.13.0) thread 0:
#000: H5D.c line 472 in H5Dclose(): not a dataset ID
major: Invalid arguments to routine
minor: Inappropriate type
Closing dataset failed
I am wondering if there is anything here that is obviously inconsistent with how I should be installing things. Let me know, thanks!
One of many checks do this comparision:
if ((attempt_count = check_app_acquire_mutex(task, &mutex_count, &acquired)) < 0)
goto done;
is a no-op because attempt_count is an unsigned int. This if condition can be removed.
Runtime segfault, below is the valgrind output.
(base) jain @ compute001 ~/F5/async_hdf5 (rajeeja/async_hdf5_io)
└─ $ ▶ valgrind --leak-check=full ./flash5
==65029== Memcheck, a memory error detector
==65029== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==65029== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==65029== Command: ./flash5
==65029==
[Driver_initParallel]: Called MPI_Init_thread - requested level 3, given level 3
RuntimeParameters_read: ignoring unknown parameter "nriem"...
Grid_init: resolution based on runtime params:
lrefine dx dy
1 1.250 1.250
2 0.625 0.625
3 0.312 0.312
MaterialProperties initialized
attribute # 1 = 2 ->meshVar 1 1
attribute # 2 = 7 ->meshVar 8 1
pt_gcMaskForAdvance: T F F F F F F F T T F
pt_gcMaskForWrite: T F F F F F F T F F F
Particles_init: pt_velNumAttrib is 2
Particles_init: pt_velAttrib is 9 9 10 10 0 0
Source terms initialized
5.0000000000000000 1 1
flash: 2 dimensional vortex initialization
Parameters read:
gamma = 1.3999999999999999
ambient density = 1.0000000000000000
ambient pressure = 1.0000000000000000
ambient x-velocity = 1.0000000000000000
ambient y-velocity = 1.0000000000000000
vortex_strength = 5.0000000000000000
x center = 5.0000000000000000
y center = 5.0000000000000000
x subintervals = 1
y subintervals = 1
Parameters computed :
ambient temperature = 1.2027239580856474E-008
ambient int. energy = 2.5000000000000004
gas constant = 83144598.000000000
iteration, no. not moved = 0 0
Done with refinement: total blocks = 1
[amr_morton_process]: Initializing surr_blks using standard orrery implementation
INFO: Grid_fillGuardCells is ignoring masking.
iteration, no. not moved = 0 0
Done with refinement: total blocks = 5
iteration, no. not moved = 0 0
Done with refinement: total blocks = 21
Finished with Grid_initDomain, no restart
Ready to call Hydro_init
Hydro initialized
Gravity initialized
Warning: The initial timestep is too large.
initial timestep = 2.5000000000000001E-002
CFL timestep = 0.10170685742456619
Resetting dtinit to dr_tstepSlowStartFactor*dtcfl.
Initial dt verified
Particles_initPositions on processor 0 done, pt_numLocal= 100
arrays freed
==65029== Warning: client switching stacks? SP change: 0xffeffedd0 --> 0xe86c078
==65029== to suppress, use: --max-stackframe=68458982744 or greater
==65029== Warning: client switching stacks? SP change: 0xe86bfa0 --> 0xec6d078
==65029== to suppress, use: --max-stackframe=4198616 or greater
==65029== Warning: client switching stacks? SP change: 0xec6cf60 --> 0xffeffedd0
==65029== to suppress, use: --max-stackframe=68454784624 or greater
==65029== further instances of this message will not be shown.
HDF5-DIAG: Error detected in HDF5 (1.13.0) MPI-process 0:
#000: H5Pfapl.c line 5671 in H5Pget_vol_info(): not a property list
major: Invalid arguments to routine
minor: Inappropriate type
==65029== Use of uninitialised value of size 8
==65029== at 0xB62B875: H5VL_async_file_create (h5_async_vol.c:21067)
==65029== by 0x102D103: H5VL__file_create (H5VLcallback.c:3393)
==65029== by 0x102D37C: H5VL_file_create (H5VLcallback.c:3427)
==65029== by 0xC6C7E8: H5F__create_api_common (H5F.c:613)
==65029== by 0xC6D0BD: H5Fcreate_async (H5F.c:703)
==65029== by 0x8189EC: io_h5init_file_ (io_h5file_interface.c:205)
==65029== by 0x8200F3: io_initfile_ (io_initFile.F90:56)
==65029== by 0x529F0A: io_writecheckpoint_ (IO_writeCheckpoint.F90:112)
==65029== by 0x52966C: io_outputinitial_ (IO_outputInitial.F90:76)
==65029== by 0x412219: driver_initflash_ (Driver_initFlash.F90:194)
==65029== by 0x42C217: MAIN__ (Flash.F90:49)
==65029== by 0x42C284: main (Flash.F90:43)
==65029==
==65029== Invalid read of size 8
==65029== at 0xB62B875: H5VL_async_file_create (h5_async_vol.c:21067)
==65029== by 0x102D103: H5VL__file_create (H5VLcallback.c:3393)
==65029== by 0x102D37C: H5VL_file_create (H5VLcallback.c:3427)
==65029== by 0xC6C7E8: H5F__create_api_common (H5F.c:613)
==65029== by 0xC6D0BD: H5Fcreate_async (H5F.c:703)
==65029== by 0x8189EC: io_h5init_file_ (io_h5file_interface.c:205)
==65029== by 0x8200F3: io_initfile_ (io_initFile.F90:56)
==65029== by 0x529F0A: io_writecheckpoint_ (IO_writeCheckpoint.F90:112)
==65029== by 0x52966C: io_outputinitial_ (IO_outputInitial.F90:76)
==65029== by 0x412219: driver_initflash_ (Driver_initFlash.F90:194)
==65029== by 0x42C217: MAIN__ (Flash.F90:49)
==65029== by 0x42C284: main (Flash.F90:43)
==65029== Address 0x900000000000001 is not stack'd, malloc'd or (recently) free'd
==65029==
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x5272777
#1 0x5272D7E
#2 0x5F31CAF
#3 0xB62B875
#4 0x102D103 in H5VL__file_create at H5VLcallback.c:3393
#5 0x102D37C in H5VL_file_create at H5VLcallback.c:3427
#6 0xC6C7E8 in H5F__create_api_common at H5F.c:613
#7 0xC6D0BD in H5Fcreate_async at H5F.c:703
#8 0x8189EC in io_h5init_file_ at io_h5file_interface.c:205
#9 0x8200F3 in io_initfile_ at io_initFile.F90:56
#10 0x529F0A in io_writecheckpoint_ at IO_writeCheckpoint.F90:112
#11 0x52966C in io_outputinitial_ at IO_outputInitial.F90:76
#12 0x412219 in driver_initflash_ at Driver_initFlash.F90:194
#13 0x42C217 in flash at Flash.F90:49
==65029==
==65029== HEAP SUMMARY:
==65029== in use at exit: 83,157,515 bytes in 6,205 blocks
==65029== total heap usage: 32,452 allocs, 26,247 frees, 94,766,509 bytes allocated
==65029==
==65029== 8 bytes in 1 blocks are possibly lost in loss record 178 of 5,182
==65029== at 0x4C2D110: memalign (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==65029== by 0x4001149: allocate_and_init (dl-tls.c:529)
==65029== by 0x4001149: tls_get_addr_tail (dl-tls.c:742)
==65029== by 0xB840C83: local_set_xstream_internal (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB84542C: xstream_launch_root_ythread (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB854F3D: xstream_context_thread_func (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0x4E3F183: start_thread (pthread_create.c:312)
==65029== by 0x5FF903C: clone (clone.S:111)
==65029==
==65029== 336 bytes in 1 blocks are possibly lost in loss record 4,356 of 5,182
==65029== at 0x4C2CC70: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==65029== by 0x4012EE4: allocate_dtv (dl-tls.c:296)
==65029== by 0x4012EE4: _dl_allocate_tls (dl-tls.c:460)
==65029== by 0x4E3FD92: allocate_stack (allocatestack.c:589)
==65029== by 0x4E3FD92: pthread_create@@GLIBC_2.2.5 (pthread_create.c:500)
==65029== by 0xB855073: ABTD_xstream_context_create (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB845379: xstream_create (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB845CEC: ABT_xstream_create (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB5FA7FC: async_instance_init (h5_async_vol.c:1133)
==65029== by 0xB5FAF6D: H5VL_async_init (h5_async_vol.c:1386)
==65029== by 0x104983D: H5VL__register_connector (H5VLint.c:1237)
==65029== by 0x104A2EE: H5VL__register_connector_by_name (H5VLint.c:1379)
==65029== by 0x1046321: H5VL__set_def_conn (H5VLint.c:442)
==65029== by 0x104543D: H5VL_init_phase2 (H5VLint.c:201)
==65029==
==65029== 4,194,432 bytes in 1 blocks are possibly lost in loss record 5,175 of 5,182
==65029== at 0x4C2D110: memalign (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==65029== by 0x4C2D227: posix_memalign (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==65029== by 0xB850576: ABTI_ythread_create_root (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB845259: xstream_create (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB846F7F: ABTI_xstream_create_primary (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB83F24D: ABT_init (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB5FA1C6: async_init (h5_async_vol.c:925)
==65029== by 0xB5FA544: async_instance_init (h5_async_vol.c:1054)
==65029== by 0xB5FAF6D: H5VL_async_init (h5_async_vol.c:1386)
==65029== by 0x104983D: H5VL__register_connector (H5VLint.c:1237)
==65029== by 0x104A2EE: H5VL__register_connector_by_name (H5VLint.c:1379)
==65029== by 0x1046321: H5VL__set_def_conn (H5VLint.c:442)
==65029==
==65029== 4,194,432 bytes in 1 blocks are possibly lost in loss record 5,176 of 5,182
==65029== at 0x4C2D110: memalign (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==65029== by 0x4C2D227: posix_memalign (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==65029== by 0xB84967B: ythread_create (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB850713: ABTI_ythread_create_main_sched (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB845300: xstream_create (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB846F7F: ABTI_xstream_create_primary (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB83F24D: ABT_init (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB5FA1C6: async_init (h5_async_vol.c:925)
==65029== by 0xB5FA544: async_instance_init (h5_async_vol.c:1054)
==65029== by 0xB5FAF6D: H5VL_async_init (h5_async_vol.c:1386)
==65029== by 0x104983D: H5VL__register_connector (H5VLint.c:1237)
==65029== by 0x104A2EE: H5VL__register_connector_by_name (H5VLint.c:1379)
==65029==
==65029== 4,194,432 bytes in 1 blocks are possibly lost in loss record 5,177 of 5,182
==65029== at 0x4C2D110: memalign (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==65029== by 0x4C2D227: posix_memalign (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==65029== by 0xB84967B: ythread_create (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB850713: ABTI_ythread_create_main_sched (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB845300: xstream_create (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB845CEC: ABT_xstream_create (in /nfs/proj-flash5/argobots/install/lib/libabt.so.1.1.0)
==65029== by 0xB5FA7FC: async_instance_init (h5_async_vol.c:1133)
==65029== by 0xB5FAF6D: H5VL_async_init (h5_async_vol.c:1386)
==65029== by 0x104983D: H5VL__register_connector (H5VLint.c:1237)
==65029== by 0x104A2EE: H5VL__register_connector_by_name (H5VLint.c:1379)
==65029== by 0x1046321: H5VL__set_def_conn (H5VLint.c:442)
==65029== by 0x104543D: H5VL_init_phase2 (H5VLint.c:201)
==65029==
==65029== LEAK SUMMARY:
==65029== definitely lost: 0 bytes in 0 blocks
==65029== indirectly lost: 0 bytes in 0 blocks
==65029== possibly lost: 12,583,640 bytes in 5 blocks
==65029== still reachable: 70,573,875 bytes in 6,200 blocks
==65029== suppressed: 0 bytes in 0 blocks
==65029== Reachable blocks (those to which a pointer was found) are not shown.
==65029== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==65029==
==65029== For counts of detected and suppressed errors, rerun with: -v
==65029== Use --track-origins=yes to see where uninitialised values come from
==65029== ERROR SUMMARY: 7 errors from 7 contexts (suppressed: 0 from 0)
Killed
(base) jain @ compute001 ~/F5/async_hdf5 (rajeeja/async_hdf5_io)
└─ $ ▶ uname -a
lsLinux compute001 3.13.0-170-generic #220-Ubuntu SMP Thu May 9 12:40:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
(base) jain @ compute001 ~/F5/async_hdf5 (rajeeja/async_hdf5_io)
└─ $ ▶ lsb_release
Display all 2371 possibilities? (y or n)
(base) jain @ compute001 ~/F5/async_hdf5 (rajeeja/async_hdf5_io)
└─ $ ▶ lsb_release
LSB Version: core-2.0-amd64:core-2.0-noarch:core-3.0-amd64:core-3.0-noarch:core-3.1-amd64:core-3.1-noarch:core-3.2-amd64:core-3.2-noarch:core-4.0-amd64:core-4.0-noarch:core-4.1-amd64:core-4.1-noarch:security-4.0-amd64:security-4.0-noarch:security-4.1-amd64:security-4.1-noarch
(base) jain @ compute001 ~/F5/async_hdf5 (rajeeja/async_hdf5_io)
└─ $ ▶ lsb_release -a
LSB Version: core-2.0-amd64:core-2.0-noarch:core-3.0-amd64:core-3.0-noarch:core-3.1-amd64:core-3.1-noarch:core-3.2-amd64:core-3.2-noarch:core-4.0-amd64:core-4.0-noarch:core-4.1-amd64:core-4.1-noarch:security-4.0-amd64:security-4.0-noarch:security-4.1-amd64:security-4.1-noarch
Distributor ID: Ubuntu
Description: Ubuntu 14.04.6 LTS
Release: 14.04
Codename: trusty
On MacOS, one may encounter the following segfault:
*** Process received signal ***
Signal: Segmentation fault: 11 (11)
Signal code: (0)
Failing at address: 0x0
[ 0] 0 libsystem_platform.dylib 0x00007fff20428d7d _sigtramp + 29
[ 1] 0 ??? 0x0000000000000000 0x0 + 0
[ 2] 0 libabt.1.dylib 0x0000000105bdbdc0 ABT_thread_create + 128
[ 3] 0 libh5async.dylib 0x00000001064bde1f push_task_to_abt_pool + 559
[ 4] 0 libh5async.dylib 0x00000001064e6a02 async_group_create + 1890
[ 5] 0 libh5async.dylib 0x00000001064c4061 H5VL_async_group_create + 321
[ 6] 0 libhdf5.1000.dylib 0x0000000105f6f794 H5VL__group_create + 180
[ 7] 0 libhdf5.1000.dylib 0x0000000105f6f569 H5VL_group_create + 217
[ 8] 0 libhdf5.1000.dylib 0x0000000105d48a04 H5G__create_api_common + 660
[ 9] 0 libhdf5.1000.dylib 0x0000000105d485f5 H5Gcreate2 + 325
[10] 0 async_test_parallel.exe 0x0000000105bbcb43 main + 739
[11] 0 libdyld.dylib 0x00007fff203fef5d start + 1
[12] 0 ??? 0x0000000000000001 0x0 + 1
Solution from Argobots developer is setting the following variable before running the application:
ABT_THREAD_STACKSIZE=100000 ./your_app.exe
Using Async I/O VOL version 1.4 and H5S_BLOCK as a memory space
will cause errors below.
HDF5-DIAG: Error detected in HDF5 (1.13.3) MPI-process 0:
#000: ../../hdf5-1.13.3/src/H5S.c line 487 in H5Scopy(): not a dataspace
major: Invalid arguments to routine
minor: Inappropriate type
[ASYNC ABT LOG] Argobots execute async_dataset_write_fn failed
free(): invalid pointer
Abort (core dumped)
Here is a short test program to reproduce.
https://github.com/DataLib-ECP/vol-log-based/blob/master/tests/basic/h5s_block.c
Need to update the code to support the new multi-dataset API changes.
Hi,
I am getting errors when I run the test cases in the code.
For example:
$>./async_test_multifile.exe
async_test_multifile.exe: H5CX.c:3610: H5CX__pop_common: Assertion `head && *head' failed
and:
$ ./async_test_serial_event_set_error_stack.exe
HDF5-DIAG: Error detected in HDF5 (1.13.0) thread 0:
#000: H5.c line 1010 in H5open(): library initialization failed
major: Function entry/exit
minor: Unable to initialize object
#001: H5.c line 277 in H5_init_library(): unable to initialize vol interface
major: Function entry/exit
minor: Unable to initialize object
#002: H5VLint.c line 202 in H5VL_init_phase2(): unable to set default VOL connector
major: Virtual Object Layer
minor: Can't set value
#003: H5VLint.c line 444 in H5VL__set_def_conn(): can't register connector
major: Virtual Object Layer
minor: Unable to register new ID
#004: H5VLint.c line 1376 in H5VL__register_connector_by_name(): unable to load VOL connector
major: Virtual Object Layer
minor: Unable to initialize object
H5Fcreate start
H5Fcreate done
H5Gcreate start
H5Gcreate done
H5Gcreate 2 start (should fail when executed)
HDF5-DIAG: Error detected in HDF5 (1.13.0) thread 0:
#000: H5G.c line 268 in H5Gcreate_async(): unable to asynchronously create group
major: Symbol table
minor: Unable to create file
#001: H5G.c line 185 in H5G__create_api_common(): unable to create group
major: Symbol table
minor: Unable to initialize object
#002: H5VLcallback.c line 4248 in H5VL_group_create(): group create failed
major: Virtual Object Layer
minor: Unable to create file
#003: H5VLcallback.c line 4215 in H5VL__group_create(): group create failed
major: Virtual Object Layer
minor: Unable to create file
#004: H5VLnative_group.c line 103 in H5VL__native_group_create(): unable to create group
major: Symbol table
minor: Unable to initialize object
#005: H5Gint.c line 328 in H5G__create_named(): unable to create and link to group
major: Symbol table
minor: Unable to initialize object
#006: H5L.c line 2546 in H5L_link_object(): unable to create new link to object
major: Links
minor: Unable to initialize object
#007: H5L.c line 2788 in H5L__create_real(): can't insert link
major: Links
minor: Unable to insert object
#008: H5Gtraverse.c line 838 in H5G_traverse(): internal path traversal failed
major: Symbol table
minor: Object not found
#009: H5Gtraverse.c line 614 in H5G__traverse_real(): traversal operator failed
major: Symbol table
minor: Callback failed
#010: H5L.c line 2581 in H5L__link_cb(): name already exists
major: Links
minor: Object already exists
Error with group create
HDF5-DIAG: Error detected in HDF5 (1.13.0) thread 0:
#000: H5S.c line 496 in H5Sclose(): not a dataspace
major: Invalid arguments to routine
minor: Inappropriate type
Closing dataset's dataspace failed
HDF5-DIAG: Error detected in HDF5 (1.13.0) thread 0:
#000: H5D.c line 472 in H5Dclose(): not a dataset ID
major: Invalid arguments to routine
minor: Inappropriate type
Closing dataset failed
Thanks
Dear Sir:
When I use HDF5 VOL-Async within HDF5 1.13.1, and I run my app with async I/O of HDF5. The information like "ASYNC ABT INFO 0 write size 18920009385957 larger than async memory limit 23632764928, switch to synchronous write
It seems like I haven't set some Environmental Variable in my system? Now the async mode can not function well.
Thanks
Li Jian
I am testing E3SM-IO benchmark by stacking the Log VOL
on top of Async VOL using release of 1.7 and it crashed the test.
I think #34 fixed the bug. Thus may I suggest to make a new release?
Dear Authors, @houjun @jeanbez I was trying to compile 2.1 but had an issue once I run the second command which is
> ./configure --prefix=$H5_DIR/install --enable-parallel --enable-threadsafe --enable-unsupported #(may need to add CC=cc or CC=mpicc)
I tried with both, also I added some flags to make it work but it gives me errors. ./autogen.sh works but it does not generate any make file to compile as well.
Two different errors while adding CC=cc
or CC=mpicc
I added some flags like --with-zlib CFLAGS="03"
in some cases, I followed this link but still, it did not resolve my issue, not sure what blocking me to execute it successfully: Link:
[2.2 works fine
2.3 Fixed but can't make
it as H5 is required]
I tried figuring it out but I really need a bit of suggestion or help to debug it.
Any suggestions will be highly appreciated, Thank you.
As I'm developing the FORTRAN async tests in HDF5, I'm seeing an issue with H5Aopen_async_f (backtrace below)
Sometimes the test fails and sometimes it does not. I'm running on 6 ranks.
It is basically doing:
CALL h5fopen_async_f(filename, H5F_ACC_RDWR_F, file_id, es_id, hdferror, access_prp = fapl_id )
CALL check("h5fopen_async_f",hdferror, total_error)
f_ptr = C_LOC(exists0)
CALL H5Aexists_async_f(file_id, attr_name, f_ptr, es_id, hdferror)
CALL check("H5Aexists_async_f",hdferror, total_error)
f_ptr = C_LOC(exists1)
CALL H5Aexists_async_f(file_id, TRIM(attr_name)//"00", f_ptr, es_id, hdferror)
CALL check("H5Aexists_async_f",hdferror, total_error)
f_ptr = C_LOC(exists2)
CALL H5Aexists_by_name_async_f(file_id, "/", attr_name, f_ptr, es_id, hdferror)
CALL check("H5Aexists_by_name_async_f",hdferror, total_error)
f_ptr = C_LOC(exists3)
CALL H5Aexists_by_name_async_f(file_id, "/", TRIM(attr_name)//"00", f_ptr, es_id, hdferror)
CALL check("H5Aexists_by_name_async_f",hdferror, total_error)
CALL H5Aopen_async_f(file_id, attr_name, attr_id0, es_id, hdferror) <--- fails here
CALL check("H5Aopen_async_f", hdferror, total_error)
async_test: ../../src/H5Fint.c:631: H5F__get_objects_cb: Assertion `obj_ptr' failed.
async_test: ../../src/H5Fint.c:631: H5F__get_objects_cb: Assertion `obj_ptr' failed.
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
#0 0x7f5c5f7734e2 in ???
#1 0x7f5c5f772675 in ???
#2 0x7f5c5e280d4f in ???
#3 0x7f5c5e280cbb in ???
#4 0x7f5c5e282354 in ???
#5 0x7f5c5e278cb9 in ???
#6 0x7f5c5e278d41 in ???
#7 0x7f5c60813ec7 in H5F__get_objects_cb
at ../../src/H5Fint.c:631
#8 0x7f5c608e3555 in H5I__iterate_cb
at ../../src/H5Iint.c:1526
#9 0x7f5c608e4eb2 in H5I_iterate
at ../../src/H5Iint.c:1592
#10 0x7f5c60813dc0 in H5F__get_objects
at ../../src/H5Fint.c:599
#11 0x7f5c608173a0 in H5F_get_obj_count
at ../../src/H5Fint.c:475
#12 0x7f5c60920b98 in H5O__attr_find_opened_attr
at ../../src/H5Oattribute.c:661
#13 0x7f5c60921f31 in H5O__attr_open_by_name
at ../../src/H5Oattribute.c:473
#14 0x7f5c606fcacc in H5A__open
at ../../src/H5Aint.c:535
#15 0x7f5c60b04368 in H5VL__native_attr_open
at ../../src/H5VLnative_attr.c:154
#16 0x7f5c60ae073d in H5VL__attr_open
at ../../src/H5VLcallback.c:1104
#17 0x7f5c60ae8827 in H5VLattr_open
at ../../src/H5VLcallback.c:1175
#18 0x7f5c60d8527b in async_attr_open_fn
at /home/brtnfld/work/vol-async/src/h5_async_vol.c:5675
#19 0x7f5c5c1bbc97 in ???
#20 0x7f5c5c1c1e98 in ???
#21 0xffffffffffffffff in ???
When I try to run hdf5-iotest with > 1 node I get a crash, below. It works fine if it is using one node.:
#0 0x000020001ac6bfb4 in ABT_thread_create () from /ccs/home/brtnfld/packages/argobots/build/argobots//lib/libabt.so.1
#1 0x0000200003d98870 in push_task_to_abt_pool (qhead=0x4b22fed0, pool=0x4b2a1980) at h5_async_vol.c:2249
#2 0x0000200003db98e4 in async_file_open (qtype=REGULAR, aid=0x4b22fed0, name=0x7fffdc00e840 "hdf5_iotest.h5", flags=0, fapl_id=792633534417208627, dxpl_id=792633534417207304, req=0x0) at h5_async_vol.c:13253
#3 0x0000200003dd4b3c in H5VL_async_file_open (name=0x7fffdc00e840 "hdf5_iotest.h5", flags=0, fapl_id=792633534417207316, dxpl_id=792633534417207304, req=0x0) at h5_async_vol.c:22141
#4 0x00002000004a85e4 in H5VL__file_open (name=<optimized out>, name@entry=0x7fffdc00e840 "hdf5_iotest.h5", flags=flags@entry=0, fapl_id=<optimized out>, fapl_id@entry=792633534417207316, dxpl_id=<optimized out>,
dxpl_id@entry=792633534417207304, req=<optimized out>, req@entry=0x0, cls=<optimized out>, cls=<optimized out>) at ../../src/H5VLcallback.c:3497
#5 0x00002000004b199c in H5VL_file_open (connector_prop=0x7fffdc00e440, name=0x7fffdc00e840 "hdf5_iotest.h5", flags=<optimized out>, fapl_id=792633534417207316, dxpl_id=792633534417207304, req=0x0) at ../../src/H5VLcallback.c:3646
#6 0x000020000025346c in H5F__open_api_common (filename=filename@entry=0x7fffdc00e840 "hdf5_iotest.h5", flags=flags@entry=0, fapl_id=<optimized out>, fapl_id@entry=792633534417207316, token_ptr=token_ptr@entry=0x0)
at ../../src/H5F.c:795
#7 0x0000200000255c38 in H5Fopen_async (app_file=0x1000f878 "../../src/read_test.c", app_func=0x1000fbc8 "read_test", app_line=<optimized out>, filename=0x7fffdc00e840 "hdf5_iotest.h5", flags=<optimized out>,
fapl_id=792633534417207316, es_id=0) at ../../src/H5F.c:880
#8 0x0000000010009284 in ?? ()
#9 0x000000001000820c in ?? ()
#10 0x00002000008b4078 in generic_start_main.isra () from /lib64/power9/libc.so.6
#11 0x00002000008b4264 in __libc_start_main () from /lib64/power9/libc.so.6
#12 0x0000000000000000 in ?? ()
For the serial tests (test/API in HDF5), only h5_api_test_attribute fails with:
1: Testing shared datatype for attributes *FAILED*
1: reference count of the named datatype is wrong: 1
For the parallel tests (testpar/API), only h5_api_test_parallel_async fails with:
9: **********************************************
9: * *
9: * API Parallel Async Tests *
9: * *
9: **********************************************
9:
9: Testing single dataset I/O
9: Testing test setup HDF5-DIAG: Error detected in HDF5 (1.15.0) MPI-process 0:
9: #000: ../../src/H5VLcallback.c line 6321 in H5VLintrospect_get_conn_cls(): NULL obj pointer
9: major: Invalid arguments to routine
9: minor: Bad value
9: HDF5-DIAG: Error detected in HDF5 (1.15.0) MPI-process 0:
9: #000: ../../src/H5VL.c line 658 in H5VLobject_is_native(): can't determine if object is a native connector object
9: major: Virtual Object Layer
9: minor: Can't get value
9: #001: ../../src/H5VLint.c line 1077 in H5VL_object_is_native(): can't get VOL connector class
9: major: Virtual Object Layer
9: minor: Can't get value
9: #002: ../../src/H5VLcallback.c line 6289 in H5VL_introspect_get_conn_cls(): can't query connector class
9: major: Virtual Object Layer
9: minor: Can't get value
9: #003: ../../src/H5VLcallback.c line 6256 in H5VL__introspect_get_conn_cls(): can't query connector class
9: major: Virtual Object Layer
9: minor: Can't get value
9: #004: ../../src/H5VLcallback.c line 6321 in H5VLintrospect_get_conn_cls(): NULL obj pointer
9: major: Invalid arguments to routine
9: minor: Bad value
9: *FAILED*
I am using the develop branch of vol-async 73a870d to test E3SM-IO benchmark.
One of the tests failed. The failed command runs on 1 MPI process, but
the same command runs fine with 16 processes.
Below are the related env variables.
HDF5_PLUGIN_PATH=$HOME/ASYNC_VOL/lib
HDF5_VOL_CONNECTOR=async under_vol=0;under_info={}
LD_LIBRARY_PATH=$HOME/ASYNC_VOL/lib:$HOME/Argobots/1.1/lib:$HOME/HDF5/1.14.1-2-thread/lib
Here is the run command.
e3sm_io -k -r 2 -y 2 datasets/map_f_case_16p.h5 -o blob_f_out.h5 -a hdf5 -x blob
Part of GDB trace is given below.
#26 0x00007f717436f218 in H5D__write (count=count@entry=1, dset_info=dset_info@entry=0x7f71565fff00)
at ../../hdf5-1.14.1-2/src/H5Dio.c:745
#27 0x00007f71745b1f61 in H5VL__native_dataset_write (count=1, obj=<optimized out>,
mem_type_id=<optimized out>, mem_space_id=0x1922630, file_space_id=0x191b230, dxpl_id=<optimized out>,
buf=0x191c130, req=0x0) at ../../hdf5-1.14.1-2/src/H5VLnative_dataset.c:407
#28 0x00007f717459db47 in H5VL__dataset_write (cls=<optimized out>, req=0x0, buf=0x191c130,
dxpl_id=792633534417207497, file_space_id=0x191b230, mem_space_id=0x1922630, mem_type_id=0x191a430,
obj=0x1915350, count=1) at ../../hdf5-1.14.1-2/src/H5VLcallback.c:2236
#29 H5VLdataset_write (count=1, obj=0x1915350, connector_id=648518346341351424, mem_type_id=0x191a430,
mem_space_id=0x1922630, file_space_id=0x191b230, dxpl_id=792633534417207497, buf=0x191c130, req=0x0)
at ../../hdf5-1.14.1-2/src/H5VLcallback.c:2396
#30 0x00007f71725a8ef0 in async_dataset_write_fn (foo=0x1a335a0)
at /homes/wkliao/ASYNC_VOL/vol-async/src/h5_async_vol.c:9712
#31 0x00007f717238104a in ABTD_ythread_func_wrapper (p_arg=0x7f71566001e0)
at ../../argobots-1.1/src/arch/abtd_ythread.c:21
Hi,
This is just a comment of an issue I found in the test/Makefile
Got this error while I was running the tests async_test_serial.exe
./async_test_serial.exe: symbol lookup error: /home/myuser/hdf5-async/vol-async/src/libh5async.so: undefined symbol: ABT_initialized
I noticed that LDFLAGS in the test/Makefile has:
LDFLAGS = $(DEBUG) -L$(ASYNC_DIR) -L$(ABT_DIR)/lib -L$(HDF5_DIR)/lib -Wl,-rpath=$(ASYNC_DIR) -Wl,-rpath=$(ABT_DIR)/lib -Wl,-rpath=$(HDF5_DIR)/lib -labt -lhdf5 -lh5async -lasynchdf5 -labt
So I removed '-lh5async' in LDFLAGS which is pointing to the dynamic library. Now the test async_test_serial.exe
passed.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.