Giter Club home page Giter Club logo

Comments (12)

HalfPhoton avatar HalfPhoton commented on June 16, 2024

Hi @Rafael-Cast thanks for reporting this issue.

We'll look into this right away.

Kind regards,
Rich

from pod5-file-format.

0x55555555 avatar 0x55555555 commented on June 16, 2024

Hi @Rafael-Cast ,

I've had a look internally and can't reproduce trivially by writing > 1000 reads.

Can you provide more info about how you are calling the function to add reads?

Thanks,

  • George

from pod5-file-format.

Rafael-Cast avatar Rafael-Cast commented on June 16, 2024

Hi, thanks for looking into the issue.

I'm reading a POD5 file and copying it with a different compression method (either uncompressed or VBZ as requested by input). I could provide the full source code, but it's long (~400 lines) and not really the most tidy. Nonetheless, if you think it'd help I got no problem providing it.

Here is the relevant snippet:

	for (size_t current_batch_idx = 0; current_batch_idx < batch_count; current_batch_idx++)
	{
		Pod5ReadRecordBatch_t *current_batch;
		LOG_PROGRAM_ERROR(pod5_get_read_batch(&current_batch, reader, current_batch_idx))
		size_t batch_row_count;
		LOG_PROGRAM_ERROR(pod5_get_read_batch_row_count(&batch_row_count, current_batch))
		ReadBatchRowInfo_t read_record_batch_array[batch_row_count]; // For large batches this could cause a stack overflow. Should push data to heap
		size_t sample_count[batch_row_count];
		int16_t *signal[batch_row_count];

		for (size_t current_batch_row = 0; current_batch_row < batch_row_count;
			 current_batch_row++)
		{
			// Load ReadBatchRowInfo to memory
			uint16_t read_table_version;
			LOG_PROGRAM_ERROR(pod5_get_read_batch_row_info_data(
				current_batch, current_batch_row, READ_BATCH_ROW_INFO_VERSION,
				&read_record_batch_array[current_batch_row], &read_table_version));

			LOG_PROGRAM_ERROR(pod5_get_read_complete_sample_count(
				reader, current_batch, current_batch_row,
				&sample_count[current_batch_row]))

			signal[current_batch_row] =
				(int16_t *)malloc(sizeof(int16_t) * sample_count[current_batch_row]);
			LOG_PROGRAM_ERROR(pod5_get_read_complete_signal(
				reader, current_batch, current_batch_row,
				sample_count[current_batch_row], signal[current_batch_row]))
		}

		uint32_t signal_length[batch_row_count];
		for (size_t i = 0; i < batch_row_count; i++)
		{
			signal_length[i] = sample_count[i];
		}

		static ReadBatchRowInfoArray_t flattened_array;
		transform_read_data_batch_array(read_record_batch_array, batch_row_count, current_batch, writer, &flattened_array);
		LOG_PROGRAM_ERROR(pod5_add_reads_data(
			writer, batch_row_count, READ_BATCH_ROW_INFO_VERSION, &flattened_array,
			const_cast<const int16_t **>(signal), signal_length))

		free_batch_array(&flattened_array);

		for (size_t i = 0; i < batch_row_count; i++)
		{
			free(signal[i]);
		}
		LOG_PROGRAM_ERROR(pod5_free_read_batch(current_batch))
	}

Here:

		uint32_t signal_length[batch_row_count];
		for (size_t i = 0; i < batch_row_count; i++)
		{
			signal_length[i] = sample_count[i];
		}

Is used to cast the array type (it might not be necessary, but shouldn't be part of the problem)

And:

static ReadBatchRowInfoArray_t flattened_array;
		transform_read_data_batch_array(read_record_batch_array, batch_row_count, current_batch, writer, &flattened_array);

Flattens the array of ReadBatchRowInfo_t to ReadBatchRowInfoArray_t as requested by pod5_add_reads_data (I couldn't find a way to diretly feed what's been read by pod5_get_read_batch_row_info_data into what pod5_add_reads_data expects without flattening).

By the way, the produced files are reported as consistent by the python script given in check_pod5_files_equal.py when the program doesn't crash by this assert. Is this the "intended" use of said script? I'm using it as a kind of observational equivalence test.

Thanks,
Rafael.

from pod5-file-format.

0x55555555 avatar 0x55555555 commented on June 16, 2024

I assume you are adding pore types and run info's in transform_read_data_batch_array, rather than just reusing the bare integer values?

Do you have a gdb core dump (and built executable) I could have a look at? Or better, a full buildable project i can poke at?

Thanks,

  • George

from pod5-file-format.

Rafael-Cast avatar Rafael-Cast commented on June 16, 2024

I'm adding both pore types and run infos in transform_read_data_batch_array.

The buildable project is located in https://github.com/Rafael-Cast/pod5-file-format-debug.git under branch "debug"
This branch contains both the executable and the data, and is a fork from your project which I added under "examples" the code. You shouldn't need any more dependencies than what are included in your project.
To compile from scratch simply run "bash install.sh" and then "bash run.sh" will execute the failing test case.

I'm using gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-16) to compile the project.

Thanks,
Rafael.

from pod5-file-format.

0x55555555 avatar 0x55555555 commented on June 16, 2024

Hi @Rafael-Cast ,

I get an error about a missing input file: ".../pod5-file-format-debug/batch12_new.pod5" I can't see it in the repo?

Thanks,

  • George

from pod5-file-format.

0x55555555 avatar 0x55555555 commented on June 16, 2024

Reviewing the code though, This looks a bit dodgy:

https://github.com/Rafael-Cast/pod5-file-format-debug/blob/debug/c%2B%2B/examples/copy.cpp#L200

> run_info_id[i] = in_data[i].run_info;

It copies the run info id from the source file to the dest file, but none of the run info data.

This could cause the issue - although the error should be better.

edit: running a test locally gives me the same call stack as you get - ill get an better error message in.

Thanks,

  • George

from pod5-file-format.

Rafael-Cast avatar Rafael-Cast commented on June 16, 2024

I forgot to add the data sample which crashes. I've added it now to the repo.

from pod5-file-format.

Rafael-Cast avatar Rafael-Cast commented on June 16, 2024

When copying the read data with pod5_add_reads_data, only the run info id is required, that's why I'm only copying the ID on that line.
The run info is copied after copying the read data. Is this correct or the read data should be copied first?

On the other hand, if this line is switched (https://github.com/Rafael-Cast/pod5-file-format-debug/blob/070836d28828c5956f312216a6a61ec61c8627e5/c%2B%2B/examples/copy.cpp#L357):

from:

const Pod5WriterOptions_t writer_options = {0, comp_opt, 0, 0};

to:

	const Pod5WriterOptions_t writer_options = {0, comp_opt, 0, 10000};

On a version previous to 0.2.0 solves the issue.

Nonetheless I think the problem is that I'm just assuming something which is not true from your API. I'll later try revising the line you suggested, writing the run info first and then the read data and try again.
This does seem to be an issue with my code. I'm sorry for (probably) reporting an nonexistent bug.

edit: Markdown to show code

Thanks,
Rafael.

from pod5-file-format.

0x55555555 avatar 0x55555555 commented on June 16, 2024

Yes - if you add the run info data first the issue will go away - but as you say, the API should make it clear this order is unsupported with an error, not crash.

I'll get that new error in asap.

  • George

from pod5-file-format.

0x55555555 avatar 0x55555555 commented on June 16, 2024

Hi @Rafael-Cast ,

0.2.2 is now live, it should return an error when adding a read with an invalid run info id.

Hope that helps!

  • George

from pod5-file-format.

Rafael-Cast avatar Rafael-Cast commented on June 16, 2024

Hi @jorj1988,

Thanks for the help and update!

Rafael

from pod5-file-format.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.