Giter Club home page Giter Club logo

Comments (10)

braingram avatar braingram commented on July 2, 2024

@perrygreenfield what do you think about dropping saving the 'base' array. @eslavich mentioned that you might have some thoughts on this. One 'pro' is the above issue would be fixed. One 'con' is that files with views of other arrays in the same file would contain duplicate data. I have not yet started investigating how much work would be required to drop this feature (and what effect it would have on downstream packages). I am envisioning a result/solution that would:

  • by default, not save the base array (to avoid the above issue)
  • allow the user to specify that a specific array should have it's 'base' array saved (to allow data sharing in instances where the user knows this benefit could be had). I'm thinking this might be possible using the array_storage options but this might require changes to the asdf-standard as this information should likely make it into the file either in the tree or in the block header.

I haven't constructed a case for this yet but I imagine we might see the above issue if, for instance, a user opens a roman file, takes a cut-out, and attempts to save this cut-out to an asdf file (they would currently end up with a file that is likely larger than the original file).

from asdf.

schlafly avatar schlafly commented on July 2, 2024

FWIW, in the context of Roman we had accidentally saved the entire L1 file in the L2 files because we wanted to save the 4 reference pixels and were bringing those in as a view. If we keep this feature, it feels worthwhile to throw some kind of warning if the views are much smaller than the image.

Brett was able to rapidly diagnose this, but I was not---I looked at all the arrays and the dtypes and saw they were correct but I could not find the "missing memory," which required looking at the block manager et al..

from asdf.

perrygreenfield avatar perrygreenfield commented on July 2, 2024

If I'm not mistaken, you should be able to tell from the yaml definition for the small arrays that they are views based on the metadata necessary to present a view into a larger array. And now I see a long ago question I need to answer.

from asdf.

schlafly avatar schlafly commented on July 2, 2024

It's true that I remember seeing words like 'strides' in there, but not having ever stared at the yaml definitions before, for me it was not enough to trigger alarms.

from asdf.

perrygreenfield avatar perrygreenfield commented on July 2, 2024

Regarding the old question. If there is only one view of an array (and the base array is not referred to the tree) it should be possible to determine that and thus only save the data in the view. It becomes more complex if there are multiple views (again without a reference to the base array in the tree), but still in principle possible to determine the subarray needed to contain all views. But this suggests some pre-write analysis to do such optimizations (it has to be held until then since the base array may be referenced later) . Also, if views don't share common areas, they could also be made separate arrays.

But I would be against losing the shared aspect of views since that has meaning and duplication would break.

from asdf.

braingram avatar braingram commented on July 2, 2024

Thanks for the discussion. I don't think the approach of looking for views that overlap would solve the roman issue. There are 4 row/cols of reference pixels on all edges (top, bottom, left, right) of the detector array and views are generated by slicing out these rows/cols.
https://github.com/spacetelescope/romancal/blob/6c9b6c7816e7e20a8cf5bf218bb712043270f0f6/romancal/dq_init/dq_init_step.py#L106-L109
Since these views overlap (in the 4x4 pixel regions at the corners of the detector) asdf would still have to save the entire L1 data array.

I think minimally we should provide a way to disable this behavior because as seen above there are some cases where we know we don't want to save the base array. I think this could be accomplished by extending the current block options (although those are stored using the base array). I started a draft PR taking this approach: #1753

from asdf.

schlafly avatar schlafly commented on July 2, 2024

And two cents, the default should be to warn if you're putting a view in a file, or at least provide an easier mechanism to see the size of the blocks (e.g., in asdf.info? or something?). We were definitely making files 3x too big for 6 months without noticing!

from asdf.

perrygreenfield avatar perrygreenfield commented on July 2, 2024

As far as the warning goes, sure until the behavior changes. But when people make slices that they don't care that they share some identical data, I think the onus is on the developer to copy the views before saving.

from asdf.

braingram avatar braingram commented on July 2, 2024

I was curious how the current asdf behavior (saving the base array) is impacting roman file sizes (beyond the issue mentioned above). I modified the compare_asdf function to calculate the file size for 2 conditions:

  • "shared", the current asdf behavior (saving the base array)
  • "not-shared", disabling the base array saving (using the feature in #1753)

I then ran the regression tests (https://plwishmaster.stsci.edu:8081/blue/organizations/jenkins/RT%2FRoman-Developers-Pull-Requests/detail/Roman-Developers-Pull-Requests/601/tests) where each test failure is due to a difference in file size between the 2 conditions. The table below shows the file sizes for the 22 test files. Only 2 had a smaller "shared" size (for these difference was <1%). Many files are significantly smaller for the "not-shared" case (not saving the base array). Just looking at the mean of the ratios, disabling saving the base array will reduce the roman file sizes by ~22%.

test name "shared" "not-shared" "not-shared" to "shared" ratio
test_flat_field_image_step 1151629085 414480658 0.3599081192014181
test_dark_current_subtraction_step 2327063200 1590438649 0.6834531391326201
test_level2_image_processing_pipeline 415435017 414911149 0.9987389893038313
test_linearity_step 2327063171 1590438620 0.6834531351877942
test_dq_init_image_step 1587554674 1590438522 1.0018165346033305
test_refpix_step 2327063136 1590438585 0.6834531304267973
test_absolute_photometric_calibration 878475679 409238421 0.4658505986936947
test_dq_init_grism_step 1587554771 1590438620 1.0018165351222392
test_rampfit_step[image_full] 348927658 347617992 0.9962465973390966
test_flat_field_crds_match_image_step 1151629093 414480666 0.3599081236479322
test_dark_current_outfile_step 2327063167 1590438616 0.6834531346436803
test_saturation_image_step 1654663642 1590438560 0.9611854153498104
test_rampfit_step[spec_full] 348927629 347617963 0.9962465970271446
test_linearity_outfile_step 2327063138 1590438587 0.6834531306988543
test_rampfit_step[image_trunc] 347878872 346569206 0.9962352815723744
test_saturation_grism_step 1654663739 1590438657 0.961185417625206
test_level2_grism_processing_pipeline 348960667 347651001 0.9962469523821721
test_dark_current_outfile_suffix 2327063167 1590438616 0.6834531346436803
test_rampfit_step[spec_trunc] 347878869 346569203 0.9962352815399087
test_dark_current_output 2327063200 1590438649 0.6834531391326201
test_processing_pipeline_all_saturated 1654663619 1590438536 0.9611854142059311
test_tweakreg 1152206554 415058127 0.3602289238497076

from asdf.

braingram avatar braingram commented on July 2, 2024

xref: #1005

from asdf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.