Right now, caching of UngriddedData objects is being

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Caching of UngriddedData about pyaerocom HOT 4 CLOSED

metno commented on August 30, 2024

Caching of UngriddedData

from pyaerocom.

Comments (4)

jgriesfeller commented on August 30, 2024 1

Hi,
the caching should be variable specific. I don't think JSON is the right format for the cache files. Just converting the data to strings slows everything down. I would choose a standard binary format and try netcdf first. Due to the restrictions of netcdf, HDF might be better suited because it does not share the dimensions between variables. But at the end it might help to run some test before a decision.

Another way of getting around caching problems is to make the reading fast enough ;-)

Another thing: Pickle also saves all the internal variables that are part of the object, but strictly speaking not part of the data. Some of them might need caching as well (e.g. the file date of the newest file of a data set; version of pyaerocom reading class used, etc.).

To be honest, I think there is no quick help with the programming, just because you are the only one knowing the pyaerocom source good enough to develop there right now. The reality is that I cannot stay focused on pyaerocom long enough to get to know it again due to other things coming up all the time and the satellite work I am doing (Sentinel5P work was just added to my todo list). The month of August might be an exception (but this might be needed for other things like updating the CAM43 alert).

from pyaerocom.

jgliss commented on August 30, 2024

@jgriesfeller @MichaelSchulzMETNO @AugustinMortier @hannasv @paulina-t

I think we should resolve this issue with medium to high priority. The main reason for this is because the current caching strategy is rather inefficient and leads to repeated recreation of the ungridded cache files if different variable combinations (for one dataset) are used alternately during reading of observations (e.g. scattering & absorption coeffs. or pm10 & pm25 or sconco3 & sconco4), more below. And, as I painfully found out during the last 2 weeks (working from home), it is terribly slow to reload non-cached data from lustre when using VPN (or even working on lustre directly). Overall: caching needs to be more flexible and we need to be able to flexibly add / update / remove individual variables in an existing cache file for a given dataset.

Current caching strategy

Someone wants to read something, e.g. EBAS, O3 and SO4 surface concentrations:
data = ReadUngridded().read("EBASMC", vars_to_retrieve=["sconco3", "sconcso4"])
During reading, the data object (instance of UngriddedData class) will be cached automatically into a pickle file EBASMC_MultipleVars_None_None.pkl, in the cache directory (pyaerocom.const.CACHEDIR).
Side note: The None_None in the cache filename is indicating start / stop time of the dataset and this is only written if this is one of the explicit constraints applied to the data before caching. I think this should be removed from the convention as it is currently not used and won't be necessary with a more flexible caching strategy (more below). MultipleVars indicates that the cached data object contains more than one variable.
Now consider, a day later, the same someone wants to read:
data = ReadUngridded().read("EBASMC", vars_to_retrieve=["absc550aer", "scatc550dryaer"])
the ReadUngridded.read method finds the existing pickle cache file, looks into it and finds that it only contains sconco3, sconcso4.
Since it cannot add data to that files, it will delete this existing cache file, read absc550aer, scatc550dryaer, and save a new cache file (with the same name): EBASMC_MultipleVars_None_None.pkl.
Now, consider, next day someone wants to read sconcpm10, sconcpm25 ...

To conclude: this is rather inefficient and I propose to investigate the following strategy instead:

Proposed new caching strategy

move away from pickle and use json standard as cache file.
One cache file for each dataset (e.g. EBASMC.json) or one cache file per dataset and variable (e.g. EBASMC_scatc550aer.json)
The file structure could be (or similar), e.g.:

EBASMC.json
{
   'sconco3' : {'revision_and_pyaerocom_version_identifier' : <O3 data>},
   'sconso4' : {'revision_and_pyaerocom_version_identifier' : <SO4 data>},
   ...,
   'scatc550dryaer' : {'revision_and_pyaerocom_version_identifier' : <scattering data>},
}

Advantages

It is easy to add / update / remove individual variable datasets from the cache file
Reloading from lustre is then done per variable and only if either the pyaerocom version or the dataset revision date (i.e. when it was updated on lustre) has changed.
We use the json standard in all our interfaces and it would probably not be too hard to create the json files for the web interfaces directly from the cache json files.

ToDos

Investigate the performance of json serialization / deserialization vs. pickle (i.e. how fast can we read from / write to json based cache files)
Implement json I/O methods in UngriddedData:
- starting with implementing 2 methods to convert an instance of UngriddedData to dictionary to_dict() and from_dict(<data_dict>), with a suitable architecture that resembles the architecture of the UngriddedData object as good as possible, but at the same time is organised in a way that it can be efficiently read and written to json, as indicated above (e.g. nested dictionaries organised in a hierarchy: dataset -> variable -> metablock -> dataarray).
- then, based on that, implement read / write methods for json in UngriddedData:
  to_json(<file_path>) and from_json(<file_path>).
  The latter 2 methods should allow to add additional constraints, such as var_name (to specify the variables that are supposed to be read / written from / to input json file) and the to_json method should also be able to append to / update an existing data object, if the input file location (<file_path>) already exists and is a valid cache file.
Implement filtering by variable in UngriddedData could be helpful as well, i.e. to extract single-variable subsets from multi-var instances of UngriddedData.

Things to consider

Cached data object can be several GB large if it contains multiple variables
Maybe there is a better alternative than json?

Happy to hear your ideas / opinions on this!

Cheers,
Jonas

from pyaerocom.

jgliss commented on August 30, 2024

Hi Jan,
thanks for your feedback. I agree with you and json I/O performance was one of my main concerns. For now, due to lack of time, I will stick with pickle, but get rid of this MultiVars object and pickle single variable instances of UngriddedData (this should be easy to implement and will solve most of the above mentioned problems).
Then, whenever someone requests data (e.g. EBASMC, absc550aer, scatc550dryaer), for each of the requested variables, the reading routine will check if a pickled UngriddedData object exists (and is up to date) and loads it, else loads from lustre. And at the end of the reading, the individual single-variable UngriddedData objects are merged into one and returned.

"Another way of getting around caching problems is to make the reading fast enough ;-)" -> Agreed, but there will be a speed limit if the original data is stored in hundreds of thousands of text files (may it be due to python, due to connection, due to loads of traffic on lustre, or due to inefficient coding) :-)

from pyaerocom.

jgliss commented on August 30, 2024

Caching of UngriddedData is now done per single variable and the data objects are pickled. During reading (using ReadUngridded class), existing cache files are checked for each input variable and loaded, if available, into single-var instances of UngriddedData. All variables, for which no cache files exist are read (and cached). At the end of the reading routine, all data objects are merged into single instance of UngriddedData, containing all variables.

Closing this issue for now.

from pyaerocom.

Caching of UngriddedData about pyaerocom HOT 4 CLOSED

Comments (4)

Current caching strategy

Proposed new caching strategy

Advantages

ToDos

Things to consider

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent