diffix / syndiffix Goto Github PK

View Code? Open in Web Editor NEW

4.0 3.0 1.0 651 KB

Python implementation of the SynDiffix synthetic data generation mechanism.

License: Other

Python 100.00%

anonymisation anonymity anonymization anonymize privacy privacy-tools synthetic-data synthetic-dataset-generation

syndiffix's People

Contributors

Stargazers

Watchers

Forkers

yoid2000

syndiffix's Issues

Rename NoClustering to SingleClustering

Move to syndiffix repository once public

Never include AID columns in output

It occurs to me that we should not include AID columns in the output.

The main reason is that the AID columns has no value, but the user might not know this.

In particular, we don't capture any event information, for instance the distribution of rows over AIDs, or inter-event timing or sequences. But if we include some kind of AID column, then the user might assume that we do capture this stuff, and get bad results.

Until we implement event information, it seems it would be cleaner and clearer just to exclude the AID columns in the output.

Implement synthetization pipeline.

Once we have all the individual stages available, we need a top-level module that assembles the full syndiffix pipeline:

select input columns
convert data
build forest
cluster columns
build trees
harvest buckets
generate microdata
stitch or patch columns
generate output

Nondeterminism in feature detection

I get different features on different runs over the same table/column pair. We must manage the random_state to be deterministic.

Clustering solutions are worse than in the F# implementation.

I was playing with a custom script that takes the first 50 K rows from the taxi-one-day.csv dataset and selects 5 specific columns for processing, and this is the output that I get:

Loaded 50000 rows. Columns:
0: pickup_longitude (float64)
1: pickup_latitude (float64)
2: fare_amount (float64)
3: rate_code (int64)
4: passenger_count (int64)

Fitting the synthesizer over the data...
Column clusters:
Initial= [2, 4]
Derived= (SHARED, [2], [0])
Derived= (SHARED, [2], [1])
Derived= (SHARED, [2], [3])

Notice how the initial cluster only has 2 columns in it.
The F# implementation produces this output:

=== Columns ===
  0 pickup_longitude (RealType); Entropy = 11.828231226413235
  1 pickup_latitude (RealType); Entropy = 11.799770332830937
  2 fare_amount (RealType); Entropy = 5.4025513988026965
  3 rate_code (IntegerType); Entropy = 0.2387340586221852
  4 passenger_count (IntegerType); Entropy = 0.6315105894728646
Assigning clusters...
Clusters: { InitialCluster = [|0; 1; 2|]
  DerivedClusters = [(Shared, [|2|], [|3|]); (Shared, [|2|], [|4|])] }.

Change terminology from AID to PID

Implement forest building.

Allow configuring ML models and other params for feature selection

When we have an interface with clustering strategies, one should be able to tweak the default behavior of ML feature selection, which currently uses Decision Tree Classifier/Regressor.

Update website with release info

Drop main.py and fire

Since we don't expose a CLI, they serve no purpose.

Crash during stitching when processing constant data.

>>> Synthesizer(pandas.DataFrame(numpy.ones((2,200)))).sample()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python310\lib\site-packages\syndiffix\synthesizer.py", line 99, in sample
    rows, root_combination = build_table(
  File "C:\Python310\lib\site-packages\syndiffix\clustering\stitching.py", line 381, in build_table
    acc = _stitch(materialize_tree, forest, metadata, acc, derived_cluster)
  File "C:\Python310\lib\site-packages\syndiffix\clustering\stitching.py", line 372, in _stitch
    return _do_stitch(forest, metadata, left, right, derived_cluster)
  File "C:\Python310\lib\site-packages\syndiffix\clustering\stitching.py", line 300, in _do_stitch
    raise ValueError(f"Empty sequence in cluster {right_combination}.")
ValueError: Empty sequence in cluster (0, 2).

Test stitching

Generating the salt

Do we have anything that configures the salt?

It seems to me that we should automatically create a good salt the first time syndiffix-py is run.

I see that the typical usage for obtaining the seed is like this:

noise = _generate_noise(anon_params.salt, "noise", noise_sd, (context.bucket_seed, aid_seed))

What we could do instead is to use a get_salt() routine instead of anon_params.salt, which always checks to see if the salt is set to something other than the default, and if not it sets it to a cryptographically strong value. Here is a library for that:

https://docs.python.org/3/library/secrets.html

Avoid unnecessary xxx*yyy style strings

Current implementation makes xxx*yyy type strings when the original string needs to be suppressed (where xxx is a possibly-null prefix, and yyy is a number.

The primary reason for this is to avoid releasing any strings that fail LCF. The problem is that we are too aggressive about this, and suppress strings that strictly speaking don't need to be suppressed. This happens because we partition a given columns values at 2dim and above.

What we should do instead is to use the 1dim tree to determine what the valid strings, and then at Ndim choose strings from the set of valid strings.

Clustering strategy interface

There are different ways to build clusters:

General purpose with dependence measure
Target column where we put the main column in every cluster.
Target column with ML feature selection, with or without patching
(Optional) Target column Univariate feature selection, with or without patching

Hitting a problem with bloom filter size

This error happens when running with an AID column. The specific case here is for the taxi table. The aid column is hack. I ran with 10000 rows. I tried several different data dataframes. Two had one column (med and passenger_count). And one had two columns (start and end datetime). All failed like this.

Fitting the synthesizer over the data...
c:\paul\GitHub\syndiffix-py\syndiffix\forest.py:62: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
  self.aid_data: npt.NDArray[np.uint64] = aids.applymap(hash_aid).to_numpy(Hash)
Traceback (most recent call last):
  File "C:\paul\GitHub\misc\python\syndiffix-py-play\testSynDiffixPy.py", line 120, in <module>
    tsd = testSynDiffixPy(df, csvFile, ['med','hack'], output_dir, aidsColumns=aidsColumns)
  File "C:\paul\GitHub\misc\python\syndiffix-py-play\testSynDiffixPy.py", line 55, in __init__
    synthesizer = Synthesizer(self.dfOrig, aids=aids)
  File "c:\paul\GitHub\syndiffix-py\syndiffix\synthesizer.py", line 46, in __init__
    self.forest = Forest(
  File "c:\paul\GitHub\syndiffix-py\syndiffix\forest.py", line 75, in __init__
    tree = self._build_tree(combination).push_down_1dim_root()
  File "c:\paul\GitHub\syndiffix-py\syndiffix\forest.py", line 120, in _build_tree
    tree = tree.add_row(0, RowId(index))
  File "c:\paul\GitHub\syndiffix-py\syndiffix\tree.py", line 221, in add_row
    self._create_child_leaf(child_index, row) if child is None else child.add_row(depth + 1, row)
  File "c:\paul\GitHub\syndiffix-py\syndiffix\tree.py", line 223, in add_row
    self.update_aids(row)
  File "c:\paul\GitHub\syndiffix-py\syndiffix\tree.py", line 50, in update_aids
    self.entity_counter.add(self.context.aid_data[row])
  File "c:\paul\GitHub\syndiffix-py\syndiffix\counters.py", line 43, in add
    self.aid_sets[i].add(aid)
  File "C:\Users\local_francis\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pybloom_live\pybloom.py", line 141, in add
    raise IndexError("BloomFilter is at capacity")
IndexError: BloomFilter is at capacity
PS C:\paul\GitHub\misc\python\syndiffix-py-play>

Make shortcut syntax for target column

Make a shortcut syntax like this:

df_synth = Synthesizer(df_orig, target_column='col1')

to accomplish this:

df_synth = Synthesizer(df_orig, clustering=MLClustering(target_column='col1'))

Implement range handling.

Dependence measure

There should be some existing library that supports dependence measure.
We might be able to use sci-kit's f-tests.

Metadata about original columns

I need to know whether a column is integral or real. The forest carries no such data. I added orig_data but that one is still converted to floats.

I might also need this for ML encoding of text columns with random shuffling.

Implement tree building.

Implement value-to-float conversions.

We need to provide various transformers for converting data values to floats.

For example, in order to:

convert datetimes to Unix timestamps and back
encode datetimes as year, month, day, etc.
encode strings as integers (in random or sorted order)
handle NULL values (drop rows, replace with average, replace with 2 * max, etc.)

Implement counting of multiple AID contributions.

Initial cluster sometimes only has a single column.

Here are some examples:

Loading data from `..\test\intrusion.csv`...
Loaded 494021 rows. Columns:
0: Unnamed: 0 (int64)
1: duration (int64)
2: protocol_type (object)
3: service (object)
4: flag (object)
5: src_bytes (int64)
6: dst_bytes (int64)
7: land (int64)
8: wrong_fragment (int64)
9: urgent (int64)
10: hot (int64)
11: num_failed_logins (int64)
12: logged_in (int64)
13: num_compromised (int64)
14: root_shell (int64)
15: su_attempted (int64)
16: num_root (int64)
17: num_file_creations (int64)
18: num_shells (int64)
19: num_access_files (int64)
20: is_host_login (int64)
21: is_guest_login (int64)
22: count (int64)
23: srv_count (int64)
24: serror_rate (int64)
25: srv_serror_rate (int64)
26: rerror_rate (int64)
27: srv_rerror_rate (int64)
28: same_srv_rate (int64)
29: diff_srv_rate (int64)
30: srv_diff_host_rate (int64)
31: dst_host_count (int64)
32: dst_host_srv_count (int64)
33: dst_host_same_srv_rate (int64)
34: dst_host_diff_srv_rate (int64)
35: dst_host_same_src_port_rate (int64)
36: dst_host_srv_diff_host_rate (int64)
37: dst_host_serror_rate (int64)
38: dst_host_srv_serror_rate (int64)
39: dst_host_rerror_rate (int64)
40: dst_host_srv_rerror_rate (int64)
41: label (object)

Fitting the synthesizer over the data...
Column clusters:
Initial= [9]
Derived= (SHARED, [9], [33, 4, 24, 26, 28])
Derived= (SHARED, [4], [7])
Derived= (SHARED, [4], [19])
Derived= (SHARED, [4], [15])
Derived= (SHARED, [4], [34])
Derived= (SHARED, [4], [17])
Derived= (SHARED, [33, 4, 28], [32, 1, 35, 38])
Derived= (SHARED, [4], [10])
Derived= (SHARED, [4], [20])
Derived= (SHARED, [32, 28, 4], [40, 2, 27, 31])
Derived= (SHARED, [4], [14])
Derived= (SHARED, [4], [36])
Derived= (SHARED, [1, 2, 35], [41, 3, 12, 6])
Derived= (SHARED, [4], [16])
Derived= (SHARED, [4], [11])
Derived= (SHARED, [26, 27, 4], [39])
Derived= (SHARED, [4], [29])
Derived= (SHARED, [33, 35, 28], [5, 22, 23])
Derived= (SHARED, [4], [21])
Derived= (SHARED, [22], [30])
Derived= (SHARED, [4], [8])
Derived= (SHARED, [32, 22], [0])
Derived= (SHARED, [24, 4, 38], [25, 37])
Derived= (SHARED, [4], [18])
Derived= (SHARED, [4], [13])

Loading data from `..\test\insurance.csv`...
Loaded 14000 rows. Columns:
0: GoodStudent (bool)
1: Age (object)
2: SocioEcon (object)
3: RiskAversion (object)
4: VehicleYear (object)
5: ThisCarDam (object)
6: RuggedAuto (object)
7: Accident (object)
8: MakeModel (object)
9: DrivQuality (object)
10: Mileage (object)
11: Antilock (bool)
12: DrivingSkill (object)
13: SeniorTrain (bool)
14: ThisCarCost (object)
15: Theft (bool)
16: CarValue (object)
17: HomeBase (object)
18: AntiTheft (bool)
19: PropCost (object)
20: OtherCarCost (object)
21: OtherCar (bool)
22: MedCost (object)
23: Cushioning (object)
24: Airbag (bool)
25: ILiCost (object)
26: DrivHist (object)

Fitting the synthesizer over the data...
Column clusters:
Initial= [15]
Derived= (SHARED, [15], [4, 14, 19, 20, 24])
Derived= (SHARED, [24, 4, 14], [16, 10])
Derived= (SHARED, [19, 20, 14], [9, 26, 5, 7])
Derived= (SHARED, [24, 16, 4], [11, 17, 18, 2])
Derived= (SHARED, [19, 14], [25])
Derived= (SHARED, [9, 18, 26], [1, 3, 13])
Derived= (SHARED, [1], [0])
Derived= (SHARED, [24, 17, 2], [8, 6, 23])
Derived= (SHARED, [9, 26, 7], [12])
Derived= (SHARED, [17, 2, 4], [21])
Derived= (SHARED, [19, 20, 14], [22])

Localize the RNG in microdata.py

Once we have a Forest class and instance available: #16 (comment)

Top level interface discussion

Please share how you envision to initiate synthesis from the top level.

What we need to be configurable:

Parameters
What columns to select
Column encoding
Clustering (univariate/focus/ML-k)
(Future) Number of rows
Other customizable behaviors

Set up pipeline for pip publication

Initially an alpha version.

Excessive suppression

When running slurm, I notice a lot of files are almost completely suppressed in some columns. Example here is insurance.csv:

GoodStudent,Age,SocioEcon,RiskAversion,VehicleYear,ThisCarDam,RuggedAuto,Accident,MakeModel,DrivQuality,Mileage,Antilock,DrivingSkill,SeniorTrain,ThisCarCost,Theft,CarValue,HomeBase,AntiTheft,PropCost,OtherCarCost,OtherCar,MedCost,Cushioning,Airbag,ILiCost,DrivHist
False,*0,*0,*0,Current,*3,*1,*3,*1,*1,*2,False,*1,False,T*2,False,*3,*1,True,*1,T*3,True,*0,*0,True,T*3,*0
False,*0,*0,*0,Current,*0,*1,*2,*0,Poor,*0,False,*0,False,HundredThou,False,TwentyThou,*1,False,*0,T*3,False,*0,*0,True,T*2,*0
False,*0,*0,*0,Older,*2,Tank,*2,*0,*1,*2,False,*0,False,T*3,False,*2,*1,False,*0,T*3,False,*0,*0,True,T*3,*0
False,*0,*0,*0,Older,*2,*1,*2,*0,Poor,*1,False,SubStandard,False,HundredThou,False,TwentyThou,S*2,True,*1,T*2,True,*0,*0,False,T*2,*0
False,*0,*0,*0,Current,*0,*1,Severe,*1,*0,*0,False,*0,False,T*3,False,*2,*0,False,*0,T*3,False,*0,*0,True,T*2,*0
False,*0,*0,*0,Current,*0,*0,*2,*0,Poor,*0,False,SubStandard,False,HundredThou,False,TwentyThou,S*3,False,*0,HundredThou,False,*0,*0,True,T*2,*0
False,*0,*0,*0,Current,*1,*0,*0,*0,Poor,*0,False,SubStandard,False,T*3,False,TwentyThou,*0,True,*0,T*2,False,*0,*0,True,T*2,*0
False,*0,*0,*0,Current,None,*0,None,*0,Poor,*0,False,SubStandard,False,HundredThou,False,TwentyThou,S*2,True,*0,T*2,True,*0,*0,True,T*2,*0
False,*0,*0,*0,Current,*2,*1,*2,*1,Poor,*1,False,SubStandard,False,T*2,False,*1,*1,False,*1,T*2,False,*0,*0,False,T*2,*0
False,*0,*0,*0,Current,*0,*0,*1,*0,Poor,*0,False,SubStandard,False,HundredThou,False,TwentyThou,*0,False,*1,T*3,True,*0,*0,True,T*2,*0
False,*0,*0,*0,Current,*2,Tank,Severe,*2,*0,*2,False,*0,False,Thousand,False,*1,*0,False,*1,HundredThou,True,*0,*0,True,T*3,*0
False,*0,*0,*0,Current,*2,*1,*2,*0,Poor,*0,False,SubStandard,False,HundredThou,False,TwentyThou,*1,False,*0,T*2,False,*0,*0,True,T*2,*0
...

Integrate with SLURM (test-syndiffix)

Hash the column names into the per-tree base seed.

Implement clustering.

Generate helpful error messages for salt IO failures

We might want to catch any potential I/O errors and re-raise with a helpful message explaining that the user didn't provide a salt and we failed to produce a persisted one.

Compute and respect the noisy row-limit.

Localize microdata RNG so that syn jobs are order-independent in notebooks

We don't detect datetimes properly

File is expedia_hotel_logs, column date_time, I saw this result:

         log_id      date_time  ...  hotel_market  hotel_cluster
0  LOG_0076*532    2014-01-*86  ...            29             25
1    LOG_000*16     2014-0*393  ...           366             22
2   LOG_008*587  2014-08-1*424  ...           191             25
3   LOG_007*525  2014-07-1*265  ...           633             70
4   LOG_007*499  2014-07-1*291  ...            24             15

Suppression indicates it's parsed as string.

Support more Python versions.

Because of scipy, we are limited to supporting Python versions >= 3.10 and < 3.13.
We should fix dependencies to allow any version of Python >= 3.10.

add scaffolding for jekyl

Not sure what all should be here, but at least a simple home page with a few tabs like "home", "contact", "download"

Expose clustering solver parameters

Add a notebook targeting the deployed library

Entropy measure

list index out of range when building microdata

Model sdx_py
Using source file census.csv
Model sdx_py for dataset /INS/syndiffix/work/edon/results/sdx_py/csv/train/census.csv, focus column detailed household summary in household
Training dataframe shape (before features) (209499, 41)
['age', 'class of worker', 'detailed industry recode', 'detailed occupation recode', 'education', 'wage per hour', 'enroll in edu inst last wk', 'marital stat', 'major industry code', 'major occupation code', 'race', 'hispanic origin', 'sex', 'member of a labor union', 'reason for unemployment', 'full or part time employment stat', 'capital gains', 'capital losses', 'dividends from stocks', 'tax filer stat', 'region of previous residence', 'state of previous residence', 'detailed household and family stat', 'detailed household summary in household', 'migration code-change in msa', 'migration code-change in reg', 'migration code-move within reg', 'live in this house 1 year ago', 'migration prev res in sunbelt', 'num persons worked for employer', 'family members under 18', 'country of birth father', 'country of birth mother', 'country of birth self', 'citizenship', 'own business or self employed', "fill inc questionnaire for veteran's admin", 'veterans benefits', 'weeks worked in year', 'year', 'label']
Columns ['age', 'class of worker', 'detailed industry recode', 'detailed occupation recode', 'education', 'wage per hour', 'enroll in edu inst last wk', 'marital stat', 'major industry code', 'major occupation code', 'race', 'hispanic origin', 'sex', 'member of a labor union', 'reason for unemployment', 'full or part time employment stat', 'capital gains', 'capital losses', 'dividends from stocks', 'tax filer stat', 'region of previous residence', 'state of previous residence', 'detailed household and family stat', 'detailed household summary in household', 'migration code-change in msa', 'migration code-change in reg', 'migration code-move within reg', 'live in this house 1 year ago', 'migration prev res in sunbelt', 'num persons worked for employer', 'family members under 18', 'country of birth father', 'country of birth mother', 'country of birth self', 'citizenship', 'own business or self employed', "fill inc questionnaire for veteran's admin", 'veterans benefits', 'weeks worked in year', 'year', 'label']
Running with ML target detailed household summary in household...
Column clusters:
Initial= [6, 7, 9, 10, 11, 12, 23]
Derived= (SHARED, [23], [13, 14, 15, 19, 20, 24])
Derived= (SHARED, [23], [25, 26, 27, 28, 29])
Derived= (SHARED, [23], [0, 37, 39, 40, 30])
Derived= (SHARED, [23], [2, 3, 4, 5])
Derived= (SHARED, [23], [8, 16, 17, 18, 21])
Derived= (SHARED, [23], [32, 33, 38, 22, 31])
Derived= (SHARED, [], [1])
Derived= (SHARED, [], [34])
Derived= (SHARED, [], [35])
Derived= (SHARED, [], [36])
Traceback (most recent call last):
  File "/INS/syndiffix/work/edon/test-syndiffix/test_syndiffix/oneModel.py", line 598, in <module>
    fire.Fire(oneModel)
  File "/home/egashi/.asdf/installs/python/3.10.13/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/egashi/.asdf/installs/python/3.10.13/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/egashi/.asdf/installs/python/3.10.13/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/INS/syndiffix/work/edon/test-syndiffix/test_syndiffix/oneModel.py", line 530, in oneModel
    runSynDiffix(df, outPath, focusColumn, doPatches, testData, job)
  File "/INS/syndiffix/work/edon/test-syndiffix/test_syndiffix/oneModel.py", line 136, in runSynDiffix
    synData = synthesizer.sample()
  File "/INS/syndiffix/work/edon/syndiffix-py/syndiffix/synthesizer.py", line 70, in sample
    rows, root_combination = build_table(
  File "/INS/syndiffix/work/edon/syndiffix-py/syndiffix/clustering/stitching.py", line 377, in build_table
    acc = materialize_tree(forest, clusters.initial_cluster)
  File "/INS/syndiffix/work/edon/syndiffix-py/syndiffix/synthesizer.py", line 62, in materialize_tree
    generate_microdata(
  File "/INS/syndiffix/work/edon/syndiffix-py/syndiffix/microdata.py", line 198, in generate_microdata
    microdata_rows.extend(
  File "/INS/syndiffix/work/edon/syndiffix-py/syndiffix/microdata.py", line 148, in _microdata_row_generator
    yield [_generate(i, c, nm) for i, c, nm in zip(intervals, convertors, null_mappings)]
  File "/INS/syndiffix/work/edon/syndiffix-py/syndiffix/microdata.py", line 148, in <listcomp>
    yield [_generate(i, c, nm) for i, c, nm in zip(intervals, convertors, null_mappings)]
  File "/INS/syndiffix/work/edon/syndiffix-py/syndiffix/microdata.py", line 139, in _generate
    return convertor.from_interval(interval) if interval.min != null_mapping else (None, null_mapping)
  File "/INS/syndiffix/work/edon/syndiffix-py/syndiffix/microdata.py", line 122, in from_interval
    return self._map_interval(interval)
  File "/INS/syndiffix/work/edon/syndiffix-py/syndiffix/microdata.py", line 127, in _map_interval
    min_value = self.value_map[int(interval.min)]
IndexError: list index out of range

I think it somehow ended up with an empty set of strings.

Traceback (most recent call last):
  File "/INS/syndiffix/work/edon/test-syndiffix/test_syndiffix/oneModel.py", line 599, in <module>
    fire.Fire(oneModel)
  File "/home/egashi/.asdf/installs/python/3.10.13/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/egashi/.asdf/installs/python/3.10.13/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/egashi/.asdf/installs/python/3.10.13/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/INS/syndiffix/work/edon/test-syndiffix/test_syndiffix/oneModel.py", line 531, in oneModel
    runSynDiffix(df, outPath, focusColumn, doPatches, testData, job)
  File "/INS/syndiffix/work/edon/test-syndiffix/test_syndiffix/oneModel.py", line 155, in runSynDiffix
    json.dump(outJson, f, indent=4)
  File "/home/egashi/.asdf/installs/python/3.10.13/lib/python3.10/json/__init__.py", line 179, in dump
    for chunk in iterable:
  File "/home/egashi/.asdf/installs/python/3.10.13/lib/python3.10/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/home/egashi/.asdf/installs/python/3.10.13/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/home/egashi/.asdf/installs/python/3.10.13/lib/python3.10/json/encoder.py", line 325, in _iterencode_list
    yield from chunks
  File "/home/egashi/.asdf/installs/python/3.10.13/lib/python3.10/json/encoder.py", line 325, in _iterencode_list
    yield from chunks
  File "/home/egashi/.asdf/installs/python/3.10.13/lib/python3.10/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File "/home/egashi/.asdf/installs/python/3.10.13/lib/python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type Timestamp is not JSON serializable