diffix / syndiffix Goto Github PK
View Code? Open in Web Editor NEWPython implementation of the SynDiffix synthetic data generation mechanism.
License: Other
Python implementation of the SynDiffix synthetic data generation mechanism.
License: Other
It occurs to me that we should not include AID columns in the output.
The main reason is that the AID columns has no value, but the user might not know this.
In particular, we don't capture any event information, for instance the distribution of rows over AIDs, or inter-event timing or sequences. But if we include some kind of AID column, then the user might assume that we do capture this stuff, and get bad results.
Until we implement event information, it seems it would be cleaner and clearer just to exclude the AID columns in the output.
Once we have all the individual stages available, we need a top-level module that assembles the full syndiffix pipeline:
I get different features on different runs over the same table/column pair. We must manage the random_state
to be deterministic.
I was playing with a custom script that takes the first 50 K rows from the taxi-one-day.csv
dataset and selects 5 specific columns for processing, and this is the output that I get:
Loaded 50000 rows. Columns:
0: pickup_longitude (float64)
1: pickup_latitude (float64)
2: fare_amount (float64)
3: rate_code (int64)
4: passenger_count (int64)
Fitting the synthesizer over the data...
Column clusters:
Initial= [2, 4]
Derived= (SHARED, [2], [0])
Derived= (SHARED, [2], [1])
Derived= (SHARED, [2], [3])
Notice how the initial cluster only has 2 columns in it.
The F# implementation produces this output:
=== Columns ===
0 pickup_longitude (RealType); Entropy = 11.828231226413235
1 pickup_latitude (RealType); Entropy = 11.799770332830937
2 fare_amount (RealType); Entropy = 5.4025513988026965
3 rate_code (IntegerType); Entropy = 0.2387340586221852
4 passenger_count (IntegerType); Entropy = 0.6315105894728646
Assigning clusters...
Clusters: { InitialCluster = [|0; 1; 2|]
DerivedClusters = [(Shared, [|2|], [|3|]); (Shared, [|2|], [|4|])] }.
When we have an interface with clustering strategies, one should be able to tweak the default behavior of ML feature selection, which currently uses Decision Tree Classifier/Regressor.
Since we don't expose a CLI, they serve no purpose.
>>> Synthesizer(pandas.DataFrame(numpy.ones((2,200)))).sample()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python310\lib\site-packages\syndiffix\synthesizer.py", line 99, in sample
rows, root_combination = build_table(
File "C:\Python310\lib\site-packages\syndiffix\clustering\stitching.py", line 381, in build_table
acc = _stitch(materialize_tree, forest, metadata, acc, derived_cluster)
File "C:\Python310\lib\site-packages\syndiffix\clustering\stitching.py", line 372, in _stitch
return _do_stitch(forest, metadata, left, right, derived_cluster)
File "C:\Python310\lib\site-packages\syndiffix\clustering\stitching.py", line 300, in _do_stitch
raise ValueError(f"Empty sequence in cluster {right_combination}.")
ValueError: Empty sequence in cluster (0, 2).
Do we have anything that configures the salt?
It seems to me that we should automatically create a good salt the first time syndiffix-py is run.
I see that the typical usage for obtaining the seed is like this:
noise = _generate_noise(anon_params.salt, "noise", noise_sd, (context.bucket_seed, aid_seed))
What we could do instead is to use a get_salt()
routine instead of anon_params.salt
, which always checks to see if the salt is set to something other than the default, and if not it sets it to a cryptographically strong value. Here is a library for that:
Current implementation makes xxx*yyy
type strings when the original string needs to be suppressed (where xxx
is a possibly-null prefix, and yyy
is a number.
The primary reason for this is to avoid releasing any strings that fail LCF. The problem is that we are too aggressive about this, and suppress strings that strictly speaking don't need to be suppressed. This happens because we partition a given columns values at 2dim and above.
What we should do instead is to use the 1dim tree to determine what the valid strings, and then at Ndim choose strings from the set of valid strings.
There are different ways to build clusters:
This error happens when running with an AID column. The specific case here is for the taxi
table. The aid column is hack
. I ran with 10000 rows. I tried several different data dataframes. Two had one column (med
and passenger_count
). And one had two columns (start and end datetime). All failed like this.
Fitting the synthesizer over the data...
c:\paul\GitHub\syndiffix-py\syndiffix\forest.py:62: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
self.aid_data: npt.NDArray[np.uint64] = aids.applymap(hash_aid).to_numpy(Hash)
Traceback (most recent call last):
File "C:\paul\GitHub\misc\python\syndiffix-py-play\testSynDiffixPy.py", line 120, in <module>
tsd = testSynDiffixPy(df, csvFile, ['med','hack'], output_dir, aidsColumns=aidsColumns)
File "C:\paul\GitHub\misc\python\syndiffix-py-play\testSynDiffixPy.py", line 55, in __init__
synthesizer = Synthesizer(self.dfOrig, aids=aids)
File "c:\paul\GitHub\syndiffix-py\syndiffix\synthesizer.py", line 46, in __init__
self.forest = Forest(
File "c:\paul\GitHub\syndiffix-py\syndiffix\forest.py", line 75, in __init__
tree = self._build_tree(combination).push_down_1dim_root()
File "c:\paul\GitHub\syndiffix-py\syndiffix\forest.py", line 120, in _build_tree
tree = tree.add_row(0, RowId(index))
File "c:\paul\GitHub\syndiffix-py\syndiffix\tree.py", line 221, in add_row
self._create_child_leaf(child_index, row) if child is None else child.add_row(depth + 1, row)
File "c:\paul\GitHub\syndiffix-py\syndiffix\tree.py", line 223, in add_row
self.update_aids(row)
File "c:\paul\GitHub\syndiffix-py\syndiffix\tree.py", line 50, in update_aids
self.entity_counter.add(self.context.aid_data[row])
File "c:\paul\GitHub\syndiffix-py\syndiffix\counters.py", line 43, in add
self.aid_sets[i].add(aid)
File "C:\Users\local_francis\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pybloom_live\pybloom.py", line 141, in add
raise IndexError("BloomFilter is at capacity")
IndexError: BloomFilter is at capacity
PS C:\paul\GitHub\misc\python\syndiffix-py-play>
Make a shortcut syntax like this:
df_synth = Synthesizer(df_orig, target_column='col1')
to accomplish this:
df_synth = Synthesizer(df_orig, clustering=MLClustering(target_column='col1'))
There should be some existing library that supports dependence measure.
We might be able to use sci-kit's f-tests.
I need to know whether a column is integral or real. The forest carries no such data. I added orig_data
but that one is still converted to floats.
I might also need this for ML encoding of text columns with random shuffling.
We need to provide various transformers for converting data values to floats.
For example, in order to:
Here are some examples:
Loading data from `..\test\intrusion.csv`...
Loaded 494021 rows. Columns:
0: Unnamed: 0 (int64)
1: duration (int64)
2: protocol_type (object)
3: service (object)
4: flag (object)
5: src_bytes (int64)
6: dst_bytes (int64)
7: land (int64)
8: wrong_fragment (int64)
9: urgent (int64)
10: hot (int64)
11: num_failed_logins (int64)
12: logged_in (int64)
13: num_compromised (int64)
14: root_shell (int64)
15: su_attempted (int64)
16: num_root (int64)
17: num_file_creations (int64)
18: num_shells (int64)
19: num_access_files (int64)
20: is_host_login (int64)
21: is_guest_login (int64)
22: count (int64)
23: srv_count (int64)
24: serror_rate (int64)
25: srv_serror_rate (int64)
26: rerror_rate (int64)
27: srv_rerror_rate (int64)
28: same_srv_rate (int64)
29: diff_srv_rate (int64)
30: srv_diff_host_rate (int64)
31: dst_host_count (int64)
32: dst_host_srv_count (int64)
33: dst_host_same_srv_rate (int64)
34: dst_host_diff_srv_rate (int64)
35: dst_host_same_src_port_rate (int64)
36: dst_host_srv_diff_host_rate (int64)
37: dst_host_serror_rate (int64)
38: dst_host_srv_serror_rate (int64)
39: dst_host_rerror_rate (int64)
40: dst_host_srv_rerror_rate (int64)
41: label (object)
Fitting the synthesizer over the data...
Column clusters:
Initial= [9]
Derived= (SHARED, [9], [33, 4, 24, 26, 28])
Derived= (SHARED, [4], [7])
Derived= (SHARED, [4], [19])
Derived= (SHARED, [4], [15])
Derived= (SHARED, [4], [34])
Derived= (SHARED, [4], [17])
Derived= (SHARED, [33, 4, 28], [32, 1, 35, 38])
Derived= (SHARED, [4], [10])
Derived= (SHARED, [4], [20])
Derived= (SHARED, [32, 28, 4], [40, 2, 27, 31])
Derived= (SHARED, [4], [14])
Derived= (SHARED, [4], [36])
Derived= (SHARED, [1, 2, 35], [41, 3, 12, 6])
Derived= (SHARED, [4], [16])
Derived= (SHARED, [4], [11])
Derived= (SHARED, [26, 27, 4], [39])
Derived= (SHARED, [4], [29])
Derived= (SHARED, [33, 35, 28], [5, 22, 23])
Derived= (SHARED, [4], [21])
Derived= (SHARED, [22], [30])
Derived= (SHARED, [4], [8])
Derived= (SHARED, [32, 22], [0])
Derived= (SHARED, [24, 4, 38], [25, 37])
Derived= (SHARED, [4], [18])
Derived= (SHARED, [4], [13])
Loading data from `..\test\insurance.csv`...
Loaded 14000 rows. Columns:
0: GoodStudent (bool)
1: Age (object)
2: SocioEcon (object)
3: RiskAversion (object)
4: VehicleYear (object)
5: ThisCarDam (object)
6: RuggedAuto (object)
7: Accident (object)
8: MakeModel (object)
9: DrivQuality (object)
10: Mileage (object)
11: Antilock (bool)
12: DrivingSkill (object)
13: SeniorTrain (bool)
14: ThisCarCost (object)
15: Theft (bool)
16: CarValue (object)
17: HomeBase (object)
18: AntiTheft (bool)
19: PropCost (object)
20: OtherCarCost (object)
21: OtherCar (bool)
22: MedCost (object)
23: Cushioning (object)
24: Airbag (bool)
25: ILiCost (object)
26: DrivHist (object)
Fitting the synthesizer over the data...
Column clusters:
Initial= [15]
Derived= (SHARED, [15], [4, 14, 19, 20, 24])
Derived= (SHARED, [24, 4, 14], [16, 10])
Derived= (SHARED, [19, 20, 14], [9, 26, 5, 7])
Derived= (SHARED, [24, 16, 4], [11, 17, 18, 2])
Derived= (SHARED, [19, 14], [25])
Derived= (SHARED, [9, 18, 26], [1, 3, 13])
Derived= (SHARED, [1], [0])
Derived= (SHARED, [24, 17, 2], [8, 6, 23])
Derived= (SHARED, [9, 26, 7], [12])
Derived= (SHARED, [17, 2, 4], [21])
Derived= (SHARED, [19, 20, 14], [22])
Once we have a Forest
class and instance available: #16 (comment)
Please share how you envision to initiate synthesis from the top level.
What we need to be configurable:
Initially an alpha version.
When running slurm, I notice a lot of files are almost completely suppressed in some columns. Example here is insurance.csv
:
GoodStudent,Age,SocioEcon,RiskAversion,VehicleYear,ThisCarDam,RuggedAuto,Accident,MakeModel,DrivQuality,Mileage,Antilock,DrivingSkill,SeniorTrain,ThisCarCost,Theft,CarValue,HomeBase,AntiTheft,PropCost,OtherCarCost,OtherCar,MedCost,Cushioning,Airbag,ILiCost,DrivHist
False,*0,*0,*0,Current,*3,*1,*3,*1,*1,*2,False,*1,False,T*2,False,*3,*1,True,*1,T*3,True,*0,*0,True,T*3,*0
False,*0,*0,*0,Current,*0,*1,*2,*0,Poor,*0,False,*0,False,HundredThou,False,TwentyThou,*1,False,*0,T*3,False,*0,*0,True,T*2,*0
False,*0,*0,*0,Older,*2,Tank,*2,*0,*1,*2,False,*0,False,T*3,False,*2,*1,False,*0,T*3,False,*0,*0,True,T*3,*0
False,*0,*0,*0,Older,*2,*1,*2,*0,Poor,*1,False,SubStandard,False,HundredThou,False,TwentyThou,S*2,True,*1,T*2,True,*0,*0,False,T*2,*0
False,*0,*0,*0,Current,*0,*1,Severe,*1,*0,*0,False,*0,False,T*3,False,*2,*0,False,*0,T*3,False,*0,*0,True,T*2,*0
False,*0,*0,*0,Current,*0,*0,*2,*0,Poor,*0,False,SubStandard,False,HundredThou,False,TwentyThou,S*3,False,*0,HundredThou,False,*0,*0,True,T*2,*0
False,*0,*0,*0,Current,*1,*0,*0,*0,Poor,*0,False,SubStandard,False,T*3,False,TwentyThou,*0,True,*0,T*2,False,*0,*0,True,T*2,*0
False,*0,*0,*0,Current,None,*0,None,*0,Poor,*0,False,SubStandard,False,HundredThou,False,TwentyThou,S*2,True,*0,T*2,True,*0,*0,True,T*2,*0
False,*0,*0,*0,Current,*2,*1,*2,*1,Poor,*1,False,SubStandard,False,T*2,False,*1,*1,False,*1,T*2,False,*0,*0,False,T*2,*0
False,*0,*0,*0,Current,*0,*0,*1,*0,Poor,*0,False,SubStandard,False,HundredThou,False,TwentyThou,*0,False,*1,T*3,True,*0,*0,True,T*2,*0
False,*0,*0,*0,Current,*2,Tank,Severe,*2,*0,*2,False,*0,False,Thousand,False,*1,*0,False,*1,HundredThou,True,*0,*0,True,T*3,*0
False,*0,*0,*0,Current,*2,*1,*2,*0,Poor,*0,False,SubStandard,False,HundredThou,False,TwentyThou,*1,False,*0,T*2,False,*0,*0,True,T*2,*0
...
We might want to catch any potential I/O errors and re-raise with a helpful message explaining that the user didn't provide a salt and we failed to produce a persisted one.
File is expedia_hotel_logs
, column date_time
, I saw this result:
log_id date_time ... hotel_market hotel_cluster
0 LOG_0076*532 2014-01-*86 ... 29 25
1 LOG_000*16 2014-0*393 ... 366 22
2 LOG_008*587 2014-08-1*424 ... 191 25
3 LOG_007*525 2014-07-1*265 ... 633 70
4 LOG_007*499 2014-07-1*291 ... 24 15
Suppression indicates it's parsed as string.
Because of scipy
, we are limited to supporting Python versions >= 3.10 and < 3.13.
We should fix dependencies to allow any version of Python >= 3.10.
Not sure what all should be here, but at least a simple home page with a few tabs like "home", "contact", "download"
Model sdx_py
Using source file census.csv
Model sdx_py for dataset /INS/syndiffix/work/edon/results/sdx_py/csv/train/census.csv, focus column detailed household summary in household
Training dataframe shape (before features) (209499, 41)
['age', 'class of worker', 'detailed industry recode', 'detailed occupation recode', 'education', 'wage per hour', 'enroll in edu inst last wk', 'marital stat', 'major industry code', 'major occupation code', 'race', 'hispanic origin', 'sex', 'member of a labor union', 'reason for unemployment', 'full or part time employment stat', 'capital gains', 'capital losses', 'dividends from stocks', 'tax filer stat', 'region of previous residence', 'state of previous residence', 'detailed household and family stat', 'detailed household summary in household', 'migration code-change in msa', 'migration code-change in reg', 'migration code-move within reg', 'live in this house 1 year ago', 'migration prev res in sunbelt', 'num persons worked for employer', 'family members under 18', 'country of birth father', 'country of birth mother', 'country of birth self', 'citizenship', 'own business or self employed', "fill inc questionnaire for veteran's admin", 'veterans benefits', 'weeks worked in year', 'year', 'label']
Columns ['age', 'class of worker', 'detailed industry recode', 'detailed occupation recode', 'education', 'wage per hour', 'enroll in edu inst last wk', 'marital stat', 'major industry code', 'major occupation code', 'race', 'hispanic origin', 'sex', 'member of a labor union', 'reason for unemployment', 'full or part time employment stat', 'capital gains', 'capital losses', 'dividends from stocks', 'tax filer stat', 'region of previous residence', 'state of previous residence', 'detailed household and family stat', 'detailed household summary in household', 'migration code-change in msa', 'migration code-change in reg', 'migration code-move within reg', 'live in this house 1 year ago', 'migration prev res in sunbelt', 'num persons worked for employer', 'family members under 18', 'country of birth father', 'country of birth mother', 'country of birth self', 'citizenship', 'own business or self employed', "fill inc questionnaire for veteran's admin", 'veterans benefits', 'weeks worked in year', 'year', 'label']
Running with ML target detailed household summary in household...
Column clusters:
Initial= [6, 7, 9, 10, 11, 12, 23]
Derived= (SHARED, [23], [13, 14, 15, 19, 20, 24])
Derived= (SHARED, [23], [25, 26, 27, 28, 29])
Derived= (SHARED, [23], [0, 37, 39, 40, 30])
Derived= (SHARED, [23], [2, 3, 4, 5])
Derived= (SHARED, [23], [8, 16, 17, 18, 21])
Derived= (SHARED, [23], [32, 33, 38, 22, 31])
Derived= (SHARED, [], [1])
Derived= (SHARED, [], [34])
Derived= (SHARED, [], [35])
Derived= (SHARED, [], [36])
Traceback (most recent call last):
File "/INS/syndiffix/work/edon/test-syndiffix/test_syndiffix/oneModel.py", line 598, in <module>
fire.Fire(oneModel)
File "/home/egashi/.asdf/installs/python/3.10.13/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/egashi/.asdf/installs/python/3.10.13/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/egashi/.asdf/installs/python/3.10.13/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/INS/syndiffix/work/edon/test-syndiffix/test_syndiffix/oneModel.py", line 530, in oneModel
runSynDiffix(df, outPath, focusColumn, doPatches, testData, job)
File "/INS/syndiffix/work/edon/test-syndiffix/test_syndiffix/oneModel.py", line 136, in runSynDiffix
synData = synthesizer.sample()
File "/INS/syndiffix/work/edon/syndiffix-py/syndiffix/synthesizer.py", line 70, in sample
rows, root_combination = build_table(
File "/INS/syndiffix/work/edon/syndiffix-py/syndiffix/clustering/stitching.py", line 377, in build_table
acc = materialize_tree(forest, clusters.initial_cluster)
File "/INS/syndiffix/work/edon/syndiffix-py/syndiffix/synthesizer.py", line 62, in materialize_tree
generate_microdata(
File "/INS/syndiffix/work/edon/syndiffix-py/syndiffix/microdata.py", line 198, in generate_microdata
microdata_rows.extend(
File "/INS/syndiffix/work/edon/syndiffix-py/syndiffix/microdata.py", line 148, in _microdata_row_generator
yield [_generate(i, c, nm) for i, c, nm in zip(intervals, convertors, null_mappings)]
File "/INS/syndiffix/work/edon/syndiffix-py/syndiffix/microdata.py", line 148, in <listcomp>
yield [_generate(i, c, nm) for i, c, nm in zip(intervals, convertors, null_mappings)]
File "/INS/syndiffix/work/edon/syndiffix-py/syndiffix/microdata.py", line 139, in _generate
return convertor.from_interval(interval) if interval.min != null_mapping else (None, null_mapping)
File "/INS/syndiffix/work/edon/syndiffix-py/syndiffix/microdata.py", line 122, in from_interval
return self._map_interval(interval)
File "/INS/syndiffix/work/edon/syndiffix-py/syndiffix/microdata.py", line 127, in _map_interval
min_value = self.value_map[int(interval.min)]
IndexError: list index out of range
I think it somehow ended up with an empty set of strings.
Make a clustering option called NoClustering
that simply makes every column a patch column.
I think they should be part of the ClusteringStrategy
sub-classes (with a NoClustering
sub-class to disable any clustering).
More a test-syndiffix issue. Looks like datasets with datetimes fail to produce output.
Traceback (most recent call last):
File "/INS/syndiffix/work/edon/test-syndiffix/test_syndiffix/oneModel.py", line 599, in <module>
fire.Fire(oneModel)
File "/home/egashi/.asdf/installs/python/3.10.13/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/egashi/.asdf/installs/python/3.10.13/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/egashi/.asdf/installs/python/3.10.13/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/INS/syndiffix/work/edon/test-syndiffix/test_syndiffix/oneModel.py", line 531, in oneModel
runSynDiffix(df, outPath, focusColumn, doPatches, testData, job)
File "/INS/syndiffix/work/edon/test-syndiffix/test_syndiffix/oneModel.py", line 155, in runSynDiffix
json.dump(outJson, f, indent=4)
File "/home/egashi/.asdf/installs/python/3.10.13/lib/python3.10/json/__init__.py", line 179, in dump
for chunk in iterable:
File "/home/egashi/.asdf/installs/python/3.10.13/lib/python3.10/json/encoder.py", line 431, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
File "/home/egashi/.asdf/installs/python/3.10.13/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/home/egashi/.asdf/installs/python/3.10.13/lib/python3.10/json/encoder.py", line 325, in _iterencode_list
yield from chunks
File "/home/egashi/.asdf/installs/python/3.10.13/lib/python3.10/json/encoder.py", line 325, in _iterencode_list
yield from chunks
File "/home/egashi/.asdf/installs/python/3.10.13/lib/python3.10/json/encoder.py", line 438, in _iterencode
o = _default(o)
File "/home/egashi/.asdf/installs/python/3.10.13/lib/python3.10/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type Timestamp is not JSON serializable
Stitching and harvesting rely on random state and we have different RNG algos across implementations. Will be more tricky to test...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.