hitsz-ids / synthetic-data-generator Goto Github PK
View Code? Open in Web Editor NEWSDG is a specialized framework designed to generate high-quality structured tabular data.
License: Apache License 2.0
SDG is a specialized framework designed to generate high-quality structured tabular data.
License: Apache License 2.0
I have searched for issues similar to this one.
Since we are refactoring, the example contents of the two py files currently in the example/
directory are outdated and cannot be run.
In this Issue, please rewrite a valid example based on the content ideas in the example script.
Update the code in:
In these two example scripts:
TBD
Currently, the sdg lacks a complete data processor, which is used to complete the necessary pre-processing and post-processing for input data and generated data.
Implement data processor that supports pre-processing and post-processing.
When this component is initially introduced, it implements the following two functions:
The implemented data processor should support the sdgx's dataloader access method.
TBD
please add the code of the base class to metrics/base.py
, and use python 3.
please try to define enough methods in the base class to prevent repeated additions when implementing metrics.
sdgx/metrics/base.py
- Remove the
pass
statement from the file.
• Import theABC
andabstractmethod
from theabc
module at the top of the file.
• Create a new class namedBaseMetric
that inherits fromABC
.
• Inside theBaseMetric
class, define an__init__
method that takes two parameters:real_data
andsynthetic_data
. These parameters should be stored as instance variables.
• Still inside theBaseMetric
class, define an abstract method namedcalculate
that takes no parameters. This method will be used to calculate the metric and should be implemented in each subclass.
• Still inside theBaseMetric
class, define a method namedvalidate_datasets
that takes no parameters. This method should check if thereal_data
andsynthetic_data
instance variables are valid datasets. For now, this method can simply pass.
I have searched for issues similar to this one.
In the current implementation of CTGAN, all training data will be loaded into memory at once, which will greatly increase memory consumption.
Thanks.
When using csv files to store training data, the current implementation uses pd.dataframe to load the data once. I plan to load the batch data in CTGAN in real time, and load other batch data after the use is completed.
There will be no parameter changes to the current implementation.
This optimized implementation will not change the existing example code.
Waitting Desgin:
Please pay attention to the LICENSE of the referenced implemention/code, we accept MIT, BSD, Apache 2.0 License...Non-contagious, commercially friendly license, OR Apache 2.0 License compatible license.
I have searched for issues similar to this one.
Currently, demo data can only be obtained through function sdgx.utils.io.csv_utils.get_demo_single_table
, and only one adult data sets are supported. In this issue, please implement a more scientific demonstration data management module.
We recommend stripping this moudule out of script sdgx/utils/io/csv_utils.py
and implementing a separate script in the sdgx/utils/io/
directory。
We recommend creating a file demo_data.py
and implementing the functions or class in this file.
We provide a class example for your reference:
# ISSUE DESCRIPTION A DemoData example
class DemoData(object):
def __init__(self, dataset_name) -> None:
# ISSUE DESCRIPTION
# the dataset name should be checked
pass
def get_data(self, offline_path = None) -> pd.DataFrame:
# ISSUE DESCRIPTION
# if offline_path is not None value,
# read data from the input path
pass
def download_data(self) -> None:
#
pass
Some operations that enhance user experience are also worthwhile, such as:
Known blocking issues:
I have searched for issues similar to this one.
Implement OCT-GAN model for single table synthetic data generation.
TBD
TBD
TBD
When I use this tool to synthesize multi-table databases, I cannot evaluate the differences in correlation between the generated data and real data across tables.
Some commonly used statistical metrics, such as mutual information, can be introduced, like this
Mutual information can be used to measure the correlation between each pair of columns in two tables.
Let X and Y be two datasets with the same number of discrete columns/features (m), where one is the original dataset, and the other is a simulated dataset.
For each dataset, we calculate the normalized mutual information between each pair of columns.
The normalized mutual information between random variables X and Y is defined as
Since
We can get
That is
The element at the ith row and jth column of the pairwise mutual information matrix M is
where
Let M and N be the pairwise mutual information matrices for X and Y. Then, mutual information similarity is defined as:
J is the Jaccard index defined as
This metric is bounded by 0 and 1 with 1 being the maximal and best value.
TBD
please add @iokk3732 for code.
please add @Femi-lawal for code.
@all-contributors
please add @Wh1isper for code.
please add @MooooCat for code.
please add @joeyscave for code.
We currently only provide Python code for interaction, it is more difficult for other languages to call sdgx code directly.
One possible approach is to add a CLI that enables the user to specify input and output paths for cleaned data, models.
$sdgx fit --model=CTGAN --table_path=/path/to/table.csv --output_path=/path/to/model --param_path=/path/to/config.toml
$sdgx sample --model_path=/path/to/model --num_rows=1000 # Generate 1000 samples with model
Since each model has a different sample and fit method, I thought it would be easier to add a --param_path
and make it point to a file for configuration.
Perhaps we could introduce a plugin system so that we can not only easily add new models, but also support user plugins to add their own algorithms without modifying this project, and pulggy
is a good choice.
In particular, when the primary key is a composite key, foreign keys might also be composite keys. How can this tool support multi-table synthesis in this scenario?
A simple approach is to concatenate the associated parent and child table data through foreign keys, and then use an existing model to learn their data distribution.
@all-contributors
please add @Z712023 for code.
I have searched for issues similar to this one.
This inspector is used to infer whether the columns in the tabular data are of type Address (Mainland China), thereby better labeling the data.
This inspector can be implemented through regular expressions and some rules.
Implementation content should include:
I have searched for issues similar to this one.
This inspector determines whether a column matches the regular expression given by the user and outputs the column names.
Add an Inspector that accepts two parameters:
The content is as follows:
sdgx.data_models.inspector.base.Inspector
and implement the fit method;sdgx.data_models.inspector.base.Inspector
and implement the inspect method;For the __init__
method:
For the fit method, the input parameters should be:
match_rate
parameter (default is set to 0.8 or other values). This parameter is between 0-1, when a column of data with a "match_rate" ratio matches the regular expression, this column should appear in the inspect results.For inspect method:
inspectors = InspectorManager().init_inspcetors(
include_inspectors, exclude_inspectors, **(inspector_init_kwargs or {})
)
for inspector in inspectors:
inspector.fit(df)
metadata = Metadata(primary_keys=[df.columns[0]], column_list=list(df.columns))
for inspector in inspectors:
metadata.update(inspector.inspect())
Table data often includes special ID fields, such as a fixed string "AXBSAX" followed by a variable string X, where the fixed string holds a static physical meaning and X increments in quantity, such as "0001", "0002", and so on.
Using regular expressions to analyze the ID format, synthesize different meaningful segments while preserving the static meaning of the original ID field.
We can consider two conditions and handle them separately:
When modeling multi-table data, I've observed that real-world data often doesn't satisfy foreign key constraints. For example: parents_id=[1,2,3], children_id=[1,2]
, or parents_id=[1,2,3], children_id=[1,2,3,4]
.
I wish this tool could automatically assist me in cleaning the data, ensuring that foreign keys exist in both the parent and child tables (e.g., parents_id=[1,2], children_id=[1,2]
).
In this way, the data used for multi-table simulation modeling can accurately reflect the associative relationships of foreign keys.
Retain only the intersection of foreign keys between the parent table and the child table.
TBD
This tool aims to support composite primary keys, but it seems that it does not guarantee the uniqueness of composite primary keys.
Add validation to the metadata to ensure the uniqueness of composite primary keys during validation.
TBD
The output of api normalized_mutual_info_score may violate Symmetry, although the official documentation claims that
"This metric is furthermore symmetric: switching label_true with label_pred will return the same score value. "https://scikit-learn.org/stable/modules/generated/sklearn.metrics.normalized_mutual_info_score.html
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics.cluster import normalized_mutual_info_score
a = "guest"
b = "user"
c = "admin"
src = [a,c,b,b,c,a,c,a,c,c]
tar = [a,b,a,a,b,c,b,b,c,a]
test_nums = 100
for i in range(test_nums):
le = LabelEncoder()
src_list = list(set(src))
tar_list = list(set(tar))
fit_list = tar_list + src_list
le.fit(fit_list)
src_col = le.transform(src)
tar_col = le.transform(tar)
test1 = normalized_mutual_info_score(src_col, tar_col,average_method='geometric')
test2 = normalized_mutual_info_score(tar_col, src_col,average_method='geometric')
print(f"iter:{i}: test1:{test1} test2:{test2}")
print(src_col,tar_col)
print(tar_col,src_col)
assert test2==test1
average_method
, but the error was still there.Keep the Symmetry property of normalized mutual information. We may need to rewrite code by ourselves.
I have searched for issues similar to this one.
This GFI allows sdgx to obtain the column description from raw_data or raw_data sampled data, and return a string in a text form that LLM can understand. The information in the text should include but not be limited to:
Implements the _form_columns_description
method of sdgx.models.LLM.single_table.base.LLMBaseModel
.
This method returns a string.
Developers can refer to the implementation ideas of _form_message_with_offtable_features
and _form_dataset_description
methods.
The data types of foreign keys need to be consistent.
Add validation to the metadata to ensure that the data types of the two columns related to foreign keys are the same.
Hi everybody,
Welcome to the SDG community !
This Issue empowers first-time contributors of open-source software.
We created a list of Good First Issues for you to pick. To get more people into the program faster, we came up with this Issue. Thank you for clicking on this issue and thinking about how you can contribute to SDG program.
Participation in a project that has already been assigned is also encouraged, but you must be aware that it will require more effort to get up.
You can also filter the tasks by difficulty label:
There are currently three maintainers responsible for maintaining this project:
You are the future of SDG, and we appreciate your contribution!!! 🌠🌌🌇🌄🌅🌁
The two base classes that need to be merged are:
sdgx/models/base.py
sdgx/models/single_table/ctgan.py
Currently these two base classes implement similar functionality
Please merge their code into one base class, and modify the import part of the file header, while ensuring that the code can be executed.
The two base classes that need to be merged are:
sdgx/models/base.py
sdgx/models/single_table/ctgan.py
Currently these two base classes implement similar functionality
Please merge their code into one base class, and modify the import part of the file header, while ensuring that the code can be executed.
The rewritten code of the base class should be placed in base.py instead of ctgan.py
sdgx/models/base.py
- Rename the class
BaseGeneratorModel
toBaseSynthesizerModel
.
• Add the methods and properties from theBaseSynthesizer
class insdgx/models/single_table/ctgan.py
to theBaseSynthesizerModel
class. This includes the methods__getstate__
,__setstate__
,save
,load
,set_random_state
, and the propertyrandom_states
.
• Remove thefit
method as it is already defined in theBaseSynthesizer
class.
sdgx/models/single_table/ctgan.py
- Replace all instances of
BaseSynthesizer
withBaseSynthesizerModel
.
• Remove theBaseSynthesizer
class definition.
• Update the import statement at the top of the file to importBaseSynthesizerModel
fromsdgx.models.base
instead ofBaseSynthesizer
.
sdgx/models/base.py
- Rename the class
BaseGeneratorModel
toBaseSynthesizerModel
.
• Add the methods and properties from theBaseSynthesizer
class insdgx/models/single_table/ctgan.py
to theBaseSynthesizerModel
class. This includes the methods__getstate__
,__setstate__
,save
,load
,set_random_state
, and the propertyrandom_states
.
• Remove thefit
method as it is already defined in theBaseSynthesizer
class.
sdgx/models/single_table/ctgan.py
- Replace all instances of
BaseSynthesizer
withBaseSynthesizerModel
.
• Remove theBaseSynthesizer
class definition.
• Update the import statement at the top of the file to importBaseSynthesizerModel
fromsdgx.models.base
instead ofBaseSynthesizer
.
I have searched for issues similar to this one.
Add more logs, which will help users understand the current process and locate problems (if any).
Use from sdgx.utils import logger
to initialize a logger, and then use logger.info
or other levels to output logs.
Welcome feature request!
Files:
DB:
NOSQL:
I have searched for issues similar to this one.
In probability theory and statistics, the Jensen–Shannon divergence, “JS散度” in Chinese, is a method of measuring the similarity between two probability distributions. It is also known as information radius (IRad) or total divergence to the average. It is based on the Kullback–Leibler divergence, with some notable (and useful) differences, including that it is symmetric and it always has a finite value. The square root of the Jensen–Shannon divergence is a metric often referred to as Jensen–Shannon distance.
The value range of JSD is [0,1], when two distributions are similar, JSD approaches 0. If the two distributions do not overlap at all, JSD is equal to the constant 1. The closer the JSD to 0, the more similar the distribution, thus the higher the quality of the synthetic data.
Implement the calculation of Jensen–Shannon divergence by coding or calling methods in existing modules. The implementation code should be located in the sdgx/metrics
directory, and the class implementation should inherit from the base class in sdgx/metrics/base.py
.
In addition to implementing JSD, the demo code in example/
and the related code in Readme.md
should also be added.
Since this is the first metric method implemented, the base class can be modified if necessary to make it more suitable for various metric calculations.
Metadata's _extend
is missing when saving Metadata to disk (in Json).
# import packages
import pandas as pd
from pathlib import Path
from sdgx.data_models.metadata import Metadata
from sdgx.utils import download_demo_data
# get a metadata, I use a demo dataset
# every dataset is OK
p = download_demo_data()
df = pd.read_csv(p)
m = Metadata.from_dataframe(df)
# I add a k-v pair
# this will add the the `.extend` field
m.add('a', "something")
# then save the model
m.save(Path('here.json'))
print(m.get('a'))
"""The output is:
{'something'}
"""
# load the model from disk
n = Metadata.load(Path("here.json"))
# the value "something" is missing
print(n.get('a'))
"""The output is:
set()
"""
# the `_extend`is empty
print(n._extend)
''' The output is :
defaultdict(<class 'set'>, {})
'''
# load the model from disk
n = Metadata.load(Path("here.json"))
# the value "something" is missing
print(n.get('a'))
"""The expected output should be:
{'something'}
I initially think this bug is related to the model_dump_json()
method in pydantic.BaseModel
.
The json str output by this method does not contain _extend
.
Maybe it is related to the fact that _extend is a private member of the class ?
I have searched for issues similar to this one.
Is it possible to get/find the code that is used for the preprocessing of the dataset? After evaluating the code, I saw that there is almost no correlation anymore and I was wondering how this was done.
I have searched for issues similar to this one.
Implement GuassianCopula model for single table synthetic data generation.
TBD
TBD
TBD
I have searched for issues similar to this one.
Adding metadata‘ code usage examples, an ipython notebook is preferred.
There are some examples for reference in the sdgx documentation and unit tests.
Add detailed descriptions in the ipython notebook.
I have searched for issues similar to this one.
Implement Table-GAN model for single table synthetic data generation.
When sdgx is used as a CLI, the results of the task should be communicated to the research programme in the form of a json output and an exit code.
Another benefit of this change would be a feature similar to reporting progress, but this would require additional staking support.
For synthetic data, it is important to maintain the rules/constraints between original data columns. For example, the opening time of a bank card must be less than its expiration time. Applying such rules to existing generated data can effectively improve data quality.
When introducing this Feature, one should first implement:
TBD
I have searched for issues similar to this one.
Many users are asking the same or similar questions.
We will first collect related/similar/same questions, then summarize them as FAQ in the document.
This will help new users understand sdgx asap, also solve common doubts.
Details will be updated in our documentation.
I have searched for issues similar to this one.
Update class DatetimeInspector(Inspector)
in sdgx/data_models/inspectors/datetime.py
to achieve:
Same as the existing DatetimeInspector
, after running the fit
method, you can infer which columns belong to the Datetime type. After implementing this Issue, DatetimeInspector
can output the specific format strings of some of the columns (not all of them), will help improve the quality of synthetic data.
Briefly speaking, the following steps are required:
__init__
method, add parameters in user-specified format;fit
method, add the Datetime format string matching steps;inspect
method and add the output Datetime format string, which will be passed to metadata;I have searched for issues similar to this one.
Implement TVAE model for single table synthetic data generation.
TBD
TBD
TBD
I have searched for issues similar to this one.
In the transformer_opt module, add this function in the method to write the np.ndarray format data output by the module to disk in npz format.
Compared with directly writing the entire csv file, this function can effectively save hard disk space. Since the transformer_opt module has already processed the csv file in batches, writing npz files in each batch can reduce repeated batches in the processing of other modules in the future. The operation is also more convenient for parallel processing.
Modifications to this issue should be located in the sdgx/transform/transformer_opt.py
path.
Please find the _synchronous_transform
method in class DataTransformer
, it is necessary to add the parameter output_type
to determine the storage type.
For the coding implementation details of Issue, please refer to the comments in the following code block:
# ISSUE DESCRIPTION add the parameter `output_type`to
def _synchronous_transform(self, input_data_path,
column_transform_info_list,
output_path,
output_type): # ISSUE DESCRIPTION new args
"""Method Description ... """
loop = True
# has_write_header = True
# use iterator = True
reader = pd.read_csv(input_data_path, iterator=True, chunksize= 1000000)
while loop:
# Existing Code ...
# ISSUE DESCRIPTION Some codes are omitted due to space reasons
# ISSUE DESCRIPTION Add your code here
chunk_array = np.concatenate(column_data_list, axis=1).astype(float)
# file object
f = open(output_path , 'a')
np.savetxt(f, chunk_array, fmt="%g", delimiter= ',')
f.close()
# end while
TBD
When large amount of real data is used to train a CTGAN model, the current implementation is not working well.
Since all the data (DataFrame) is loaded into the memory when training, this will cause huge memory consumption, which is not an elegant solution.
Fortunately, in this refactoring, sdgx provides the new DataLoader and the NDArryLoader under development.
We can use these new data-related components to modify the Data transformer, Data sampler, and CTGAN model.
The data will not be loaded into the memory all at once, instead, the data will be loaded into the memory in rows or columns (chunks) according to needs, then the data will be used to train the model.
This will effectively reduce memory consumption and provide larger data processing capabilities.
TBD
I have searched for issues similar to this one.
As a common practice for openAI API, developers are accustomed to using dotenv
to manage environment variables (openAI API key).
Currently, sdgx's single table LLM (gpt) has not yet used this convenient tool.
It is not very difficult to refer to the interface documentation of openAI to understand the relevant usage.
I have searched for issues similar to this one.
Implement CTAB-GAN model for single table synthetic data generation.
TBD
TBD
TBD
Benchmarks aim to measure the performance of the library.
Now we provide a simple benchmark for our CTGAN implementation against the original one. Fit them with a big ramdom dataset, and compare their memory consumptions.
https://github.com/hitsz-ids/synthetic-data-generator/tree/main/benchmarks
Add benchmarks for:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.