hitsz-ids / synthetic-data-generator Goto Github PK

View Code? Open in Web Editor NEW

2.2K 2.2K 364.0 2.54 MB

SDG is a specialized framework designed to generate high-quality structured tabular data.

License: Apache License 2.0

Python 59.11% Dockerfile 0.02% Shell 0.02% Jupyter Notebook 40.85%

synthetic-data-generator's People

Contributors

Stargazers

Watchers

Forkers

guo-yunzhe sjh120 yanniszhou obsidian6s closegoingaway xupercoin ntt720 hay-man mistyr0se tutuna zaku-zaku 0x8235 adeladet minisoco luluchou jiiyf d3p10y n0wwa monsterdove herpacker billionerd hs991023 spicyguml coder-drinker cerviny femi-lawal hisstar moguijoe e-kiss-me iam20cm awekling missflow iwospc nicbair masemxiao windb3ll paramedick maigone s8xy wensiyuansix lycokie molierflower tufo830 qugou1350636 vamoko jbluv staccats yoshi-kitty fskeo farmingtong piapplepi exodoer 65533 nanpusher tipcan nicolesherwood excelisa err-nil pittu127 zstacko ymzhang96 w90o0u xiao2duan twacoco paoyes luozhe023 jinyi-sama aimogmog black-rain-bow commachan nap1ch kamifr raymusk wudbut ai2047 sparkcus zshpro bartslab leonz87 tqcheung jtt1998 skillcampalan yetaye stlkoch alexyiy javaartisan wongli233 lt6253090 caramelmario xuyu67 quantumira hui13579246 reikolo xigua369 zeozez avungard innaturn coolume halfloat poyexe

synthetic-data-generator's Issues

[Enhance | 0.2.0] Update SDG examples.

🚅Search before asking

I have searched for issues similar to this one.

🚅Description

Since we are refactoring, the example contents of the two py files currently in the example/ directory are outdated and cannot be run.

In this Issue, please rewrite a valid example based on the content ideas in the example script.

🏕Solution(optional)

Update the code in:

example/1_ctgan_example.py
example/2_guassian_copula_example.py

🍰Detail(optional)

In these two example scripts:

For the first script, CTGAN is used to train the model and the synthetic data quality is evaluated using the JSD metric.
For the second script, use Gaussian Copula to train the model, and use the ML metrics (Accuray and Precision) of the ML Task (machine learning downstream task) to evaluate the synthetic data.

🍰Example(optional)

TBD

[Feature] Data Processor: support pre-processing and post-processing

Problem

Currently, the sdg lacks a complete data processor, which is used to complete the necessary pre-processing and post-processing for input data and generated data.

Proposed Solution

Implement data processor that supports pre-processing and post-processing.

When this component is initially introduced, it implements the following two functions:

data pre-processing, nan value processing will be implemented first
data post-processing, conversion to int type will be implemented first

The implemented data processor should support the sdgx's dataloader access method.

Additional context

TBD

Sweep: add a base class of metrics which can evaluate how similar the synthetic dataset is to real dataset

please add the code of the base class to metrics/base.py, and use python 3.

please try to define enough methods in the base class to prevent repeated additions when implementing metrics.

Checklist

sdgx/metrics/base.py

Remove the pass statement from the file.
• Import the ABC and abstractmethod from the abc module at the top of the file.
• Create a new class named BaseMetric that inherits from ABC.
• Inside the BaseMetric class, define an __init__ method that takes two parameters: real_data and synthetic_data. These parameters should be stored as instance variables.
• Still inside the BaseMetric class, define an abstract method named calculate that takes no parameters. This method will be used to calculate the metric and should be implemented in each subclass.
• Still inside the BaseMetric class, define a method named validate_datasets that takes no parameters. This method should check if the real_data and synthetic_data instance variables are valid datasets. For now, this method can simply pass.

[0.1.0] Docstring update

CTGAN optimization (Memory Usage)

🚅Search before asking

I have searched for issues similar to this one.

🚅Description

In the current implementation of CTGAN, all training data will be loaded into memory at once, which will greatly increase memory consumption.

Thanks.

🏕Solution(optional)

When using csv files to store training data, the current implementation uses pd.dataframe to load the data once. I plan to load the batch data in CTGAN in real time, and load other batch data after the use is completed.

🍰Detail(optional)

There will be no parameter changes to the current implementation.

🍰Example(optional)

This optimized implementation will not change the existing example code.

[Good First Issue | Feature] Implement Data Processor

Waitting Desgin:

[Good First Issue | Feature] Implement Models

Please pay attention to the LICENSE of the referenced implemention/code, we accept MIT, BSD, Apache 2.0 License...Non-contagious, commercially friendly license, OR Apache 2.0 License compatible license.

ML Models

Single Table ML Model

Multi Table ML Model

Statistic Models

Single Table ML Model

Multi Table ML Model

[Good First Issue]🌄Implement a better demo data module

🚅Search before asking

I have searched for issues similar to this one.

🚅Description

Currently, demo data can only be obtained through function sdgx.utils.io.csv_utils.get_demo_single_table, and only one adult data sets are supported. In this issue, please implement a more scientific demonstration data management module.

🏕Solution

We recommend stripping this moudule out of script sdgx/utils/io/csv_utils.py and implementing a separate script in the sdgx/utils/io/ directory。

We recommend creating a file demo_data.py and implementing the functions or class in this file.

🍰 Example

We provide a class example for your reference:

# ISSUE DESCRIPTION A DemoData example
class DemoData(object):
    def  __init__(self, dataset_name) -> None:
        # ISSUE DESCRIPTION 
        # the dataset name should be checked 
        pass 

    def get_data(self, offline_path = None) -> pd.DataFrame:
        # ISSUE DESCRIPTION 
        # if offline_path is not None value, 
        # read data from the input path
        pass

    def download_data(self) -> None:
        # 
        pass

⚙️ Detail

Some operations that enhance user experience are also worthwhile, such as:

When we support many datasets, it is unreasonable to put each dataset in the dataset/ directory. We expect to support downloading the target dataset from the Internet,this helps reduce the size of the entire git repository.
Due to the network speed in mainland China, you can ask the development team to use network resources to upload and provide download links for some larger data sets. The speed of these download links will be faster than the original links of the data sets.

Support Python 3.12

Known blocking issues:

pytorch/pytorch#110436

OCT-GAN Model Implementation

🚅Search before asking

I have searched for issues similar to this one.

🚅Description

Implement OCT-GAN model for single table synthetic data generation.

🏕Solution(optional)

TBD

🍰Detail(optional)

TBD

🍰Example(optional)

TBD

[Good first issue | Feature] Multi-Table Metric: Mutual Information Similarity

Problem

When I use this tool to synthesize multi-table databases, I cannot evaluate the differences in correlation between the generated data and real data across tables.

Some commonly used statistical metrics, such as mutual information, can be introduced, like this

Proposed Solution

Mutual information can be used to measure the correlation between each pair of columns in two tables.

Let X and Y be two datasets with the same number of discrete columns/features (m), where one is the original dataset, and the other is a simulated dataset.
For each dataset, we calculate the normalized mutual information between each pair of columns.
The normalized mutual information between random variables X and Y is defined as
$nMI(X,Y)=\frac{MI(X,Y)}{min{H(X),H(Y)}}=\frac{H(X)+H(Y)-H(X,Y)}{min{H(X),H(Y)}}$

Since
$H(X,Y) \geq max{ H(X),H(Y) }$
$-H(X,Y) \leq min{ -H(X), -H(Y) }$

We can get
$H(X)+H(Y)-H(X,Y) \leq H(X)+H(Y)+min {-H(X),-H(Y) } $
That is
$MI(X,Y) \leq min {H(X),H(Y)} $
The element at the ith row and jth column of the pairwise mutual information matrix M is
$M_{ij}=nMI(X_i,X_j)$

where $X_{i}$ and $X_{j}$ are the random variables corresponding to the ith and jth columns of the dataset X.

Let M and N be the pairwise mutual information matrices for X and Y. Then, mutual information similarity is defined as:
$\frac{1}{m^2} \Sigma_{i,j} J(M_{ij},N_{ij})$

J is the Jaccard index defined as $J(A,B)=\frac{min{A,B}}{max{A,B}}$
This metric is bounded by 0 and 1 with 1 being the maximal and best value.

Additional context

TBD

[BOT] Add contributors

@all-contributors

please add @iokk3732 for code.
please add @Femi-lawal for code.

[BOT] Add contributors

@all-contributors
please add @Wh1isper for code.
please add @MooooCat for code.
please add @joeyscave for code.

Add CLI for sdgx

Problem

We currently only provide Python code for interaction, it is more difficult for other languages to call sdgx code directly.

Proposed Solution

One possible approach is to add a CLI that enables the user to specify input and output paths for cleaned data, models.

Examples

$sdgx fit --model=CTGAN --table_path=/path/to/table.csv --output_path=/path/to/model  --param_path=/path/to/config.toml
$sdgx sample --model_path=/path/to/model --num_rows=1000 # Generate 1000 samples with model

Additional context

Since each model has a different sample and fit method, I thought it would be easier to add a --param_path and make it point to a file for configuration.

Perhaps we could introduce a plugin system so that we can not only easily add new models, but also support user plugins to add their own algorithms without modifying this project, and pulggy is a good choice.

[Good first issue | Feature] Support Multi-Table Synthesis with Composite Foreign Key

Problem

In particular, when the primary key is a composite key, foreign keys might also be composite keys. How can this tool support multi-table synthesis in this scenario?

Proposed Solution

A simple approach is to concatenate the associated parent and child table data through foreign keys, and then use an existing model to learn their data distribution.

Additional context

[BOT] Add contributors

@all-contributors
please add @Z712023 for code.

[Feature | Inspector] add chn address inspector

🚅Search before asking

I have searched for issues similar to this one.

🚅Description

This inspector is used to infer whether the columns in the tabular data are of type Address (Mainland China), thereby better labeling the data.

🏕Solution(optional)

This inspector can be implemented through regular expressions and some rules.

🍰Detail(optional)

Implementation content should include:

Related implementations of inspector
Necessary test cases

🍰Example(optional)

[Good first issue | Feature] Add Regex Inspector

🚅Search before asking

I have searched for issues similar to this one.

🚅Description

This inspector determines whether a column matches the regular expression given by the user and outputs the column names.

Add an Inspector that accepts two parameters:

User defined regular expression (string type);
Whether it is a PII column (bool type): whether the column contains private information.

🏕Solution(optional)

The content is as follows:

Inherit sdgx.data_models.inspector.base.Inspector and implement the fit method;
Inherit sdgx.data_models.inspector.base.Inspector and implement the inspect method;
Complete examples of using this Inspector to infer data types;
Complete necessary test cases.

🍰Detail(optional)

For the __init__ method:

This method should contain regular expressions as input parameter;
Necessary checks should be executed on regular expressions.

For the fit method, the input parameters should be:

raw_data (pd.DataFrame): the input data;
It is recommended to add a match_rate parameter (default is set to 0.8 or other values). This parameter is between 0-1, when a column of data with a "match_rate" ratio matches the regular expression, this column should appear in the inspect results.

For inspect method:

Like other inspectors, should output the names of columns that match the data type inferred by this inspector.
Output PII attributes for easy updating to metadata.

🍰Example(optional)

inspectors = InspectorManager().init_inspcetors(
        include_inspectors, exclude_inspectors, **(inspector_init_kwargs or {})
    )
for inspector in inspectors:
    inspector.fit(df)

metadata = Metadata(primary_keys=[df.columns[0]], column_list=list(df.columns))
for inspector in inspectors:
    metadata.update(inspector.inspect())

[Good first issue | Feature] Synthesize specific types of IDs

Problem

Table data often includes special ID fields, such as a fixed string "AXBSAX" followed by a variable string X, where the fixed string holds a static physical meaning and X increments in quantity, such as "0001", "0002", and so on.

Proposed Solution

Using regular expressions to analyze the ID format, synthesize different meaningful segments while preserving the static meaning of the original ID field.

We can consider two conditions and handle them separately:

The field has no semantic meaning
(1) Determine the number of unique types.
(2) Use Faker (Note: Faker-generated fields may not preserve the semantic meaning of the original ID field).
The field is associated with other attributes and has simulation value
We need to preserve the original field's semantics:
(1) If the ID field carries additional information, abstract it into a new column.
(2) Use the data fitted by the model to guide the generation of Faker.
(3) Exclude in post-processing.

Additional context

时序数据生成功能

各位大佬，我想通过自己的各地区项目数据（时间，地区，天气，园区类型，和用电负荷等）训练模型，生成典型日的数据，不知道贵框架能否实现~~

使用场景：

①“帮我生成一份北京地区 8月份工业园区某一阴天典型日一天24小时的温度，湿度，辐照度，用电负荷等数据。”
②然后会生成一份24小时数据：

看到其他库有类似功能：

https://colab.research.google.com/drive/1YLk2uwn8yrSRPy0soEeJwu8Hdk_tGTlE?usp=sharing#scrollTo=8bORDkGXJcgr

[Good first issue | Feature]Add a Multi-Table Auto-Modify Processor

Problem

When modeling multi-table data, I've observed that real-world data often doesn't satisfy foreign key constraints. For example: parents_id=[1,2,3], children_id=[1,2], or parents_id=[1,2,3], children_id=[1,2,3,4].

I wish this tool could automatically assist me in cleaning the data, ensuring that foreign keys exist in both the parent and child tables (e.g., parents_id=[1,2], children_id=[1,2]).

In this way, the data used for multi-table simulation modeling can accurately reflect the associative relationships of foreign keys.

Proposed Solution

Retain only the intersection of foreign keys between the parent table and the child table.

$$ID_{remain} = ID_{parent} \Cap ID_{child}$$

Additional context

TBD

[Enhance | 0.2.0] Update SDG Readme

[Good first issue | Feature]Ensure Uniqueness of Composite Primary Key

Problem

This tool aims to support composite primary keys, but it seems that it does not guarantee the uniqueness of composite primary keys.

Proposed Solution

Add validation to the metadata to ensure the uniqueness of composite primary keys during validation.

Additional context

TBD

[Bug] API normalized_mutual_info_score may violate Symmetry.

Description

The output of api normalized_mutual_info_score may violate Symmetry, although the official documentation claims that
"This metric is furthermore symmetric: switching label_true with label_pred will return the same score value. "https://scikit-learn.org/stable/modules/generated/sklearn.metrics.normalized_mutual_info_score.html

Reproduce

Run the code as follows:

from sklearn.preprocessing import LabelEncoder
from sklearn.metrics.cluster import normalized_mutual_info_score

a = "guest"
b = "user"
c = "admin"

src = [a,c,b,b,c,a,c,a,c,c]
tar = [a,b,a,a,b,c,b,b,c,a]
test_nums = 100
for i in range(test_nums):
le = LabelEncoder()
src_list = list(set(src))
tar_list = list(set(tar))
fit_list = tar_list + src_list
le.fit(fit_list)
src_col = le.transform(src)
tar_col = le.transform(tar)
test1 = normalized_mutual_info_score(src_col, tar_col,average_method='geometric')
test2 = normalized_mutual_info_score(tar_col, src_col,average_method='geometric')
print(f"iter:{i}: test1:{test1} test2:{test2}")
print(src_col,tar_col)
print(tar_col,src_col)
assert test2==test1

I tried different choices for average_method, but the error was still there.

Expected behavior

Keep the Symmetry property of normalized mutual information. We may need to rewrite code by ourselves.

Context

Operating System and version: Mac OS 12.5.1
sdgx==0.1.4.dev0
scikit-learn==1.3.2

[Good first issue | Enhance] Add column description info in LLM prompt

🚅Search before asking

I have searched for issues similar to this one.

🚅Description

This GFI allows sdgx to obtain the column description from raw_data or raw_data sampled data, and return a string in a text form that LLM can understand. The information in the text should include but not be limited to:

Column type: float type, int type, category type, datetime type, etc.;
For numeric type columns: maximum value, minimum value, mean, standard deviation, distribution, etc.;
For datetime type: start and end dates, datetime type format, etc.;
For category types: specific category value, number of category values, etc.;
For ID type: ID category, format, etc.;
For other types of necessary information, developers are asked to add it based on their own creativity.

🏕Solution

Implements the _form_columns_description method of sdgx.models.LLM.single_table.base.LLMBaseModel.

synthetic-data-generator/sdgx/models/LLM/base.py

Line 89 in ec31560

def _form_columns_description(self):

This method returns a string.

Developers can refer to the implementation ideas of _form_message_with_offtable_features and _form_dataset_description methods.

[BOT] Add contributors

[Good first issue | Feature] Ensure Data Type of foreign keys

Problem

The data types of foreign keys need to be consistent.

Proposed Solution

Add validation to the metadata to ensure that the data types of the two columns related to foreign keys are the same.

Additional context

[Good First Issue] Guide for SDG's New Contributors

Hi everybody,

Welcome to the SDG community !
This Issue empowers first-time contributors of open-source software.

We created a list of Good First Issues for you to pick. To get more people into the program faster, we came up with this Issue. Thank you for clicking on this issue and thinking about how you can contribute to SDG program.

🛠 How to pick a task?

Participation in a project that has already been assigned is also encouraged, but you must be aware that it will require more effort to get up.

👀 Browse the issues you are interested in, In the issue interface, all issues with "SDG future" or "good first issue" labels are optional tasks.
🙋‍ Comment below the issue you want to participate in, comment directly under that issue, we will assign tasks to you.
🔧 A Maintainer will be your mentor and will continue to work with you on issue resolution
🚅 Read our contributing guidelines and commit code to your fork.
✔ After PR, Code review, Request changes, Approve..., your code will be merged into the master branch and you will become one of our developers!

You can also filter the tasks by difficulty label:

🕋 Is it difficult to finish a task?

First of all, we are expecting you to have basic language skill (SDG is written in Python), as well as have ability torun basic SDG demo by your own.
All "Good First Issues" should be easy enough for the new contributor.

⚙️ Where can I get help?

For each task, there will be a Mentor or Reviewer to help you getting started. You can talk to he/she directly to ask whatever you want to know.
After you choose the task, please send me an email to [email protected] and the mentor will help you.

There are currently three maintainers responsible for maintaining this project:

You are the future of SDG, and we appreciate your contribution!!! 🌠🌌🌇🌄🌅🌁

Sweep: Merge the code of the two base classes into the same class

Details

The two base classes that need to be merged are:

BaseGeneratorModel in sdgx/models/base.py
BaseSynthesizer in sdgx/models/single_table/ctgan.py

Currently these two base classes implement similar functionality

Please merge their code into one base class, and modify the import part of the file header, while ensuring that the code can be executed.

Details

The two base classes that need to be merged are:

BaseGeneratorModel in sdgx/models/base.py
BaseSynthesizer in sdgx/models/single_table/ctgan.py

Currently these two base classes implement similar functionality

Please merge their code into one base class, and modify the import part of the file header, while ensuring that the code can be executed.

The rewritten code of the base class should be placed in base.py instead of ctgan.py

Checklist

sdgx/models/base.py

Rename the class BaseGeneratorModel to BaseSynthesizerModel.
• Add the methods and properties from the BaseSynthesizer class in sdgx/models/single_table/ctgan.py to the BaseSynthesizerModel class. This includes the methods __getstate__, __setstate__, save, load, set_random_state, and the property random_states.
• Remove the fit method as it is already defined in the BaseSynthesizer class.

sdgx/models/single_table/ctgan.py

Replace all instances of BaseSynthesizer with BaseSynthesizerModel.
• Remove the BaseSynthesizer class definition.
• Update the import statement at the top of the file to import BaseSynthesizerModel from sdgx.models.base instead of BaseSynthesizer.

Checklist

sdgx/models/base.py

Rename the class BaseGeneratorModel to BaseSynthesizerModel.
• Add the methods and properties from the BaseSynthesizer class in sdgx/models/single_table/ctgan.py to the BaseSynthesizerModel class. This includes the methods __getstate__, __setstate__, save, load, set_random_state, and the property random_states.
• Remove the fit method as it is already defined in the BaseSynthesizer class.

sdgx/models/single_table/ctgan.py

Replace all instances of BaseSynthesizer with BaseSynthesizerModel.
• Remove the BaseSynthesizer class definition.
• Update the import statement at the top of the file to import BaseSynthesizerModel from sdgx.models.base instead of BaseSynthesizer.

[Maintance] Speedup test case

[Good first issue | Bugfix]Add more logs in current components

🚅Search before asking

I have searched for issues similar to this one.

🚅Description

Add more logs, which will help users understand the current process and locate problems (if any).

🏕Solution(optional)

Use from sdgx.utils import logger to initialize a logger, and then use logger.info or other levels to output logs.

[Good First Issue] Implement Data Connector

Welcome feature request!

Files:

CSV: https://github.com/hitsz-ids/synthetic-data-generator/blob/main/sdgx/data_connectors/csv_connector.py
TSV
Parquest

DB:

MySQL: https://github.com/hitsz-ids/synthetic-data-generator/blob/main/sdgx/data_connectors/mysql_connector.py

NOSQL:

Clickhouse
S3

[Feature]🌌Jensen–Shannon divergence Implementation

🚅Search before asking

I have searched for issues similar to this one.

🚅Description

In probability theory and statistics, the Jensen–Shannon divergence, “JS散度” in Chinese, is a method of measuring the similarity between two probability distributions. It is also known as information radius (IRad) or total divergence to the average. It is based on the Kullback–Leibler divergence, with some notable (and useful) differences, including that it is symmetric and it always has a finite value. The square root of the Jensen–Shannon divergence is a metric often referred to as Jensen–Shannon distance.

The value range of JSD is [0,1], when two distributions are similar, JSD approaches 0. If the two distributions do not overlap at all, JSD is equal to the constant 1. The closer the JSD to 0, the more similar the distribution, thus the higher the quality of the synthetic data.

🏕Solution

Implement the calculation of Jensen–Shannon divergence by coding or calling methods in existing modules. The implementation code should be located in the sdgx/metrics directory, and the class implementation should inherit from the base class in sdgx/metrics/base.py.

🍰Detail

In addition to implementing JSD, the demo code in example/ and the related code in Readme.md should also be added.

Since this is the first metric method implemented, the base class can be modified if necessary to make it more suitable for various metric calculations.

[Bug] Metadata's `_extend` missing when saving Metadata to disk (in Json)

Description

Metadata's _extend is missing when saving Metadata to disk (in Json).

Reproduce

# import packages 
import pandas as pd 
from pathlib import Path
from sdgx.data_models.metadata import Metadata
from sdgx.utils import download_demo_data

# get a metadata, I use a demo dataset 
# every dataset is OK 
p = download_demo_data()
df = pd.read_csv(p)
m = Metadata.from_dataframe(df)

# I add a k-v pair 
# this will add the the  `.extend`  field 
m.add('a', "something") 
# then save the model 
m.save(Path('here.json'))

print(m.get('a'))
"""The output is:
{'something'}
"""

# load the model from disk 
n = Metadata.load(Path("here.json"))
# the value "something" is missing
print(n.get('a'))
"""The output is:
set()
"""
# the `_extend`is empty 
print(n._extend)
''' The output is :
defaultdict(<class 'set'>, {})
'''

Expected behavior

# load the model from disk 
n = Metadata.load(Path("here.json"))
# the value "something" is missing
print(n.get('a'))
"""The expected output should be:
{'something'}

Context

Operating System and version: macOS 14.2.1（23C71）+ python 3.9
Browser and version(if necessary): -
Which version are you using: 0.1.5 / 0.1.6 dev0

I initially think this bug is related to the model_dump_json() method in pydantic.BaseModel.

The json str output by this method does not contain _extend.

Maybe it is related to the fact that _extend is a private member of the class ?

Information Data Preprocessing

❓Search before asking

I have searched for issues similar to this one.

❓Description

Is it possible to get/find the code that is used for the preprocessing of the dataset? After evaluating the code, I saw that there is almost no correlation anymore and I was wondering how this was done.

GuassianCopula Model Implementation

🚅Search before asking

I have searched for issues similar to this one.

🚅Description

Implement GuassianCopula model for single table synthetic data generation.

🏕Solution(optional)

TBD

🍰Detail(optional)

TBD

🍰Example(optional)

TBD

[Good First Issue | Document] Add metadata code example in ipynb

🚅Search before asking

I have searched for issues similar to this one.

🚅Description

Adding metadata‘ code usage examples, an ipython notebook is preferred.

🏕Solution(optional)

There are some examples for reference in the sdgx documentation and unit tests.

Add detailed descriptions in the ipython notebook.

🍰Detail(optional)

🍰Example(optional)

Table-GAN Model Implementation

🚅Search before asking

I have searched for issues similar to this one.

🚅Description

Implement Table-GAN model for single table synthetic data generation.

🏕Solution(optional)

🍰Detail(optional)

🍰Example(optional)

[Feature] Exit code and json output for CLI

When sdgx is used as a CLI, the results of the task should be communicated to the research programme in the form of a json output and an exit code.

Another benefit of this change would be a feature similar to reporting progress, but this would require additional staking support.

[Feature] Rule Manager: discover, manage rule and constraint between features(columns)

Problem

For synthetic data, it is important to maintain the rules/constraints between original data columns. For example, the opening time of a bank card must be less than its expiration time. Applying such rules to existing generated data can effectively improve data quality.

Proposed Solution

When introducing this Feature, one should first implement:

supports one simple rule inspector;
supports necessary interfaces for manual rule editing;
supports applying rules to synthetic and removing unqualified data entries;
rule manager that supports plug-in management.

Additional context

TBD

[Feature | Document] add FAQ section in document

🚅Search before asking

I have searched for issues similar to this one.

🚅Description

Many users are asking the same or similar questions.

🏕Solution(optional)

We will first collect related/similar/same questions, then summarize them as FAQ in the document.

This will help new users understand sdgx asap, also solve common doubts.

🍰Detail(optional)

Details will be updated in our documentation.

🍰Example(optional)

[Good first issue | Feature] Implement a smarter `DatetimeInspector`

🚅Search before asking

I have searched for issues similar to this one.

🚅Description

Update class DatetimeInspector(Inspector) in sdgx/data_models/inspectors/datetime.py to achieve:

Allow preset common datetime format string, such as: "%Y/%m/%d", "%Y-%m-%d", etc.;
Allow users to specify uncommon but clear datetime format string: "%b-%d-%Y", etc.;

Same as the existing DatetimeInspector, after running the fit method, you can infer which columns belong to the Datetime type. After implementing this Issue, DatetimeInspector can output the specific format strings of some of the columns (not all of them), will help improve the quality of synthetic data.

🏕Solution(optional)

Briefly speaking, the following steps are required:

Modify member variables, add preset Datetime format string;
Modify the __init__ method, add parameters in user-specified format;
Modify the fit method, add the Datetime format string matching steps;
Modify the inspect method and add the output Datetime format string, which will be passed to metadata;
During the implementation process, performance also needs to noticed.

TVAE Model Implementation

🚅Search before asking

I have searched for issues similar to this one.

🚅Description

Implement TVAE model for single table synthetic data generation.

🏕Solution(optional)

TBD

🍰Detail(optional)

TBD

🍰Example(optional)

TBD

[Good First Issue]🎆Add npz format output in transform module

🚅Search before asking

I have searched for issues similar to this one.

🚅Description

In the transformer_opt module, add this function in the method to write the np.ndarray format data output by the module to disk in npz format.

Compared with directly writing the entire csv file, this function can effectively save hard disk space. Since the transformer_opt module has already processed the csv file in batches, writing npz files in each batch can reduce repeated batches in the processing of other modules in the future. The operation is also more convenient for parallel processing.

🏕Solution

Modifications to this issue should be located in the sdgx/transform/transformer_opt.py path.

Please find the _synchronous_transform method in class DataTransformer, it is necessary to add the parameter output_typeto determine the storage type.

🍰Detail

For the coding implementation details of Issue, please refer to the comments in the following code block:

# ISSUE DESCRIPTION add the parameter `output_type`to
def _synchronous_transform(self, input_data_path,
                           column_transform_info_list, 
                           output_path,
                           output_type): # ISSUE DESCRIPTION new args
    """Method Description ... """
    
    loop = True
    # has_write_header = True
    # use iterator = True 
    reader =  pd.read_csv(input_data_path, iterator=True, chunksize= 1000000)
    
    while loop:
        # Existing Code ... 
        # ISSUE DESCRIPTION Some codes are omitted due to space reasons 
        
        # ISSUE DESCRIPTION Add your code here
        chunk_array = np.concatenate(column_data_list, axis=1).astype(float)
        # file object 
        f = open(output_path , 'a')
        np.savetxt(f, chunk_array, fmt="%g", delimiter= ',')
        f.close()
    # end while

🍰Example

TBD

[1.0.0] CTGAN Optimization

Problem

When large amount of real data is used to train a CTGAN model, the current implementation is not working well.

Since all the data (DataFrame) is loaded into the memory when training, this will cause huge memory consumption, which is not an elegant solution.

Proposed Solution

Fortunately, in this refactoring, sdgx provides the new DataLoader and the NDArryLoader under development.

We can use these new data-related components to modify the Data transformer, Data sampler, and CTGAN model.

The data will not be loaded into the memory all at once, instead, the data will be loaded into the memory in rows or columns (chunks) according to needs, then the data will be used to train the model.

This will effectively reduce memory consumption and provide larger data processing capabilities.

Additional context

TBD

[Good First Issue | ENV] add dotenv in single LLM models

🚅Search before asking

I have searched for issues similar to this one.

🚅Description

As a common practice for openAI API, developers are accustomed to using dotenv to manage environment variables (openAI API key).

Currently, sdgx's single table LLM (gpt) has not yet used this convenient tool.

🏕Solution(optional)

It is not very difficult to refer to the interface documentation of openAI to understand the relevant usage.

🍰Detail(optional)

🍰Example(optional)

CTAB-GAN Model Implementation

🚅Search before asking

I have searched for issues similar to this one.

🚅Description

Implement CTAB-GAN model for single table synthetic data generation.

🏕Solution(optional)

TBD

🍰Detail(optional)

TBD

🍰Example(optional)

TBD

[Good First Issue] Add more benchmarks

🚅Description

Benchmarks aim to measure the performance of the library.

Performance: Processing time, Training time of model, Simpling rate...
Memory Consumption
Others, like cache hit rate...

Now we provide a simple benchmark for our CTGAN implementation against the original one. Fit them with a big ramdom dataset, and compare their memory consumptions.
https://github.com/hitsz-ids/synthetic-data-generator/tree/main/benchmarks