Giter Club home page Giter Club logo

dl_term_project's Introduction

Term Project: Synthetic Data Generation using GAN

Why synthetic data?

In 2020, the amount of data on the internet hit 40 zetabytes. A zetabyte is about a trillion gigabytes.

minute_in_internet.png

So why we need synthetic data?

There are many good reasons behind it. Here are some important ones:

  • Cost of preparing and labeling data
  • Prototype Development
  • Edge-case Simulation
  • Data Privacy

TimeGAN:

  • TimeGAN is proposed in 2019.

  • It is different from other GAN architecture. It has 4 components:

    • Generator
    • Discriminator
    • Recovery
    • Embedder
  • It introduced the concept of supervised loss and embedding network.

    • Supervised loss: The model is capturing the time conditional distribution by using the original data as a supervision.
    • Embedding network: It is reducing the adversarial learning space dimensionality.
  • It can generate both static (dimensions) and sequential data (TS) at the same time.

  • Less sensitive to hyperparameters.

  • The training process is more stable.

TimeGAN_Block_Diagram.png

With respect to the loss functions:

  • Reconstruction loss: Compares the reconstruction of the encoded data compared to the original data.
  • Supervised loss: Measures the quality of generator approximation of the next time step in the latent space.
  • Unsupervised loss: This is the familiar min-max game. It shows the relation of the generator and discriminator networks.

There are three training phases:

  • Phase I: Training the Autoencoder
  • Phase II: Training the supervisor to capture the temporal behavior.
  • Phase III: The combined training of generator, discriminator and embedder. We try to minimize all three loss functions in this phase. Based on the paper's suggestion the generator and embedder are trained twice the discriminator in this phase.

Data:

Data is fetched from Yahoo Finance and saved and shared in my Google Drive. I used the data for Google, Amazon, and Apple stock for last 3 to 5 years.

Result:

I have experimented with 10K, 20K, and 50K then plotted the Synthetic vs real data in two ways: for each variable I have plotted sample of the data and then Reduce the dimension using PCA and TSNE and plot the PCA and TSNE in 2 dimension to compare the data.

Finally, two regressor were trained:

  • one on real data
  • Another one on synthesized data

Then both are tested on real data, and the results are compared.

The runtime:

The training is computationally intensive. Especially in the phase III where the generator and embedder are trained twice the discriminator.

I am running it on an ThinkPad with 10th Gen i7 CPU and 16GB memory. I do not have NVIDIA GPU. Here are runtime stats:

  • For 10K:
    • Phase I (Embedding network): ~15m
    • Phase II (Supervisor network): ~15m
    • Phase III (Joint network training): ~2H:30m
  • For 20K:
    • Phase I (Embedding network): ~30m
    • Phase II (Supervisor network): ~30m
    • Phase III (Joint network training): ~3H:30m
  • For 50K:
    • Phase I (Embedding network): ~7H
    • Phase II (Supervisor network): ~7H
    • Phase III (Joint network training): ~21H

As you can see the runtime increases significantly, but you will see that we are not benefited much from the extra time.

Data Comparison:

After training for 10K steps here are the variable comparison and the PCA and TSNE.

data_comparison.png

synthetic_vs_real.png

The regressor is showing a promising results.

train_synth_test_real.png

After training for 20K steps here are the variable comparison and the PCA and TSNE.

data_comparison.png

synthetic_vs_real.png

The regressor is showing a promising results.

train_synth_test_real.png

It is evident that after almost spending twice the time training the model we didn't gain much in terms of predictive power. However, The PCA and TSNE graphs are showing a tighter fit.

Finally, I trained the TimeGAN for 50K steps here are the variable comparison and the PCA and TSNE.

data_comparison.png

synthetic_vs_real.png

The regressor is showing a promising results.

train_synth_test_real.png

I can fairly conclude that it wasn't worth the time spent. Even we lost a bit of the prediction power probably due to the noise added.

References:

dl_term_project's People

Contributors

shsab avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.