Giter Club home page Giter Club logo

data-balance-simulator's Introduction

Data Balance Simulator CLI

To run Swift REPL in a docker container run:

docker run --rm -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined swift:5.10.1 swift repl

Run

To run the project in a container, then:

make run CONFIG_FILE_PATH=config-files/base_config.json SIMULATOR_ARGS=[...]

If the environment already have Swift installed (e.g. when you are developing using VSCode devcontainer feature):

make run IS_DEVCONTAINER=true CONFIG_FILE_PATH=config-files/base_config.json SIMULATOR_ARGS=[...]

Simulation Complexity

The number of simulations is determined by the execution parameters:

$\sum_{s = MinServices}^{MaxServices} \sum_{n = MinNodes}^{MaxNodes} \sum_{w = 1}^{min(n, MaxWindowSize)} s^{w} * (n - w + 1) + n$

An execution with:

$ nodes = 5 \newline services = 6 \newline maxWindowSize = 4 \newline $

Includes the following number of samplings:

$ winSize = 1 \to samplings = (6^{1}) * 5 + 5\newline winSize = 2 \to samplings = (6^{2}) * 4 + 5\newline winSize = 3 \to samplings = (6^{3}) * 3 + 5\newline winSize = 4 \to samplings = (6^{4}) * 2 + 5\newline $

$6^{x}$ represents the number of combinations in a window, which is multiplied by the number of windows in a simulation. After we choose the service, it is executed and the resulting dataset is stored and cached. This is the meaning of $+ n$ (one service for each node).


Datasets

Datasets are located in the datasets folder. The following table describes the characteristics of each dataset:

Dataset Average of Columns entropy Variance of Columns entropy Std Dev of Columns entropy
high_variability 11.80 0.24 0.49
low_variability 1.7 0.2 0.45
inmates_enriched_10k 5.35 13.09 3.62
IBM_HR_Analytics_employee_attrition 3.13 8.56 2.93
red_wine_quality 5.61 2.01 1.42
avocado 9.36 22.13 4.7

To compute the entropy of each column:

import pandas as pd
import numpy as np
from typing import Dict

dataset = pd.read_csv(dataset_name + ".csv")

dataset_size = len(dataset)

def get_column_frequency(column: pd.Series) -> pd.Series:
    return column.value_counts()

def get_column_probability(column: pd.Series) -> pd.Series:
    return column.value_counts(normalize=True)

def get_column_entropy(column: pd.Series) -> float:
    column_probability = get_column_probability(column)
    return -sum(column_probability * np.log2(column_probability))


entropies = [get_column_entropy(dataset[column]) for column in dataset.columns ]
print(f"{round(np.mean(entropies), 2)}, {round(np.var(entropies), 2)}, {round(np.std(entropies), 2)}")

Logging

To set the logger level, create an env variable called LOGGER_LEVEL with one of the following values: trace, debug, info, notice, warning, error, critical ( default is info). The alternative is to pass this variable to make run.


DB Migrations and DB queries

For DB migration, run make migrate-db SQL_CODE="your_migration_sql".

To run queries on DB, run make run-query SQL_CODE="your_plain_sql".


Deepnote experiments

data-balance-simulator's People

Contributors

marco-luzzara avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.