Giter Club home page Giter Club logo

badges's Introduction

Code used in the following paper:

T. Kusmierczyk, M. Gomez-Rodriguez: On the Causal Effect of Badges. WWW 2018. (SLIDES)


Framework for passive causal inference

We developed a novel causal inference framework designed for cases where the goal is to validate an effect of the change applied to a group of users at a certain time point.

Our framework avoids modeling the mechanisms underlying individual user actions and instead adopts a data-driven approach based on survival analysis and statistical hypothesis testing. At the heart of our approach there are two technical innovations:

  1. a robust survival-based hypothesis testing procedure, inspired by the discrete choice literature on latent variable models, which allows us to account for the utility heterogeneity,
  2. a bootstrap difference-in-differences method, inspired by the economics literature on natural experiments, which allows us to control for the random fluctuations in users'{} behavior over time.

Application: first-time digital badges

A wide variety of online platforms use digital badges to encourage users to take certain types of desirable actions. However, despite their growing popularity, their causal effect on users' behavior is not well understood. This is partly due to the lack of counterfactual data and the myriad of complex factors that influence users' behavior over time. As a consequence, their design and deployment lacks general principles.

We focus on first-time badges, which are awarded after a user takes a particular type of action for the first time, and study their causal effect by harnessing the delayed introduction of several badges in a popular Q&A website. In doing so, we introduce a novel causal inference framework for badges whose main technical innovations are a robust survival-based hypothesis testing procedure, which controls for the utility heterogeneity across users, and a bootstrap difference-in-differences method, which controls for the random fluctuations in users' behavior over time.

Data

In our experiments, we used StackOverflow data archive that was processed to extract the following files:

  • data/badges/tagedits/tageditor.tsv - data related to TagEditor badge
  • data/badges/bounties/bounty1.tsv - data related to the bounty badge of type1
  • data/badges/bounties/bounty2.tsv - data related to the bounty badge of type2
  • data/badges/user_features.tsv - users' characteristics necessary to calculate SMDs

Files are TAB separated and every included event consist of user id, time and event type (either eligiblity time, action time or data censoring).

Code

  • Real data experiments - src/experiments:

    • RUN_ALL_EXPERIMENTS.sh - Runs main experiments for TagEditor badge and bounties and plots the results.
    • sliding_window.py - Calculates p-values from various models (Fisher=counts, Simple=simple survival, Robust='Bayesian' with priors) on real world data using sliding window. Outputs a TSV file containing the test statistics for each time point.
    • sliding_window_plot.py - Visualises outputs of the sliding_window.py.
    • sliding_window_extract_users.py - Extracts users appearing in each of the (sliding) window.
    • sliding_window_extract_users_smd.py - Calculates SMD scores using results (user lists for each of the windows) and users' characteristics (data/badges/user_features.tsv).
  • Synthetic data experiments - src/synthetic:

    • RUN_ALL_SIMULATIONS.sh - Runs main experiments on synthetic data and plots the results.
    • sliding_window.py - Calculates p-values from various models (Fisher=counts, Simple=simple survival, Robust='Bayesian' with priors) using sliding window over synthetically generated data. Outputs a TSV file containing the test statistics for each time point.
    • sliding_window_plot.py - Analysis (including empirical p-value calculation) and plotting of output data from sliding_window.py
    • sliding_window_plot_survival.py - Plotting intensity values extracted from output data calculated in sliding_window.py
    • thinning.py - Sampling from non-homogenous Poisson temporal point processes with known maximum.
  • Modeling and fitting of non-homogenous Poisson temporal point processes - src/processes:

    • process.py - Model of a survival process with one hazard rate until switching point and another after that.
    • process_factory.py - Factory that produces processes (process.py)
    • trend_process.py - Model of a survival process with hazard rate a1+trend*time until switching point and a2+trend*time after that.
    • trend_process_factory.py - Factory that produces processes (trend_process.py)
    • fit_simple.py - Fits the simple (intensities shared between users) survival model in two cases. First, assuming intensity to be constant (H0) and second, allowing for intensity change around certain time point (H1).
    • fit_bayes.py - Base functions for fitting of the robust (bayesian with priors) survival model.
    • fit_bayes1.py - Fits the robust (Bayesian with priors) survival model under H0.
    • fit_bayes2.py - Fits the robust (Bayesian with priors) survival model under H1.
  • Statistical tests - src/testing:

    • fisher_test.py - Calculation of p-values from Fisher exact test on counts (before vs after change).
    • wilks_test.py - Calculation of p-values from log-likelihood ratio test of nested models using Chi2 distribution (Wilks theorem).
    • bootstrap.py - Helpers to calculate p-value from empirical distribution of the test statistic.
  • Auxiliary files used for plotting, arguments parsing and data processing - src/analytics, src/aux

Execution and multiprocessing

Two main scripts to reproduce our results are src/experiments/RUN_ALL_EXPERIMENTS.sh and src/synthetic/RUN_ALL_EXPERIMENTS.sh. We support multiprocessing, e.g., every time window can be executed in a separate process. The number of used cores can be specified inside the scripts.

badges's People

Contributors

tkusmierczyk avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.