Giter Club home page Giter Club logo

transition-counting-gpu's Introduction

Parallelization of Transition Counting for Process Mining on Multi-core CPUs and GPUs

This repository contains the source code for the paper Parallelization of Transition Counting for Process Mining on Multi-core CPUs and GPUs presented at the 12th International Workshop on Business Process Intelligence (BPI 2016) in Rio de Janeiro, Brazil, September 2016.

Source files

The source code comprises the following files:

  • simulator.py - A Python script to generate event logs of various sizes. The input parameter for this script is the number of cases that the event log should contain.

  • preproc.py - A Python script to do the preprocessing of an event log. This script receives a CSV file (.csv) as input and produces a binary file (.pre) as output.

  • common.h - Some common routines and variables used by all programs, namely to read the event log in the binary format (.pre) created by the previous script.

  • flow1.h - The single-threaded CPU version of the flow algorithm.

  • flow2.h - The multi-threaded CPU version of the flow algorithm.

  • flow3.h - The GPU version of the flow algorithm.

  • flow_cpu.cc - The C/C++ program that runs the multi-threaded CPU version of the flow algorithm.

  • flow_gpu.cu - The CUDA C program that runs the GPU version of the flow algorithm.

  • handover1.h - The single-threaded CPU version of the handover algorithm.

  • handover2.h - The multi-threaded CPU version of the handover algorithm.

  • handover3.h - The GPU version of the handover algorithm.

  • handover_cpu.cc - The C/C++ program that runs the multi-threaded CPU version of the handover algorithm.

  • handover_gpu.cu - The CUDA C program that runs the GPU version of the handover algorithm.

  • together1.h - The single-threaded CPU version of the together algorithm.

  • together2.h - The multi-threaded CPU version of the together algorithm.

  • together3.h - The GPU version of the together algorithm.

  • together_cpu.cc - The C/C++ program that runs the multi-threaded CPU version of the together algorithm.

  • together_gpu.cu - The CUDA C program that runs the GPU version of the together algorithm.

  • runtests.py - A Python script that runs a series of tests and generates results in a similar form to Table 2 in the paper.

How to compile the code

Operating system: Ubuntu (or another Linux distro).

Check that you have all the necessary pre-requisites, namely CUDA. Also, make sure that CUDA is properly configured, as described here. A good place to configure the environment variables (PATH and LD_LIBRARY_PATH) is in ~/.bashrc.

Use the Makefile to compile the code. However, before running make, change the value of the ARCH variable to the compute capability of your GPU. For a list of GPUs and their compute capabilities, check this page.

Generating the event logs

Use simulator.py to generate event logs with size ranging from 10 to 107 cases. For this purpose, run the following commands:

  • python simulator.py 10 > eventlog_10.csv
  • python simulator.py 100 > eventlog_100.csv
  • python simulator.py 1000 > eventlog_1000.csv
  • python simulator.py 10000 > eventlog_10000.csv
  • python simulator.py 100000 > eventlog_100000.csv
  • python simulator.py 1000000 > eventlog_1000000.csv
  • python simulator.py 10000000 > eventlog_10000000.csv

Note that generating the larger event logs may take a significant amount of time.

Preprocessing the event logs

After generating the event logs, run the following commands to preprocess them:

  • python preproc.py 0 1 2 3 eventlog_10.csv
  • python preproc.py 0 1 2 3 eventlog_100.csv
  • python preproc.py 0 1 2 3 eventlog_1000.csv
  • python preproc.py 0 1 2 3 eventlog_10000.csv
  • python preproc.py 0 1 2 3 eventlog_100000.csv
  • python preproc.py 0 1 2 3 eventlog_1000000.csv
  • python preproc.py 0 1 2 3 eventlog_10000000.csv

The arguments 0 1 2 3 specify that the case id is in column 0 (first column), task is in column 1, user is in column 2, and timestamp is in column 3, respectively.

You may be surprised that preprocessing the largest event log can take about 5min. Most of this time is spent on reading the data (i.e. parsing the CSV file). This is precisely the reason why we decided to separate the preprocessing from the main programs. This way we can preprocess the event log only once, and then run several different programs on it, without having to reprocess it again.

Running the programs

To run the multi-threaded CPU version of each algorithm, you can use instructions similar to the following:

  • ./flow_cpu 100 4 eventlog_1000000.pre
  • ./handover_cpu 100 4 eventlog_1000000.pre
  • ./together_cpu 100 4 eventlog_1000000.pre

where the first argument is the number of runs, and the second argument is the number of threads. As a rule of thumb, set the number of threads to be equal to the number of physical cores available in the CPU.

To run the GPU version of each algorithm, you can use instructions similar to the following:

  • ./flow_gpu 100 128 eventlog_1000000.pre
  • ./handover_gpu 100 128 eventlog_1000000.pre
  • ./together_gpu 100 128 eventlog_1000000.pre

where the first argument is the number of runs, and the second argument is the number of threads per block to be used on the GPU. Please check the hardware architecture of your GPU to determine the number of threads per block that should be used. As a rule of thumb, use a number of threads per block that is equal to the number of cores per SM (streaming multiprocessor). For example, all Maxwell generation cards (e.g. GTX 750 Ti, GTX 980, GTX Titan X) have 128 cores per SM.

Running a series of tests

For convenience, we provide a Python script that runs all algorithms on all event logs. Note that the event logs must have been preprocessed first by preproc.py (see above).

To run the tests, execute:

  • python runtests.py

This will present the results in a similar form to Table 2 in the paper.

Testing on the BPI Challenge 2016 event logs

To run the experiment that we describe in the paper, do the following:

  • Grab the largest event log for BPI Challenge 2016 (BPI2016_Clicks_NOT_Logged_In.csv) from here.

  • We use the columns SessionID as case id, and PAGE_NAME both as task and user. Therefore, preprocess the event log with the following command:

 python preproc.py 0 5 5 2 BPI2016_Clicks_NOT_Logged_In.csv

  • Run the flow algorithm on the GPU (adjust the number of threads per block according to the number of cores per SM):

 ./flow_gpu 1 128 BPI2016_Clicks_NOT_Logged_In.pre

  • Run the together algorithm on the CPU (adjust the number of threads according to the number of physical cores):

 ./together_cpu 1 4 BPI2016_Clicks_NOT_Logged_In.pre

How to cite this work

See the publisher's website to download a citation in the desired format.

transition-counting-gpu's People

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.