Giter Club home page Giter Club logo

surface's Introduction

SURF Automatic Collection Engine (SURFace)

This repository contains several scripts to analyze and visualize data collected from SURF's Lisa cluster. The data can be found on Zenodo at https://zenodo.org/record/4459519.

Usage

  1. Download the dataset using the link mentioned above
  2. Clone this repository to some folder.
  3. Per script, modify the paths as required. "/path/to/surfsara-jobdata/", "path to machine metric dataset" and variants should point the the dataset downloaded in point 1. ./cache should be point to a location where some scratch data can be put.
  4. Run the notebook on a machine that has 64GB or more RAM, as some analyses require some in-memory storage. For some scripts, a Spark cluster is required due to the sheer amount of data and processing required. If Koalas is used in a script, you are most likely needing to setup a small spark cluster. 4-10 machines each having 64GB or more RAM will suffice. In correlation_plot_koalas.py, we use 5 machines (1 master, 4 workers) each having 64GB of RAM.
  5. The figures will be output in the folder where the notebook resides, or where you point the paths Matplotlib/Seaborn should output to. Tables are printed in the notebook as a string.

Scripts The scripts have generally a self-describing name. Below we provide some more details per script.

Script Explanation
!LSTM V2.ipynb Investigates the effect of different sampling intervals on predictions of metric values.
!Network data analysis.ipynb Performs various analyses related to network IO.
!Z-Score.ipynb A script that investigates if anomalies can be detected using z-scores.
!jobdata_analysis_new.ipynb Performs various different analyses related to the executed jobs within Lisa.
Full_Cluster_bottleneck_analysis.ipynb Creates a holistic normalized overview of the dataset by aligning job arrivals with various machine metrics.
Generic_outline_dataset.ipynb Computes various generic properties of the dataset. The overview table below in the readme is constructed using this script.
correlation_single_rack_one_day.ipynb Computes the Pearson, Spearman, and Kendall correlation coefficients for all pairs of metrics within the dataset on individual days.
analysis_coefficient_separate_days.ipynb Visualizes in various ways the output of correlation_single_rack_one_day.ipynb.
correlation_plot_koalas.py Computes a dense correlation plot of normalized histograms, scatterplots with linear regression lines per metric pair, and visualized the Pearson, Spearman, and Kendall correlation coefficients per metric pair.
koalas_correlation_plot_data_only.ipynb Creates a better visualization of the plot of correlation_plot_koalas.py by creating a variant of Seaborn's pairgrid.
daily_weekly_trend_load.ipynb Creates several weekly and diurnal trend visualizations.
file_sizes_different_granularities.ipynb Computes the storage overhead for different sampling frequencies using a selection of metrics.
generate_barplots.py Generates barplots of metric values in covid vs non-covid periods.
generate_boxplots.py Generates boxplots of metric values in covid vs non-covid periods.
job_arrival_characterization.ipynb Creates several visualizations and performs different kind of analyses based on job arrivals.
mean_memory_utilization_nodes.ipynb Analyses different aspects of the node RAM usage and creates several different visualization.
power_consumption_analysis.ipynb Performs several analyses on the rack and power consumption and creates several different visualizations.
rack_temp_noenc.py Analyzes for various racks their node temperatures and creates visualizations for them.

Outline of the dataset


The dataset spans from 2019-12-29 to 2020-08-07.

Element Value
Sampling frequency 15 seconds
Max. samples per metric per node 1,258,646
Number of metrics 327
Number of measurements 66,541,895,243

Libraries used

Most tools in this repository were created and tested using the following libraries and their versions:

Library Version
Pandas 1.2.0
NumPy 1.19.4
SciPy 1.5.3
statsmodels 0.12.1
pytz 2020.4
SKlearn 0.24.0
Tensorflow 2.3.1
pyarrow 3.0.0
Dask 2021.03.0
Matplotlib 3.4.1
Seaborn 0.11.1
Koalas 1.5.0
Spark 3.0.0
Hadoop 2.7.7

surface's People

Contributors

casgr avatar cetinmehmet avatar lfdversluis avatar magthalion avatar saradamian avatar vcodreanu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.