Giter Club home page Giter Club logo

kaggle-lanl-earthquake-prediction's Introduction

Kaggle-LANL-Earthquake-Prediction

26th place entry (out of 4540) for Kaggle LANL Earthquake Prediction competition

The goal of this competition was to predict the timing of a laboratory earthquake based on acoustic time series data. The acoustic_data input signal is used to predict the time remaining before the next laboratory earthquake (time_to_failure).

More info can be found at the Kaggle site: https://www.kaggle.com/c/LANL-Earthquake-Prediction/overview

Solution

Data

Data

The time series data shows 16 experiments, with 14 of them complete. There are approximately 600,000,000 acoustic time samples in the data. Each time sample also has the time remaining before the next earthquake occurs. The 'timetofailure' data is retrospective -not causal. At the time point the experiment is started, no one knows the timetofailure, since it has not yet physically failed in the run. That is what makes this prediction problem hard.

The test data reflects many small (150,000 time sample) segments. For each the model is required to predict a single time to failure.

Data Selection

Only training data reflecting the 14 completed experiments were used in the training to help ensure the training data distribution reflects full experiments. Training data was segmented into 150,000 sample batches in order to match the test conditions. Batches were taken out of each of the 14 experiments. Batches were selected so that experiments were not mixed within a batch. Batches were allowed to overlap within a particular experiment, using a step of 30,000 for a batch size of 150,000.

Feature Generation

Feature Generation was inspired by this starter kernel: https://www.kaggle.com/artgor/earthquakes-fe-more-features-and-samples The following additional features were added into the mix for final selection:

Matched filtering

Matched Filtering

A matched filtering approach was used to generate some additional features. Matched filtering is a way to identify patterns in a time series and is used in radar, sonar, well as earthquake analysis. I reviewed the acoustic data in Audacity and identified what looked like common and repeated patterns both early and late in the experiments. These short samples (1000-3000 time stamps) were used to build a set of matched filter detectors using cross correlation between the sample and the training data set.

Hilbert transform for envelope detection

The hilbert transform was used to derive the analytical envelope of the matched filter outputs as well as the unfiltered training data. For the unflitered training data the hilbert transform was low-pass filtered and the 10th percentile value was used (over the 150,000 time samples) in order to reflect the amount of peaks.

Short Term Fourier Transform (STFT)

A short term fourier transform was applied to both the match filtered and raw training data at various frequency bands. The bands were selected based on bands with most signal energy during exploratory FFT analysis of the whole training data set.

Peak detection

Peak detection was used on both the matched filter features as well as STFT features.

Total Feature Space

As a result of all the combinations of the above approaches 451 features were generated.

Feature Selection

Approach 1 - correlation

Features were ranked in order of absolute correlation (ignoring features where correlation had no statistical significance)

Approach 2 - model-based

Feature selection

Features were selected using the feature importances given by the LGBM model. 5- fold cross-validation was used, grouping experiments within the folds. This resulted in an ordered list of features.

Training and Model combinations

The following commbinations were made, resulting in 384 models n_fold of 5, 14, 20. For n_fold of 14, experiments were not split between folds. number of features of 10, 15, 20, 30, 50, 60, 70, all features sordered via correlation, or via model number of leaves of 31, 51, 91, 121 boost type of gbdt or random forest

Model selection

Model selection was based on the best cross-validation scores. The contest allowed two submissions to be considered. These were selected so one was a result of cross-validation which does not group experiments, and one where experiment grouping was allowed.

MPreds

Full notebook (via nbviewer): https://nbviewer.jupyter.org/github/filipmu/Kaggle-LANL-Earthquake-Prediction/blob/master/Kaggle%20LANL%20Earthquake%20Prediction%2013-best%20scoring%20submission.ipynb

kaggle-lanl-earthquake-prediction's People

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Forkers

chihhao428

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.