gillesvandewiele / ehg-oversampling Goto Github PK
View Code? Open in Web Editor NEWReproducing feature engineering & oversampling experiments on TPEHG DB and assessing the real impact of over-sampling
Reproducing feature engineering & oversampling experiments on TPEHG DB and assessing the real impact of over-sampling
Has an entire list of features (some are similar to other related work):
Proposes the following features:
Support vector machine (SVM) was implemented to classify the features, and it is worth noting that by using 10 most superior features, the accuracy rate, sensitivity, and specificity were obtained as 97.1%, 95%, and 99%, respectively.
Like Acharya, but with Yule-Walker AR
In this work, SVM classifier with RBF kernel function is used for classification of term and preterm delivery using EHG signal records.
Using the ADASYN method, we resample the obtained features and increase the assume data by 514 data
The system achieves 95.5% accuracy on publicly available Term-Preterm EHG Database.
In this research four type of features are extracted from the EHG signatures such as; Median frequency [33], Shannon energy [34], Log energy [35], Lyapunov exponent [36] for the categorization of the EHG waveforms.
This research uses support vector machine (SVM)
In this research, adaptive synthetic sampling approach (ADASYN) [31, 37, 38]is used
Our approach shows an improvement on existing studies with 96% sensitivity, 90% specificity, and a 95% area under the curve value with 8% global error using the polynomial classifier.
['Weight', 'Rectime', 'Age', 'Parity', 'Abortions']
|| Categorical: ['Hypertension', 'Diabetes', 'Placental_position', 'Bleeding_first_trimester', 'Bleeding_second_trimester', 'Funneling', 'Smoker']
logistic classifier
SMOTE
the proposed approach shows an improvement on existing studies with 89%sensitivity, 91%specificity, 90%positive predicted value, 90%negative predicted value, and an overall accuracy of 90%
root mean squares, peak frequency, median frequency, and sample entropy.
The self-organised network inspired by the immune algorithm is developed to improve recognition and generalization capability of the backpropagation neural networks.
The first evaluation uses the original TPEHG dataset (38 pretermand 262 term)–the preterm are oversampled using min and max to produce 262 pretermrecords).
--> Very unclear...
Sounds like a modeling technique that never gets used... But, other more well-known techniques are reported in the paper as well, so we can use one of these (Decision Trees, SVM).
Extract all features for each of the signals and for each of the channels. It is important to note that some preprocessing might be required of the signal, which should be checked from literature or tuned as a hyper-parameter. E.g. digital filtering (although this is already done for the TPEHGDB dataset), removal of first and last measurements (I found that cutting off 3000 values from the start and end better reproduced the provided features), normalization, ...
All the ranked features are fed to support vector machine (SVM) classifierfor automated differentiation and achieved an accuracy of 96.25%, sensitivity of 95.08%, and specificity of 97.33% using only tenEHG signal features
The signals are decomposed only up to11 IMFs
In this work, 6 level WPD is implemented on each IMFs of 300 EHG signals using Daubechies 8 (db 8) and obtained a total of 12coefficients.
Feature selection (and ow yes they of course used all data to do this...) is applied to obtain 10 features:
we have employed data balancing using adaptive synthetic sampling approach (ADASYN)
Overall, our results show a clear improvement in prediction accuracy of preterm delivery risk compared with previous approaches, achieving an impressive maximum AUC value of 0.986 when using signals from an electrode positioned below the navel
Entropy is one of the most widely used complexity measures in biomedical signal analysis [41]. In our study Shannon entropy was used to calculate the average uncertainty or unpredictability of the instantaneous amplitude and the instantaneous frequency of the first ten IMF components of the uterine EMG signals obtained by EMD. In this way twenty entropy values can be derived from each EMG recording.
Hence, in our study the entropy ratios of the instantaneous amplitude and the instantaneous frequency of each two IMFs of the uterine EMG signals were calculated for the purpose of exploring the intrinsic relations between IMFs, given by Eqs (7) and (8).
Table 1 shows the classification performances of the extracted features from channel 3 based on both the EMD (180 entropy ratios) and non-EMD methods (P. Ferguset al. used four extracted features together: root mean square, median frequency, peak frequency and sample entropy [30]),
In addition, when we used all the features extracted from three uterine EMG signal channels (180 features per channel, 540 features in total) to classify the preterm and term delivery recordings, this still only achieved an average AUC value of 0.778. However, if we only use the features extracted from channel 3 (180 features) alone, the average AUC value can reach up to 0.89.
AdaBoost
A previous study has applied the synthetic minority over-sampling technique (SMOTE) to classify the records of preterm and term delivery groups in the TPEHG dataset [32]. In our study, the same SMOTE approach was also used.
Not exactly sure how they reach 180 features per channel...
Let's clarify the scope of the present work with the deadline of 01/08.
Proposes "Empirical Mode Decomposition"
Create a notebook that analyses each of the features (individually):
The achieved classification accuracy was 100% for early records, recorded around the 23rd week of pregnancy; and 96.33%, the area under the curve of 99.44%, for all records of the database.
For the classification of the entire preterm and term EHG records, the sample entropy, SE, median frequency, MF, and peak amplitude, PA, of the normalized power spectrum were derived, in each of the frequency bands B0, B1, B2, and B3, and for each of the EHG signals, S1, S2, and S3, of the database. Due to the normalization of each power spectrum, the PA from the frequency band B0 was omitted, resulting in 11 features per signal per record.
For these reasons, the QDA classifier seems suitable choice for this study.
The ADASYN technique, used in order to balance the representation of data distribution in two separate classes, increased the number of samples in preterm minority class for early records form 19 to 140, and for all records from 38 to 256.
This study focuses on two datasets (TPEHGDB and TPEHG DS), and has a few features which are not yet included (using frequency bands).
The results illustrate that the Random Forest performed the best of sensitivity 97%, specificity of 85%, Area under the Receiver Operator curve (AUROC) of 94% and mean square error rate of 14%.
Root Mean Square of EHG Signal, Peak Frequency of EHG Signal, Median Frequency, Sample Entropy
Random Forest
To address this issue, the minority class (preterm) has been oversampled using the Synthetic Minority Over-Sampling Technique (SMOTE).
S., Bhandary, S.V.: Automated detection of premature delivery using empirical mode and wavelet packet decomposition techniques with uterine electromyogram signals.
Proposes the following features:
Hi Gyuri,
I thought it might be interesting to have an early discussion about the author list as well. As you may have currently noticed, the author list is already quite extensive due to the fact that this study has been part of a larger project.
Would you agree with being third author in that list? Or would you prefer another position?
The final list would then be:
Gilles Vandewiele, Isabelle Dehaene, György Kovács, Lucas Sterckx, Olivier Janssens, Femke Ongenae, Femke De Backere, Filip De Turck, Kristien Roelens, Sofie Van Hoecke, and Thomas Demeester.
Proposes the following features:
Based on MMFE features, an improvement in the classification accuracy of term-preterm deliveries was achieved, with a maximum area under the curve (AUC) value of 0.99.
Then both MMFE (Fuzzy Entropy) and MMSE (Sample Entropy) analyses were performed on each one-min epoch (which had 60 × 20 = 1200 samples) and afterwards averaged over the 27 epochs to produce the MMFE or MMSE curves for each record. In this multiscale study, we considered 10 scales for each epoch, so that the coarse graining process of MMFE/MMSE analysis
yielded only 120 samples at the highest scale, which however was sufficient for MFSampEn calculation. These MSampEn or MFSampEn values calculated on 10 different coarse-graining scales were used as features in classification stage
Guassian (??) SVM? Is this RBF?
In this study, to solve the class skew problem, the Adaptive Synthetic Sampling (ADASYN) [44,45] technique was used.
We have no Fuzzy Entropy, but I think it is enough to only consider Sample Entropy
Uses the following features:
Proposes a variant on the Sample Entropy (Multivariate Fuzzy Sample Entropy)
There seems to be a bug in the FeaturesSadiAhmed:
When calling FeaturesAllEHG().extract(signal_ch3[3000:-3000])
for tpehg929
the following exception occurs:
Traceback (most recent call last):
File "all_features.py", line 38, in <module>
results_ch3 = fe.extract(signal_ch3[3000:-3000])
File "/usr/local/lib/python3.6/dist-packages/ehgfeatures-0.0.1-py3.6.egg/ehgfeatures/features/_FeatureGroup.py", line 17, in extract
results= {**results, **(f.extract(signal))}
File "/usr/local/lib/python3.6/dist-packages/ehgfeatures-0.0.1-py3.6.egg/ehgfeatures/features/_FeaturesSadiAhmed.py", line 49, in extract
emd= emds['emd_' + str(i)]
KeyError: 'emd_6'
The results illustrate that the combination of the Levenberg-Marquardt trained Feed-Forward Neural Network, Radial Basis Function Neural Network and the Random Neural Network classifiers performed the best, with 91% for sensitivity, 84% for specificity, 94% for the area under the curve and 12% for the mean error rate
Some weird flavors of neural nets...
To address this issue, the minority class (preterm) is oversampled using the Synthetic Minority Over-Sampling Technique (SMOTE).
I would not focus too much on the specifics of their neural net and just use a simple feed-forward network.
--> Should be provided with the original data (so no work here)
--> Uses four different frequency bands to extract the following features from:
Also applies some extra pre-processing of the signal (Fourier, Hanning windows, ...)
After employing the adaptive synthetic sampling approach and six-fold cross-validation, the accuracy (ACC), sensitivity, specificity and area under the curve (AUC) were applied to evaluate RF classification. For PL and TL group, RF achieved the ACC of 0.93, sensitivity of 0.89, specificity of 0.97, and AUC of 0.80. Similarly, their corresponding values were 0.92, 0.88, 0.96 and 0.88 for PE and TE group, indicating that RF could be used to recognize preterm delivery effectively with EHG signals recorded before the 26th week of gestation.
31 features per EHG recording:
Random Forest
Adasyn
Seems like we have all features in place except for time reversibility... I implemented that one in a previous project:
def time_reversibility(data):
norm = 1 / (len(data) - 1)
lagged_data = data[1:]
return norm * np.sum(np.power((lagged_data - data[:-1]), 3))
Additionally, we could also use TSFRESH and HCTSA to extract another few 1000 extra features. These features are very generic ones for timeseries (and thus maybe not suited for high-frequency biomedical signals), but could contain a few interesting ones. Of course, this would cause our feature elimination to be much more expensive...
Proposes wavelet-based features (four-step algorithm)
We can perform an analysis of the predictive power of the extracted features using different datasets:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.