Giter Club home page Giter Club logo

linregoutliers's Introduction

Build Status DOI Doc codecov

LinRegOutliers

A Julia package for outlier detection in linear regression.

Implemented Methods

  • Ordinary Least Squares and Weighted Least Squares regression
  • Regression diagnostics (DFBETA, DFFIT, CovRatio, Cook's Distance, Mahalanobis, Hadi's measure, etc.)
  • Hadi & Simonoff (1993)
  • Kianifard & Swallow (1989)
  • Sebert & Montgomery & Rollier (1998)
  • Least Median of Squares
  • Least Trimmed Squares
  • Minimum Volume Ellipsoid (MVE)
  • MVE & LTS Plot
  • Billor & Chatterjee & Hadi (2006)
  • Pena & Yohai (1995)
  • Satman (2013)
  • Satman (2015)
  • Setan & Halim & Mohd (2000)
  • Least Absolute Deviations (LAD)
  • Quantile Regression Parameter Estimation (quantileregression)
  • Least Trimmed Absolute Deviations (LTA)
  • Hadi (1992)
  • Marchette & Solka (2003) Data Images
  • Satman's GA based LTS estimation (2012)
  • Fischler & Bolles (1981) RANSAC Algorithm
  • Minimum Covariance Determinant Estimator
  • Imon (2005) Algorithm
  • Barratt & Angeris & Boyd (2020) CCF algorithm
  • Atkinson (1994) Forward Search Algorithm
  • BACON Algorithm (Billor & Hadi & Velleman (2000))
  • Hadi (1994) Algorithm
  • Chatterjee & Mächler (1997)
  • Theil-Sen estimator for multiple regression
  • Deepest Regression Estimator
  • Summary

Unimplemented Methods

  • Pena & Yohai (1999). See #25 for the related issue.

Installation

LinRegOutliers can be installed using the Julia REPL.

julia> ]
(@v1.9) pkg> add LinRegOutliers

or

julia> using Pgk
julia> Pkg.add("LinRegOutliers")

then

julia> using LinRegOutliers

to make all the stuff be ready!

Examples

We provide some examples here.

Documentation

Please check out the reference manual here.

News

  • We implemented ~25 outlier detection algorithms which covers a high percentage of the literature.
  • Visit the CHANGELOG.md for the log of latest changes.

Contributions

You are probably the right contributor

  • If you have statistics background
  • If you like Julia

However, the second condition is more important because an outlier detection algorithm is just an algorithm. Reading the implemented methods is enough to implement new ones. Please follow the issues. Here is the a bunch of first shot introductions for new comers. Welcome and thank you in advance!

Citation

Please refer our original paper if you use the package in your research using

Satman et al., (2021). LinRegOutliers: A Julia package for detecting outliers in linear regression. Journal of Open Source Software, 6(57), 2892, https://doi.org/10.21105/joss.02892

or the bibtex entry

@article{Satman2021,
  doi = {10.21105/joss.02892},
  url = {https://doi.org/10.21105/joss.02892},
  year = {2021},
  publisher = {The Open Journal},
  volume = {6},
  number = {57},
  pages = {2892},
  author = {Mehmet Hakan Satman and Shreesh Adiga and Guillermo Angeris and Emre Akadal},
  title = {LinRegOutliers: A Julia package for detecting outliers in linear regression},
  journal = {Journal of Open Source Software}
}

Contact & Communication

  • Please use issues for a new feature request or bug reports.
  • We are in #linregoutliers channel on Julia Slack for any discussion requires online chatting.

linregoutliers's People

Contributors

akadal avatar angeris avatar jbytecode avatar tantei3 avatar timholy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

linregoutliers's Issues

BACON algorithm is missing

The BACON algorithm with reference

Billor, Nedret, Ali S. Hadi, and Paul F. Velleman. "BACON: blocked adaptive computationally efficient outlier nominators." Computational statistics & data analysis 34.3 (2000): 279-298.

is required. Any contributor for this? volunteers will be assigned to this issue.

Hadi (1994) algorithm is missing

The algorithm with reference

Hadi, Ali S. "A modification of a method for the detection of outliers in multivariate samples." Journal of the Royal Statistical Society: Series B (Methodological) 56.2 (1994): 393-396.

is required. This algorithm is modification of the Hadi (1992) algorithm which is implemented in /src/hadi1992.jl.

Implementation of this method requires small code refactors of the implemented one.

volunteers will be assigned to this issue.

Minimum Covariance Determinant (MCD) estimator is missing

Minimum Covariance Determinant estimator is missing in the package.

The algorithm can be implemented using the paper

Rousseeuw, Peter J., and Katrien Van Driessen. "A fast algorithm for the minimum covariance determinant estimator." Technometrics 41.3 (1999): 212-223.

Documentation - 1. Functions

I think that documentation related to the algorithms provided in the package needs to be improved. Although each method has its clear reference provided in the reference manual, I think it would be nice to have a couple of sentences providing an idea of what the algorithm does.

In addition, outputs from different algorithms are not consistent between each other. This not a problem per se, but I think that there should be a description of what the different output elements are. For example ks89 and smr98 seems to return an Array of indexes for the outliers, while lmsreturns a Dictionary with 6 entries. These entries are not obvious to interpret, and I think should be documented.

openjournals/joss-reviews#2892

Singularity

@tantei3 we have

ols(X::Array{Float64,2}, y::Array{Float64,1})::OLS = OLS(X, y, inv(X' * X) * X' * y)

in /src/ols.jl and the method inv throws error in the case of singularity. How about changing the method as

ols(X::Array{Float64,2}, y::Array{Float64,1}; inversefunction = inv)::OLS = OLS(X, y, inversefunction(X' * X) * X' * y)

so some methods can call it using pinv from package LinearAlgebra using

ols(X, y, inversefunction = pinv)

by this change, you can remove the try catch block in ransac method.

MVE

the mve() function in the package results different covariance matrix when it is compared to R results using cov.mve() function in the MASS package.

Any contributions are also welcome, if there is a bug in the implemented method. However, mve & lts plot results satisfactory results.

Julia compat entry should be updated to 1.7 to accommodate RANSAC

Using Julia 1.6.0, the RANSAC example code gave me the following error: UndefVarError: ColumnNorm not defined

I tracked the change to this commit - judging by the commit date I suspect it was introduced in version 1.7.0. I also see that the LinRegOutliers tests are passing for Julia 1.7 (the README badge is out of date)

The error was resolved after I upgraded to Julia to 1.7.1

Versioning

By now on, using the version scheme

major.minor.patch

  • minor will be increased by adding a new algorithm
  • patch will be increased by any modification, refactoring, documentation for existing method, bug fixes, or any other minor things
  • major will be increased for any milestones.

The definition of milestone is not yet clear. It could be discussed in issues.

feature request: methods for `(y,X)`

Hi,

following up the discussion on Discourse, I would kindly ask for method for (y,X).

Motivation: while GLM and friends are often useful, it is sometimes easier to just do 'b = X\y' etc.

Feasibility: looking at your code, it sometimes (like in lad.jl) starts with

X = designMatrix(setting)
Y = responseVector(setting)

In these cases, it should be straightforward to add methods. (I am busy with teaching right now, but might able submit PRs later this autumn.)

Robust regression: A weighted least squares approach.

I think it worths to look at the publication given below since it seeks robust estimates of regression parameters in a different way:

Chatterjee, Samprit, and Martin Mächler. "Robust regression: A weighted least squares approach." Communications in Statistics-Theory and Methods 26.6 (1997): 1381-1394.

Consistency of objects returned from functions

I am not sure this is strictly related with the review, but I have noticed that satman2013 returns a Dictionary with 1 entry named outliers, while functions ks89 and smr98 returns an Array. Aren't all these functions returning indexes of outliers? If so, I think it would be nice that the objects returned are consistent across functions (either a Dictionary with one entry or an Array).

openjournals/joss-reviews#2892

Documentation - 2. Examples

As for the documentation, I think that examples could clearer if expanded a bit. I like the fact that algorithms calls follow the same structure, the function usage is pretty self-explanatory.
However, I would suggest expanding the example page to better reflect how the functions of the package help to perform data analysis. For example, providing few sentences describing the data, objective of the analysis and commenting the results. The example could also clarify that algorithms can provide different outputs, and that they all provide a list of outliers named outliers directly or contained in the Dictionary.

openjournals/joss-reviews#2892

Implement minimization of clipped convex functions for outlier detection

(For original context on this issue, see here and thread.)

There are some basic heuristics in this paper (here's an arXiv PDF with better formatting) for outlier detection and data fitting, when the problem of fitting with outliers is specified as the minimization of a sum of clipped convex functions. The algorithm is relatively short, doesn't really require many parameters, and seems to have good performance in practice, so, if there is any interest, it might be worth including!

CC: @sbarratt

Paper submission

Hi folks,

Firstly, thanks for all contributors and for those who are trying to implement a new functionality right now.

We have nearly ~20 algorithms / papers implemented and the package is in a good state. It seems, after one or two months, the package will cover nearly %95 of the classical literature. Our long-term goal is, of course, to implement all of the pioneer algorithms on this scientific area and make them all production ready for Julia users and researches.

Since our package is also a scientific contribution and / or a useful tool for those will construct their scientific research on the package, it would be good to a have a citable manuscript in an academic journal. Our package perfectly fits the contextual requirements of Journal of Open Source Software (JOSS). I am planning to prepare a manuscript when the current status of our package can be considered mature (Yes, being mature is not clear but in terms of coverage and completeness, our package is in a good state).

All valuable contributors are also natural co-authors of the paper.

So, please make your assigned algorithm(s) ready until 2021, Jan 1st.

Fyi, have a nice week.

Deepest Regression (DR) estimator

It is good to have deepest regression estimator referenced with

Van Aelst, Stefan, et al. "The deepest regression method." Journal of Multivariate Analysis 81.1 (2002): 138-166.

in the package. Any contributions are welcome.

JOSS Submission

Dear friends,

I prepared the first draft of the paper and it is in folder paper. Since you all have valuable contributions, you are naturally co-authors.

Please have a look at the paper first. You can compile it using the wedon site of JOSS here by feeding the repository address in the input box. You can use pandoc to convert markdown to pdf in your local computer, however, I am not even sure if you can attach the bibliography file by this way. I prefer to use vscode with markdown add-in and use wedon to check the final pdf.

In the header of the paper.md file, you must correct your ORCID and Affiliation info. The order of the authors is determined by the amount of contribution. If this order is not proper for you, please let me know.

The paper is in its early stages and it can be considered as the first draft. Please make your corrections, contributions if there is any. Since @angeris is a native English speaker (at least he uses English in his everyday life), I would be very grateful if he read the article and did a language and typo check besides his contributions.

The summary and state of the field sections look good but statement of need may be incomplete in this version.

I am waiting for your pull requests, or any contributions under this issue.

Have a nice week!

@tantei3
@angeris
@akadal

Repeated medians

Is there anybody interested in this subject? We can discuss whether we should implement this or not.

MLA | Siegel, Andrew F. "Robust regression using repeated medians." Biometrika 69.1 (1982): 242-244.

Pena & Yohai (1999)

The paper

Daniel Peña & Victor Yohai (1999) A Fast Procedure for Outlier Diagnostics in Large Regression Problems, Journal of the American Statistical Association, 94:446, 434-445, DOI: 10.1080/01621459.1999.10474138

is in scope of the package and it would be nice to have this. The algorithm consists on two stages, which are relatively easy to implement. We have a py95() implementation for an other paper of the same authors.

Performance improvements, micro benchmarks, enhanced tests.

Dear all,

Since our package covers a good percentage of the corresponding literature, we can focus on performance improvements using micro benchmarks. We can also enlarge the test coverage for a more robust package.

Any contributions that significantly enhance the performance are welcome. Our motivation here is to have

  • more faster code
  • more handling edge cases
  • enhanced test coverage

Any ideas can be discussed here or if needed, a new issue can be used to address the problem.

Have a nice weekend.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

A. H. M. Rahmatullah Imon (2005)

This paper suggests an outlier detection algorithm based on regression diagnostics and relatively easy to implement.

A. H. M. Rahmatullah Imon (2005) Identifying multiple influential observations in linear regression, Journal of Applied Statistics, 32:9, 929-946, DOI: 10.1080/02664760500163599

I can assign this for any of our friends who is interested in.

Atkinson (1994) Forward search and outlier plot

this algorithm generates a useful outlier plot based on lms estimation which is already available in our package.

Atkinson, Anthony C. "Fast very robust methods for the detection of multiple outliers." Journal of the American Statistical Association 89.428 (1994): 1329-1339.

I plan to implement this in next few days if nobody is interested in.

Singularity

@tantei3 we have

ols(X::Array{Float64,2}, y::Array{Float64,1})::OLS = OLS(X, y, inv(X' * X) * X' * y)

in /src/ols.jl and the method inv throws error in the case of singularity. How about changing the method as

ols(X::Array{Float64,2}, y::Array{Float64,1}; inversefunction = inv)::OLS = OLS(X, y, inversefunction(X' * X) * X' * y)

so some methods can call it using pinv from package LinearAlgebra using

ols(X, y, inversefunction = penv)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.