Giter Club home page Giter Club logo

tabletransforms.jl's Introduction


This package provides transforms that are commonly used in statistics and machine learning. It was developed to address specific needs in feature engineering and works with general Tables.jl tables.

Installation

Get the latest stable release with Julia's package manager:

] add TableTransforms

Documentation

  • STABLEmost recently tagged version of the documentation.
  • LATESTin-development version of the documentation.

Contributing

Contributions are very welcome. Please open an issue if you have questions.

We have open issues with missing transforms that you can contribute.

tabletransforms.jl's People

Contributors

ablaom avatar antholzer avatar ceferisbarov avatar clarohenrique avatar dependabot[bot] avatar dianapat avatar eliascarv avatar essamwisam avatar frank-iii avatar github-actions[bot] avatar jay-sanjay avatar juliohm avatar jwscook avatar mrr00b00t avatar omar-elrefaei avatar rajatrc1705 avatar spaette avatar vickydeka avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

tabletransforms.jl's Issues

Hyperparameter tuning

Looking at bit into MLJ integration. For better or worse, hyper-parameter optimization (eg, grid search) in MLJ generally works by mutating the field values of the model struct. I wonder if TableTransforms.jl would consider changing their transformer types to mutable structs? I think in ML applications, at least, any loss in performance would be pretty minimal, but perhaps there are wider use-cases to consider?

The alternative for our use case is for the MLJ model wrapper to be mutable (for now a wrapper is necessary anyway) and that a user wanting to do a search does something like

values = [Scale(low=0, high=x) for x in 1.0:0.1:10]     <---- extra step 
values = range(wrapped_transformer, :model, values=values)

However, while this might be fine for Grid search, it doesn't really work for other optimization strategies.

Thoughts?

Replace could accept a predicate

Currently we have to write Replace(oldval => newval). It would be useful to replace entries of a table based on a given predicate as in Replace(<(0) => newval)

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Add documentation with Documenter.jl

The project has grown a lot with the latest additions by @eliascarv and @ceferisbarov ❤️ I think it is time to add a documentation page with Documenter.jl.

In my opinion an ideal documentation for this package would consist of two pages:

  1. A page with the examples from the README and other examples showing revertibility of pipelines.
  2. A page with a list of docstrings for all available transforms and an index table (or side page) that users could click to jump directly to any transform's docstring.

We can then focus on writing very good docstrings and never bother again about updating the documentation. Any newly added transform could just be appended to the docstring page.

Add `pratio` option to EigenAnalysis

Currently we need to specify the number of output dimensions, but an useful option available in MultivariateStats.jl is pratio which determines the number of dimensions to preserve in order to account for a ratio of the total variance.

OneHotEncoding

The transform takes categorical columns and converts into multiple columns with 1s and 0s using the CategoricalArrays.jl API.

Introduce RowTable and ColTable transforms

For row table inputs, will the output of a transform necessarily have rowaccess? Is there a general principle guiding the form of the output in general? Can I arrange that the ouput is Tables.materialized?

In MLJ we generally materialize based on input type, but is has been argued that an option to specify the output type would be useful. I'm wondering what the authors of TableTransforms.jl think about such questions.

Fix issues in documentation

  1. Copy theme from Meshes.jl/GeoStats.jl
  2. Add docstring for ProjectionPursuit
  3. Remove section on external transforms

Regarding item (3), the list will become obsolete very quickly and it is probably better to just list the dependant packages in TransformsBase.jl's README

Can TableTransforms do transforms on one row at a time?

Hi there,

I have a project with this pipeline:

  1. Prepare data in SQL
  2. For each of 100s of statistical models, do select a table, apply transforms, convert to matrix, use for model training
  3. Use the trained models in an agent-based simulation.

The transforms used in Step 2 are also used in Step 3. The difference is that Step 2 applies the transforms to a table, whereas Step 3 applies them to a Dict (representing an agent's state). This state object can be thought of as a row/observation of a table.

With that in mind I've built a transforms package with the following API:

transformtable!(table, t)  # Used in model training
transform!(obs, t)   # Used in simulation

Then:

  • By default transformtable! applies transform! to each observation one at a time.
  • Variable-level transforms, which don't need to be applied to each observation (e.g., setlevels), have:
    • A custom implementation of transformtable!, which dispatches on the transform's type
    • transform!(obs, t) = nothing

This gives me everything I need.
I'd like to open source this functionality, but I'd rather not clutter the ecosystem if this functionality already exists or is planned.
Does it exist in TableTransforms? Or is it planned?

Cheers,
Jock

Unexport internal types and functions

This types:

Transform, Stateless, Colwise

and this functions:

assertions, colapply, colrevert

are part of the internal API and should not be exported.

Treatment of metadata

This issue tracks our efforts to generalize the API to handle metadata (or special columns). The overall idea consists of renaming all apply, revert, reapply functions to internal functions that operate on the feature table. End users won't feel any difference given that the fallbacks don't change the current behavior.

Enable transforms that are functions of multiple variables?

Hi there,

This package looks great!

I have a module in one of my projects that also applies transforms to tabular data, and I'm wondering if I should port those to TableTransforms.jl.

I often have transforms of the form: newcol = f(col1, col2, ...)
E.g., the difference between 2 Date columns, interaction terms, indicator functions, etc.
Although such functions are not yet available in this package, are you open to allowing this?
I guess this amounts to allowing functions of rows, along these lines: newcol[i] = f(row[i]...)

Cheers,
Jock

Unexpected behavior for row table

I expected this:

julia> X = Tables.table(rand(10, 3))
Tables.MatrixTable{Matrix{Float64}} with 10 rows, 3 columns, and schema:
 :Column1  Float64
 :Column2  Float64
 :Column3  Float64

julia> rt = Tables.rowtable(X)
10-element Vector{NamedTuple{(:Column1, :Column2, :Column3), Tuple{Float64, Float64, Float64}}}:
 (Column1 = 0.5565428815102202, Column2 = 0.9460318649319521, Column3 = 0.3359377652605777)
 (Column1 = 0.5075982188653294, Column2 = 0.18315493443083453, Column3 = 0.9682078694091394)
 (Column1 = 0.9963268867042159, Column2 = 0.9329899214178856, Column3 = 0.5470116730645322)
 (Column1 = 0.3586863538486338, Column2 = 0.028403436910423796, Column3 = 0.37539265155858015)
 (Column1 = 0.6219233944778164, Column2 = 0.3499360310552866, Column3 = 0.14038853850080102)
 (Column1 = 0.587164895366161, Column2 = 0.9365631957076301, Column3 = 0.4638888590010579)
 (Column1 = 0.5896922517858991, Column2 = 0.29876303863960274, Column3 = 0.8361667225744209)
 (Column1 = 0.3924751962812454, Column2 = 0.7981085942387889, Column3 = 0.11151780408924594)
 (Column1 = 0.26805907763502645, Column2 = 0.6863527449570381, Column3 = 0.5322558300965325)
 (Column1 = 0.4000480227789831, Column2 = 0.5888482501962962, Column3 = 0.25538366367233634)

julia> X |> Center()
(Column1 = [0.02869116358486712, -0.020253499060023716, 0.46847516877886275, -0.16916536407671934, 0.09407167655246329, 0.059313177440807885, 0.061840533860546, -0.13537652164410774, -0.25979264029032667, -0.12780369514637002], Column2 = [0.37111666368337826, -0.39176026681773934, 0.35807472016931174, -0.5465117643381501, -0.22497917019328728, 0.36164799445905627, -0.27615216260897113, 0.223193392990215, 0.11143754370846426, 0.013933048947722293], Column3 = [-0.12067737246214477, 0.5115927316864168, 0.09039653534180975, -0.08122248616414235, -0.3162265992219215, 0.007273721278335421, 0.37955158485169843, -0.34509733363347656, 0.07564069237381005, -0.20123147405038616])

But not this:

julia> rt |> Center()
NamedTuple{(), Tuple{}}[]

StdNames

Implement a StdNames transform that standardizes the column names.

We could use symbols to specify naming conventions:

StdNames(:snake) # column_name
StdNames(:camel) # ColumnName
StdNames(:upper) # COLUMNNAME
StdNames() # default to :upper

This transform generalizes the functions in janitor and Cleaner..jl

Levels

Hi there,

Is there appetite for a SetLevels transform that resets the levels of a column of scitype Finite?

Something along the lines of: SetLevels(:mycol, levels)

I often receive binary data with levels ["yes", "no"] and need to reorder them as ["no", "yes"], or simply recast them as [0, 1].

Another example is replacing alphabetical order with a more meaningful order.
E.g., ["large", "medium", "small"] changes to ["small", "medium", "large"]

Add dummy table type with metadata in tests

Our tests currently don't cover the metadata case. We can create a test/dummy.jl file with a simple struct definition that mimicks a table with metadata. Then use this table type in all test sets.

Pipeline with `Parallel` and `Select` can not be reverted

Using Select inside a Parallel pipeline makes it not longer possible to use revert and results in an error.

For example the following:

julia> p = Select(:a)  Select(:b);

julia> t = (a=rand(3), b=rand(3));

julia> r, c = apply(p, t)
((a = [0.7731147741011499, 0.969760210089194, 0.7926637097373099], b = [0.6707783348893381, 0.8919247172804263, 0.6657373751200266]), ((:a, :b), Tuple{Vector{Symbol}, Vector{Vector{Float64}}, Vector{Int64}, Vector{Union{Nothing, Int64}}}[([:b], [[0.6707783348893381, 0.8919247172804263, 0.6657373751200266]], [1], [2]), ([:a], [[0.7731147741011499, 0.969760210089194, 0.7926637097373099]], [1], [1])], (1, 1:1)))

julia> revert(p, r, c)
ERROR: NamedTuple names and field types must have matching lengths
Stacktrace:
 [1] (NamedTuple{(:a, :b), T} where T<:Tuple)(args::Tuple{Vector{Float64}})
   @ Core ./boot.jl:591
 [2] (NamedTuple{(:a, :b), T} where T<:Tuple)(itr::Vector{Vector{Float64}})
   @ Base ./namedtuple.jl:105
 [3] merge(a::NamedTuple{(), Tuple{}}, b::Base.Iterators.Zip{Tuple{Tuple{Symbol, Symbol}, Vector{Vector{Float64}}}})
   @ Base ./namedtuple.jl:261
 [4] revert(p::Parallel, newtable::NamedTuple{(:a, :b), Tuple{Vector{Float64}, Vector{Float64}}}, cache::Tuple{Tuple{Symbol, Symbol}, Vector{Tuple{Vector{Symbol}, Vector{Vector{Float64}}, Vector{Int64}, Vector{Union{Nothing, Int64}}}}, Tuple{Int64, UnitRange{Int64}}})
   @ TableTransforms ~/.julia/dev/TableTransforms/src/transforms/parallel.jl:78
 [5] top-level scope
   @ REPL[28]:1

ColSpec is broken for ZScore

MWE:

using TableTransforms

(a=rand(Int,3), b=rand(3)) |> ZScore(:b)

AssertionError: columns must hold continuous variables

assert_continuous(::NamedTuple{(:a, :b), Tuple{Vector{Int64}, Vector{Float64}}})@assertions.jl:8
applyfeat(::TableTransforms.ZScore{TableTransforms.NameSpec}, ::NamedTuple{(:a, :b), Tuple{Vector{Int64}, Vector{Float64}}}, ::Nothing)@transforms.jl:175
apply@transforms.jl:131[inlined]
Transform@interface.jl:84[inlined]
|>(::NamedTuple{(:a, :b), Tuple{Vector{Int64}, Vector{Float64}}}, ::TableTransforms.ZScore{TableTransforms.NameSpec})@operators.jl:911
top-level scope@[Local: 1](http://localhost:1234/edit?id=ad9e29a0-74c7-11ed-38ee-0ffbf894998f#)[inlined]

More general assertion API

Currently we provide a few basic checks as assertion functions that can be used before the transform is applied. For example, some transforms like EigenAnalysis only make sense with continuous variables. We need to refactor this API and consider assertions that apply to a specific ColSpec and assertions that hold state.

In particular, we could return a list of structs in the assertions function that represent different types of assertion. For example, we could consider something like:

"""
    SciTypeAssertion{T}(colspec)

Asserts that the columns in the `colspec` have a scientific type `T`.
"""
struct SciTypeAssertion{T,S}
  colspec::S
end

We could then introduce an API apply(assertion, table) to throw an error whenever the checks fail. The name of the function could also be something else to avoid confusion with the apply of transforms.

Dealing with missing values

I am not sure whether this is intentional, but when the input table has missing values, assert_continuous function throws AssertionError: columns must hold continuous variables, even though all other values are continuous.

Considering that there is still no transform for imputation, I think it would make sense to add an option to skip the missing values.

Restrict transform to columns?

At the moment in order to apply a Colwise transformation to a selected column(s) one has to use Select or is there another way?

I am finding that Select has some drawbacks when used in this way since it results in all the non selected columns being put into the cache but for e.g. the following pipeline this would not be necessary

(Select(:a) → MinMax()) ⊔ (Select(:b) → ZScore())

If I have a large pipeline/table, getting a large cache could be annoying . There is also an issue with revert #80.

So would it make sense to introduce a wrapper transform (or something else), where we give it a subset of columns, that only applies the (Colwise) transform to the subset of columns? For the pipeline above this could for example look like the following

Restrict(:a)(MinMax()) → Restrict(:b)(ZScore())

Consider restricting Levels to columns with categorical scitype

Currently the Levels transform contributed by @Frank-III and refactored by @eliascarv can operate directly on any type of column, including columns with the wrong scitype. I think we could consider a more consistent approach where we ask users to be explicit about the scitypes using the Coerce transform first. Even though this is a breaking change, it will provide more safety in real-world applications of pipelines.

Add option to rename columns during selection

A nice feature of the DataFrames.jl select function is that it allows in-place renaming:

select(df, :x => :y, :a => :b)

Currently, we have to write a Select(:x, :a) and then use Rename(:x => :y, :a => :b) on the result. Maybe we could introduce a SelectRename transform or generalize the Select to accept renaming.

Tests are taking too long

Perhaps some tests are using unnecessarily large example tables? Can we reduce the tables to speed up the testing experience?

Lazy Select/Reject

Currently our Select/Reject are greedy and generate copies of the input tables. We could create a lazy type for selections similar to what TableOperations.jl does.

Reverting `Select` inside pipeline fails if columntypes differ

For example in the following

df = (; a=rand(Float32, 2), b=rand(2))
p = Select(:b)  MinMax()
newtable, cache = apply(p, df)
revert(p, newtable, cache)

the column a gets converted into a Float64 vector
If one uses the following table df = (; a=[rand(3), rand(3)], b=rand(2)) the inversion fails since since it tries to convert Vector{Float64} to Float64.

Is this behavior intentional?

Explicitly creating and Any vector should fix it i.e.

diff --git a/src/transforms/select.jl b/src/transforms/select.jl
index 5748372..09ae813 100644
--- a/src/transforms/select.jl
+++ b/src/transforms/select.jl
@@ -108,7 +108,7 @@ function revert(::Select, newtable, cache)
   # selected columns
   cols   = Tables.columns(newtable)
   select = Tables.columnnames(cols)
-  scols  = [Tables.getcolumn(cols, name) for name in select]
+  scols  = Any[Tables.getcolumn(cols, name) for name in select]
 
   # rejected columns
   reject, rcols, sperm, rinds = cache

Shall I open a pull request for this fix?

Missing transforms

Below is a list of transforms to be implemented:

  • Rename
  • Filter
  • DropMissing
  • Replace
  • Coalesce
  • Coerce

Contributions are very welcome! Just comment on the issue if you plan to start working on some of these.

Add `FA` transform for factor analysis

The Factor Analysis (FA) transform is a useful probabilistic variation of PCA that identifies latent factors in a data set. We could provide a transform that returns a new table with columns FC1, FC2, ... representing the factor loadings.

Sample

Add a Sample transform that samples rows of tables a la sample in StatsBase.jl. In particular, we should be able to specify the number of samples and specify if the sampling allows replacement or not.

Sort

Add a Sort transform that sorts the rows a la sort from Base. The transform should allow the specification of columns, ascending vs descending order, etc.

Improve IO methods of all transforms

Currently we do not provide Base.show methods for the transforms and their output is quite difficult to digest. We could use some fancy unicode to display the transforms and pipelines to end users.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.