juliaml / tabletransforms.jl Goto Github PK

Transforms and pipelines with tabular data in Julia

Home Page: https://juliaml.github.io/TableTransforms.jl/stable

License: MIT License

Julia 100.00%

table data-science statistics machine-learning transforms pipelines

tabletransforms.jl's Introduction

This package provides transforms that are commonly used in statistics and machine learning. It was developed to address specific needs in feature engineering and works with general Tables.jl tables.

Installation

Get the latest stable release with Julia's package manager:

] add TableTransforms

Documentation

STABLE — most recently tagged version of the documentation.
LATEST — in-development version of the documentation.

Contributing

Contributions are very welcome. Please open an issue if you have questions.

We have open issues with missing transforms that you can contribute.

tabletransforms.jl's People

Contributors

Stargazers

Watchers

Forkers

omar-elrefaei stjordanis ablaom ceferisbarov frank-iii antholzer clarohenrique dianapat jwscook vickydeka rajatrc1705 jay-sanjay rabindababy josbert1 shayandavoodii

tabletransforms.jl's Issues

Hyperparameter tuning

Looking at bit into MLJ integration. For better or worse, hyper-parameter optimization (eg, grid search) in MLJ generally works by mutating the field values of the model struct. I wonder if TableTransforms.jl would consider changing their transformer types to mutable structs? I think in ML applications, at least, any loss in performance would be pretty minimal, but perhaps there are wider use-cases to consider?

The alternative for our use case is for the MLJ model wrapper to be mutable (for now a wrapper is necessary anyway) and that a user wanting to do a search does something like

values = [Scale(low=0, high=x) for x in 1.0:0.1:10]     <---- extra step 
values = range(wrapped_transformer, :model, values=values)

However, while this might be fine for Grid search, it doesn't really work for other optimization strategies.

Thoughts?

Replace could accept a predicate

Currently we have to write Replace(oldval => newval). It would be useful to replace entries of a table based on a given predicate as in Replace(<(0) => newval)

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Rearrange transforms

Move Levels to after Coerce and Sort to after Filter.

Add documentation with Documenter.jl

The project has grown a lot with the latest additions by @eliascarv and @ceferisbarov ❤️ I think it is time to add a documentation page with Documenter.jl.

In my opinion an ideal documentation for this package would consist of two pages:

A page with the examples from the README and other examples showing revertibility of pipelines.
A page with a list of docstrings for all available transforms and an index table (or side page) that users could click to jump directly to any transform's docstring.

We can then focus on writing very good docstrings and never bother again about updating the documentation. Any newly added transform could just be appended to the docstring page.

Add ColSpec support for more transforms

EigenAnalysis, PCA, SDS, DRS, ProjectionPursuit could all leverage the ColSpec for consistency.

Add support for rolling functions

Hope support Roll a function over data, run a statistic along a [weighted] data window

ref: https://github.com/JeffreySarnoff/RollingFunctions.jl

Add `pratio` option to EigenAnalysis

Currently we need to specify the number of output dimensions, but an useful option available in MultivariateStats.jl is pratio which determines the number of dimensions to preserve in order to account for a ratio of the total variance.

Allow regular expression in Select/Reject

We should be able to write Select(r"foo") to select all columns containing the string "foo".

OneHot should probably produce categorical arrays

Currently the resulting columns are just Vector{Bool}, maybe they should be categorical arrays.

Avoid inverting matrices in EigenAnalysis

We can be more clever and avoid inversions of S here:

TableTransforms.jl/src/transforms/eigenanalysis.jl

Lines 38 to 48 in 1841f05

 function drsproj(λ, V) 

 Λ = Diagonal(sqrt.(λ)) 

 S = V * inv(Λ) 

 S, inv(S) 

 end 

 function sdsproj(λ, V) 

 Λ = Diagonal(sqrt.(λ)) 

 S = V * inv(Λ) * transpose(V) 

 S, inv(S) 

 end

We should be able to express these inverses by swapping the order of the products and by transposing the matrices instead.

OneHotEncoding

The transform takes categorical columns and converts into multiple columns with 1s and 0s using the CategoricalArrays.jl API.

Introduce RowTable and ColTable transforms

For row table inputs, will the output of a transform necessarily have rowaccess? Is there a general principle guiding the form of the output in general? Can I arrange that the ouput is Tables.materialized?

In MLJ we generally materialize based on input type, but is has been argued that an option to specify the output type would be useful. I'm wondering what the authors of TableTransforms.jl think about such questions.

Migrate to new VSCode test infrastructure

We should experiment with the new test infra: https://discourse.julialang.org/t/prerelease-of-new-testing-framework-and-test-run-ui-in-vs-code/86355

Fix issues in documentation

Copy theme from Meshes.jl/GeoStats.jl
Add docstring for ProjectionPursuit
Remove section on external transforms

Regarding item (3), the list will become obsolete very quickly and it is probably better to just list the dependant packages in TransformsBase.jl's README

Can TableTransforms do transforms on one row at a time?

Hi there,

I have a project with this pipeline:

Prepare data in SQL
For each of 100s of statistical models, do select a table, apply transforms, convert to matrix, use for model training
Use the trained models in an agent-based simulation.

The transforms used in Step 2 are also used in Step 3. The difference is that Step 2 applies the transforms to a table, whereas Step 3 applies them to a Dict (representing an agent's state). This state object can be thought of as a row/observation of a table.

With that in mind I've built a transforms package with the following API:

transformtable!(table, t)  # Used in model training
transform!(obs, t)   # Used in simulation

Then:

By default transformtable! applies transform! to each observation one at a time.
Variable-level transforms, which don't need to be applied to each observation (e.g., setlevels), have:
- A custom implementation of transformtable!, which dispatches on the transform's type
- transform!(obs, t) = nothing

This gives me everything I need.
I'd like to open source this functionality, but I'd rather not clutter the ecosystem if this functionality already exists or is planned.
Does it exist in TableTransforms? Or is it planned?

Cheers,
Jock

Unexport internal types and functions

This types:

Transform, Stateless, Colwise

and this functions:

assertions, colapply, colrevert

are part of the internal API and should not be exported.

Allow specification of columns in Functional transform

It would be nice to assign functions to specific columns as in Functional(:a => cos, :b => sin).

Please move colspec.jl to the root folder

I noticed that you added colspec.jl to the transforms folder, but it is not a transform @eliascarv. Can you please move it to the root folder and update the include?

Treatment of metadata

This issue tracks our efforts to generalize the API to handle metadata (or special columns). The overall idea consists of renaming all apply, revert, reapply functions to internal functions that operate on the feature table. End users won't feel any difference given that the fallbacks don't change the current behavior.

Enable transforms that are functions of multiple variables?

Hi there,

This package looks great!

I have a module in one of my projects that also applies transforms to tabular data, and I'm wondering if I should port those to TableTransforms.jl.

I often have transforms of the form: newcol = f(col1, col2, ...)
E.g., the difference between 2 Date columns, interaction terms, indicator functions, etc.
Although such functions are not yet available in this package, are you open to allowing this?
I guess this amounts to allowing functions of rows, along these lines: newcol[i] = f(row[i]...)

Cheers,
Jock

Add `NarrowTypes` transform that converts columns to the most specific types possible

Add a transform that takes a table with possibly generic column eltypes and returns a new table with specific column types.

Unexpected behavior for row table

I expected this:

julia> X = Tables.table(rand(10, 3))
Tables.MatrixTable{Matrix{Float64}} with 10 rows, 3 columns, and schema:
 :Column1  Float64
 :Column2  Float64
 :Column3  Float64

julia> rt = Tables.rowtable(X)
10-element Vector{NamedTuple{(:Column1, :Column2, :Column3), Tuple{Float64, Float64, Float64}}}:
 (Column1 = 0.5565428815102202, Column2 = 0.9460318649319521, Column3 = 0.3359377652605777)
 (Column1 = 0.5075982188653294, Column2 = 0.18315493443083453, Column3 = 0.9682078694091394)
 (Column1 = 0.9963268867042159, Column2 = 0.9329899214178856, Column3 = 0.5470116730645322)
 (Column1 = 0.3586863538486338, Column2 = 0.028403436910423796, Column3 = 0.37539265155858015)
 (Column1 = 0.6219233944778164, Column2 = 0.3499360310552866, Column3 = 0.14038853850080102)
 (Column1 = 0.587164895366161, Column2 = 0.9365631957076301, Column3 = 0.4638888590010579)
 (Column1 = 0.5896922517858991, Column2 = 0.29876303863960274, Column3 = 0.8361667225744209)
 (Column1 = 0.3924751962812454, Column2 = 0.7981085942387889, Column3 = 0.11151780408924594)
 (Column1 = 0.26805907763502645, Column2 = 0.6863527449570381, Column3 = 0.5322558300965325)
 (Column1 = 0.4000480227789831, Column2 = 0.5888482501962962, Column3 = 0.25538366367233634)

julia> X |> Center()
(Column1 = [0.02869116358486712, -0.020253499060023716, 0.46847516877886275, -0.16916536407671934, 0.09407167655246329, 0.059313177440807885, 0.061840533860546, -0.13537652164410774, -0.25979264029032667, -0.12780369514637002], Column2 = [0.37111666368337826, -0.39176026681773934, 0.35807472016931174, -0.5465117643381501, -0.22497917019328728, 0.36164799445905627, -0.27615216260897113, 0.223193392990215, 0.11143754370846426, 0.013933048947722293], Column3 = [-0.12067737246214477, 0.5115927316864168, 0.09039653534180975, -0.08122248616414235, -0.3162265992219215, 0.007273721278335421, 0.37955158485169843, -0.34509733363347656, 0.07564069237381005, -0.20123147405038616])

But not this:

julia> rt |> Center()
NamedTuple{(), Tuple{}}[]

Add code examples in docstrings

We need code examples in the transformations to have good documentation.

StdNames

Implement a StdNames transform that standardizes the column names.

We could use symbols to specify naming conventions:

StdNames(:snake) # column_name
StdNames(:camel) # ColumnName
StdNames(:upper) # COLUMNNAME
StdNames() # default to :upper

This transform generalizes the functions in janitor and Cleaner..jl

Levels

Hi there,

Is there appetite for a SetLevels transform that resets the levels of a column of scitype Finite?

Something along the lines of: SetLevels(:mycol, levels)

I often receive binary data with levels ["yes", "no"] and need to reorder them as ["no", "yes"], or simply recast them as [0, 1].

Another example is replacing alphabetical order with a more meaningful order.
E.g., ["large", "medium", "small"] changes to ["small", "medium", "large"]

Add dummy table type with metadata in tests

Our tests currently don't cover the metadata case. We can create a test/dummy.jl file with a simple struct definition that mimicks a table with metadata. Then use this table type in all test sets.

Pipeline with `Parallel` and `Select` can not be reverted

Using Select inside a Parallel pipeline makes it not longer possible to use revert and results in an error.

For example the following:

julia> p = Select(:a) ⊔ Select(:b);

julia> t = (a=rand(3), b=rand(3));

julia> r, c = apply(p, t)
((a = [0.7731147741011499, 0.969760210089194, 0.7926637097373099], b = [0.6707783348893381, 0.8919247172804263, 0.6657373751200266]), ((:a, :b), Tuple{Vector{Symbol}, Vector{Vector{Float64}}, Vector{Int64}, Vector{Union{Nothing, Int64}}}[([:b], [[0.6707783348893381, 0.8919247172804263, 0.6657373751200266]], [1], [2]), ([:a], [[0.7731147741011499, 0.969760210089194, 0.7926637097373099]], [1], [1])], (1, 1:1)))

julia> revert(p, r, c)
ERROR: NamedTuple names and field types must have matching lengths
Stacktrace:
 [1] (NamedTuple{(:a, :b), T} where T<:Tuple)(args::Tuple{Vector{Float64}})
   @ Core ./boot.jl:591
 [2] (NamedTuple{(:a, :b), T} where T<:Tuple)(itr::Vector{Vector{Float64}})
   @ Base ./namedtuple.jl:105
 [3] merge(a::NamedTuple{(), Tuple{}}, b::Base.Iterators.Zip{Tuple{Tuple{Symbol, Symbol}, Vector{Vector{Float64}}}})
   @ Base ./namedtuple.jl:261
 [4] revert(p::Parallel, newtable::NamedTuple{(:a, :b), Tuple{Vector{Float64}, Vector{Float64}}}, cache::Tuple{Tuple{Symbol, Symbol}, Vector{Tuple{Vector{Symbol}, Vector{Vector{Float64}}, Vector{Int64}, Vector{Union{Nothing, Int64}}}}, Tuple{Int64, UnitRange{Int64}}})
   @ TableTransforms ~/.julia/dev/TableTransforms/src/transforms/parallel.jl:78
 [5] top-level scope
   @ REPL[28]:1

Fix missing docstrings in docs

The dostring of \to is not showing in the docs due to the recent refactorizations.

ColSpec is broken for ZScore

MWE:

using TableTransforms

(a=rand(Int,3), b=rand(3)) |> ZScore(:b)

AssertionError: columns must hold continuous variables

assert_continuous(::NamedTuple{(:a, :b), Tuple{Vector{Int64}, Vector{Float64}}})@assertions.jl:8
applyfeat(::TableTransforms.ZScore{TableTransforms.NameSpec}, ::NamedTuple{(:a, :b), Tuple{Vector{Int64}, Vector{Float64}}}, ::Nothing)@transforms.jl:175
apply@transforms.jl:131[inlined]
Transform@interface.jl:84[inlined]
|>(::NamedTuple{(:a, :b), Tuple{Vector{Int64}, Vector{Float64}}}, ::TableTransforms.ZScore{TableTransforms.NameSpec})@operators.jl:911
top-level scope@[Local: 1](http://localhost:1234/edit?id=ad9e29a0-74c7-11ed-38ee-0ffbf894998f#)[inlined]

More general assertion API

Currently we provide a few basic checks as assertion functions that can be used before the transform is applied. For example, some transforms like EigenAnalysis only make sense with continuous variables. We need to refactor this API and consider assertions that apply to a specific ColSpec and assertions that hold state.

In particular, we could return a list of structs in the assertions function that represent different types of assertion. For example, we could consider something like:

"""
    SciTypeAssertion{T}(colspec)

Asserts that the columns in the `colspec` have a scientific type `T`.
"""
struct SciTypeAssertion{T,S}
  colspec::S
end

We could then introduce an API apply(assertion, table) to throw an error whenever the checks fail. The name of the function could also be something else to avoid confusion with the apply of transforms.

Dealing with missing values

I am not sure whether this is intentional, but when the input table has missing values, assert_continuous function throws AssertionError: columns must hold continuous variables, even though all other values are continuous.

Considering that there is still no transform for imputation, I think it would make sense to add an option to skip the missing values.

Restrict transform to columns?

At the moment in order to apply a Colwise transformation to a selected column(s) one has to use Select or is there another way?

I am finding that Select has some drawbacks when used in this way since it results in all the non selected columns being put into the cache but for e.g. the following pipeline this would not be necessary

(Select(:a) → MinMax()) ⊔ (Select(:b) → ZScore())

If I have a large pipeline/table, getting a large cache could be annoying . There is also an issue with revert #80.

So would it make sense to introduce a wrapper transform (or something else), where we give it a subset of columns, that only applies the (Colwise) transform to the subset of columns? For the pipeline above this could for example look like the following

Restrict(:a)(MinMax()) → Restrict(:b)(ZScore())

Add ColSpec support for all Colwise transforms

We can edit the fallbacks for Colwise to make use of the ColSpec. We need to modify all transforms that are Colwise to include a colspec field and let the fallback access it.

Consider restricting Levels to columns with categorical scitype

Currently the Levels transform contributed by @Frank-III and refactored by @eliascarv can operate directly on any type of column, including columns with the wrong scitype. I think we could consider a more consistent approach where we ask users to be explicit about the scitypes using the Coerce transform first. Even though this is a breaking change, it will provide more safety in real-world applications of pipelines.

Quantile and MinMax fail with constant columns

We need to handle these corner cases properly.

Add option to rename columns during selection

A nice feature of the DataFrames.jl select function is that it allows in-place renaming:

select(df, :x => :y, :a => :b)

Currently, we have to write a Select(:x, :a) and then use Rename(:x => :y, :a => :b) on the result. Maybe we could introduce a SelectRename transform or generalize the Select to accept renaming.

Move Identity transform to TransformsAPI.jl

The transform is not specific to tables and we should be able to add it to pipelines with other types of objects.

Tests are taking too long

Perhaps some tests are using unnecessarily large example tables? Can we reduce the tables to speed up the testing experience?

Add dim option to EigenAnalysis to reduce the number of dimensions

Currently we retain all scores. It would be nice to have an option to retain only first dim elements in the basis.

Lazy Select/Reject

Currently our Select/Reject are greedy and generate copies of the input tables. We could create a lazy type for selections similar to what TableOperations.jl does.

Reverting `Select` inside pipeline fails if columntypes differ

For example in the following

df = (; a=rand(Float32, 2), b=rand(2))
p = Select(:b) → MinMax()
newtable, cache = apply(p, df)
revert(p, newtable, cache)

the column a gets converted into a Float64 vector
If one uses the following table df = (; a=[rand(3), rand(3)], b=rand(2)) the inversion fails since since it tries to convert Vector{Float64} to Float64.

Is this behavior intentional?

Explicitly creating and Any vector should fix it i.e.

diff --git a/src/transforms/select.jl b/src/transforms/select.jl
index 5748372..09ae813 100644
--- a/src/transforms/select.jl
+++ b/src/transforms/select.jl
@@ -108,7 +108,7 @@ function revert(::Select, newtable, cache)
   # selected columns
   cols   = Tables.columns(newtable)
   select = Tables.columnnames(cols)
-  scols  = [Tables.getcolumn(cols, name) for name in select]
+  scols  = Any[Tables.getcolumn(cols, name) for name in select]
 
   # rejected columns
   reject, rcols, sperm, rinds = cache

Shall I open a pull request for this fix?

	function drsproj(λ, V)
	Λ = Diagonal(sqrt.(λ))
	S = V * inv(Λ)
	S, inv(S)
	end

	function sdsproj(λ, V)
	Λ = Diagonal(sqrt.(λ))
	S = V * inv(Λ) * transpose(V)
	S, inv(S)
	end

juliaml / tabletransforms.jl Goto Github PK

tabletransforms.jl's Introduction

Installation

Documentation

Contributing

tabletransforms.jl's People

Contributors

Stargazers

Watchers

Forkers

tabletransforms.jl's Issues

Recommend Projects

Recommend Topics

Recommend Org