Comments (4)
Here are some comparisons between DataVecs and Vectors. DataVecs are a bit faster with nafilter(x). For everything else, Vectors are faster, often by quite a bit.
This is a little unfair in that I tweaked naFilter ahead of time to run better. Also, the code needed for NA support in vectors is here:
https://github.com/tshort/JuliaData/blob/floatNA/src/alternate_NA.jl
N = 10000000
v= randn(N)
datavec = DataVec(v)
idx = randi(N, 10000)
vna = copy(v)
vna[idx] = NA
datavecna = copy(datavec)
datavecna[idx] = NA
# Results without NA's in the data:
julia> @time sum(v)
elapsed time: 0.06444692611694336 seconds
-3206.747631560774
julia> @time sum(datavec)
elapsed time: 2.0190951824188232 seconds
-3206.7476315604167
julia> @time sum(nafilter(v))
elapsed time: 0.265045166015625 seconds
-3206.747631560774
julia> @time sum(nafilter(datavec))
elapsed time: 0.19718289375305176 seconds
-3206.747631560774
julia> @time sum(naFilter(v))
elapsed time: 0.046386003494262695 seconds
-3206.7476315604167
julia> @time sum(naFilter(datavec))
elapsed time: 4.131659030914307 seconds
-3206.7476315604167
# Results with NA's in the data:
julia> @time sum(vna)
elapsed time: 0.05554509162902832 seconds
NaN
julia> @time sum(datavecna)
no method +(Float64,NAtype)
in method_missing at base.jl:60
in sum at reduce.jl:63
julia> @time sum(nafilter(vna))
elapsed time: 0.23963308334350586 seconds
-3061.658194729886
julia> @time sum(nafilter(datavecna))
elapsed time: 0.18987488746643066 seconds
-3061.658194729886
julia> @time sum(naFilter(vna))
elapsed time: 0.04549598693847656 seconds
-3061.658194729652
julia> @time sum(naFilter(datavecna))
elapsed time: 4.118844985961914 seconds
-3061.658194729652
from dataframes.jl.
With a DataVec-specific method for sum
, one should be able to make these as fast as the Vector versions.
from dataframes.jl.
We should build upon this benchmark suite once we've settled a bit more upon the basic behaviors that DataVec's and DataFrame's should exhibit. One question I have: where will we store results as they accumulate over time? In a CSV that gets updated and held on GitHub? Having this benchmark suite run repeatedly is very important if we want to make sure that new changes actually improve the global performance of DataFrames rather than improve one component at the expense of others.
from dataframes.jl.
Closed by 319eab6
Please add new benchmarks using the Benchmark package. Please append results to benchmarks/results.csv
so that we can track performance over time.
from dataframes.jl.
Related Issues (20)
- Segmentation Fault when reading compressed file HOT 1
- Revisit spreading for `AsTable` output` HOT 6
- Better error message when forming a DataFrame from a vector of dictionaries with missing data. HOT 2
- `describe` is slow HOT 3
- CartesianIndex error in Julia 1.11 HOT 4
- `DataFrame(x=Int[], y=Int)` HOT 3
- Add comparison function for dataframes which can handle both isapprox and isequal column types HOT 2
- unique fails with column-type FixedDecimal HOT 5
- mapcols! should modify the parent of a SubDataFrame HOT 11
- Feature request: Pairs in stack HOT 2
- Grouped DataFrame with array elements fails to combine HOT 4
- error when combining a grouped empty dataframe using `first` HOT 6
- Short circuit && on subset? HOT 1
- Integer strings as colnames/selectors are error prone HOT 2
- Suggestion - Matrix Syntax for hcat (as well as vcat) HOT 4
- Document custom generation of column names in manual HOT 9
- `join` should not introduce `Missing` types to schema HOT 1
- Consider removing Tables.allocatecolumn in vcat
- DataFrame(t::Table) converts PooledVector columns HOT 2
- Sampling GroupedDataFrames (rand) HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dataframes.jl.