Comments (7)
Interesting. ~100x slower is certainly surprising. I think the {:list, :string}
dtype needs a fair bit of tricky memory allocation which might account for the difference, but I wouldn't expect it to be that stark.
By chance, are you able to test this same thing in Polars? Python Polars would be easiest.
from explorer.
@billylanchantin we had some past discussions about this. The main issue I think is that we always traverse the given data, even if a dtype is provided, and we should have a mechanism to say: "this is the dtype, don't try to check or cast it", and we send the data as is, without inferring or casting. The big question is: which one do we want as default? To cast or not to cast? We can probably do this in time for the upcoming v0.9, it is not a big change.
The improvements from #863 should also help here.
from explorer.
@josevalim Ah gotcha, thanks. Yeah I see #863 has the todo:
- Remove the type checking Elixir-side
When you say "it is not a big change", do you mean finishing #863 isn't big? Or adding a separate mechanism to skip the type check isn't big?
from explorer.
I am sure how hard the PR is, I meant the removing casting the casting and making the inference on Elixir side optional.
I pushed a commit that removed the casting. This commit removes the casting: c5c2eb7
And this PR removes the inference: #923 - @spencerkent, can you please try the PR and let us know how it impacts the numbers? Altogether, those changes should reduce at least two traversals from the Elixir side.
from explorer.
Hi all thanks for the quick followup! I was out of commission this weekend but had a chance to test the PR #923 this morning, and it looks like it does make a large improvement:
Operating System: Linux
CPU Information: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
Number of Available Cores: 16
Available memory: 19.53 GB
Elixir 1.16.0
Erlang 26.2.1
JIT enabled: true
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 1 min
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 2 min 8 s
Benchmarking no_list_column ...
Benchmarking with_list_column ...
Calculating statistics...
Formatting results...
Name ips average deviation median 99th %
no_list_column 366.50 2.73 ms ±54.29% 2.42 ms 7.29 ms
with_list_column 60.24 16.60 ms ±30.71% 15.37 ms 28.83 ms
Comparison:
no_list_column 366.50
with_list_column 60.24 - 6.08x slower +13.87 ms
Memory usage statistics:
Name Memory usage
no_list_column 449.64 KB
with_list_column 558.02 KB - 1.24x memory usage +108.38 KB
**All measurements for memory usage were the same**
So, still slower, but much better :)
from explorer.
Hi @spencerkent, feel free to try main, we merged that PR and a couple other improvements by @philss. We have one more down the pipeline. :)
from explorer.
Closing this based on the latest PRs. Maybe there are some specific lists optimizations we could look into but there should be a sizable improvement right now. Thank you!
from explorer.
Related Issues (20)
- Seeing `:nif_not_loaded` error for `Series.split/2` when mutating a dataframe HOT 1
- [Feature request] Add support for read_database in Polars backend. HOT 1
- Using `sort_by` with a grouped data frame doesn't respect `nils:` option HOT 1
- `{:datetime, :second}` dtype support HOT 2
- Add :streaming option to DataFrame.to_csv/3 HOT 1
- Exporting to CSV with a duration column returns an error
- Regression in `DataFrame.concat_rows/2` in v0.8.2 HOT 1
- Filter throwing undefined variable error HOT 1
- Error using is_finite and is_infinite within mutate HOT 1
- Explorer NIF broken on FreeBSD HOT 12
- Support Elixir built in Duration struct HOT 1
- Bug: Rounding Error in Tests HOT 1
- exposing the `fold` expressions from Polars HOT 7
- :nif_panicked "Chunk require all its arrays to have an equal number of rows" HOT 1
- Sorting an empty DataFrame results in a runtime Polars error HOT 1
- `Series.filter` should work inside `DataFrame.summarise` HOT 5
- Large memory usage when using `Explorer.Dataframe.concat_columns` on 30k (small) data frames. Memory leak? HOT 4
- [Not Issue] - Are the plans to use duckdb as an alternative backend? HOT 2
- Support streaming: true on collect HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from explorer.