jinyus / related_post_gen Goto Github PK
View Code? Open in Web Editor NEWData Processing benchmark featuring Rust, Go, Swift, Zig, Julia etc.
License: MIT License
Data Processing benchmark featuring Rust, Go, Swift, Zig, Julia etc.
License: MIT License
Does that rule "No: Specific hardware targeting" means the code should not be written to target a specific hardware, or does that also includes compilations options such as the rust:
File .cargo/config.toml
[build]
rustflags = ["-C", "target-cpu=native"]
From https://nnethercote.github.io/perf-book/build-configuration.html#cpu-specific-instructions
Which doesn't target a specific hardware, but compile for the hardware it runs on. Which is different:
With that configuration, I observe ~25% speedup (on GitHub codespace) for rust
(but it's slower with rust_con
)
Can someone explain:
Because overall process of running and checking it locally is not clear.
I am clearly a Julia user and supporter - so this is expected to come from me.
Just saying that I would like to (also) see Julia in the description:
Data Processing benchmark featuring Rust, Go, Swift, Zig etc.
This is (obviously) up to you and I also have no idea what are your criteria for including/excluding specific languages from that list.
it should be interesting to know the result for the "crystal" language
Currently the code has no concrete license - please choose something (eg MIT, Apache, GPL, etc) and license the repo.
Did you try @MichalStrehovsky, @delneg to use TLS for count array in concurrent versions of C# and F#?
dotnet/runtime#82973
In doing the D concurrent SIMD solution, we copied the C# algorithm (very nice!)
We realized the max call isn't necessary. Why?
Basically, the only way you get into the part to check the vector for larger post counts is when the Vector.LessThanOrEqualAll
call fails. This means at least one byte is going to be greater than the minimum count.
But then the Max call just brings up all the counts less than the minimum to the minimum. It doesn't affect the top5 algorithm at all.
Essentially, the code is the equivalent (for a single number) of:
if(val > minVal)
if(max(val, minVal) != minVal) // might as well be if(true)
So you can remove the whole call to Vector.Max
ping @zigzag312
It seems like I kicked of a conversation about whether the first call to the related func can stay outside of the measured time or not.
Original conversation was here:
#204
In testing on my local machine, it seems like switching from C++11 to C++20 and using std::string_views over std::string is generally faster. Using std::string_views, we also don't need to iterate by const reference. We can iterate by value. This abuses the fact that we're keeping the JSON data alive.
Also, I'm not sure using std::chrono::high_resolution_clock is necessarily a good idea. It's not guaranteed to be monotonic. It could end up using the system clock which could be affected by network time sync and such.
I'm not really up for doing a PR myself. But I figured I'd mention some of the more low hanging fruit. I'm sure there's more that could be done.
It's interesting to see max scores of performant versions, but if there are rules, I think all implementations should follow them. Also in some languages there are really simple tricks to speedup processing with adding literally or almost no additional line of code.
So, It seems that in D implementation instead of
taggedPostsCount.ptr[idx]++;
should be:
taggedPostsCount[idx]++;
It's relative both to d and d_con implementations.
https://dlang.org/blog/2016/09/28/how-to-write-trusted-code-in-d/
But it would be nice to have some notes in Readme or maybe additional comment column or row in the table with such results.
Most languages/compilers have sane defaults (i.e. targeting x64 means one can run the binary on any x64 machine).
Others like to live dangerously and assume something higher. For example Java/GraalVM builds for x86-64-v3 by default and the executable will fail to start on any CPU that doesn't have AVX2. -march=compatibility
needs to be passed to the compiler to target the sane baseline.
Can we set compilers to assume AVX2/BMI1/BMI2/etc.?
foreach (tag; post.tags)
foreach (idx; tagMap[tag])
taggedPostsCount[idx]++;
Should be able to be replaced by
foreach (tag; post.tags)
foreach (idx; tagMap[tag])
taggedPostsCount.ptr[idx]++;
Because it's proven safe as stated by the code.
"The size of taggedPostsCount is equal to the number of posts (postsCount). Since idx is generated from indices within the posts array itself (through the tagMap), idx will always be a valid index within taggedPostsCount. As such, you won't index taggedPostsCount with an out-of-bounds value."
Turns out it out-performs C++ with ReleaseFast
With current ReleaseSafe
approach, it safe guards everything in runtime, which adds significant overhead.
If we squat low enough and examine the top scorers, none of them is doing this.
I would like to recommend taking into consideration, if the benchmark is intended as a real-world representation, to mix in non-ASCII characters.
I'm working this evening on a revision to the Nim implementation. I noticed that when running ./run.sh rust
or ./run.sh nim
or whatever, if I then manually edit the output in related_posts_rust.json
it doesn't cause verification to fail. For example, I tried removing some entries in the top-level array, and I also tried replacing the contents of the file with []
or {}
— in any of those cases, verify.py
does not exit with error.
Before Java had OOM (but not Graal), and now NumPy (but not Python), but it didn't before:
I suspect your timing is off too, because of conditions on the VM. NumPy doesn't actually do badly, see other columns, nor for that one, I believe (rather than something in the code changed).
I would recommend reviewing always if some code doesn't finish, and rerunning. Only if it does repeatedly (or never worked for some column) believe it. It seems you order by Total (not next to last column) as I first thought.
Please close issue once results are viewed.
Hyperfine resulst
./run.sh all main
running all
Running Go
Benchmark 1: ./related
Time (mean ± σ): 7.299 s ± 0.193 s [User: 9.648 s, System: 0.609 s]
Range (min … max): 7.040 s … 7.608 s 10 runs
Running Rust
Benchmark 1: ./target/release/rust
Time (mean ± σ): 67.0 ms ± 7.2 ms [User: 58.4 ms, System: 8.0 ms]
Range (min … max): 54.5 ms … 77.1 ms 50 runs
Running Rust w/ Rayon
Benchmark 1: ./target/release/rust_rayon
Time (mean ± σ): 23.1 ms ± 3.3 ms [User: 59.1 ms, System: 10.4 ms]
Range (min … max): 19.5 ms … 37.5 ms 128 runs
Running Python
Benchmark 1: python3 ./related.py
Time (mean ± σ): 14.966 s ± 0.056 s [User: 14.887 s, System: 0.027 s]
Range (min … max): 14.882 s … 15.056 s 10 runs
Raw Results
./run.sh all main
running all
Running Go
10.01s 17680k
Running Rust
0.08s 8832k
Running Rust w/ Rayon
0.02s 32828k
Running Python
14.92s 24260k
In .NET8 - TieredPGO enabled by default (https://devblogs.microsoft.com/dotnet/announcing-dotnet-8-preview-5/)
I’m wondering which effect it gave to the code of this benchmark?
Can someone run it with/without option in settings:
false
Also interesting how big is effect for Nim.
I think you could improve it, by having multiple solutions per language similar to how this was done for benchmarking primes in all the different languages:
https://github.com/PlummersSoftwareLLC/Primes
Also seems a bit better to use a docker container so it's easier to reproduce locally, using the shared Github runner is unreliable at best times, depending on what else is going on at that moment.
Maybe there should be a PR template that populates the description box with a commented-out section like
<!--
💡 Benchmark results in the project's readme are gathered from builds/runs peformed in a Microsoft Azure free tier VM:
Standard F4s v2 (4 vcpus, 8 GiB memory)
Your changes or new implementation may perform better in a different environment, e.g. your local machine or a different cloud provider.
To avoid having your contributions reverted because of performance regressions, consider doing a build/run in an equivalent Azure VM before submitting your PR. 🙏
-->
I'm suggesting it because reverts for perf regressions keep landing in main
and contributors may appreciate a reminder to try out their changes in an equivalent environment.
For example, I was disappointed to learn a consistent 50% speedup on my local machines turns into a 200%+ slowdown in the benchmarking VM.
I now have my own Standard F4s v2 (4 vcpus, 8 GiB memory)
VM and am hunting down the problem, but I might have worked with one from the start if it had been suggested in the process of submitting a PR.
I noticed that some languages have the statistics calculated based on ten runs while others are only evaluated five times.
I suggest using the same number of runs for all the languages: in that way, an outlier (on either side) will have the chance to impact the mean by the same weight. Right now - an outlier in the 5-runs languages will have a greater weight.
I found that now GCC is in dependencies as well.
GCC provides several compilers including Go, D, Rust (not finished, WIP).
Will it be interesting from your point of view to add them as a separate participants?
If I understand the flow correctly:
related_post_gen/docker_start.sh
Lines 18 to 31 in e7814a4
We execute run.sh three times. First time without append
, then twice with append
.
In both graal and graal_con, we have a check that ensures the test only gets built when append was not specified:
Lines 443 to 447 in e7814a4
However, both graal and graal_con generate the build output into ./java/target/related.
This means that only the 5k run measures graal and graal_con. The rest of the runs measure graal_con because it overwrote graal and we never rebuilt it again.
The benchmark results for graal and graal_con are pretty much the same which seems to corroborate this theory.
Would you care to write a sentence or two in Readme, about your HW configuration.
Hi,
it seems that Julia solution uses StaticArrays.jl
package which uses unsafe code https://github.com/JuliaArrays/StaticArrays.jl/blob/d80455d2e1aba57a6cf65cc8d787dc6474ec3980/src/MArray.jl#L25
Honestly my understanding of the source of the package is rather poor
so maybe the unsafe part is not called in the solution.
But if the unsafe part is called then it probably violates the rules.
Compare main
Line 496 in 37b0d1b
Line 494 in 7cf49de
-d
is an alias for --depsOnly
From nimble --help
install
[-d, --depsOnly] Only install dependencies. Leave out pkgname
Without -d
, nimble derives a package from the local sources and installs it in ${HOME}/.nimble/pkgs2
. It doesn't make sense for this project to install nim_con
or nim
in the local package cache, and it might be screwing up the builds. A lot of work went into nimble, but it's got a lot of problems and limitations also.
I added -d
for the single-threaded implementation in #237, but that PR wasn't merged and I forgot to add it to run_nim
like I did for run_nim_con
in #291.
I can make a PR but opened an issue first in case you have a particular reason for not using -d
.
It would be good to know what hardware these benchmarks are being measured on.
Would a solution using SIMD registers be allowed if it does not inline assembly?
.NET 8 will be released by the .NET team this November. Should we add it to the benchmark?
/cc @davidfowl
I am seeing this project not as a race of the languages, but as a good educational project. Where the user of lang X could see some interesting techniques and assumptions from other users and other languages.
That's why I propose to make a table and small document with description of the applied optimizations, so users don't have to monitor and check all other PRs and implementations, but have a single place with all applied things.
It will be even more useful if the author of optimizations will provide small description and main idea of them.
Like what I already saw:
After such optimizations will be prepared it will be possible to create a table: optimizations in columns and languages in rows to see which languages used which optimizations.
For discussing some things it would be nice to open discussions.
Hi,
I have two questions about the rules. ReadMe forbids following:
But for example Rust solution uses smallstr
crate which contains unsafe code.
Or even Rust's standard library uses unsafe and disables runtime checks.
The situation for .NET is similar - standard library uses unsafe.
Another example is Zig which is compiled with ReleaseFast
which according
to the documentation disables runtime checks
instead of ReleaseSafe
which enables them.
the following loop
for (const auto &tag : posts[i].tags)
{
auto it = tagMap.insert({tag, std::vector<int>{i}});
if (!it.second) {
it.first->second.push_back(i);
}
}
is not idiomatic and it has a performance issue (std::vector is created even if it's not used)
the right (and simpler) way:
for (const auto &tag : posts[i].tags)
{
tagMap[tag].push_back(i);
}
on my machine:
original code: Processing time (w/o IO): 36 ms
optimized code: Processing time (w/o IO): 28 ms
Rationale: node
and bun
are not the only representatives of JavaScript runtimes.
I noticed that the solution for Julia uses a data structure that does not have bounds checks (StrideArray)[https://github.com/JuliaSIMD/StrideArraysCore.jl]. Is this now allowed? If so, I can drop down to managed pointers (included in .NET) in F# and speed it up by removing bounds checks.
I agree that this is pretty silly. The [rust code uses the `bumpalo` package](https://github.com/jinyus/related_post_gen/blob/14d1b33831a0f610cc7180901b6c2a454479cc7f/rust/src/main.rs#L80C24-L80C24) for allocating in an arena, which also contains explicitly `unsafe` code due to implementing an allocator (which must be `unsafe` due to the `Alloc` trait mandating that):
If Julia is not allowed to use any "unsafe" constructs or dependencies, why is Rust?
I see 2 possible violations. The use of an unsafe context and bypassing any standard runtime check. Surely bounds check is not the only check performed by Julia on array access. IIUC the function re-adds bounds check and GC check and disregards all other runtime checks.
Could you clarify what you mean by "standard runtime checks"? The "standard checks" beyond bounds checking Julia usually guarantees is liveness of the array, which the GC.@preserve
ensures. Comparing to Rust again, using unsafe
blocks there also disables any "standard runtime checks" (such as borrow checking & bounds checking), so would have to be disallowed as well for consistency sake.
Originally posted by @Seelengrab in #194 (comment)
After the discussion here , I think it starts to be clear that the unsafe prohibition from any user library is going to be a difficult thing to uphold while also ensuring fairness.
I see a few challenges here:
bumpalo
, which contains unsafe code: owner, please ban bumpalo
. To some extent, it seems that some developers are concerned with going over the source code of the packages used by the languages which happen to be at the top. So now we have the non-user of language X reading the code implemented in language X and opening an issue. But then some experienced users and even language developers of X intervene and present an expertise-backed opinion, and they can be easily dismissed (because they are biased and it is natural to take the side of their language).So where is this going? Are we starting a rust + bumpalo
round now?
Maybe an objective solution is to call this the battle of STDLIBS and prohibit any usage of any external package (while also prohibiting any unsafe call from stdlib).
But if we do the above, then this benchmark is no longer real-world (because who would avoid battle-tested packages and start implementing functionality from scratch)?
So how about an even better one: prohibit the developers from doing unsafe tricks, and let's call it a languages ecosystem benchmark.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.