jinyus / related_post_gen Goto Github PK

Data Processing benchmark featuring Rust, Go, Swift, Zig, Julia etc.

License: MIT License

Python 1.37% Go 0.56% Shell 2.72% Rust 0.81% Crystal 0.36% jq 0.04% Odin 0.19% Julia 0.61% Dart 1.07% V 0.20% Zig 0.51% Swift 0.56% JavaScript 0.29% Java 0.66% Nim 0.76% F# 0.93% Lua 1.08% C# 0.81% PowerShell 0.25% C++ 86.22%

related_post_gen's People

Contributors

Stargazers

Watchers

Forkers

praveenperera nandesu-utils paholg drblury dave-andersen scottlamb mxrk fizmat tirprox atthom azrethos anykeyh guest271314 venom-goblin hwchen sliemeobn airtune rawrmonstar mraleph michaelsbradleyjr dr-emann mfirhas jordanmarr pmunch kokizzu mvikto matthewcrews worldoneo pkese dnmllr doloop maraist mcronce baronfel tdelmas algunion aradalvand inv2004 nightlyrevenger kant2002 zigzag312 neon-sunset sangelxyz mnj ajhopper axojhf renatoathaydes jiacai2050 codingonion moelf jburman felixge vinhquyen lilithhafner manuel-kolm korbinian90 thangthan mark-plunkett chyzwar nycdotnet ilevd michalstrehovsky mkitti masonprotter imperatorn delneg arikheinss kassane artgmrv vzarytovskii schoettl buzov masmullin2000 danmoseley wysagx pt600 f0rki tortar hassec goosedb feepingcreature mxmzdlv sonyeric ego n-pn mo8it leetwinski cyrusmsk 7codez thelissimus akumarujon jourdanclark strogo oliverlaursen

related_post_gen's Issues

Add D Implementation

Add C# Implementation

[Rules] Specific hardware targeting (rust)

Does that rule "No: Specific hardware targeting" means the code should not be written to target a specific hardware, or does that also includes compilations options such as the rust:

File .cargo/config.toml

[build]
rustflags = ["-C", "target-cpu=native"]

From https://nnethercote.github.io/perf-book/build-configuration.html#cpu-specific-instructions

Which doesn't target a specific hardware, but compile for the hardware it runs on. Which is different:

The code works for any hardware
Only the compiled binary will be restricted to the hardware it was compiled on (but without it, in Rust, the binary is still restricted to the CPU architecture class)

With that configuration, I observe ~25% speedup (on GitHub codespace) for rust (but it's slower with rust_con)

Add Java Implementation

[Question] process of running benchmark

Can someone explain:

how it runs?
Is it just external azure server with runs docker or remote github runner?
does it use docker runner images?
where to get the images (docker images?) to check locally?
does it follow github workflow yml or Dockerfile?
how to run it locally in the same environment ?

Because overall process of running and checking it locally is not clear.

Suggestion: add Julia to the repo description

I am clearly a Julia user and supporter - so this is expected to come from me.

Just saying that I would like to (also) see Julia in the description:

Data Processing benchmark featuring Rust, Go, Swift, Zig etc.

This is (obviously) up to you and I also have no idea what are your criteria for including/excluding specific languages from that list.

test crystal

it should be interesting to know the result for the "crystal" language

Add ScalaNative Implementation

Where are the two other files with 20k posts and 60k posts?

Is CPU parallel processing allowed by the task definition?

Please license this under some concrete license

Currently the code has no concrete license - please choose something (eg MIT, Apache, GPL, etc) and license the repo.

.NET potential improvement

Did you try @MichalStrehovsky, @delneg to use TLS for count array in concurrent versions of C# and F#?
dotnet/runtime#82973

Add Haskell Implementation

Improve C++ Implementation

C# concurrent solution needlessly uses max

In doing the D concurrent SIMD solution, we copied the C# algorithm (very nice!)

We realized the max call isn't necessary. Why?

Basically, the only way you get into the part to check the vector for larger post counts is when the Vector.LessThanOrEqualAll call fails. This means at least one byte is going to be greater than the minimum count.

But then the Max call just brings up all the counts less than the minimum to the minimum. It doesn't affect the top5 algorithm at all.

Essentially, the code is the equivalent (for a single number) of:

if(val > minVal)
   if(max(val, minVal) != minVal) // might as well be if(true)

So you can remove the whole call to Vector.Max

ping @zigzag312

Do type assertions and any kind of compute except warmup during the measurement phase

It seems like I kicked of a conversation about whether the first call to the related func can stay outside of the measured time or not.

Original conversation was here:
#204

Potential C++ Improvements

In testing on my local machine, it seems like switching from C++11 to C++20 and using std::string_views over std::string is generally faster. Using std::string_views, we also don't need to iterate by const reference. We can iterate by value. This abuses the fact that we're keeping the JSON data alive.

Also, I'm not sure using std::chrono::high_resolution_clock is necessarily a good idea. It's not guaranteed to be monotonic. It could end up using the system clock which could be affected by network time sync and such.

I'm not really up for doing a PR myself. But I figured I'd mention some of the more low hanging fruit. I'm sure there's more that could be done.

D implementation

It's interesting to see max scores of performant versions, but if there are rules, I think all implementations should follow them. Also in some languages there are really simple tricks to speedup processing with adding literally or almost no additional line of code.

So, It seems that in D implementation instead of

taggedPostsCount.ptr[idx]++;

should be:

taggedPostsCount[idx]++;

It's relative both to d and d_con implementations.

https://dlang.org/blog/2016/09/28/how-to-write-trusted-code-in-d/

But it would be nice to have some notes in Readme or maybe additional comment column or row in the table with such results.

What's the assumed CPU baseline?

Most languages/compilers have sane defaults (i.e. targeting x64 means one can run the binary on any x64 machine).

Others like to live dangerously and assume something higher. For example Java/GraalVM builds for x86-64-v3 by default and the executable will fail to start on any CPU that doesn't have AVX2. -march=compatibility needs to be passed to the compiler to target the sane baseline.

Can we set compilers to assume AVX2/BMI1/BMI2/etc.?

Allow .ptr if proven safe?

		foreach (tag; post.tags)
			foreach (idx; tagMap[tag])
				taggedPostsCount[idx]++;

Should be able to be replaced by

		foreach (tag; post.tags)
			foreach (idx; tagMap[tag])
				taggedPostsCount.ptr[idx]++;

Because it's proven safe as stated by the code.

"The size of taggedPostsCount is equal to the number of posts (postsCount). Since idx is generated from indices within the posts array itself (through the tagMap), idx will always be a valid index within taggedPostsCount. As such, you won't index taggedPostsCount with an out-of-bounds value."

Can zig use ReleaseFast to build the target ?

Turns out it out-performs C++ with ReleaseFast
With current ReleaseSafe approach, it safe guards everything in runtime, which adds significant overhead.
If we squat low enough and examine the top scorers, none of them is doing this.

Input and output files are not containing any multibyte characters

I would like to recommend taking into consideration, if the benchmark is intended as a real-world representation, to mix in non-ASCII characters.

Does verify.py work as intended?

I'm working this evening on a revision to the Nim implementation. I noticed that when running ./run.sh rust or ./run.sh nim or whatever, if I then manually edit the output in related_posts_rust.json it doesn't cause verification to fail. For example, I tried removing some entries in the top-level array, and I also tried replacing the contents of the file with [] or {} — in any of those cases, verify.py does not exit with error.

Rerun benchmarks

Before Java had OOM (but not Graal), and now NumPy (but not Python), but it didn't before:

6bcf716

I suspect your timing is off too, because of conditions on the VM. NumPy doesn't actually do badly, see other columns, nor for that one, I believe (rather than something in the code changed).

I would recommend reviewing always if some code doesn't finish, and rerunning. Only if it does repeatedly (or never worked for some column) believe it. It seems you order by Total (not next to last column) as I first thought.

Add Nim Implementation

Results on a Lenovo x13s w/ Arch Arm Linux

Please close issue once results are viewed.

Hyperfine resulst

./run.sh all                                                                                                                           main 
running all
Running Go
Benchmark 1: ./related
  Time (mean ± σ):      7.299 s ±  0.193 s    [User: 9.648 s, System: 0.609 s]
  Range (min … max):    7.040 s …  7.608 s    10 runs

Running Rust
Benchmark 1: ./target/release/rust
  Time (mean ± σ):      67.0 ms ±   7.2 ms    [User: 58.4 ms, System: 8.0 ms]
  Range (min … max):    54.5 ms …  77.1 ms    50 runs

Running Rust w/ Rayon
Benchmark 1: ./target/release/rust_rayon
  Time (mean ± σ):      23.1 ms ±   3.3 ms    [User: 59.1 ms, System: 10.4 ms]
  Range (min … max):    19.5 ms …  37.5 ms    128 runs

Running Python
Benchmark 1: python3 ./related.py
  Time (mean ± σ):     14.966 s ±  0.056 s    [User: 14.887 s, System: 0.027 s]
  Range (min … max):   14.882 s … 15.056 s    10 runs

Raw Results

./run.sh all                                                                                                                           main 
running all
Running Go
10.01s 17680k
Running Rust
0.08s 8832k
Running Rust w/ Rayon
0.02s 32828k
Running Python
14.92s 24260k

PGO effect [.NET, Nim]

In .NET8 - TieredPGO enabled by default (https://devblogs.microsoft.com/dotnet/announcing-dotnet-8-preview-5/)

I’m wondering which effect it gave to the code of this benchmark?
Can someone run it with/without option in settings:
false

Also interesting how big is effect for Nim.

Improvements

I think you could improve it, by having multiple solutions per language similar to how this was done for benchmarking primes in all the different languages:
https://github.com/PlummersSoftwareLLC/Primes

Also seems a bit better to use a docker container so it's easier to reproduce locally, using the shared Github runner is unreliable at best times, depending on what else is going on at that moment.

Reverts

Maybe there should be a PR template that populates the description box with a commented-out section like

I'm suggesting it because reverts for perf regressions keep landing in main and contributors may appreciate a reminder to try out their changes in an equivalent environment.

For example, I was disappointed to learn a consistent 50% speedup on my local machines turns into a 200%+ slowdown in the benchmarking VM.

I now have my own Standard F4s v2 (4 vcpus, 8 GiB memory) VM and am hunting down the problem, but I might have worked with one from the start if it had been suggested in the process of submitting a PR.

Difference in the number of runs

I noticed that some languages have the statistics calculated based on ten runs while others are only evaluated five times.

I suggest using the same number of runs for all the languages: in that way, an outlier (on either side) will have the chance to impact the mean by the same weight. Right now - an outlier in the 5-runs languages will have a greater weight.

GCC compilers

I found that now GCC is in dependencies as well.
GCC provides several compilers including Go, D, Rust (not finished, WIP).

Will it be interesting from your point of view to add them as a separate participants?

Is the graalvm benchmark measuring the concurrent version?

If I understand the flow correctly:

related_post_gen/docker_start.sh

Lines 18 to 31 in e7814a4

 echo "Run Benchmark (5k posts)" && 

 ./run.sh "$TEST_NAME" raw_results.md && 

 # 

 echo "Generate $RUN2 posts" && 

 python gen_fake_posts.py "$RUN2" && 

 # 

 echo "Run Benchmark (15k posts)" && 

 ./run.sh "$TEST_NAME" raw_results.md append && 

 # 

 echo "Generate $RUN3 posts" && 

 python gen_fake_posts.py "$RUN3" && 

 # 

 echo "Run Benchmark (30k posts)" && 

 ./run.sh "$TEST_NAME" raw_results.md append &&

We execute run.sh three times. First time without append, then twice with append.

In both graal and graal_con, we have a check that ensures the test only gets built when append was not specified:

related_post_gen/run.sh

Lines 443 to 447 in e7814a4

 if [ -z "$appendToFile" ]; then # only build on 5k run 

 java -version && 

 mvn -q -B clean package && 

 mvn -q -B -Pnative,pgo package 

 fi &&

However, both graal and graal_con generate the build output into ./java/target/related.

This means that only the 5k run measures graal and graal_con. The rest of the runs measure graal_con because it overwrote graal and we never rebuilt it again.

The benchmark results for graal and graal_con are pretty much the same which seems to corroborate this theory.

Explain what hardware is used for benchmarking

Would you care to write a sentence or two in Readme, about your HW configuration.

Add Lean4 implementation

Add LuaJIT Implementation

Julia's StaticArrays contains unsafe code

Hi,

it seems that Julia solution uses StaticArrays.jl package which uses unsafe code https://github.com/JuliaArrays/StaticArrays.jl/blob/d80455d2e1aba57a6cf65cc8d787dc6474ec3980/src/MArray.jl#L25

Honestly my understanding of the source of the package is rather poor
so maybe the unsafe part is not called in the solution.

But if the unsafe part is called then it probably violates the rules.

[Nim] nimble flags

Compare main

related_post_gen/run.sh

Line 496 in 37b0d1b

nimble install -y &&

With what I had in #291

related_post_gen/run.sh

Line 494 in 7cf49de

nimble -y install -d &&

-d is an alias for --depsOnly

From nimble --help

install
[-d, --depsOnly] Only install dependencies. Leave out pkgname

Without -d, nimble derives a package from the local sources and installs it in ${HOME}/.nimble/pkgs2. It doesn't make sense for this project to install nim_con or nim in the local package cache, and it might be screwing up the builds. A lot of work went into nimble, but it's got a lot of problems and limitations also.

I added -d for the single-threaded implementation in #237, but that PR wasn't merged and I forgot to add it to run_nim like I did for run_nim_con in #291.

I can make a PR but opened an issue first in case you have a particular reason for not using -d.

Add Details of the Benchmarking Computer to README.md

It would be good to know what hardware these benchmarks are being measured on.

Clarification for the use of SIMD/Assembly

Context

The README.md originally said that assembly cannot be inlined.
.NET can detect the presence of hardware acceleration of 128 and 256-bit registers
.NET enables the use of assembly without having to inline assembly

Question

Would a solution using SIMD registers be allowed if it does not inline assembly?

Suggest adding .NET 8 and .NET 8 AOT

.NET 8 will be released by the .NET team this November. Should we add it to the benchmark?

/cc @davidfowl

Optimizations techniques description

I am seeing this project not as a race of the languages, but as a good educational project. Where the user of lang X could see some interesting techniques and assumptions from other users and other languages.

That's why I propose to make a table and small document with description of the applied optimizations, so users don't have to monitor and check all other PRs and implementations, but have a single place with all applied things.

It will be even more useful if the author of optimizations will provide small description and main idea of them.

Like what I already saw:

using pointers instead of strings in Go
using List for initializing a HashMap and after that initialization of the HashMap based on the array (probably this will prevent many allocation while appending post indexes in the HashMap)
Several languages trying to use Flat List for Top5 calculations
etc...

After such optimizations will be prepared it will be possible to create a table: optimizations in columns and languages in rows to see which languages used which optimizations.

Erroneous highlighting of best 5K score on multicore

Rust is crowned the winner despite being >1ms slower than C# w/ AoT

Discussions

For discussing some things it would be nice to open discussions.

Rules clarification

Hi,

I have two questions about the rules. ReadMe forbids following:

Unsafe code blocks
Disabling runtime checks (bounds etc)

But for example Rust solution uses smallstr crate which contains unsafe code.
Or even Rust's standard library uses unsafe and disables runtime checks.
The situation for .NET is similar - standard library uses unsafe.

Another example is Zig which is compiled with ReleaseFast which according
to the documentation disables runtime checks
instead of ReleaseSafe which enables them.

Is it ok to use a library which uses unsafe code / assembly / disables runtime checks?
What about compiler settings? Is okay to disable runtime checks with compiler flags?

cpp performance issue

the following loop

for (const auto &tag : posts[i].tags)
{
    auto it = tagMap.insert({tag, std::vector<int>{i}});
    if (!it.second) {
        it.first->second.push_back(i);
    }
}

is not idiomatic and it has a performance issue (std::vector is created even if it's not used)
the right (and simpler) way:

for (const auto &tag : posts[i].tags)
{
   tagMap[tag].push_back(i);
}

on my machine:
original code: Processing time (w/o IO): 36 ms
optimized code: Processing time (w/o IO): 28 ms

Include at least QuickJS, Deno in JavaScript tests

Rationale: node and bun are not the only representatives of JavaScript runtimes.

Add C++ Implementation

Bounds checking a requirement?

I noticed that the solution for Julia uses a data structure that does not have bounds checks (StrideArray)[https://github.com/JuliaSIMD/StrideArraysCore.jl]. Is this now allowed? If so, I can drop down to managed pointers (included in .NET) in F# and speed it up by removing bounds checks.

delete

          I agree that this is pretty silly. The [rust code uses the `bumpalo` package](https://github.com/jinyus/related_post_gen/blob/14d1b33831a0f610cc7180901b6c2a454479cc7f/rust/src/main.rs#L80C24-L80C24) for allocating in an arena, which also contains explicitly `unsafe` code due to implementing an allocator (which must be `unsafe` due to the `Alloc` trait mandating that):

https://github.com/fitzgen/bumpalo/blob/4ff9827c00d7baa9bf532a9d0bf56f8b4e240087/src/alloc.rs#L203C15-L219

If Julia is not allowed to use any "unsafe" constructs or dependencies, why is Rust?

I see 2 possible violations. The use of an unsafe context and bypassing any standard runtime check. Surely bounds check is not the only check performed by Julia on array access. IIUC the function re-adds bounds check and GC check and disregards all other runtime checks.

Could you clarify what you mean by "standard runtime checks"? The "standard checks" beyond bounds checking Julia usually guarantees is liveness of the array, which the GC.@preserve ensures. Comparing to Rust again, using unsafe blocks there also disables any "standard runtime checks" (such as borrow checking & bounds checking), so would have to be disallowed as well for consistency sake.

Originally posted by @Seelengrab in #194 (comment)

We reached the witch-hunting phase

After the discussion here , I think it starts to be clear that the unsafe prohibition from any user library is going to be a difficult thing to uphold while also ensuring fairness.

I see a few challenges here:

The developer challenge: is everybody willing to provide an implementation here under the obligation to invest a tremendous amount of time and check every nook and cranny of the packages he uses so a case can be made that there is no usafe code?
The watch over the fence issue: I don't like that Rust is fast. I feel that I have reached the limits of my language, so let's take a peak then - Rust uses bumpalo, which contains unsafe code: owner, please ban bumpalo. To some extent, it seems that some developers are concerned with going over the source code of the packages used by the languages which happen to be at the top. So now we have the non-user of language X reading the code implemented in language X and opening an issue. But then some experienced users and even language developers of X intervene and present an expertise-backed opinion, and they can be easily dismissed (because they are biased and it is natural to take the side of their language).
The owner's challenge: the owner of the repository will not be able to invoke expertise in all the used languages and ensure a fair judgment over all these issues.

So where is this going? Are we starting a rust + bumpalo round now?

Maybe an objective solution is to call this the battle of STDLIBS and prohibit any usage of any external package (while also prohibiting any unsafe call from stdlib).

But if we do the above, then this benchmark is no longer real-world (because who would avoid battle-tested packages and start implementing functionality from scratch)?

So how about an even better one: prohibit the developers from doing unsafe tricks, and let's call it a languages ecosystem benchmark.

	echo "Run Benchmark (5k posts)" &&
	./run.sh "$TEST_NAME" raw_results.md &&
	#
	echo "Generate $RUN2 posts" &&
	python gen_fake_posts.py "$RUN2" &&
	#
	echo "Run Benchmark (15k posts)" &&
	./run.sh "$TEST_NAME" raw_results.md append &&
	#
	echo "Generate $RUN3 posts" &&
	python gen_fake_posts.py "$RUN3" &&
	#
	echo "Run Benchmark (30k posts)" &&
	./run.sh "$TEST_NAME" raw_results.md append &&

	if [ -z "$appendToFile" ]; then # only build on 5k run
	java -version &&
	mvn -q -B clean package &&
	mvn -q -B -Pnative,pgo package
	fi &&

jinyus / related_post_gen Goto Github PK

related_post_gen's People

Contributors

Stargazers

Watchers

Forkers

related_post_gen's Issues

Context

Question

Recommend Projects

Recommend Topics

Recommend Org