Giter Club home page Giter Club logo

related_post_gen's People

Contributors

akhildevelops avatar algunion avatar artemkakun avatar azrethos avatar copper280z avatar cyrusmsk avatar dave-andersen avatar delneg avatar drblury avatar fizmat avatar ilevd avatar imperatorn avatar inv2004 avatar jinyus avatar kaiwu avatar kokizzu avatar leetwinski avatar masmullin2000 avatar masonprotter avatar matthewcrews avatar mcronce avatar michaelsbradleyjr avatar michalstrehovsky avatar mo8it avatar moelf avatar noobsenslaver avatar schoettl avatar thelissimus avatar tortar avatar zigzag312 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

related_post_gen's Issues

[Rules] Specific hardware targeting (rust)

Does that rule "No: Specific hardware targeting" means the code should not be written to target a specific hardware, or does that also includes compilations options such as the rust:

File .cargo/config.toml

[build]
rustflags = ["-C", "target-cpu=native"]

From https://nnethercote.github.io/perf-book/build-configuration.html#cpu-specific-instructions

Which doesn't target a specific hardware, but compile for the hardware it runs on. Which is different:

  • The code works for any hardware
  • Only the compiled binary will be restricted to the hardware it was compiled on (but without it, in Rust, the binary is still restricted to the CPU architecture class)

With that configuration, I observe ~25% speedup (on GitHub codespace) for rust (but it's slower with rust_con)

[Question] process of running benchmark

Can someone explain:

  • how it runs?
    Is it just external azure server with runs docker or remote github runner?
  • does it use docker runner images?
  • where to get the images (docker images?) to check locally?
  • does it follow github workflow yml or Dockerfile?
  • how to run it locally in the same environment ?

Because overall process of running and checking it locally is not clear.

Suggestion: add Julia to the repo description

I am clearly a Julia user and supporter - so this is expected to come from me.

Just saying that I would like to (also) see Julia in the description:

Data Processing benchmark featuring Rust, Go, Swift, Zig etc.

This is (obviously) up to you and I also have no idea what are your criteria for including/excluding specific languages from that list.

test crystal

it should be interesting to know the result for the "crystal" language

C# concurrent solution needlessly uses max

In doing the D concurrent SIMD solution, we copied the C# algorithm (very nice!)

We realized the max call isn't necessary. Why?

Basically, the only way you get into the part to check the vector for larger post counts is when the Vector.LessThanOrEqualAll call fails. This means at least one byte is going to be greater than the minimum count.

But then the Max call just brings up all the counts less than the minimum to the minimum. It doesn't affect the top5 algorithm at all.

Essentially, the code is the equivalent (for a single number) of:

if(val > minVal)
   if(max(val, minVal) != minVal) // might as well be if(true)

So you can remove the whole call to Vector.Max

ping @zigzag312

Potential C++ Improvements

In testing on my local machine, it seems like switching from C++11 to C++20 and using std::string_views over std::string is generally faster. Using std::string_views, we also don't need to iterate by const reference. We can iterate by value. This abuses the fact that we're keeping the JSON data alive.

Also, I'm not sure using std::chrono::high_resolution_clock is necessarily a good idea. It's not guaranteed to be monotonic. It could end up using the system clock which could be affected by network time sync and such.

I'm not really up for doing a PR myself. But I figured I'd mention some of the more low hanging fruit. I'm sure there's more that could be done.

D implementation

It's interesting to see max scores of performant versions, but if there are rules, I think all implementations should follow them. Also in some languages there are really simple tricks to speedup processing with adding literally or almost no additional line of code.

So, It seems that in D implementation instead of

taggedPostsCount.ptr[idx]++;

should be:

taggedPostsCount[idx]++;

It's relative both to d and d_con implementations.

https://dlang.org/blog/2016/09/28/how-to-write-trusted-code-in-d/

But it would be nice to have some notes in Readme or maybe additional comment column or row in the table with such results.

What's the assumed CPU baseline?

Most languages/compilers have sane defaults (i.e. targeting x64 means one can run the binary on any x64 machine).

Others like to live dangerously and assume something higher. For example Java/GraalVM builds for x86-64-v3 by default and the executable will fail to start on any CPU that doesn't have AVX2. -march=compatibility needs to be passed to the compiler to target the sane baseline.

Can we set compilers to assume AVX2/BMI1/BMI2/etc.?

Allow .ptr if proven safe?

		foreach (tag; post.tags)
			foreach (idx; tagMap[tag])
				taggedPostsCount[idx]++;

Should be able to be replaced by

		foreach (tag; post.tags)
			foreach (idx; tagMap[tag])
				taggedPostsCount.ptr[idx]++;

Because it's proven safe as stated by the code.

"The size of taggedPostsCount is equal to the number of posts (postsCount). Since idx is generated from indices within the posts array itself (through the tagMap), idx will always be a valid index within taggedPostsCount. As such, you won't index taggedPostsCount with an out-of-bounds value."

Can zig use ReleaseFast to build the target ?

Turns out it out-performs C++ with ReleaseFast
With current ReleaseSafe approach, it safe guards everything in runtime, which adds significant overhead.
If we squat low enough and examine the top scorers, none of them is doing this.

Does verify.py work as intended?

I'm working this evening on a revision to the Nim implementation. I noticed that when running ./run.sh rust or ./run.sh nim or whatever, if I then manually edit the output in related_posts_rust.json it doesn't cause verification to fail. For example, I tried removing some entries in the top-level array, and I also tried replacing the contents of the file with [] or {} — in any of those cases, verify.py does not exit with error.

Rerun benchmarks

Before Java had OOM (but not Graal), and now NumPy (but not Python), but it didn't before:

6bcf716

I suspect your timing is off too, because of conditions on the VM. NumPy doesn't actually do badly, see other columns, nor for that one, I believe (rather than something in the code changed).

I would recommend reviewing always if some code doesn't finish, and rerunning. Only if it does repeatedly (or never worked for some column) believe it. It seems you order by Total (not next to last column) as I first thought.

Results on a Lenovo x13s w/ Arch Arm Linux

Please close issue once results are viewed.

Hyperfine resulst

./run.sh all                                                                                                                           main 
running all
Running Go
Benchmark 1: ./related
  Time (mean ± σ):      7.299 s ±  0.193 s    [User: 9.648 s, System: 0.609 s]
  Range (min … max):    7.040 s …  7.608 s    10 runs

Running Rust
Benchmark 1: ./target/release/rust
  Time (mean ± σ):      67.0 ms ±   7.2 ms    [User: 58.4 ms, System: 8.0 ms]
  Range (min … max):    54.5 ms …  77.1 ms    50 runs

Running Rust w/ Rayon
Benchmark 1: ./target/release/rust_rayon
  Time (mean ± σ):      23.1 ms ±   3.3 ms    [User: 59.1 ms, System: 10.4 ms]
  Range (min … max):    19.5 ms …  37.5 ms    128 runs

Running Python
Benchmark 1: python3 ./related.py
  Time (mean ± σ):     14.966 s ±  0.056 s    [User: 14.887 s, System: 0.027 s]
  Range (min … max):   14.882 s … 15.056 s    10 runs

Raw Results

./run.sh all                                                                                                                           main 
running all
Running Go
10.01s 17680k
Running Rust
0.08s 8832k
Running Rust w/ Rayon
0.02s 32828k
Running Python
14.92s 24260k

Improvements

I think you could improve it, by having multiple solutions per language similar to how this was done for benchmarking primes in all the different languages:
https://github.com/PlummersSoftwareLLC/Primes

Also seems a bit better to use a docker container so it's easier to reproduce locally, using the shared Github runner is unreliable at best times, depending on what else is going on at that moment.

Reverts

Maybe there should be a PR template that populates the description box with a commented-out section like

<!--
💡 Benchmark results in the project's readme are gathered from builds/runs peformed in a Microsoft Azure free tier VM:

Standard F4s v2 (4 vcpus, 8 GiB memory)

Your changes or new implementation may perform better in a different environment, e.g. your local machine or a different cloud provider.

To avoid having your contributions reverted because of performance regressions, consider doing a build/run in an equivalent Azure VM before submitting your PR. 🙏
-->

I'm suggesting it because reverts for perf regressions keep landing in main and contributors may appreciate a reminder to try out their changes in an equivalent environment.

For example, I was disappointed to learn a consistent 50% speedup on my local machines turns into a 200%+ slowdown in the benchmarking VM.

I now have my own Standard F4s v2 (4 vcpus, 8 GiB memory) VM and am hunting down the problem, but I might have worked with one from the start if it had been suggested in the process of submitting a PR.

Difference in the number of runs

I noticed that some languages have the statistics calculated based on ten runs while others are only evaluated five times.

I suggest using the same number of runs for all the languages: in that way, an outlier (on either side) will have the chance to impact the mean by the same weight. Right now - an outlier in the 5-runs languages will have a greater weight.

GCC compilers

I found that now GCC is in dependencies as well.
GCC provides several compilers including Go, D, Rust (not finished, WIP).

Will it be interesting from your point of view to add them as a separate participants?

Is the graalvm benchmark measuring the concurrent version?

If I understand the flow correctly:

echo "Run Benchmark (5k posts)" &&
./run.sh "$TEST_NAME" raw_results.md &&
#
echo "Generate $RUN2 posts" &&
python gen_fake_posts.py "$RUN2" &&
#
echo "Run Benchmark (15k posts)" &&
./run.sh "$TEST_NAME" raw_results.md append &&
#
echo "Generate $RUN3 posts" &&
python gen_fake_posts.py "$RUN3" &&
#
echo "Run Benchmark (30k posts)" &&
./run.sh "$TEST_NAME" raw_results.md append &&

We execute run.sh three times. First time without append, then twice with append.

In both graal and graal_con, we have a check that ensures the test only gets built when append was not specified:

related_post_gen/run.sh

Lines 443 to 447 in e7814a4

if [ -z "$appendToFile" ]; then # only build on 5k run
java -version &&
mvn -q -B clean package &&
mvn -q -B -Pnative,pgo package
fi &&

However, both graal and graal_con generate the build output into ./java/target/related.

This means that only the 5k run measures graal and graal_con. The rest of the runs measure graal_con because it overwrote graal and we never rebuilt it again.

The benchmark results for graal and graal_con are pretty much the same which seems to corroborate this theory.

[Nim] nimble flags

Compare main

nimble install -y &&

With what I had in #291
nimble -y install -d &&

-d is an alias for --depsOnly

From nimble --help

install
[-d, --depsOnly] Only install dependencies. Leave out pkgname

Without -d, nimble derives a package from the local sources and installs it in ${HOME}/.nimble/pkgs2. It doesn't make sense for this project to install nim_con or nim in the local package cache, and it might be screwing up the builds. A lot of work went into nimble, but it's got a lot of problems and limitations also.

I added -d for the single-threaded implementation in #237, but that PR wasn't merged and I forgot to add it to run_nim like I did for run_nim_con in #291.

I can make a PR but opened an issue first in case you have a particular reason for not using -d.

Clarification for the use of SIMD/Assembly

Context

  • The README.md originally said that assembly cannot be inlined.
  • .NET can detect the presence of hardware acceleration of 128 and 256-bit registers
  • .NET enables the use of assembly without having to inline assembly

Question

Would a solution using SIMD registers be allowed if it does not inline assembly?

Optimizations techniques description

I am seeing this project not as a race of the languages, but as a good educational project. Where the user of lang X could see some interesting techniques and assumptions from other users and other languages.

That's why I propose to make a table and small document with description of the applied optimizations, so users don't have to monitor and check all other PRs and implementations, but have a single place with all applied things.

It will be even more useful if the author of optimizations will provide small description and main idea of them.

Like what I already saw:

  • using pointers instead of strings in Go
  • using List for initializing a HashMap and after that initialization of the HashMap based on the array (probably this will prevent many allocation while appending post indexes in the HashMap)
  • Several languages trying to use Flat List for Top5 calculations
  • etc...

After such optimizations will be prepared it will be possible to create a table: optimizations in columns and languages in rows to see which languages used which optimizations.

Discussions

For discussing some things it would be nice to open discussions.

Rules clarification

Hi,

I have two questions about the rules. ReadMe forbids following:

  • Unsafe code blocks
  • Disabling runtime checks (bounds etc)

But for example Rust solution uses smallstr crate which contains unsafe code.
Or even Rust's standard library uses unsafe and disables runtime checks.
The situation for .NET is similar - standard library uses unsafe.

Another example is Zig which is compiled with ReleaseFast which according
to the documentation disables runtime checks
instead of ReleaseSafe which enables them.

  1. Is it ok to use a library which uses unsafe code / assembly / disables runtime checks?
  2. What about compiler settings? Is okay to disable runtime checks with compiler flags?

cpp performance issue

the following loop

for (const auto &tag : posts[i].tags)
{
    auto it = tagMap.insert({tag, std::vector<int>{i}});
    if (!it.second) {
        it.first->second.push_back(i);
    }
}

is not idiomatic and it has a performance issue (std::vector is created even if it's not used)
the right (and simpler) way:

for (const auto &tag : posts[i].tags)
{
   tagMap[tag].push_back(i);
}

on my machine:
original code: Processing time (w/o IO): 36 ms
optimized code: Processing time (w/o IO): 28 ms

Bounds checking a requirement?

I noticed that the solution for Julia uses a data structure that does not have bounds checks (StrideArray)[https://github.com/JuliaSIMD/StrideArraysCore.jl]. Is this now allowed? If so, I can drop down to managed pointers (included in .NET) in F# and speed it up by removing bounds checks.

delete

          I agree that this is pretty silly. The [rust code uses the `bumpalo` package](https://github.com/jinyus/related_post_gen/blob/14d1b33831a0f610cc7180901b6c2a454479cc7f/rust/src/main.rs#L80C24-L80C24) for allocating in an arena, which also contains explicitly `unsafe` code due to implementing an allocator (which must be `unsafe` due to the `Alloc` trait mandating that): 

https://github.com/fitzgen/bumpalo/blob/4ff9827c00d7baa9bf532a9d0bf56f8b4e240087/src/alloc.rs#L203C15-L219

If Julia is not allowed to use any "unsafe" constructs or dependencies, why is Rust?

I see 2 possible violations. The use of an unsafe context and bypassing any standard runtime check. Surely bounds check is not the only check performed by Julia on array access. IIUC the function re-adds bounds check and GC check and disregards all other runtime checks.

Could you clarify what you mean by "standard runtime checks"? The "standard checks" beyond bounds checking Julia usually guarantees is liveness of the array, which the GC.@preserve ensures. Comparing to Rust again, using unsafe blocks there also disables any "standard runtime checks" (such as borrow checking & bounds checking), so would have to be disallowed as well for consistency sake.

Originally posted by @Seelengrab in #194 (comment)

We reached the witch-hunting phase

After the discussion here , I think it starts to be clear that the unsafe prohibition from any user library is going to be a difficult thing to uphold while also ensuring fairness.

I see a few challenges here:

  • The developer challenge: is everybody willing to provide an implementation here under the obligation to invest a tremendous amount of time and check every nook and cranny of the packages he uses so a case can be made that there is no usafe code?
  • The watch over the fence issue: I don't like that Rust is fast. I feel that I have reached the limits of my language, so let's take a peak then - Rust uses bumpalo, which contains unsafe code: owner, please ban bumpalo. To some extent, it seems that some developers are concerned with going over the source code of the packages used by the languages which happen to be at the top. So now we have the non-user of language X reading the code implemented in language X and opening an issue. But then some experienced users and even language developers of X intervene and present an expertise-backed opinion, and they can be easily dismissed (because they are biased and it is natural to take the side of their language).
  • The owner's challenge: the owner of the repository will not be able to invoke expertise in all the used languages and ensure a fair judgment over all these issues.

So where is this going? Are we starting a rust + bumpalo round now?

Maybe an objective solution is to call this the battle of STDLIBS and prohibit any usage of any external package (while also prohibiting any unsafe call from stdlib).

But if we do the above, then this benchmark is no longer real-world (because who would avoid battle-tested packages and start implementing functionality from scratch)?

So how about an even better one: prohibit the developers from doing unsafe tricks, and let's call it a languages ecosystem benchmark.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.