Giter Club home page Giter Club logo

wordcount's Introduction

Announcement

This project is looking for a new owner. Discussion here.

wordcount

Counting words in different programming languages.

See the article on this project: http://juditacs.github.io/2015/11/26/wordcount.html

or the follow-up article: http://juditacs.github.io/2016/03/19/wordcount2.html

Leaderboard

Full Hungarian Wikipedia

Updated: June 13, 2016

Only the ones that finish are listed. The rest run out of memory.

Rank Experiment CPU seconds User time Maximum memory Contributor
1 java -Xms2G -Xmx8G -classpath java:java/zah-0.6.jar WordCountOptimized 140.64 191.44 4232308
2 java -Xms2G -Xmx8G -classpath java:java/trove-3.0.3.jar WordCountOptimized 183.17 219.39 4391804
3 rust/wordcount/wordcount 208.86 193.34 6966228 Joshua Holmer
4 cpp/wordcount_clang 224.23 211.76 4933968
5 cpp/wordcount 227.74 213.86 4934204 Dmitry Andreev, Matias Fontanini, Judit Acs
6 c/wordcount 270.15 255.95 2453228 gaebor
7 php7.0 php/wordcount.php 285.01 267.78 4174284 Andrey Bukatov, Braun Patrik
8 d/wordcount 297.95 285.76 6363420 Pavel Chebotarev
9 hhvm php/wordcount.php 315.48 294.06 5076280 Andrey Bukatov, Braun Patrik
10 java -Xms2G -Xmx8G -classpath java WordCountBaseline 378.77 577.71 7036308
11 python/wordcount_py2.py 403.6 389.15 3943716 Judit Acs
12 go/bin/wordcount 410.81 418.64 5274756 David Siklosi
13 scala -J-Xms2G -J-Xmx8g -classpath scala Wordcount 546.28 710.57 7075976
14 mono csharp/WordCount.exe 610.95 588.56 4517848 Joe Amenta, Tim Posey, Peter Szel
15 python/wordcount_py2_baseline.py 627.22 606.13 8920684 Judit Acs
16 php5.6 php/wordcount.php 671.8 646.55 12745360 Andrey Bukatov, Braun Patrik
17 perl/wordcount.pl 942.27 915.93 7206948 Larion Garaczi, Judit Acs
18 cpp/wordcount_baseline 1094.24 990.64 6043624 Judit Acs
19 python/wordcount_py3.py 1192.51 1162.29 7771396 Judit Acs
20 lua lua/wordcount.lua 1344.87 1219.83 7255316 daurnimator
21 julia julia/wordcount.jl 1828.39 1789.74 7457092 Attila Zseder, getzdan
22 elixir/wordcount 2326.65 2290.94 12542340 Norbert Melzer
23 bash/wordcount.sh 2561.07 2643.59 13728 Judit Acs

5 million lines from the Hungarian Wikipedia

Updated: April 16, 2016

Rank Experiment CPU seconds User time Maximum memory Contributor
1 rust/wordcount/wordcount 21.23 20.28 990008 Joshua Holmer
2 java -Xmx6G -classpath java:java/trove-3.0.3.jar WordCountOptimized 29.76 34.41 962244 Sam Van Oort
3 d/wordcount 29.79 28.97 752676 Pavel Chebotarev
4 cpp/wordcount 33.58 32.51 758728 Dmitry Andreev, Matias Fontanini, Judit Acs
5 cpp/wordcount_clang 33.84 32.95 758724
6 go/bin/wordcount 38.84 37.86 859284 David Siklosi
7 python/wordcount_py2.py 38.89 37.98 595500 Gabor Szabo
8 c/wordcount 39.64 38.49 427180 gaebor
9 php7.0 php/wordcount.php 41.11 27.13 709756 Andrey Bukatov, Braun Patrik
10 mono csharp/WordCount.exe 48.18 46.37 836040 Joe Amenta, Tim Posey, Peter Szel
11 hhvm php/wordcount.php 51.19 31.72 882584 Andrey Bukatov, Braun Patrik
12 python/wordcount_py2_baseline.py 64.56 63.02 1438388 Judit Acs
13 php5.6 php/wordcount.php 82.56 66.12 2113976 Andrey Bukatov, Braun Patrik
14 java -Xmx6G -classpath java WordCountBaseline 89.12 111.27 1602580 Sam Van Oort, Rick Hendricksen, Dávid Márk Nemeskey
15 python/wordcount_py3.py 98.33 96.62 1246564 Judit Acs
16 perl/wordcount.pl 113.04 110.78 1242056 Larion Garaczi, Judit Acs
17 lua lua/wordcount.lua 132.04 116.02 1210248 daurnimator
18 ruby2.3 ruby/wordcount.rb 137.09 131.31 3847648 Joshua Holmer
19 java -classpath java WordCount 143.01 128.67 1828536
20 julia julia/wordcount.jl 152.88 146.15 2490572 Attila Zseder, getzdan
21 scala -J-Xmx2g -classpath scala Wordcount 174.09 221.95 1457272 Hans van den Bogert
22 elixir/wordcount 184.63 181.5 2590384 Norbert Melzer
23 bash/wordcount.sh 276.14 290.33 13608 Judit Acs
24 haskell/WordCount 295.73 286.33 4216656 Larion Garaczi
25 cpp/wordcount_baseline 354.01 338.81 983292 Judit Acs
26 java -cp clojure.jar clojure.main clojure/wordcount.clj 357.75 379.64 2158764 lverweijen
27 elixir elixir/wordcount.ex 417.88 407.45 2545708
28 nodejs javascript/wordcount.js 578.55 577.19 972904 Laci Kundra
29 nodejs typescript/wordcount.js 628.72 604.63 920764 Braun Patrik

The task

The task is to split a text and count each word's frequency, then print the list sorted by frequency in decreasing order. Ties are printed in alphabetical order.

Rules

  • the input is read from STDIN
  • the input is always encoded in UTF-8
  • output is printed to STDOUT
  • break only on space, tab and newline (do not break on non-breaking space)
  • do not write anything to STDERR
  • the output is tab-separated
  • sort by frequency AND secondary sort in alphabetical order
  • try to write simple code with few dependencies
    • standard library
  • single-thread is preferred but you can add multi-threaded or multicore versions too

The output should contain lines like this:

freqword <tab> freq

Example

$ echo "apple pear apple art" | python2 python/wordcount.py
apple   2
art     1
pear    1

Test corpus: Hungarian Wikisource and Wikipedia

scripts/create_input.sh downloads and unpacks the latest Hungarian Wikisource XML dump. Why Wikisource? It's not too small not too large and more importantly, it's valid utf8. Why Hungarian? There are many non-ascii characters and the number of different word types is high.

scripts/create_large_input.sh downloads and unpacks the latest Hungarian Wikipedia. This is the largest input used for comparison, see the first leaderboard.

Usage

To test on a small sample:

time cat data/huwikisource-latest-pages-meta-current.xml | head -10000 | python3 python/wordcount_py3.py > python_out

Installation with Docker

I strongly recommend building the Docker image instead of installing every package manually, but it's possible to install the dependencies manually. See the installation commands in Dockerfile.

Docker image

You can run the experiment in a Docker container.

Pull from DockerHub & tag it:

docker pull svanoort/allthelanguages:latest && docker tag svanoort/allthelanguages:latest allthelanguages

If you wish to build locally (this while take quite a while):

docker build -t allthelanguages --rm .

In either case, this requires quite a bit of storage, currently about 2.4 GB.

Run the image, mounting the local directory into the working directory of the docker file as a volume:

docker run -h DOCKER -it --rm -v $(pwd):/allthelanguages allthelanguages as_user.sh $(id -un) $(id -u) $(id -gn) $(id -g)

Changes you made in the folder will show up in the docker container, and any output (builds, results) will write to the folder as well. All permissions are maintained.

Note that if you're building node.js/typescript, it needs root (non-sudo) access so you'll need to do it this way, which creates files with root as the owner (you'll need to chown them):

docker run -h DOCKER -it --rm -v $(pwd):/allthelanguages allthelanguages

Downloading the dataset

bash scripts/create_input.sh

or the full dataset:

bash scripts/create_large_input.sh

Compile/build/whatever the wordcount scripts

bash scripts/build.sh

Run tests on one language

scripts/test.sh runs all tests for one language, well actually for a single command.

bash scripts/test.sh "python2 python/wordcount_py2.py"

Or

bash scripts/test.sh python/wordcount_py2.py

if the file is executable and has a valid shebang line.

The script either prints OK or the list of failed tests and a final FAIL.

Run tests on all languages

All commands are listed in the file run_commands.txt and the script scripts/test_all.sh runs test.sh with each command:

bash scripts/test_all.sh

Run the actual experiment on a larger dataset

If all tests are passed, the scripts work reasonably well. This does not mean that all output will be the same, see the full test later. For now, we consider them good enough for testing.

This command will run each test twice and append the results to results.txt. It's possible to add a comment at the end of each line.

bash scripts/compare.sh data/huwikisource-latest-pages-meta-current.xml 2 "full huwikisource"

Or test it on a part of huwikisource:

bash scripts/compare.sh <( head -10000 data/huwikisource-latest-pages-meta-current.xml) 1

Results.txt in a tab separated file that can be formatted to a Markdown table with this command:

cat results.txt | python2 scripts/evaluate_results.py

This scripts prints the fastest run for each command in a markup table like this:

Rank Experiment CPU seconds User time Maximum memory Contributor
1 rust/wordcount/wordcount 20.57 19.79 990008 Joshua Holmer
2 cpp/wc_vector 33.3 31.93 775952 Matias Fontanini, Judit Acs
3 python/wordcount_py2gabor.py 40.13 38.71 594800 Gabor Szabo

Adding a new program

Adding a new programming language or a new version for an existing programming language consists of the following steps:

  1. Add dependencies to the Dockerfile. Basically add the package to the existing apt-get package list.
  2. If it needs compiling or any other setup method, add it to scripts/build.sh
  3. Add the actual invoke command to run_commands.txt
  4. If your executable differs from the source file, add the executable - source code mapping to binary_mapping.txt. This is used by scripts/evaluate_results.py for finding out the contributors of each program. The file is tab-separated.

Adding your program to this experiment

  1. Make sure all dependencies are installed via standard packages and your code compiles.
  2. Your code passes all the tests.
  3. Make sure it runs for less than two minutes for 100,000 lines of text. If it is slower, it doesn't make much sense to add it.

wordcount's People

Contributors

airbreather avatar bpatrik avatar coderdreams avatar crbelaus avatar daurnimator avatar davidnemeskey avatar eksperimental avatar flababah avatar gaborszabo88 avatar gaebor avatar getzdan avatar hansbogert avatar juditacs avatar kpeu3i avatar kundralaci avatar larion avatar leventov avatar matias-te avatar nexor avatar nobbz avatar shedar avatar svanoort avatar szarnyasg avatar szelpe avatar timposey2 avatar unipolar avatar xupwup avatar zseder avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wordcount's Issues

Concurrency or not concurrency?

There is that rule in the README:

  • single-thread is preferred but you can add multi-threaded or multicore versions too

This reads as if there shall be a singlethreaded contribution for a language, and then multithreaded contributions can be added as seperate entries.

Now there was this cleanup which threw out quite a lot of programs in languages that had multiple entries.

Since Elixir is a language that bets on concurrency, I'd like to retry a concurrent version before starting this in Erlang.

So how do we handle this?

It would be very easy to add a CLI-switch/subcommand to the script which does enable/disable concurrency easily, but then still the question remains, how to handle it during the runs?

Both Java versions fail on the first two tests

bash scripts/test.sh java -classpath java WordCount

---- java -classpath java WordCount ----
  test1 fails
  test2 fails
FAIL

or testing just one file:

cat data/test/test1.in | java -classpath java WordCount
aaa     3
abc     3
bbb     2
        1
ccc     1

It looks like empty lines are counted as words.

Reorganize setup

Reorganize the setup of everything such that it is much easier to begin with everything.

I struggled a lot during the setup phase, and I'm still unsure if everything is working as it should. Probably either the documentation or the workflow itself should be updated in a way that makes the process more clear for potential contributors.

Python 3 version is not working

When executed with an UTF-8 locale, it outputs completely different counts than e.g. the py2 / java versions. When executed with the C locale, it fails at the output phase with

UnicodeEncodeError: 'ascii' codec can't encode character '\xee' in position 0: ordinal not in range(128)

The project is looking for a new owner

Due to numerous other engagement I am unable to continue maintaining the project and therefore looking for a new owner. I will try to fix the current issues mostly due to bitrot and after that the project will officially be discontinued until someone is willing take over.

Memory usage of Bash isn't comparable to rest

I think the memory usage as shown in the overview page is not comparable to other languages.

The bash script fork to different processes many times by piping outputs to other binaries. I think the memory usage of those separate processes is not accounted for.

Rename solutions

  • have one solution for each language except the two original baselines (Python2 and cpp)
  • every source file should be named wordcount (or whatever capitalization is the naming convention in that language)

Use unambiguous docker base image

Using "ubuntu" as base is not best practice. In my case it fails. I installed docker a long time ago, and my ubuntu image was still at 12.04

Let use

FROM ubuntu:14.04

wordcount.js fails on all tests

bash scripts/test.sh node javascript/wordcount.js 

---- node javascript/wordcount.js ----
  test1 fails
  test2 fails
  test3 fails
  test4 fails
FAIL

The test filed are located in data/test/test*
the ones ending in .in are the input and the corresponding .out is their supposed output.

java implementation bug

the java implementation produces different output:

$ cat data/huwiki-latest-pages-meta-current.xml | /usr/bin/time -v cpp/wordcount_clang >out-cpp.txt
$ cat data/huwiki-latest-pages-meta-current.xml | /usr/bin/time -v java -Xms2G -Xmx8G -classpath java:java/zah-0.6.jar WordCountOptimized >out-java.txt
$ cat data/huwiki-latest-pages-meta-current.xml | /usr/bin/time -v rust/wordcount/wordcount >out-rust.txt
$ wc -c out-*
1208432559 out-cpp.txt
1208432559 out-java.txt
1208432559 out-rust.txt
$ sha256sum out-*
ea81dd93280cfc6e64d0037ba63388cc0fad9f9ac325393c625e8838c8607ed5  out-cpp.txt
71ff0ac64c298edbaff79c81fe9c86e2bdcbc250b3df3a99781241d5db162eee  out-java.txt
ea81dd93280cfc6e64d0037ba63388cc0fad9f9ac325393c625e8838c8607ed5  out-rust.txt
$ java -version
openjdk version "21.0.2" 2024-01-16
OpenJDK Runtime Environment (build 21.0.2+13)
OpenJDK 64-Bit Server VM (build 21.0.2+13, mixed mode, sharing)

Tidy up Dockerfile

The Dockerfile is a mess right now.
Install commands should be grouped by the language they are required for.

Any help would be welcome.

Simple vs. optimized versions

Originally I wanted only one version in each language but simple/vanilla and optimized versions reasonably differ (Java is a prime example of this), so we should support more than one version. Should it be one simple and one optimized or maybe more?

Needs a better dataset for comparison with large data size

The full Hungarian wiki has ~4.3 GB of data, but ~2.5GB of unique string content:

cat data/huwiki-latest-pages-meta-current.xml | sed 's/[\t ]/\n/g' | grep -v ^$ | sort | uniq | wc -m

2507384541

There are ~25M unique tokens.

This means that we are generating gigantic hashtables with generally count = 1, and languages that store Unicode strings as 2-byte representations in memory suffer greatly due to memory overheads. Much of the memory used will simply be storing the unique strings.

AWS test environment + automated test/benchmark in Jenkins

I am looking at setting up an AWS environment (spun up on demand only) that will run tests in a fast and automated fashion, using my personal Jenkins host to trigger it when commits are pushed.

Work progress:

  • Create an r3.large instance with ephemeral storage and assigned benchmarking-specific IAM role
  • Create setup script that installs docker, xz, git, starts docker, and pulls the ubuntu-16 allthelanguages docker image
  • Create and attach policy to IAM role for benchmarking that allows read of the dataset S3 bucket, and write of the results bucket
  • Compress huwiki, huwikisource, cleaned huwiki with xz -9 (smallest size) and upload to new S3 buckets (data is a private bucket, benchmark results is initially private, but later public).
  • Add script commands to setup script that will download data from S3 and decompress it
  • Set the AWS host to use ephemeral (instance) storage for /tmp folder
  • Run benchmark using docker
  • Upload first result to S3 - available here
  • Create scripting to grab instance + package info to metadata file
    • Git hash used in build
    • Timestamp
    • Host type, from aws cli
    • Hash of input file
  • Timeouts and resource limits on individual runs (Node.js for example hung on the instance, and needed to be manually killed, another one ran out of RAM and broke the Docker session)
  • Create scripting to name results by run/host info individually
  • Jenkins: job to run tests (inside a resource-limited container) against main wordcount branch + PRs
  • Jenkins - role or similar to allow control of benchmarking host?
    • Public view-only access to builds now enabled on dynamic.codeablereason.com/jenkins
    • HTTPS access added to dynamic.codeablereason.com (with LetsEncrypt)
    • Enforce HTTPS for all but badges/static resources on Jenkins (for performance/access reasons)
    • Enable limited-access users for wordcount use
  • Jenkins - job to fire benchmarks (github triggering)

Hardware/specs:

  • Storage: use SSD instance storage to benchmark (limits instance types). General purpose EBS SSD storage is generally slower and would run out of I/O credits after 1/2 hour (benchmarks need several hours).
  • Memory: either 7 GB (small datasets or where memory is not needed) or 15 GB (large or high-memory datasets).
  • CPUs: 2 or 4 core.
  • Instance types: m3.large (2-core, 7. 5GB RAM) for the small datasets, and r3.large (2-core, 15.25 GB ram) for big. If we do lots of parallelized implementations, add m3.xlarge (4-core, 15 GB RAM).
  • Cost: I am not spending more than $10-15/month on it, beyond my existing Jenkins host (reserved t2.micro) and domain/S3 hosting. Instances will be created to run a set of benchmarks and then terminated, with frequency to keep costs within limits.

Architecture:

  • Instances are spun up by my Jenkins host, with an appropriate IAM role or credentials to do this in a limited way.
  • Benchmark datasets will be self-hosted to not hit their sources hard. They won't be fully public unless small.
  • Instance gets an IAM role that allows uploading to a public (?) S3 results bucket.
  • Instance runs benchmarks on instance storage
  • Instance will upload each result to the S3 bucket as it completes, stamped with the git commit hash, timestamp run, language, etc.
  • All testing will use a reasonable timeout for both individual tests and the whole test set, if it hangs it is killed or skipped.
  • All testing uses the docker image, for reproducibility across hardware.

Two options for how to set it up:

  • EBS based & on-demand instaces:
    • Use an EBS volume containing benchmark data and preconfigured system, and just start/stop the instance.
    • When run, the git repo is cloned, the dataset is copied to the data folder, and tests are run & uploaded.
    • Easier to set up and run, but more expensive.
  • S3 based/spot instances:
    • cheaper (1/4 the instance price) but more maintenance.
    • Submit spot bids, instances are configured using the "user data" field to submit a startup script which sets up and runs benchmarks.
    • Private S3 buckets host compressed corpus data, these are fetched and decompressed.

Open questions:

  • What to use for controlling instances?
    • AWS CLI is easy
    • Jenkins AWS EC2 plugin will spin jenkins agents in EC2 (far easier to generate and report results from this), but comes with performance overheads
    • Ansible is kind of amazing and easy to work with

Yesterday I had good results tinkering with a spot-purchased c3.large instance for benchmarking, doing all I/O to the /media/ephemeral0 instance store. Pricing was only about $0.04/hour for the spot buy (purchased at 2x current spot price to prevent it being terminated after exceeding the price).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.