I modified Lemire's <co

How does C++17's <a href="https://en.cppreference.com/w/cpp/header/charconv" rel="nofo

CC <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Slow f64 parsing,about google/wuffs

Comments (13)

ddevienne commented on May 23, 2024

How does C++17's https://en.cppreference.com/w/cpp/header/charconv compare to those?

from wuffs.

nigeltao commented on May 23, 2024

https://git.sr.ht/~atrieu/simple_fastfloat_benchmark/tree/master/item/README.md

says

cmake -B build .
cmake --build build
./build/benchmarks/benchmark

This builds a debug (non-optimized) binary instead of a release (optimized) binary. Wuffs differs from the other libraries you're measuring (netlib, doubleconversion, etc) in that you're building Wuffs from source. The other libraries are pre-built.

I'm not very familiar with cmake but try replacing

cmake -B build .

with

cmake -B build -DCMAKE_BUILD_TYPE=Release .

Basically, running grep -r ^CXX_FLAGS your_build_directory should see CXX_FLAGS = -O3 -DNDEBUG. In your benchmark.cpp file, you can also add these lines:

#ifndef NDEBUG
#error "Benchmarks should only be compiled with full optimization."
#endif

from wuffs.

nigeltao commented on May 23, 2024

CC @lemire in case there's anything worth backporting to the upstream simple_fastfloat_benchmark repo. I'll repeat that I'm not very familiar with cmake. I also don't have a Windows system readily available, so I can't test out its "Build on Windows" instructions.

from wuffs.

lemire commented on May 23, 2024

The expected file format, in simple_fastfloat_benchmark, is a list of ASCII numbers, one per line, like so...

-65.613616999999977
43.420273000000009
-65.619720000000029
43.418052999999986
-65.625
43.421379000000059
-65.636123999999882
43.449714999999969
-65.633056999999951
43.474709000000132
-65.611389000000031
43.513054000000068
-65.605835000000013
43.516105999999979
-65.598343
43.515830999999935
-65.566101000000003
43.508331000000055
...

I am not sure what happens if you have some other format. I would consider the benchmark results invalid.

The other libraries are pre-built.

The fast_float library is header-only. Except for strtod, everything is built by cmake.

This builds a debug (non-optimized) binary instead of a release (optimized) binary

You'd be correct if everything was "by default", but the CMakeLists.txt file overrides the default... You can check that in simple_fastfloat_benchmark, the default build is Release:

This does not work under Windows, but the README file provides the instructions:

I'd like to suggest (though I did not investigate) that the atrocious results regarding wuffs are not meaningful. They are likely caused by pointing the software at unexpected file formats.

from wuffs.

lemire commented on May 23, 2024

Though I am not very explicit in the README, it does say that we try to parse each line as a number:

Please make sure that this is what you are pointing the software at.

from wuffs.

atrieu commented on May 23, 2024

I only followed the instructions from the original README, so it should be compiled in Release mode as I understand.
I do use the benchmarks on correct files I think, e.g.,

.0
.0000
0
0.0
0.00
0.000
0.00000
0.00000000000000000000000000000000000000000000000000000000000
0.000000000000000000000000000000000000000000000000000000000000
0.00000e+0
0.00e+0
0.0e+0
0E1
0E11328
0E3
0E4
0E6
0E8
0E86
0e+0
0e+59
1e-325
2.5e-324
40000e-328
4e-324
5e-324
3.6118391954597633033930746e-310
4.8980019078642963771876647e-310
1.0263847514109591822395668e-309
1.2191449367984545884039792e-309
1.3586641673566107715029324e-309
1.5947055924688004253007524e-309
1.6597654066767748210494446e-309
1.8594943567529094619479745e-309
1.9852357578498988925763729e-309
2.1233738662160330012710983e-309

These were obtained by extracting the fourth column from Nigel's data set, i.e., awk '{print $4}' ../../parse-number-fxx-test-data/data/google-double-conversion.txt > google-double-conversion-extracted.txt

The diff on benchmark.cpp is as follows.

diff --git a/benchmarks/benchmark.cpp b/benchmarks/benchmark.cpp
index 0246082..3380eba 100644
--- a/benchmarks/benchmark.cpp
+++ b/benchmarks/benchmark.cpp
@@ -3,6 +3,12 @@
 #include "absl/strings/numbers.h"
 #include "fast_float/fast_float.h"
 
+#define WUFFS_IMPLEMENTATION
+#define WUFFS_CONFIG__STATIC_FUNCTIONS
+#define WUFFS_CONFIG__MODULES
+#define WUFFS_CONFIG__MODULE__BASE
+#include "wuffs-unsupported-snapshot.c"
+
 #ifdef ENABLE_RYU
 #include "ryu_parse.h"
 #endif
@@ -128,6 +134,28 @@ double findmax_strtod(std::vector<std::string> &s) {
   }
   return answer;
 }
+
+uint32_t wuffs_options = WUFFS_BASE__PARSE_NUMBER_XXX__ALLOW_UNDERSCORES \
+  | WUFFS_BASE__PARSE_NUMBER_XXX__ALLOW_MULTIPLE_LEADING_ZEROES;
+
+double findmax_wuffs(std::vector<std::string> &s) {
+  double answer = 0;
+  double x = 0;
+  for (std::string &st : s) {
+    wuffs_base__slice_u8 sx = {
+      .ptr = (uint8_t*)st.data(),
+      .len = st.size()
+    };
+    wuffs_base__result_f64 res = wuffs_base__parse_number_f64(sx, wuffs_options);
+    if (res.status.repr) {
+      throw std::runtime_error("bug in findmax_wuffs");
+    }
+    x = res.value;
+    answer = answer > x ? answer : x;
+  }
+  return answer;
+}
+
 // Why not `|| __cplusplus > 201703L`? Because GNU libstdc++ does not have
 // float parsing for std::from_chars.
 #if defined(_MSC_VER)
@@ -292,6 +320,7 @@ void process(std::vector<std::string> &lines, size_t volume) {
 #endif
   pretty_print(volume, lines.size(), "abseil", time_it_ns(lines, findmax_absl_from_chars, repeat));
   pretty_print(volume, lines.size(), "fastfloat", time_it_ns(lines, findmax_fastfloat, repeat));
+  pretty_print(volume, lines.size(), "wuffs", time_it_ns(lines, findmax_wuffs, repeat));
 #ifdef FROM_CHARS_AVAILABLE_MAYBE
   pretty_print(volume, lines.size(), "from_chars", time_it_ns(lines, findmax_from_chars, repeat));
 #endif
@@ -308,6 +337,7 @@ void fileload(const char *filename) {
   lines.reserve(10000); // let us reserve plenty of memory.
   size_t volume = 0;
   while (getline(inputfile, line)) {
+    if (0 < line.size() && line[0] == '.') { line = "0" + line; }
     volume += line.size();
     lines.push_back(line);
   }

The defines are the same as in manual-test-parse-number-f64.cc so hopefully they are correct. I think I'm correctly calling wuffs_base__parse_number_f64 (?) It should throw an error otherwise. The last modification is to avoid having a leading dot as wuffs' implementation doesn't support it, but this is not part of the performance measure anyway so shouldn't affect the numbers.

from wuffs.

lemire commented on May 23, 2024

I only followed the instructions from the original README, so it should be compiled in Release mode as I understand.

It should yes.

20 MB/s is very slow, however. Have you do some profiling? Just disable all kernels, and run the benchmark with perf record <command line> followed by perf report. You should be able to see something glaring at you.

from wuffs.

atrieu commented on May 23, 2024

No, I had not done any profiling as I don't really know much about it. But here are the results from using perf

For comparison, on data/canada.txt where wuffs performs more as would be expected.

I'm not sure I understand what this all means, are the small_xshift functions too slow ?

from wuffs.

nigeltao commented on May 23, 2024

"Debug vs Release" compiler flags was an incorrect guess. Sorry.

Digging deeper into it, test data like data/ulfjack-ryu-extracted.txt contains many lines like 1.0168286519992372611942638e+100. Crucially, this has more than 19 significant digits, so the Eisel-Lemire algorithm, the fastest algorithm, does not apply. Both Wuffs and fast_float fail over to a slower fallback algorithm.

Wuffs' fallback algorithm is Simple Decimal Conversion. fast_float also used to fall back to SDC, but more recently uses the big-integer arithmetic algorithm. I'm not familiar with this newer algorithm but it's presumably faster than SDC.

(The parse-number-fxx-test-data test suite is more about testing correctness than testing performance. It isn't about collecting a representative sample of real world numbers. For example, when printing float64 numbers, 19 significant digits is often enough.)

from wuffs.

atrieu commented on May 23, 2024

I do understand that data/ulfjack-ryu-extracted.txt is not really representative of real world numbers since Wuffs performs similarly to fast_float on canada.txt for instance. But, for the record, here are the results when using tag v2.0.0 (last release still using the SDC algorithm) of fast_float on data/ulfjack-ryu-extracted.txt.

# read 599458 lines
volume = 14.4666 MB
netlib                                  :   122.65 MB/s (+/- 5.0 %)     5.08 Mfloat/s      86.69 i/B  2193.64 i/f (+/- 0.0 %)     28.83 c/B   729.60 c/f (+/- 2.7 %)      3.01 i/c      3.71 GHz
doubleconversion                        :   285.83 MB/s (+/- 2.0 %)    11.84 Mfloat/s      42.37 i/B  1072.09 i/f (+/- 0.0 %)     11.94 c/B   302.11 c/f (+/- 1.0 %)      3.55 i/c      3.58 GHz
strtod                                  :   177.26 MB/s (+/- 1.6 %)     7.35 Mfloat/s      53.96 i/B  1365.46 i/f (+/- 0.0 %)     19.38 c/B   490.34 c/f (+/- 1.1 %)      2.78 i/c      3.60 GHz
abseil                                  :   544.78 MB/s (+/- 4.0 %)    22.57 Mfloat/s      25.27 i/B   639.51 i/f (+/- 0.0 %)      6.27 c/B   158.59 c/f (+/- 2.7 %)      4.03 i/c      3.58 GHz
fastfloat                               :   629.08 MB/s (+/- 2.0 %)    26.07 Mfloat/s      21.08 i/B   533.39 i/f (+/- 0.0 %)      5.49 c/B   138.99 c/f (+/- 1.0 %)      3.84 i/c      3.62 GHz
wuffs                                   :    21.53 MB/s (+/- 1.5 %)     0.89 Mfloat/s     490.44 i/B 12410.66 i/f (+/- 0.0 %)    160.77 c/B  4068.41 c/f (+/- 0.7 %)      3.05 i/c      3.63 GHz

So, while changing the fallback algorithm to using big-integer arithmetic did improve performance by ~100MB/s, I'm not sure it explain why Wuffs is so slow comparatively.

from wuffs.

nigeltao commented on May 23, 2024

Running the bisect further back than fast_float v2.0.0 leads to fastfloat/fast_float@05ad45d "Let us try the long path" being the difference. The key part of that patch is:

diff --git a/include/fast_float/simple_decimal_conversion.h b/include/fast_float/simple_decimal_conversion.h
index 410ba05..ef1f0ad 100644
--- a/include/fast_float/simple_decimal_conversion.h
+++ b/include/fast_float/simple_decimal_conversion.h
@@ -368,8 +354,20 @@ adjusted_mantissa compute_float(decimal &d) {
 template <typename binary>
 adjusted_mantissa parse_long_mantissa(const char *first, const char* last) {
     decimal d = parse_decimal(first, last);
+    const uint64_t mantissa = d.to_truncated_mantissa();
+    const int64_t exponent =  d.to_truncated_exponent();
+    adjusted_mantissa am1 = compute_float<binary>(exponent, mantissa);
+    adjusted_mantissa am2 = compute_float<binary>(exponent, mantissa+1);
+    if( am1 == am2 ) { return am1; }
     return compute_float<binary>(d);
 }

Even if we're presented more than 19 digits, we try Eisel-Lemire twice for a lower/upper bound. If the two bounds are equal, return it.

That code was removed in
fastfloat/fast_float@192b271 "Removing dead code" but the optimization presumably lives on somewhere near https://github.com/fastfloat/fast_float/blob/24374ece716db48f974f49da4aa5851aa371cfa9/include/fast_float/parse_number.h#L208-L212 if you follow the trail through adjusted_mantissa, compute_float, compute_error etc.

from wuffs.

nigeltao commented on May 23, 2024

Thanks @atrieu for the bug report and the informative follow-ups, even when I was confidently wrong. :-)

Your ./build/benchmarks/benchmark -f data/ulfjack-ryu-extracted.txt Wuffs times should be faster now. Let me know if they're not.

from wuffs.

atrieu commented on May 23, 2024

Thanks! Wuffs is definitely faster, here are new numbers.
tencent-rapidjson-extracted seems to still be an outlier though if you feel like investigating.

> ./build/benchmarks/benchmark -f data/google-double-conversion-extracted.txt
# read 564745 lines
volume = 12.7842 MB
netlib                                  :   117.11 MB/s (+/- 6.1 %)     5.17 Mfloat/s      95.72 i/B  2272.09 i/f (+/- 0.0 %)     31.37 c/B   744.59 c/f (+/- 3.9 %)      3.05 i/c      3.85 GHz
doubleconversion                        :   298.31 MB/s (+/- 2.3 %)    13.18 Mfloat/s      43.91 i/B  1042.18 i/f (+/- 0.0 %)     12.07 c/B   286.39 c/f (+/- 0.9 %)      3.64 i/c      3.77 GHz
strtod                                  :   174.28 MB/s (+/- 4.0 %)     7.70 Mfloat/s      56.60 i/B  1343.55 i/f (+/- 0.0 %)     20.39 c/B   483.88 c/f (+/- 1.8 %)      2.78 i/c      3.73 GHz
abseil                                  :   560.13 MB/s (+/- 3.3 %)    24.74 Mfloat/s      26.22 i/B   622.35 i/f (+/- 0.0 %)      6.43 c/B   152.66 c/f (+/- 1.6 %)      4.08 i/c      3.78 GHz
fastfloat                               :   748.82 MB/s (+/- 2.3 %)    33.08 Mfloat/s      18.68 i/B   443.46 i/f (+/- 0.0 %)      4.78 c/B   113.55 c/f (+/- 0.8 %)      3.91 i/c      3.76 GHz
wuffs                                   :   160.94 MB/s (+/- 3.0 %)     7.11 Mfloat/s      67.34 i/B  1598.37 i/f (+/- 0.0 %)     23.11 c/B   548.58 c/f (+/- 1.5 %)      2.91 i/c      3.90 GHz
> ./build/benchmarks/benchmark -f data/ulfjack-ryu-extracted.txt
# read 599458 lines
volume = 14.4666 MB
netlib                                  :   131.34 MB/s (+/- 3.5 %)     5.44 Mfloat/s      86.69 i/B  2193.64 i/f (+/- 0.0 %)     28.49 c/B   720.89 c/f (+/- 0.3 %)      3.04 i/c      3.92 GHz
doubleconversion                        :   316.52 MB/s (+/- 3.0 %)    13.12 Mfloat/s      42.37 i/B  1072.09 i/f (+/- 0.0 %)     11.75 c/B   297.43 c/f (+/- 0.8 %)      3.60 i/c      3.90 GHz
strtod                                  :   186.76 MB/s (+/- 4.4 %)     7.74 Mfloat/s      53.96 i/B  1365.46 i/f (+/- 0.0 %)     19.40 c/B   490.92 c/f (+/- 1.7 %)      2.78 i/c      3.80 GHz
abseil                                  :   596.31 MB/s (+/- 3.3 %)    24.71 Mfloat/s      25.27 i/B   639.51 i/f (+/- 0.0 %)      6.27 c/B   158.61 c/f (+/- 0.8 %)      4.03 i/c      3.92 GHz
fastfloat                               :   759.54 MB/s (+/- 1.8 %)    31.47 Mfloat/s      18.42 i/B   466.17 i/f (+/- 0.0 %)      4.70 c/B   118.91 c/f (+/- 0.6 %)      3.92 i/c      3.74 GHz
wuffs                                   :   172.83 MB/s (+/- 2.0 %)     7.16 Mfloat/s      63.21 i/B  1599.60 i/f (+/- 0.0 %)     21.52 c/B   544.48 c/f (+/- 0.4 %)      2.94 i/c      3.90 GHz
> ./build/benchmarks/benchmark -f data/tencent-rapidjson-extracted.txt
# read 3563 lines
volume = 0.0328283 MB
netlib                                  :   176.31 MB/s (+/- 53.5 %)    19.14 Mfloat/s      35.00 i/B   338.14 i/f (+/- 0.0 %)     12.88 c/B   124.45 c/f (+/- 5.2 %)      2.72 i/c      2.38 GHz
doubleconversion                        :   194.22 MB/s (+/- 17.1 %)    21.08 Mfloat/s      58.26 i/B   562.90 i/f (+/- 0.0 %)     18.12 c/B   175.09 c/f (+/- 2.9 %)      3.21 i/c      3.69 GHz
strtod                                  :   138.89 MB/s (+/- 4.7 %)    15.07 Mfloat/s      74.26 i/B   717.41 i/f (+/- 0.0 %)     25.35 c/B   244.94 c/f (+/- 2.5 %)      2.93 i/c      3.69 GHz
abseil                                  :   265.08 MB/s (+/- 5.6 %)    28.77 Mfloat/s      47.16 i/B   455.59 i/f (+/- 0.0 %)     13.29 c/B   128.38 c/f (+/- 3.5 %)      3.55 i/c      3.69 GHz
fastfloat                               :   622.92 MB/s (+/- 9.2 %)    67.61 Mfloat/s      20.78 i/B   200.80 i/f (+/- 0.0 %)      5.66 c/B    54.65 c/f (+/- 6.7 %)      3.67 i/c      3.69 GHz
wuffs                                   :    73.49 MB/s (+/- 4.3 %)     7.98 Mfloat/s     154.97 i/B  1497.23 i/f (+/- 0.0 %)     49.19 c/B   475.23 c/f (+/- 1.1 %)      3.15 i/c      3.79 GHz
> ./build/benchmarks/benchmark -f data/canada.txt
# read 111126 lines
volume = 1.93374 MB
netlib                                  :   287.27 MB/s (+/- 6.7 %)    16.51 Mfloat/s      31.95 i/B   582.90 i/f (+/- 0.0 %)     12.43 c/B   226.85 c/f (+/- 1.0 %)      2.57 i/c      3.74 GHz
doubleconversion                        :   260.15 MB/s (+/- 3.4 %)    14.95 Mfloat/s      52.52 i/B   958.32 i/f (+/- 0.0 %)     13.76 c/B   251.00 c/f (+/- 1.0 %)      3.82 i/c      3.75 GHz
strtod                                  :   138.97 MB/s (+/- 5.0 %)     7.99 Mfloat/s      70.11 i/B  1279.24 i/f (+/- 0.0 %)     26.11 c/B   476.46 c/f (+/- 1.9 %)      2.68 i/c      3.81 GHz
abseil                                  :   420.22 MB/s (+/- 4.4 %)    24.15 Mfloat/s      31.15 i/B   568.47 i/f (+/- 0.0 %)      8.53 c/B   155.66 c/f (+/- 1.8 %)      3.65 i/c      3.76 GHz
fastfloat                               :  1075.84 MB/s (+/- 5.9 %)    61.82 Mfloat/s      14.14 i/B   257.92 i/f (+/- 0.0 %)      3.36 c/B    61.32 c/f (+/- 2.4 %)      4.21 i/c      3.79 GHz
wuffs                                   :   865.96 MB/s (+/- 5.9 %)    49.76 Mfloat/s      16.96 i/B   309.50 i/f (+/- 0.0 %)      4.18 c/B    76.18 c/f (+/- 2.6 %)      4.06 i/c      3.79 GHz

from wuffs.

Slow f64 parsing about wuffs HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent