Comments (25)
https://www.exploringbinary.com/maximum-number-of-decimal-digits-in-binary-floating-point-numbers/ suggests that the number of digits for double could be well over 1000. (link provided by @plokhotnyuk).
Okay. I understand now, why you wrote 'well over 1000' and I wrote 768. What I meant to say is that we need to perform the costly conversion from decimal to binary with up to 768 digits. Leading zero digits are cheap: we need to skip over leading zeroes that are before the decimal point, and we need to count leading zeros that are after the decimal point.
In the end, I think jackson-core is better off enforcing limits on the number of chars in a number than trying to eke out better performance for edge case scenarios.
Yes. If a 'number' in a JSON has more than 24 characters, it is probably not a double value.
from fastdoubleparser.
Phew! Very interesting discussion, some of which I even understood. :)
(but more importantly, people that matter are talking and grok it)
From what I can gather, it does not sound like there would be any simple fixed (independent of textual representation) length to safely use to truncate, below something like 768 characters (varies b/w float
, double
, BigDecimal
but that aside).
So in case of Jackson we do 2-phase processing:
- Tokenize valid number representations (as per JSON spec). At this point we are now adding basic configurable max length where there previously were none. I think initial limit is 1024 but I guess we could tune it to 768 safely. Regardless, at this point we do not know type at which user/app wants to access number as
- On accessing value as specific numeric type, we perform decoding. At this point we do know target type and could further limit length if we chose to -- but ideally we'd quietly drop whatever (if anything) we can.
But. From above numbers it also does seem like there was diminishing return -- the goal (for me & Jackson, I think) is not to try to optimize performance of sub-optimal cases for legit but misguided users, but to try to limit DoS attacks. If 1k characters would still give over 60k / sec, that does not look too worrisome in grand scheme of things.
Having said that, if there was a way to determine possible truncation, it seems to me that some basic heuristics ("only worry about truncation if charlen above 100; otherwise proceed as-is") could be useful.
from fastdoubleparser.
I get the following results on my Mac mini 2018:
# JMH version: 1.35
# VM version: JDK 19, OpenJDK 64-Bit Server VM, 19+36-2238
Benchmark Mode Cnt Score Error Units
DoubleParserBench.double10 thrpt 5 38852169.446 ± 2051582.451 ops/s
DoubleParserBench.double1000 thrpt 5 1615511.309 ± 52510.462 ops/s
DoubleParserBench.double1000000 thrpt 5 1591.276 ± 42.500 ops/s
DoubleParserBench.fastDouble10 thrpt 5 52012968.516 ± 4261938.881 ops/s
DoubleParserBench.fastDouble1000 thrpt 5 658186.310 ± 86431.775 ops/s
DoubleParserBench.fastDouble1000000 thrpt 5 669.971 ± 17.593 ops/s
The FastDoubleParser is slower for long digit sequences, because it falls back to Double.parseDouble(). Therefore it scans the digits twice. This is not a useful vector for a denial of service attack, because it is a linear overhead.
For double numbers, we indeed stop parsing the significand after 768 digits¹. This is not visible in the code, because of the fallback to Double.parseDouble(). What you can see in the code, is that we scan all digits and do some multiplications that we then throw away.
To demonstrate this, I have included the JavaBigDecimalParser
in your benchmark. I also added the annotations @OutputTimeUnit(TimeUnit.NANOSECONDS) @BenchmarkMode(Mode.AverageTime)
. This yields the following results:
Benchmark Mode Cnt Score Error Units
DoubleParserBench.double10 avgt 5 25.480 ± 0.437 ns/op
DoubleParserBench.double1000 avgt 5 623.658 ± 22.041 ns/op
DoubleParserBench.double1000000 avgt 5 623618.515 ± 24134.389 ns/op
DoubleParserBench.fastDouble10 avgt 5 18.165 ± 0.181 ns/op
DoubleParserBench.fastDouble1000 avgt 5 1371.018 ± 9.989 ns/op
DoubleParserBench.fastDouble1000000 avgt 5 1408092.544 ± 8572.876 ns/op
DoubleParserBench.fastBigDecimal10 avgt 5 19.451 ± 0.703 ns/op
DoubleParserBench.fastBigDecimal1000 avgt 5 6786.623 ± 77.915 ns/op
DoubleParserBench.fastBigDecimal1000000 avgt 5 228543520.443 ± 4037012.871 ns/op
We can see that only the JavaBigDecimalParser
shows the massive slow down for very long digit sequences. For BigDecimal
numbers, we have to parse all those digits. If we did the same for Double
numbers, we would see the same slow down.
I am aware that the JavaDoubleParser
is slower than Double.parseDouble
for long digit sequences. My intention is to provide a faster implementation once the pull request for JDK-8205592 is integrated into the JDK.
¹ See 'Daniel Lemire, Number Parsing at a Gigabyte per Second, Software: Practice and Experience 51 (8), 2021, Chapter 11 Processing Numbers Quickly': "it may be necessary to read tens or even hundreds of digits (up to 768 digits in the worst case)"
from fastdoubleparser.
Shouldn't JSON number
inputs with more than 24 characters be rejected anyway?
(I think Double.toString(double) never produces more than 24 characters. I haven't tested this though).
from fastdoubleparser.
Thanks @wrandelshofer - with jackson-core, we are introducing checks on number sizes. From your detailed analysis, it looks like we should probably have a few different sizes that we allow. So far, the unreleased dev code has one limit that defaults to 1000 chars. This is fine for BigDecimal/BigInteger. For Floats and Doubles, we should have different and much smaller limits.
from fastdoubleparser.
@wrandelshofer You are considering that JSON is limited to what Javascript supports but I think this is not really the case -- format is specified in terms of textual representation and thereby languages/platforms can and do support higher precision too: for Java BigDecimal
supports arbitrary lengths (or at least way past 24 characters).
So I don't think 24 would be a good limit in practice as it may impact some use cases.
But at the same time we do want to impose limits based on performance characteristics.
from fastdoubleparser.
Ok, another probably naive question: would it make sense, from Jackson side, to truncate "too long" input Strings for double
/ float
parsing: so we'd have higher max limit for allowed tokens (... because checks are done at the point where eventually desired type [ float, double, BigDecimal ] is not yet known), but only use "useful portion"?
Or the problem really that due to algorithm needed for base-2-fp-from-decimal-string may actually up to hundreds of digits in certain cases, meaning that for full accuracy this is not an option?
from fastdoubleparser.
@wrandelshofer You are considering that JSON is limited to what Javascript supports but I think this is not really the case ...
No, no, I just conjectured this from the topic of this issue
I believe it is sensible to assume that the producer and the conumser of data agree on the format of the data.
Producing JSON numbers with superfluous digits is most likely a bug if the consumer is known to interpret them as double values. So, it would make sense to accept no more than 24 characters by default. Being more lenient should need extra configuration.
Likewise, if someone needs to parse BigDecimal with hundreds of digits, then they have to provision their system for this workload. But they may not want to provision their system for BigDecimals with thousands or even millions of digits.
from fastdoubleparser.
Or the problem really that due to algorithm needed for base-2-fp-from-decimal-string may actually up to hundreds of digits in certain cases, meaning that for full accuracy this is not an option?
These are crafted string representations. Libraries (like the JDK) do not produce strings with that many digits from a double. Therefore, In my opinion, this should not be accepted by default.
from fastdoubleparser.
Ok, another probably naive question: would it make sense, from Jackson side, to truncate "too long" input Strings for
double
/float
parsing: so we'd have higher max limit for allowed tokens (... because checks are done at the point where eventually desired type [ float, double, BigDecimal ] is not yet known), but only use "useful portion"?
No, because this is what Double.parseDouble
and JavaDoubleParser
already do.
They have to scan through all digits though, because there can be an exponent after the digits.
Scanning is very fast in Double.parseDouble
, and somewhat slower in JavaDoubleParser
.
from fastdoubleparser.
Ok, thank you @wrandelshofer -- very good points.
Now, as to maximums: one reason why I definitely would not want to add failing-beyond-24-digits case is that we have no idea of how many users would now get failure on previously working use cases -- that is, getting an exception where there used to be none. Never mind that usage was sub-optimal and most of, say, 60-digit value was ignored.
My gut feeling is that there'd be enough to get lots of bug reports for the change, and for most users it comes down to "why did you break my system" from their POV. In a way it is also case of "if it ain't broke don't fix it", if we leave out legit DoS concerns; there's not much upside from maintainer perspective. Breaking perceived backwards-compatibility is generally a bad idae.
However. I'd be quite interested in knowing about quietly truncating unused digits if there is specific limit -- that sounds like something that would be backwards-compatible approach. If we can quietly truncate reminder of useless tail for double
/float
parsing before calling FDP (or JDK equivalent), sure: that would avoid even more of waste.
So.... what kind of limit would be safe? Since JSON does not allow leading zeroes in integral part, truncating beginning of the String would work. Although leading zeroes in fractional part would count so maybe it's not that trivial.
from fastdoubleparser.
Or the problem really that due to algorithm needed for base-2-fp-from-decimal-string may actually up to hundreds of digits in certain cases, meaning that for full accuracy this is not an option?
These are crafted string representations. Libraries (like the JDK) do not produce strings with that many digits from a double. Therefore, In my opinion, this should not be accepted by default.
Yes, but they may produce them from BigDecimal
, no? (or anything else on other platforms that produces/uses higher precision).
But what I was trying to get at was specifically performance differences between BigDecimal
(which internally is 10-based, right?) and double
(etc). I realize that as per its name FastDoubleParser focuses on its area, but Jackson supports both and as such there are places where stricter limits can only be imposed when we know actual usage via API calls -- but not yet at point of decoding where verification just checks that JSON number representation is followed (which does not, at spec level, impose any specific limits but just states implementations may impose their limits).
from fastdoubleparser.
However. I'd be quite interested in knowing about quietly truncating unused digits if there is specific limit -- that sounds like something that would be backwards-compatible approach. If we can quietly truncate reminder of useless tail for
double
/float
parsing before calling FDP (or JDK equivalent), sure: that would avoid even more of waste.So.... what kind of limit would be safe? Since JSON does not allow leading zeroes in integral part, truncating beginning of the String would work. Although leading zeroes in fractional part would count so maybe it's not that trivial.
Truncating involves parsing the number. We have to truncate from the first non-zero digit. That digit can be before or after the decimal point in the significand. So, it is best done inside the parser. Otherwise we would have to scan and parse the number twice.
Of course, we would have to be faster than Double.parseDouble()
. And that's no easy feat. On my trusty Mac mini 2018, I get the following performance with Double.parseDouble()
:
digits | ns/op | ns/digit | op/sec |
---|---|---|---|
24 | 181 | 7.56 | 5'512'709.5519 |
1 | 14 | 13.59 | 73'561'865.5289 |
10 | 22 | 2.22 | 45'069'406.8866 |
100 | 277 | 2.77 | 3'610'668.8042 |
1'000 | 441 | 0.44 | 2'267'661.1117 |
10'000 | 4'426 | 0.44 | 225'926.5133 |
100000 | 47'251 | 0.47 | 21'163.4093 |
1000000 | 469'428 | 0.47 | 2'130.2525 |
10000000 | 4'821'577 | 0.48 | 207.4010 |
100000000 | 48'427'968 | 0.48 | 20.6492 |
1000000000 | 490'691'955 | 0.49 | 2.0379 |
2147483643 | 1'040'876'939 | 0.48 | 0.9607 |
We can have up to Integer.MAX_VALUE - 4
= 2147483643 characters in a String.
The table shows that Double.parseDouble()
takes 0.48 ns per digit.
The table is also interesting for assessing whether there is a need for imposing a maximal character length for numbers in Jackson.
Double.parseDouble()
accepts up to Integer.MAX_VALUE - 4
= 2147483643 characters per value.
If we allow for that many input characters, we can guarantee a performance of 0.9607 values per second.
However, if we are able to limit number values to 24 characters, we can guarantee a performance of 5.5 million values per second.
from fastdoubleparser.
https://www.exploringbinary.com/maximum-number-of-decimal-digits-in-binary-floating-point-numbers/ suggests that the number of digits for double could be well over 1000. (link provided by @plokhotnyuk).
With floats, the limit is a lot lower (149 if I read the values correctly).
In the end, I think jackson-core is better off enforcing limits on the number of chars in a number than trying to eke out better performance for edge case scenarios.
from fastdoubleparser.
If you have crazily long input strings, you can almost always determine exactly the right number with only the leading 19 digits... See section 11 of the paper at https://arxiv.org/pdf/2101.11408.pdf cited by @wrandelshofer above...
It involves adding at most 2-3 lines of code...
It is also conceptually trivial. Suppose that I have...
1.321342134212321321321...321312e10
and I want to parse it... suppose I pick a bunch of leading digits...
1.321342134212e10
now if I add one to these digits...
1.3213421342123e10
The exact number I need, is somewhat between these two, right? Well, if I parse both and they end up generating the same IEEE floating-point number, then it implies that I do not need to look further... I can discard the leftover digits.
from fastdoubleparser.
https://www.exploringbinary.com/maximum-number-of-decimal-digits-in-binary-floating-point-numbers/ suggests that the number of digits for double could be well over 1000. (link provided by @plokhotnyuk).
No, that page does not suggest that.
The "Summary" section shows a table with the value 767 for row 'double / Max Fraction Digits Significant'. The paragraph below the table states: "It is the maximum number of digits in a fraction that determines the maximum number of digits for a given IEEE format."
This is consistent with the 768 digits in Lemire's paper.
from fastdoubleparser.
If you have crazily long input strings, you can almost always determine exactly the right number with only the leading 19 digits... See section 11 of the paper at https://arxiv.org/pdf/2101.11408.pdf cited by @wrandelshofer above...
It involves adding at most 2-3 lines of code...
This is awesome!
However this is way more than 2-3 lines of code!
In an earlier iteration, I had ported all code for the 'slow path' from your fast_float project. But at that time, the code was a lot slower than the one in java.util.Double
. So I removed it later on. Now I am waiting until JDK-8205592 is resolved. When this code becomes available, I will be able to port the 'slow path' with much less work to Java. And - hopefully - it will be faster than `java.util.Double.parseDouble(String)'.
This issue is about a malicious use of the parser. I expect that a malicious actor will use a number that can only be resolved by processing the maximal number of digits.
from fastdoubleparser.
However this is way more than 2-3 lines of code!
Not really. It is much easier than you seem to imagine.
Here is the pseudocode:
adjusted_mantissa am = compute_float(exponent, mantissa);
if(too_many_digits) {
if(am != compute_float(exponent, mantissa + 1)) {
go to slow path;
}
}
If you really don't see it, I can try to do a PR.
This issue is about a malicious use of the parser. I expect that a malicious actor will use a number that can only be resloved after the maximal number of digits has been processed.
The trick that I describe will solve the issue that you encounter with the benchmarks and make it harder for malicious users to trick you.
But even then... you need to rescale the running time with respect to the size of the input.
Let us look at that...
DoubleParserBench.fastDouble1000 avgt 5 1371.018 ± 9.989 ns/op
DoubleParserBench.fastDouble1000000 avgt 5 1408092.544 ± 8572.876 ns/op
So you have 1000 more digits... and the code gets... well... 1000 times slower. This means that on a per-byte basis, you have constant time performance.
I mean... If I send you a JSON file that is 1000 x larger and you take 1000 x longer to process it, it is fine. It is not much of an attack opportunity.
At that point, you are probably more vulnerable for out-of-memory errors...
from fastdoubleparser.
Not really. It is much easier than you seem to imagine.
I am sure you are right.
I imagined, that you proposed code hat progressively consumes additional digits until it is able to compute the proper value of the mantissa.
However, when I look at your code snippet again, then I believe now that it corresponds to code that I have already ported from fast_float to Java.
Is this correct?
from fastdoubleparser.
Is this correct?
Yes. And this code should (if it is correct) cover 99% of the cases if you randomly generate digits.
from fastdoubleparser.
Great.
Since I am malicious, I used a sequence of the character '7' though.
from fastdoubleparser.
But what I was trying to get at was specifically performance differences between
BigDecimal
(which internally is 10-based, right?) anddouble
(etc).
The significand of BigDecimal is 2-based, its exponent is 10-based. Therefore we still need to convert the significand from 10-base to 2-base.
For BigDecimal we can only drop leading zeroes of the significand; we have to convert all other digits of the significand. Because of this the conversion is very costly.
On my trusty Mac mini 2018, I get the following performance with new BigDecimal(String)
:
Digits | ns/op | ns/digit | op/sec | duration/op |
---|---|---|---|---|
24 | 117 | 5 | 8'562'670 | 0ms |
1 | 12 | 12 | 80'173'174 | 0ms |
10 | 21 | 2 | 48'132'461 | 0ms |
100 | 481 | 5 | 2'079'793 | 0ms |
1'000 | 14'833 | 15 | 67'419 | 0ms |
10'000 | 1'200'633 | 120 | 833 | 1ms |
100000 | 117'587'913 | 1'176 | 8.5043 | 118ms |
1000000 | 12'027'995'151 | 12'028 | 0.0831 | 12s 28ms |
10000000 | 1'281'686'999'490 | 128'169 | 0.0008 | 21m 21s 687ms |
100000000 | 128'168'699'949'000 | 1'281'687 | 0.00000780 | 1d 11h 36m 8s 700ms |
1000000000 | 12'816'869'994'900'000 | 12'816'870 | 0.00000008 | 21w 1d 8h 14m 29s 995ms |
1292782621 | 21'420'666'987'609'800 | 16'569'427 | 0.00000005 | 35w 2d 22h 11m 6s 988ms |
The values in italics are estimates.
from fastdoubleparser.
Do we need to make adjustmens in FastDoubleParser?
The current behavior is as follows:
float, double
- asymptotic performance: O(n) where n is the number of input characters
- throws NumberFormatException if n exceeds Integer.MAX_VALUE - 4
- silently crops significand to fixed size of the binary mantissa
- silently yields infinity if exponent is out of range
BigInteger
- asymptotic performance: O(nc) where n is the number of input characters
- throws NumberFormatException if n exceeds 1,292,782,622
- throws NumberFormatException if the significand does not fit into 2^31 - 1 bits
BigDecimal
- asymptotic performance: O(nc) where n is the number of input characters
- throws NumberFormatException if n exceeds 1,292,782,635
- throws NumberFormatException if the significand does not fit into 2^31 - 1 bits
- throws NumberFormatException if the exponent is out of range
from fastdoubleparser.
I am now integrating FFT multiplication from the project https://github.com/tbuktu/bigint/tree/floatfft
Using this multiplication algorithm, we can give very good guarantees for the maximal required computation time and memory usage.
The computation times are given for a Mac mini 2018 with Intel(R) Core(TM) i7-8700B CPU @ 3.20GHz.
The memory usage seem large. But they are not really, considering that we have to perform multiplications of huge bit sequences for parsing a BigInteger/BigDecimal number.
Parser | Result Type | Maximal input length |
Memory usage JVM -Xmx |
Computation Time |
---|---|---|---|---|
JavaDoubleParser | java.lang.Double | 2^31 - 5 | 10 gigabytes | < 5 sec |
JavaFloatParser | java.lang.Float | 2^31 - 5 | 10 gigabytes | < 5 sec |
JavaBigIntegerParser | java.math.BigInteger | 1,292,782,622 | 14 gigabytes | < 7 min |
JavaBigDecimalParser | java.math.BigDecimal | 1,292,782,635 | 14 gigabytes | < 7 min |
from fastdoubleparser.
The BigInteger and BigDecimal parsers with improved worst-case performance are now available in release 0.6.0.
from fastdoubleparser.
Related Issues (20)
- make it allocation free on happy path HOT 6
- release jar with jdk8 compatible HOT 2
- float parser HOT 4
- SWAR routines accept invalid non-digit chars/bytes HOT 4
- FastDoubleParser accepts illegal inputs "." and ".e2" HOT 1
- FastDoubleParser doesn't support all input formats as the default OpenJDK Float/Double parsers HOT 10
- The parser throws StringIndexOutOfBoundsException/ArrayIndexOutOfBoundsException for some inputs HOT 6
- BigDecimal parser HOT 12
- Publish a multi-release JAR HOT 2
- Parsing of hexadecimal floating point numbers is broken in release 0.5.0 HOT 1
- Publish 0.5.2 to maven central HOT 2
- BigInteger parser HOT 6
- Builds should be reproducible HOT 1
- Is this a mistake with hex float parsing? HOT 1
- Incorrect maven command sequence HOT 6
- Please bundle LICENSE/NOTICE files in the produced jar files HOT 49
- Implement faster slow path for double parser (JDK 21)
- Document which code signing keys will be used for published artifacts. HOT 2
- issue with module-info classes in v0.9.0 release HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fastdoubleparser.