Comments (21)
Yeah I know, I'm mostly talking to myself.
from junrar.
I am curious on how you use that exactly in Tika. Would you be able to explain high level what kind of tests you run on those ? I'm also interested in how you can setup the CI to test on all that, given the massive file size?
Y, our corpus is way too large for CI, and given the sources of the data, we're not going to put those into our official ASF repo.
Before we make a release, we manually run Tika against the corpus or a 1 million file sample and then compare the results with the previous version of Tika. We also use the corpus for development if we're working on a new parser or looking for issues in a specific parser. We use tika-eval (https://cwiki.apache.org/confluence/display/TIKA/TikaEval) in compare
mode to do the comparisons. On a million files, it takes about 4 hours to parse and an hour to run the comparisons on a decently sized server.
from junrar.
@tballison while trying to rebase your PR, i managed to find a way to prevent the infinite loops for the files 2 and 3 without having to change the RAR VM code.
I just pushed the changes, and also published a new
7.5.3-SNAPSHOT
.Do you think you could rerun that on the corpus ? I really want to have some ways to do that myself using the rars you provided, but i didn't had time yet to work out something (even on my local machine).
Happy to do so. I may not have time today, but I'll see what I can do.
from junrar.
Had a quick look, and it seems junrar doesn't perform CRC checks on headers, despite headers storing CRC.
When running the provided files through unrar
cli, it actually detects file corruption, and does not proceed to extracting anything:
Main archive header is corrupt
Unexpected end of archive
Testing archive loop1.rar
Unexpected end of archive
No files to extract
I have added CRC check for the end of archive header, and that takes care of loop1.rar
. Haven't had time to check on the other files, or how to implement CRC checks for other header types.
from junrar.
loop2.rar
doesn't have an End of Archive header. I am no rar expert, but i would say that means the archive is corrupt.
from junrar.
Hi @gotson , these files we provided are indeed corrupted RAR files. We use fuzzing to iteratively mutate some valid RAR files to test the junrar. Though it is not a valid RAR file, it does make the applications using junrar vulnerable to Denial of Service attack.
from junrar.
The fix proposed in #93 fixes the infinite loop in the three files attached. This is my first time working with your codebase, please forgive failure to follow junrar coding practice etc. I'm happy to learn.
We recently had a report of infinite loops in junrar here: https://issues.apache.org/jira/browse/TIKA-3798
Without the triggering file on that issue, we cannot know if this will fix that infinite loop, but we may as well fix the ones here.
Thank you!
from junrar.
I tgz'd Apache Tika's regression corpus 7GB of rars here: https://corpora.tika.apache.org/base/share/rars.tgz if you have any interest in running on a broader set of docs. I'm currently running our fuzzers against those as seeds.
from junrar.
I tgz'd Apache Tika's regression corpus 7GB of rars here: https://corpora.tika.apache.org/base/share/rars.tgz if you have any interest in running on a broader set of docs. I'm currently running our fuzzers against those as seeds.
I am curious on how you use that exactly in Tika. Would you be able to explain high level what kind of tests you run on those ? I'm also interested in how you can setup the CI to test on all that, given the massive file size?
from junrar.
@tballison i took some time to review the unmerged changes I did for this issue, i just pushed a commit with some checks on header validity.
I have published a 7.5.3-SNAPSHOT also, not sure if that's something you can use to see if that addresses TIKA-3798.
from junrar.
@tballison i took some time to review the unmerged changes I did for this issue, i just pushed a commit with some checks on header validity.
I have published a 7.5.3-SNAPSHOT also, not sure if that's something you can use to see if that addresses TIKA-3798.
Awesome thank you. We don't yet have a test file for TIKA-3798, but I can integrate the new version and run our fuzzing code on it.
As a side note, do you know if IP
is only allowed to progress or is it random access?
from junrar.
As a side note, do you know if
IP
is only allowed to progress or is it random access?
I have no idea. That code was ported from the C unrar code more than 10 years ago. I only maintain the library, the low level code is entirely cryptic to me.
I checked the current unrar code and this doesn't exist anymore, it changed completely.
I will try to make use of your corpus, that could be a good addition to our small test suite. If I manage to do so, I'll try your change and compare the results.
from junrar.
I pulled from master and rebuilt. I'm still getting infinite loops on https://github.com/tballison/junrar/blob/issue81/src/test/resources/com/github/junrar/bugfixes/gh81-infinite-loop2.rar and https://github.com/tballison/junrar/blob/issue81/src/test/resources/com/github/junrar/bugfixes/gh81-infinite-loop3.rar These are just renamed versions of two of the files in the zip attached to this issue.
Then I added:
if (ip < IP) {
System.out.println("crash less than");
System.exit(3);
}
if (ip == IP) {
System.out.println("crash equal");
System.exit(4);
}
to test how often this happens on the full corpus of ~12000 files plus the test files on this project. Those crashes are only triggered on the two broken files mentioned above. This suggests that it would be safe to throw an exception if ip <= IP
.
from junrar.
I will try to make use of your corpus, that could be a good addition to our small test suite
Given the murkiness of licensing, I would encourage you not to add those to this repo. Definitely use the files privately, but they really are just such a huge heap for the project's repo, and licenses/trademark/copyright headaches...
from junrar.
I ran the full tika-eval process with the change I proposed and the latest release. There were no differences except on the three files attached here.
from junrar.
I'm very hesitant to change the logic of code that we don't understand, but requiring that IP progress seems to be ok at least on these test files and prevents the attached infinite loop files.
from junrar.
I'm still getting infinite loops on https://github.com/tballison/junrar/blob/issue81/src/test/resources/com/github/junrar/bugfixes/gh81-infinite-loop2.rar and https://github.com/tballison/junrar/blob/issue81/src/test/resources/com/github/junrar/bugfixes/gh81-infinite-loop3.rar These are just renamed versions of two of the files in the zip attached to this issue.
Yes, that's normal. My initial fix which wasn't merged only fixed the first file, not the other 2. I still wanted that fix merge, because it adds some controls at a higher level in the code, by doing CRC checks on headers.
to test how often this happens on the full corpus of ~12000 files plus the test files on this project. Those crashes are only triggered on the two broken files mentioned above. This suggests that it would be safe to throw an exception if
ip <= IP
I ran the full tika-eval process with the change I proposed and the latest release. There were no differences except on the three files attached here.
I'm very hesitant to change the logic of code that we don't understand, but requiring that IP progress seems to be ok at least on these test files and prevents the attached infinite loop files.
Thanks a lot for doing that ! I wanted to do something along those lines, but you beat me to it! That is very valuable information, and while i agree with you that doing changes in the lower level of the RAR virtual machine is tricky, your tests shows that we can be quite confident in doing so.
from junrar.
@tballison while trying to rebase your PR, i managed to find a way to prevent the infinite loops for the files 2 and 3 without having to change the RAR VM code.
I just pushed the changes, and also published a new 7.5.3-SNAPSHOT
.
Do you think you could rerun that on the corpus ? I really want to have some ways to do that myself using the rars you provided, but i didn't had time yet to work out something (even on my local machine).
from junrar.
I'm still getting infinite loops on https://github.com/tballison/junrar/blob/issue81/src/test/resources/com/github/junrar/bugfixes/gh81-infinite-loop2.rar and https://github.com/tballison/junrar/blob/issue81/src/test/resources/com/github/junrar/bugfixes/gh81-infinite-loop3.rar These are just renamed versions of two of the files in the zip attached to this issue.
Yes, that's normal. My initial fix which wasn't merged only fixed the first file, not the other 2. I still wanted that fix merge, because it adds some controls at a higher level in the code, by doing CRC checks on headers.
Sorry, of course. I didn't read your earlier point closely enough. Sorry.
from junrar.
@tballison while trying to rebase your PR, i managed to find a way to prevent the infinite loops for the files 2 and 3 without having to change the RAR VM code.
I just pushed the changes, and also published a new7.5.3-SNAPSHOT
.
Do you think you could rerun that on the corpus ? I really want to have some ways to do that myself using the rars you provided, but i didn't had time yet to work out something (even on my local machine).Happy to do so. I may not have time today, but I'll see what I can do.
No rush at all, it's already very kind of you to do so!
from junrar.
Finally had the time to implement regression testing using the corpus provided by @tballison. We now have regression tests on 12k+ files, and amazingly running under a minute on my Macbook!
I'm closing this issue, as it's resolved in 7.5.3
.
from junrar.
Related Issues (20)
- Get a specific inputstream from a rar containing multiple items. HOT 1
- Error unpacking large file(>2G) HOT 10
- Could the junrar project benefit from free security help?
- [BUG] Special characters are not recognized anymore HOT 8
- How to disable logging in System.err? HOT 2
- FileHeader times change depending on the current timezone HOT 9
- FileTime.fromMillis(getDateDos(fileTime)) is not working below android API 26 HOT 2
- [BUG]com.github.junrar.exception.UnsupportedRarV5Exception
- CrcErrorException HOT 4
- [BUG] Compress HOT 1
- [BUG]A carefully crafted RAR archive can trigger an infinite loop while parsing. HOT 10
- com.github.junrar.exception.CrcErrorException during extracting of archive HOT 2
- [BUG] NullPointerException when extracting solid archives HOT 6
- [BUG]com.github.junrar.exception.CorruptHeaderException: null HOT 2
- [BUG] Exception swallowed during initialization -- silently missing data we used to get in 7.4.1 HOT 11
- [BUG] "IllegalArgumentException: Pipe Size <= 0" with an empty file in the archive HOT 9
- Integration with Google oss-fuzz fuzzing service HOT 2
- Hrlp me HOT 1
- [BUG] Pom License link leads to 404 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from junrar.