Giter Club home page Giter Club logo

Comments (21)

gotson avatar gotson commented on May 30, 2024 2

Yeah I know, I'm mostly talking to myself.

from junrar.

tballison avatar tballison commented on May 30, 2024 1

I am curious on how you use that exactly in Tika. Would you be able to explain high level what kind of tests you run on those ? I'm also interested in how you can setup the CI to test on all that, given the massive file size?

Y, our corpus is way too large for CI, and given the sources of the data, we're not going to put those into our official ASF repo.

Before we make a release, we manually run Tika against the corpus or a 1 million file sample and then compare the results with the previous version of Tika. We also use the corpus for development if we're working on a new parser or looking for issues in a specific parser. We use tika-eval (https://cwiki.apache.org/confluence/display/TIKA/TikaEval) in compare mode to do the comparisons. On a million files, it takes about 4 hours to parse and an hour to run the comparisons on a decently sized server.

from junrar.

tballison avatar tballison commented on May 30, 2024 1

@tballison while trying to rebase your PR, i managed to find a way to prevent the infinite loops for the files 2 and 3 without having to change the RAR VM code.

I just pushed the changes, and also published a new 7.5.3-SNAPSHOT.

Do you think you could rerun that on the corpus ? I really want to have some ways to do that myself using the rars you provided, but i didn't had time yet to work out something (even on my local machine).

Happy to do so. I may not have time today, but I'll see what I can do.

from junrar.

gotson avatar gotson commented on May 30, 2024

Had a quick look, and it seems junrar doesn't perform CRC checks on headers, despite headers storing CRC.

When running the provided files through unrar cli, it actually detects file corruption, and does not proceed to extracting anything:

Main archive header is corrupt
Unexpected end of archive

Testing archive loop1.rar

Unexpected end of archive
No files to extract

I have added CRC check for the end of archive header, and that takes care of loop1.rar. Haven't had time to check on the other files, or how to implement CRC checks for other header types.

from junrar.

gotson avatar gotson commented on May 30, 2024

loop2.rar doesn't have an End of Archive header. I am no rar expert, but i would say that means the archive is corrupt.

from junrar.

Han0nly avatar Han0nly commented on May 30, 2024

Hi @gotson , these files we provided are indeed corrupted RAR files. We use fuzzing to iteratively mutate some valid RAR files to test the junrar. Though it is not a valid RAR file, it does make the applications using junrar vulnerable to Denial of Service attack.

from junrar.

tballison avatar tballison commented on May 30, 2024

The fix proposed in #93 fixes the infinite loop in the three files attached. This is my first time working with your codebase, please forgive failure to follow junrar coding practice etc. I'm happy to learn.

We recently had a report of infinite loops in junrar here: https://issues.apache.org/jira/browse/TIKA-3798
Without the triggering file on that issue, we cannot know if this will fix that infinite loop, but we may as well fix the ones here.

Thank you!

from junrar.

tballison avatar tballison commented on May 30, 2024

I tgz'd Apache Tika's regression corpus 7GB of rars here: https://corpora.tika.apache.org/base/share/rars.tgz if you have any interest in running on a broader set of docs. I'm currently running our fuzzers against those as seeds.

from junrar.

gotson avatar gotson commented on May 30, 2024

I tgz'd Apache Tika's regression corpus 7GB of rars here: https://corpora.tika.apache.org/base/share/rars.tgz if you have any interest in running on a broader set of docs. I'm currently running our fuzzers against those as seeds.

I am curious on how you use that exactly in Tika. Would you be able to explain high level what kind of tests you run on those ? I'm also interested in how you can setup the CI to test on all that, given the massive file size?

from junrar.

gotson avatar gotson commented on May 30, 2024

@tballison i took some time to review the unmerged changes I did for this issue, i just pushed a commit with some checks on header validity.

I have published a 7.5.3-SNAPSHOT also, not sure if that's something you can use to see if that addresses TIKA-3798.

from junrar.

tballison avatar tballison commented on May 30, 2024

@tballison i took some time to review the unmerged changes I did for this issue, i just pushed a commit with some checks on header validity.

I have published a 7.5.3-SNAPSHOT also, not sure if that's something you can use to see if that addresses TIKA-3798.

Awesome thank you. We don't yet have a test file for TIKA-3798, but I can integrate the new version and run our fuzzing code on it.

As a side note, do you know if IP is only allowed to progress or is it random access?

from junrar.

gotson avatar gotson commented on May 30, 2024

As a side note, do you know if IP is only allowed to progress or is it random access?

I have no idea. That code was ported from the C unrar code more than 10 years ago. I only maintain the library, the low level code is entirely cryptic to me.

I checked the current unrar code and this doesn't exist anymore, it changed completely.

I will try to make use of your corpus, that could be a good addition to our small test suite. If I manage to do so, I'll try your change and compare the results.

from junrar.

tballison avatar tballison commented on May 30, 2024

I pulled from master and rebuilt. I'm still getting infinite loops on https://github.com/tballison/junrar/blob/issue81/src/test/resources/com/github/junrar/bugfixes/gh81-infinite-loop2.rar and https://github.com/tballison/junrar/blob/issue81/src/test/resources/com/github/junrar/bugfixes/gh81-infinite-loop3.rar These are just renamed versions of two of the files in the zip attached to this issue.

Then I added:

        if (ip < IP) {
            System.out.println("crash less than");
            System.exit(3);
        }
        if (ip == IP) {
            System.out.println("crash equal");
            System.exit(4);
        }

to test how often this happens on the full corpus of ~12000 files plus the test files on this project. Those crashes are only triggered on the two broken files mentioned above. This suggests that it would be safe to throw an exception if ip <= IP.

from junrar.

tballison avatar tballison commented on May 30, 2024

I will try to make use of your corpus, that could be a good addition to our small test suite

Given the murkiness of licensing, I would encourage you not to add those to this repo. Definitely use the files privately, but they really are just such a huge heap for the project's repo, and licenses/trademark/copyright headaches...

from junrar.

tballison avatar tballison commented on May 30, 2024

I ran the full tika-eval process with the change I proposed and the latest release. There were no differences except on the three files attached here.

from junrar.

tballison avatar tballison commented on May 30, 2024

I'm very hesitant to change the logic of code that we don't understand, but requiring that IP progress seems to be ok at least on these test files and prevents the attached infinite loop files.

from junrar.

gotson avatar gotson commented on May 30, 2024

I'm still getting infinite loops on https://github.com/tballison/junrar/blob/issue81/src/test/resources/com/github/junrar/bugfixes/gh81-infinite-loop2.rar and https://github.com/tballison/junrar/blob/issue81/src/test/resources/com/github/junrar/bugfixes/gh81-infinite-loop3.rar These are just renamed versions of two of the files in the zip attached to this issue.

Yes, that's normal. My initial fix which wasn't merged only fixed the first file, not the other 2. I still wanted that fix merge, because it adds some controls at a higher level in the code, by doing CRC checks on headers.

to test how often this happens on the full corpus of ~12000 files plus the test files on this project. Those crashes are only triggered on the two broken files mentioned above. This suggests that it would be safe to throw an exception if ip <= IP

I ran the full tika-eval process with the change I proposed and the latest release. There were no differences except on the three files attached here.

I'm very hesitant to change the logic of code that we don't understand, but requiring that IP progress seems to be ok at least on these test files and prevents the attached infinite loop files.

Thanks a lot for doing that ! I wanted to do something along those lines, but you beat me to it! That is very valuable information, and while i agree with you that doing changes in the lower level of the RAR virtual machine is tricky, your tests shows that we can be quite confident in doing so.

from junrar.

gotson avatar gotson commented on May 30, 2024

@tballison while trying to rebase your PR, i managed to find a way to prevent the infinite loops for the files 2 and 3 without having to change the RAR VM code.

I just pushed the changes, and also published a new 7.5.3-SNAPSHOT.

Do you think you could rerun that on the corpus ? I really want to have some ways to do that myself using the rars you provided, but i didn't had time yet to work out something (even on my local machine).

from junrar.

tballison avatar tballison commented on May 30, 2024

I'm still getting infinite loops on https://github.com/tballison/junrar/blob/issue81/src/test/resources/com/github/junrar/bugfixes/gh81-infinite-loop2.rar and https://github.com/tballison/junrar/blob/issue81/src/test/resources/com/github/junrar/bugfixes/gh81-infinite-loop3.rar These are just renamed versions of two of the files in the zip attached to this issue.

Yes, that's normal. My initial fix which wasn't merged only fixed the first file, not the other 2. I still wanted that fix merge, because it adds some controls at a higher level in the code, by doing CRC checks on headers.

Sorry, of course. I didn't read your earlier point closely enough. Sorry.

from junrar.

gotson avatar gotson commented on May 30, 2024

@tballison while trying to rebase your PR, i managed to find a way to prevent the infinite loops for the files 2 and 3 without having to change the RAR VM code.
I just pushed the changes, and also published a new 7.5.3-SNAPSHOT.
Do you think you could rerun that on the corpus ? I really want to have some ways to do that myself using the rars you provided, but i didn't had time yet to work out something (even on my local machine).

Happy to do so. I may not have time today, but I'll see what I can do.

No rush at all, it's already very kind of you to do so!

from junrar.

gotson avatar gotson commented on May 30, 2024

Finally had the time to implement regression testing using the corpus provided by @tballison. We now have regression tests on 12k+ files, and amazingly running under a minute on my Macbook!

I'm closing this issue, as it's resolved in 7.5.3.

from junrar.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.