Giter Club home page Giter Club logo

Comments (10)

patdunlavey avatar patdunlavey commented on June 16, 2024

Is there a reason it would not make sense to modify the line to use "include-exts=..." where we only list extensions for a few basic mimetypes?

from isle.

DiegoPino avatar DiegoPino commented on June 16, 2024

true, that would also help/be a solution. I was just thinking about saving the day (i filled 500 Gbytes in a day without this change)

from isle.

g7morris avatar g7morris commented on June 16, 2024

#206 is a duplicate of this original ticket. It does have more server side information but the solution proposed within it is not as good as this.

from isle.

shauntru avatar shauntru commented on June 16, 2024

I finally had a chance to review what happens when you turn off Tika during derivative generation. I then compared the resultant TECHMD against the same file with Tika turned on during derivative generation. Not surprisingly, there are no differences for filetypes other than PDF. In the case of PDF, the only difference I saw between Tika being on/off is a single line declaring the version of Tika being used. With that confirmation (that Tika isn't doing much of anything), I recommend commenting out the Tika tool line, as Diego originally suggested. I'm going to put in a pull request for this.

from isle.

g7morris avatar g7morris commented on June 16, 2024

@shauntru Even for OCR generation? I'm surprised as this issue seems to occur a lot for me when ingesting files i.e. newspapers, books (using internet archive book viewer) etc. Sooo I think TIKA is doing a lot frankly otherwise where are the server killer sized fragments coming from?

from isle.

shauntru avatar shauntru commented on June 16, 2024

@g7morris I dug into this a little bit more and it looks like TIKA gives some info about the tool used to generate OCR, and possibly some metadata about the OCR job entered within the program that generated the OCR, e.g. the title of the periodical being OCR'ed, but not much else that I've seen. I've tested against a couple dozen files. My point is that I don't see much value added by the data that TIKA does create. I have no idea what is inside those fragments.

from isle.

shauntru avatar shauntru commented on June 16, 2024

@g7morris A question for you - has there been improvements to /tmp garbage collection for ISLE? In the course of testing the fix for this issue, I've found that the /tmp folder is trashing TIKA fragments on its own. While running IMI and monitoring the /tmp folder, I'm seeing the fragments appear and then promptly go away. By the end of the ingest they are all gone. So my question is - has something changed and is this fix still necessary if so?

from isle.

g7morris avatar g7morris commented on June 16, 2024

@shauntru I think the fix is still necessary. I've had to setup cronjobs on other ISLE setups to zap all involved /tmp directories so I'm not seeing what you're describing. Perhaps when they are small batches? How about if the ingest is over 1 -2 days? ;) That is typically where I see the trouble occur.

From my experience it appears that rhe fragments are from the OCR process specifically objects / derived text areas that TIKA / OCR had "trouble" with and is "pointing out" this area within the fragment itself. I think the fragments are created ultimately as a nice helpful thing for a human to review and correct. In this context, with a large ingest or ingests over time, this "nice thing" becomes a server / disk killer in disguise. If someone needs it, they can read how to turn it on, right?

from isle.

shauntru avatar shauntru commented on June 16, 2024

@g7morris That's right! You had explained the contents of the TIKA fragments to me earlier.
And I kind of suspected that the cleanup I was seeing was due to the small size ingest. Thanks for the sanity check. Will make a PR asap.

from isle.

bseeger avatar bseeger commented on June 16, 2024

Should be solved by Islandora-Collaboration-Group/isle-apache#5

from isle.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.