Comments (10)
Is there a reason it would not make sense to modify the line to use "include-exts=..." where we only list extensions for a few basic mimetypes?
from isle.
true, that would also help/be a solution. I was just thinking about saving the day (i filled 500 Gbytes in a day without this change)
from isle.
#206 is a duplicate of this original ticket. It does have more server side information but the solution proposed within it is not as good as this.
from isle.
I finally had a chance to review what happens when you turn off Tika during derivative generation. I then compared the resultant TECHMD against the same file with Tika turned on during derivative generation. Not surprisingly, there are no differences for filetypes other than PDF. In the case of PDF, the only difference I saw between Tika being on/off is a single line declaring the version of Tika being used. With that confirmation (that Tika isn't doing much of anything), I recommend commenting out the Tika tool line, as Diego originally suggested. I'm going to put in a pull request for this.
from isle.
@shauntru Even for OCR generation? I'm surprised as this issue seems to occur a lot for me when ingesting files i.e. newspapers, books (using internet archive book viewer) etc. Sooo I think TIKA is doing a lot frankly otherwise where are the server killer sized fragments coming from?
from isle.
@g7morris I dug into this a little bit more and it looks like TIKA gives some info about the tool used to generate OCR, and possibly some metadata about the OCR job entered within the program that generated the OCR, e.g. the title of the periodical being OCR'ed, but not much else that I've seen. I've tested against a couple dozen files. My point is that I don't see much value added by the data that TIKA does create. I have no idea what is inside those fragments.
from isle.
@g7morris A question for you - has there been improvements to /tmp garbage collection for ISLE? In the course of testing the fix for this issue, I've found that the /tmp folder is trashing TIKA fragments on its own. While running IMI and monitoring the /tmp folder, I'm seeing the fragments appear and then promptly go away. By the end of the ingest they are all gone. So my question is - has something changed and is this fix still necessary if so?
from isle.
@shauntru I think the fix is still necessary. I've had to setup cronjobs on other ISLE setups to zap all involved /tmp directories so I'm not seeing what you're describing. Perhaps when they are small batches? How about if the ingest is over 1 -2 days? ;) That is typically where I see the trouble occur.
From my experience it appears that rhe fragments are from the OCR process specifically objects / derived text areas that TIKA / OCR had "trouble" with and is "pointing out" this area within the fragment itself. I think the fragments are created ultimately as a nice helpful thing for a human to review and correct. In this context, with a large ingest or ingests over time, this "nice thing" becomes a server / disk killer in disguise. If someone needs it, they can read how to turn it on, right?
from isle.
@g7morris That's right! You had explained the contents of the TIKA fragments to me earlier.
And I kind of suspected that the cleanup I was seeing was due to the small size ingest. Thanks for the sanity check. Will make a PR asap.
from isle.
Should be solved by Islandora-Collaboration-Group/isle-apache#5
from isle.
Related Issues (20)
- Staging install instructions: clarify or improve "commit locally to git" HOT 2
- Documentation: How to build a Multisite in ISLE HOT 3
- Drush crashes due to IMI use of Composer in install-local-migrate process at Step 10 HOT 5
- Development needed for Cantaloupe upgrade from 4.0.3 to 4.1.5 HOT 2
- Restore production `/var/www/html/sites/default/files` in `install-local-migrate.md` HOT 1
- Run standalone fedora image with AWS RDS configuration.
- Documentation: small fixes to install-environments.md HOT 1
- Documentation: update the final terminal output of install_solution_packs.sh HOT 1
- Error Message needs attention HOT 1
- Insufficient TemporaryStore disk space halting FEDORA HOT 4
- WARC tools missing from ISLE-apache Dockerfile HOT 4
- No documentation about configuring to send mail HOT 13
- Documentation update: docs/update/update.md needs "master" changed to "main" HOT 3
- Traefik should get an upgrade to 2.0
- Removing forced permissions check on ISLE apache image every time container restarts HOT 1
- Migration instructions have an odd step. HOT 5
- cache.server.source.ttl_seconds set twice in cantaloupe.properties HOT 1
- Tailing fedoragsearch logs: path needs correction HOT 4
- Staging Migration: Software Dependencies for both Ubuntu and CentOS are missing two steps HOT 1
- Should there be a warning that the docs are for ISLE7 / Islandora 7 and not for ISLE2 / Islandora 2 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from isle.