Comments (2)
A workaround on this is to download and install Python 2.7.8. I was able to get Tika working using 2.7.8. After installing, I pip install --upgrade --force tika. This should install Tika 1.9.3. After doing so, double check to make sure there aren't any temporary instances of tika server that are corrupt from previous downloads using HTTPS where it messed up because of 2.7.9. The way I checked was checking:
c:\users\chrisa~1\appdata\local\temp\
which is where tika-server is downloaded to. Check the jar file it may likely be corrupt. Delete it if necessary (which will cause Python to redownload it for you).
Then, try e.g., parser.from_file again:
>>> parser.from_file('C:\Users\Chris A Mattmann\Desktop\BalanceSheet.pdf')
tika.py: Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.9/tika-server-1.9.jar to c:\users\chrisa~1\appdata\local\temp\tika-server.jar.
{'content': None, 'metadata': {u'access_permission:can_modify': [u'true'], u'access_permission:extract_content': [u'true'], u'access_permission:assemble_document': [u'true'], u'access_permission:extract_for_accessibility': [u'true'], u'access_permission:fill_in_form': [u'true'], u'pdf:encrypted': [u'false'], u'access_permission:can_print': [u'true'], u'dc:format': [u'application/pdf; version=1.3'], u'access_permission:can_print_degraded': [u'true'], u'access_permission:modify_annotations': [u'true'], u'pdf:PDFVersion': [u'1.3'], u'X-TIKA:parse_time_millis': [u'886'], u'xmpTPg:NPages': [u'0'], u'resourceName': [u'BalanceSheet.pdf'], u'Content-Type': [u'application/pdf'], u'X-Parsed-By': [[u'org.apache.tika.parser.DefaultParser', u'org.apache.tika.parser.pdf.PDFParser']]}}
>>>
Works fine. Will keep trying to find a more permanent fix than install Python 2.7.8.
from tika-python.
OK confirmed that I can monkey patch this in the latest version. I am going to add a check inside of the function where it downloads the tika jar checking for an IOError, and if so, then doing:
>>> import ssl
>>> if hasattr(ssl, '_create_unverified_context'):
ssl._create_default_https_context = ssl._create_unverified_context
from tika-python.
Related Issues (20)
- 'charmap' codec can't decode byte 0x81 in position 279: character maps to <undefined> HOT 2
- Parsed text for EPUB mixes in metadata strings by default, and contains image tags + alt-text if service parameter is set to text HOT 3
- Airgap Environment Setup is unable to start Tika server HOT 6
- Help installing package on macOS M2 Ventura HOT 3
- Can tika extract "Marked Content" (tagged PDFs)? HOT 2
- Hi i am getting the same error HOT 1
- Timeline for tika 2.8 support HOT 3
- Increase retry duration in client only mode HOT 1
- Inclusion of PDF Metadata Title field in Extracted Content HOT 1
- Need to run tika server manualy but previously it works without tika HOT 1
- unable to run tika HOT 1
- Unable to start Tika server HOT 1
- Permission denied HOT 1
- How to fix ReadTimeout: HTTPConnectionPool(host='localhost', port=9998): Read timed out. (read timeout=60) HOT 1
- Can this receive a io[bytes] type? HOT 1
- Tika server 2.9.1 Pdf tesseract Ocr
- SSRF vulnerability: CVE-2022-46364
- Is there any way to preserve temp files?
- Any way to set IOUtils.setByteArrayMaxOverride(VALUE).
- `DeprecationWarning: pkg_resources is deprecated as an API`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tika-python.