kevm / tikaondotnet Goto Github PK

View Code? Open in Web Editor NEW

195.0 23.0 73.0 158.82 MB

Use the Java Tika text extraction library on the .NET platform

Home Page: http://kevm.github.io/tikaondotnet/

License: Apache License 2.0

C# 3.72% Batchfile 0.03% F# 0.65% Rich Text Format 95.59%

tika extract-text

tikaondotnet's Introduction

Tika on .NET

This project is a simple wrapper around the very excellent and robust Tika text extraction Java library. This project produces two nugets:

TikaOnDotNet - A straight IKVM hosted port of Java Tika project.

TikaOnDotNet.TextExtractor - Use Tika to extract text from rich documents.

Getting Started

The best way to get started is to:

Add a Nuget dependency to TikaOnDotNet.TextExtractor.
Instantiate a new TextExtractor object and call one of the Extract methods.

Usage

// using TikaOnDotNet.TextExtraction;

var textExtractor = new TextExtractor();

var wordDocContents = textExtractor.Extract(@".\path\to\my favorite word.docx");
var webPageContents = textExtractor.Extract(new Uri("https://google.com"));

Take a look at our tests for more usage examples.

How To Contribute

Have an idea to make this project better? Great! Start out by taking a look at our Contributing Guide.

Having A Problem?

Search in the Issues as your problem may be a common one. If don't find your problem please create an issue. Contributors here will chime in when they can.

tikaondotnet's People

Contributors

Stargazers

Watchers

tikaondotnet's Issues

Remove hard coded tika version in automation.

Oops #build.fsx

[IKVMcTask(tikaDir.``tika-app-1.12.jar``, "tika-app.dll", Version=release.AssemblyVersion)]

Looks like I forgot to automate one spot. The FileSystemTypeProvider in use does seem to support wildcards. I need to learn better how to find files in FAKE outside the typical guidance.

Something like where the result of the file glob curries its first argument into the IKVMcTask:

Target "CompileTikaLib" (fun _ ->
    !! "paket-files/www-us.apache.org/tika-app-*.jar"
    |> IKVMcTask "tika-app.dll", Version=release.AssemblyVersion
    |> IKVMCompile tikaLibDir
)

Rework documentation for new user and developer on boarding experience

Eric Anderson and I were chatting about improving the experience for new users and developers. The following is his advice.

I prefer that README.md is simply a Table of Contents, with light descriptions pointing to more in-depth files. So, in your case it might be something like:

Getting Started

New to the project? MyBadAssProject can be installed quickly and easily with Nuget. Check out the GettingStarted.md for a quick guide to get you up and running.

Contributing

Want to contribute? All the details you need to get started helping out are in Contributing.md

This is the structure I use on client projects at this point. I try to make sure I cover the following topics:

Brief description of the project

How to use it

How to contribute

Any special topics that should be front-page.

In my case, I have a special topic on my current project regarding database structure / migrations.

You should probably have a pointer back to the original Java project documentation

Update Tika to version 1.7

I noticed that Tika 1.7 was released so time to upgrade!

TextExtractor.Extract fails when run asynchronously

When using the Extract Method in an async Task, the method fails with the inner exception message "Thread was being aborted". The stacktrace shows

   at java.io.File..ctor(String pathname)
   at TikaOnDotNet.TextExtractor.Extract(String filePath)

The method then throws an normal TextExtractionException with the message "Extraction of text from the file '{MyFilename}' failed."

When running the same code synchronously, it works perfectly, however I'd like to do the extraction in a background thread obviously.

Error in IKVM PackageListAttribute constructor when calling AutoDetectParser

Since the latest update through nuget for TikaOnDotNet and IKVM, I have started getting errors. I also get the same error when running the Test Project included with TikaOnDotNet.
I was wondering if you have run into this problem.

After AutoDetectParser is called, I was getting this error:
InnerException {"Method not found: 'Void IKVM.Attributes.PackageListAttribute..ctor(System.String[][])'."} System.Exception {System.MissingMethodException}

I tried with both the 7.x and 8.x versions of IKVM.

I grabbed the IKVMC compiler and compiled the latest tika-app jar file which gives a slightly different constructor error:
InnerException {"Method not found: 'Void IKVM.Attributes.PackageListAttribute..ctor(System.String[])'."} {System.MissingMethodException}

I was wondering if you had any idea on where I should head on this one. I think I’ll pull the IKVM source and consider adding a matching constructor, but that seems like I’m just stabbing around in the dark and could be missing something easy.

I am having the same issue, but built it outside of visual studio 2012.

I also get the 'HEAD' error. You say I am supposed to have 'git' installed? Where do I get 'git'? I don't see any place to download 'git' for .NET. There are 5 or 6 .NET add-ins that reference 'git'. Which one do I use? I have never used git before, except to download zips and use things. I really would like to also use this in a project, but already had an issue with:

The build is referencing a jar file that is x.12 and it receives a 404 error. The jar file is actually x.13. I could fix that in the build file and continue. But I cannot get past the 'git' issue. Please help!

Version ?

Is this librarys version linked with the Tika version somehow? If not, would be nice.

Add appveyor.yml

Secure all environment secrets
Add appveyor.yml
Ensure it works. The last time you tried it we got warnings building tike-app.dll

Update project automation to consume nugets (ripple?)

There is an IKVM nuget

Some OpenJDK assemblies were not copied?

https://sourceforge.net/p/ikvm/bugs/296/

Files are not found for some reason in Extractor class

Split Nuget into TikaOnDotNet helper assembly and tika-app Nugets

Not everyone needs to take a dependency on TikaOnDotnet which is really just an example of doing general text extraction from the tika-app which is ported over from Java via ikmvc.

"The decryption operation failed" during install-package

Created a command line project.
Install-Package TikaOnDotNet

install-package : The decryption operation failed, see inner exception.
At line:1 char:1

install-package TikaOnDotNet

- CategoryInfo          : NotSpecified: (:) [Install-Package], Exception
- FullyQualifiedErrorId : NuGetCmdletUnhandledException,NuGet.PackageManagement.PowerShellCmdlets.InstallPackageCommand

Fix nuget metadata for projectUrl

Verify there is not more that I missed.

Build fails when source was downloaded from GitHub and not cloned via Git

Git is only used to inject the current commit's SHA into the generated assembly's metadata. Just an accounting technique if someone forgets to rev the version number.

We should really make Git optional in the build script this is the place to start:

Target "SetVersions" (fun _ ->
  let commitHash = Information.getCurrentSHA1 "."
  CreateCSharpAssemblyInfo "./SolutionInfo.cs"
        [Attribute.Version release.AssemblyVersion
         Attribute.FileVersion release.AssemblyVersion
         Attribute.Trademark commitHash]
)

Maybe add a local function to wrap getting the SHA in an exception handler that returns an empty string.

This issue is a replacement for what we learned in #52.

javax.xml.parsers.FactoryConfigurationError was unhandled Message=Provider com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl not found Source=IKVM.OpenJDK.XML.API

javax.xml.parsers.FactoryConfigurationError was unhandled
Message=Provider com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl not found
Source=IKVM.OpenJDK.XML.API
StackTrace:
at javax.xml.parsers.DocumentBuilderFactory.newInstance()
at org.apache.tika.mime.MimeTypesReader.read(InputStream )
at org.apache.tika.mime.MimeTypesFactory.create(InputStream inputStream)
at org.apache.tika.mime.MimeTypesFactory.create(URL url)
at org.apache.tika.mime.MimeTypesFactory.create(String filePath)
at org.apache.tika.mime.MimeTypes.getDefaultMimeTypes()
at org.apache.tika.config.TikaConfig..ctor(CompositeParser )
at org.apache.tika.config.TikaConfig..ctor()
at org.apache.tika.config.TikaConfig.getDefaultConfig()
at org.apache.tika.parser.AutoDetectParser..ctor()
at TikaOnDotNet.TextExtractor.Extract(String filePath) in E:\MyLearningLab\KevM-tikaondotnet-v0.2-5-gd63df0e\KevM-tikaondotnet-d63df0e\TikaOnDotnet\TextExtractor.cs:line 43
at testusingtikalib.Program.Main(String[] args) in E:\MyLearningLab\KevM-tikaondotnet-v0.2-5-gd63df0e\KevM-tikaondotnet-d63df0e\testusingtikalib\Program.cs:line 13
InnerException:

Upgrade to Tika 0.10

Extracting data from a mp4 somehow "locks" the file

I'm using the tika to extract data for our intranet. Everything works file excepts for mp4-files. I do get a result back, but I can't delete it after I've done the extraction.

My scenario is that I have the files on a network share and then copy them over to a temporary folder where I delete them after I've extracted the info. But for some reason it fails with mp4 files.

Next release verify GitHub release creation via Appveyor

Appveyor - enable build cache for packages dep on packet.lock

https://www.appveyor.com/docs/build-cache/

This is not a bit deal but it would make our builds a bit faster.

TikaOnDotNet getting error when calling TikaOnDot from another DLL.

I came across your TikaOnDotNet and it works great for me if I call it from my windows application, but when I make the same call to extract from another DLL it fails on the CreateInstance.

Basically the path is: Application ---calls--> Custom DLL ---calls--> TikaOnDotnet

The error I get is

"Provider com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl not found"

It is thrown on the following line of code

TransformerFactory factory = TransformerFactory.newInstance();

Any thoughts or suggestions on getting the TikaOnDotNet to work when called from an intermediate DLL.

Thanks

Joe

NuGet package 1.12

Hi,

Are you sure that the TikaOnDotNet assemblies got included in the latest build? I can't seem to find either TikaOnDotNet or TextExtractor anymore...

Reason why IKVM is omitting some methods?

What could be a possible reason for IKVM to omit certain methods? Looking at the class:

org.apache.fontbox.encoding.Encoding, you get this from metadata:

#region Assembly tika-app, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null
// <in-memory assembly>
#endregion

using IKVM.Attributes;
using java.lang;
using java.util;

namespace org.apache.fontbox.encoding
{
    public abstract class Encoding : Object
    {
        public const int NUMBER_OF_MAC_GLYPHS = 258;
        [Signature("Ljava/util/Map<Ljava/lang/String;Ljava/lang/Integer;>;")]
        public static Map MAC_GLYPH_NAMES_INDICES;
        [Signature("Ljava/util/Map<Ljava/lang/Integer;Ljava/lang/String;>;")]
        protected internal Map codeToName;
        [Signature("Ljava/util/Map<Ljava/lang/String;Ljava/lang/Integer;>;")]
        protected internal Map nameToCode;

        [LineNumberTableAttribute(new[] { 159, 174, 232, 160, 78, 203 })]
        public Encoding();

        [Modifiers(Modifiers.Public | Modifiers.Static | Modifiers.Final)]
        public static string[] MAC_GLYPH_NAMES { get; }

        [LineNumberTableAttribute(new[] { 160, 97, 113, 131, 130 })]
        public static string getCharacter(string name);
        public static void __<clinit>();
        [LineNumberTableAttribute(199)]
        [Throws(new[] { "java.io.IOException" })]
        public virtual string getCharacter(int code);
        [LineNumberTableAttribute(new[] { 92, 114, 131, 159, 16 })]
        [Throws(new[] { "java.io.IOException" })]
        public virtual int getCode(string name);
        [LineNumberTableAttribute(new[] { 111, 119, 131, 134 })]
        [Throws(new[] { "java.io.IOException" })]
        public virtual string getName(int code);
        [LineNumberTableAttribute(new[] { 159, 97, 66, 118, 131, 159, 16 })]
        [Throws(new[] { "java.io.IOException" })]
        public virtual string getNameFromCharacter(char c);
        [LineNumberTableAttribute(new[] { 77, 115, 115 })]
        protected internal virtual void addCharacterEncoding(int code, string name);
    }
}

But if we look at the method summary table from the docs (either 1.8.x or 2.0.x), you can see there are methods such as:

Map<Integer,String>	getCodeToNameMap()
Returns an unmodifiable view of the code to name mapping.

I've seen this behaviour on other libs I was tinkering with yesterday, but the methods/classes showed up if I exported the jar differently before running it through IKVM. Looking for your thoughts and thanks in advance.

Build fails when source was downloaded from GitHub and not cloned via Git

And it is very frustrating, to have 9 people who have spent so much time in order to publish something like this, only to have it wasted because no one can use the library as it stands. There is no way to download libraries, so one is left to attempt to compile. And in 30+ years of coding, I've never seen so much difficulty just to get something to compile.

Just to start with:

The build references an old version of the tikalib jar files. This needs to be updated.
When one downloads git on a windows 8.1 machine, git is inexplicably installed in a users\username\appdata\local\github\ (???????) folder! What the heck is up with that? To make matters even more dubious, git attaches a folder named 9w8pwoeropweiurpowisdsoiufs[o or something like this, where git is actually installed. So adding it to a path is not a simple task, especially when the user name has spaces in it. It's just a bad idea to hide an app in a user folder. I have no clue what the thinking is behind this.
After 3 days of trying all kinds of things, including verifying that git IS in fact, in my path, I can STILL not get this to complile. All I receive is the following error, no matter what I do:

Checking Paket version (downloading latest stable)...
Paket.exe 3.3.6 is up to date.
Paket version 3.3.6.0
0 seconds - ready.
Building project with version: LocalBuild
Shortened DependencyGraph for Target RunTests:
<== RunTests
<== Build
<== CompileTikaLib
<== SetVersions
<== Clean

The resulting target order is:

Clean
SetVersions
CompileTikaLib
Build
RunTests
Starting Target: Clean
Deleting contents of artifacts
Deleting contents of temp
Deleting contents of lib
Finished Target: Clean
Starting Target: SetVersions (==> Clean)
git.exe rev-parse HEAD
Running build failed.
Error:
System.Exception: Could not run "git rev-parse HEAD".
Error: Start of process git.exe failed. The system cannot find the file specified
at [email protected](String message) in C:\code\fake\src\app\FakeLib\Git\CommandHelper.fs:line 89
at Fake.Git.CommandHelper.runSimpleGitCommand(String repositoryDir, String command) in C:\code\fake\src\app\FakeLib\Git\CommandHelper.fs:line 89
at Fake.Git.Branches.getSHA1(String repositoryDir, String commit) in C:\code\fake\src\app\FakeLib\Git\Branches.fs:line 32
at Fake.Git.Information.getCurrentSHA1(String repositoryDir) in C:\code\fake\src\app\FakeLib\Git\Information.fs:line 60
at [email protected](Unit _arg2)
at Fake.TargetHelper.runSingleTarget(TargetTemplate`1 target) in C:\code\fake\src\app\FakeLib\TargetHelper.fs:line 492

Build Time Report

Target Duration

Clean 00:00:00.0022169
Total: 00:00:00.0875770

Status: Failure

System.Exception: Could not run "git rev-parse HEAD".
Error: Start of process git.exe failed. The system cannot find the file specified
at [email protected](String message) in C:\code\fake\src\app\FakeLib\Git\CommandHelper.fs:line 89
at Fake.Git.CommandHelper.runSimpleGitCommand(String repositoryDir, String command) in C:\code\fake\src\app\FakeLib\Git\CommandHelper.fs:line 89
at Fake.Git.Branches.getSHA1(String repositoryDir, String commit) in C:\code\fake\src\app\FakeLib\Git\Branches.fs:line 32
at Fake.Git.Information.getCurrentSHA1(String repositoryDir) in C:\code\fake\src\app\FakeLib\Git\Information.fs:line 60
at [email protected](Unit _arg2)

at Fake.TargetHelper.runSingleTarget(TargetTemplate`1 target) in C:\code\fake\src\app\FakeLib\TargetHelper.fs:line 492

C:\downloads\tikaondotnet-master\tikaondotnet-master>git rev-parse HEAD
fatal: Not a git repository (or any of the parent directories): .git

C:\downloads\tikaondotnet-master\tikaondotnet-master>

Can someone please help? I notice that the last comment was made over 3 months ago, referencing, it seems very similar issues, and in that time, nothing has been resolved.

I am NOT compiling this in visual studio, as was specified in your readme. From the command line, I type 'build' in the tika download folder. And no matter if git is in my path or not in my path, I still get the above error.

Also, it seems there are files trying to get compiled that don't even exist in the sources. I'll get nailed for saying this, but it's a complete mess. How can someone feel good about the quality of the conversion, if there are so many errors in just getting it to compile? I realize the work you folks have done in this, but what's the point if no one can actually use it? Isn't THAT the whole point?

Recursive processing with TikaOnDotNet

I am attempting to process nested docs using the example code from: http://wiki.apache.org/tika/RecursiveMetadata.

The key seems to be setting up the context with context.set(Parser.class, parser); since somewhere down in the Tika code it will pull out that class object with context.get(Parser.class).

In .NET, Parser is an interface and has no 'class' property.

What do I use on the .Net side so that the converted java code works correctly?

Add step to build to convert release notes to HTML

http://tpetricek.github.io/FSharp.Formatting/

Upgrade to Tika 1.4

Reference discrepancy?

When installing the .TextExtractor package with dependencies on an empty project, I get the message:

Assembly 'TikaOnDotNet' with identity 'TikaOnDotNet, Version=1.12.0.0, Culture=neutral, PublicKeyToken=null'
uses 'IKVM.OpenJDK.Core, Version=8.1.5717.0, Culture=neutral, PublicKeyToken=13235d27fcbfff58'
which has a higher version than referenced assembly 
'IKVM.OpenJDK.Core' with identity 'IKVM.OpenJDK.Core, Version=7.2.4630.5, Culture=neutral, PublicKeyToken=13235d27fcbfff58'

When examining the TikaOnDotNet nupkg nuspec, I find that the dependency is in fact <dependency id="IKVM" version="[7.2.4630.5]" />.

I have no wierd assemblyRedirects (I actually cleared them just to be sure), and IKVM is installed on the said project with the "correct" version. No other project is using it.

Upgrading the NuGet package manually does it, so this is a bug report.

TikaDotNet does not recurse zip files the same way as direct cli execution of the jar does

Hello,

I had a problem and it took me a while to notice exactly why but when tika is executed from the command line (as a jar or a ikvm wrapped exe) it correctly recurses through zip files. TikaDotNet's TextExtractor Extract method does not.

The issue comes down to how the context is created. Instead of:
...
Class parserClass = parser.GetType();
parseContext.set(parserClass, parser);
...
The correct option (for parity with the java cli usage anyway) would be:

parseContext.set(typeof(org.apache.tika.parser.Parser), parser);

This fundamentally changes the behavior of tikadotnet, so I don't want to just submit a patch or anything, but this will provide parity with the direct Java jar usage if that is desired.

Thanks,
Tim

Slowness parsing Outlook .msg files

Hi,

Tikaondotnet seems to have an issue parsing .msg files. At first it seemed like it was stuck in an infinite loop somewhere, but it does usually return after a while. I downloaded the code and stepped into the TextExtractor class in an attempt to figure it out -- the parse method does return; the issue appears to be with closing the inputStream:

using (var inputStream = streamFactory(metadata))
{
    try
    {
    parser.parse(inputStream, getTransformerHandler(outputWriter), metadata, parseContext);
    }
    finally
    {
    inputStream.close();
    }
}

I've tried several different .msg files, with the same result. Those same files do work via the Tika GUI, which is using the same tika-app-1.13.jar under the hood.

Any idea if there is a way around this? Tikaondotnet is awesome and I'd love to make use of it if there's a way to get the .msg files working. I've attached a (zipped) .msg sample to help reproduce the issue.

Many thanks,
Andrew
emailTest.zip

Update to latest Tika version

Hi,

Can you update to the latest version? Version 1.14 is out since 19 October,

Thanks.

Move tests to their own .csproj.

build error

tried to build the source with build.cmd. My source is located at c:\users\harrchen\source\repos\tikaondontnet. However, got error pointing to c:\code which doesn't exist:

System.Exception: Could not run "git rev-parse HEAD".
Error: Start of process git.exe failed. The system cannot find the file specified
at [email protected](String message) in C:\code\fake\src\app\FakeLib\Git\CommandHelper.fs:line 89

TextExtractor.Extract(byte[] data) should accept offset and length

This function should handle byte array offset and length for when the data buffer is larger than the actual document it contains.

Something similar to Encoding.GetString(byte[] bytes, int index, int count)

Configuring Tesseract OCR for TikaOnDotNet

The hope here is to get TikaOnDotNet fully configured to access Tesseract OCR for text extraction from images. With Tika .93 support for Tesseract was added, and we are now in the midst of validating the latest release Tika 1.13.1. A big set of validations center around Tika's ability to handle certain types of PDF files, it should be noted that TIFF images in PDFBox have changes due to licensing issues that are not in compliance with the Apache license.

So here is hoping that if we cannot read it one way, we might be able to read it using another.

The first step has been to extend Kevin's TextExtractor so that Meta data can be passed in to assist the parsing that set of extensions is here:

public static class TikaOnDotNetExtensions
{
    private static TikaConfig config = TikaConfig.getDefaultConfig();
    public static TextExtractionResult Extract(this TextExtractor te, byte[] data, string filePath, string ContentType)
    {
      TextExtractionResult result = te.Extract
        (
          metadata =>
          {
            metadata.add("resourceName", System.IO.Path.GetFileName(filePath));
            metadata.add("FilePath", filePath);
            try
            {
              if (!ContentType.Equals("application/octet-stream", StringComparison.CurrentCultureIgnoreCase))
              {
                metadata.add("Content-Type", ContentType);
              }
              else
              {
                Detector detector = config.getDetector();
                using (org.apache.tika.io.TikaInputStream inputStream = org.apache.tika.io.TikaInputStream.@get(data, metadata))
                {
                  MediaType foundType = detector.detect(inputStream, metadata);
                  if (!foundType.toString().Equals("application/octet-stream", StringComparison.CurrentCultureIgnoreCase))
                  {
                    metadata.add("Content-Type", foundType.toString());
                  }
                }
              }
            }
            catch (Exception ex)
            {
              throw ex;
            }


            return TikaInputStream.get(data, metadata);
          }
        );

      return result;
    }

    public static TextExtractionResult Extract(this TextExtractor te, byte[] data, string filePath)
    {
      return te.Extract(data, filePath, "application/octet-stream");
    }
}

The next step has been to dump the configuration to confirm how Tika is configured, and what changes might need to be made, the dump routine was added to the class above:

    public static string TikaConfigDump()
    {
      StringBuilder retVal = new StringBuilder();

      retVal.AppendFormat("{0}\t{1}\n\n", "Version", (new org.apache.tika.Tika(config)).toString());


      retVal.AppendLine("\nDetectors");

      CompositeDetector configDetector = (CompositeDetector)config.getDetector();
      var detectors = configDetector.getDetectors().toArray();
      foreach (Detector detector in detectors)
      {
        retVal.AppendFormat("\t{0}\n", ((java.lang.Object)detector).getClass().getName());

        if (detector.GetType() == typeof(CompositeDetector))
        {
          var subDetectors = configDetector.getDetectors().toArray();
          foreach (Detector subDetector in subDetectors)
          {
            retVal.AppendFormat("\t\t{0}\n", ((java.lang.Object)subDetector).getClass().getName());
          }
        }
      }

      retVal.AppendLine("\nParsers");

      CompositeParser configParser = (CompositeParser)config.getParser();
      var parsers = configParser.getAllComponentParsers().toArray();
      foreach (Parser parser in parsers)
      {
        retVal.AppendFormat("\t{0}\n", ((java.lang.Object)parser).getClass().getName());

        var parserTypes = parser.getSupportedTypes(new ParseContext()).toArray();
        foreach (MediaType mediaType in parserTypes)
        {
          retVal.AppendFormat("\t\t{0}\n", mediaType.toString());
        }
      }

      org.apache.tika.language.translate.Translator translator = config.getTranslator();
      if (translator.isAvailable())
      {
        retVal.AppendFormat("Translator {0}\n", ((java.lang.Object)translator).getClass().getName());
      }

      return retVal.ToString();
    }

On my system using the default configuration provided by Kevin you can see the setup below:

Version Apache Tika 1.13

Detectors
org.apache.tika.parser.microsoft.POIFSContainerDetector
org.apache.tika.parser.pkg.ZipContainerDetector
org.gagravarr.tika.OggDetector
org.apache.tika.mime.MimeTypes

Parsers
org.apache.tika.parser.asm.ClassParser
application/java-vm
org.apache.tika.parser.audio.AudioParser
audio/x-wav
audio/basic
audio/x-aiff
org.apache.tika.parser.audio.MidiParser
application/x-midi
audio/midi
org.apache.tika.parser.chm.ChmParser
application/vnd.ms-htmlhelp
application/x-chm
application/chm
org.apache.tika.parser.code.SourceCodeParser
text/x-c++src
text/x-groovy
text/x-java-source
org.apache.tika.parser.crypto.Pkcs7Parser
application/pkcs7-signature
application/pkcs7-mime
org.apache.tika.parser.dif.DIFParser
application/dif+xml
org.apache.tika.parser.dwg.DWGParser
image/vnd.dwg
org.apache.tika.parser.epub.EpubParser
application/x-ibooks+zip
application/epub+zip
org.apache.tika.parser.executable.ExecutableParser
application/x-msdownload
application/x-sharedlib
application/x-elf
application/x-object
application/x-executable
application/x-coredump
org.apache.tika.parser.external.CompositeExternalParser
org.apache.tika.parser.feed.FeedParser
application/atom+xml
application/rss+xml
org.apache.tika.parser.font.AdobeFontMetricParser
application/x-font-adobe-metric
org.apache.tika.parser.font.TrueTypeParser
application/x-font-ttf
org.apache.tika.parser.gdal.GDALParser
application/x-gsc
image/x-ozi
application/x-pds
image/eir
application/x-usgs-dem
application/aaigrid
application/x-bag
application/elas
application/x-rs2
application/x-tsx
application/x-lcp
image/geotiff
application/x-mbtiles
application/x-cappi
application/x-netcdf
application/x-gsag
application/x-epsilon
application/x-ace2
application/jaxa-pal-sar
image/x-pcraster
application/x-msgn
image/arg
application/x-hdf
image/x-mff
application/x-kro
image/x-hdf5-image
image/x-dimap
image/x-srp
image/big-gif
application/x-envi
application/x-cosar
application/x-ntv2
image/bmp
application/x-doq2
application/x-bt
application/x-kml
application/x-gmt
application/x-rst
application/vrt
application/pcisdk
application/x-ctg
application/x-e00-grid
application/x-rik
image/ida
image/x-mff2
application/sdts-raster
application/x-snodas
image/jp2
image/sar-ceos
application/terragen
application/x-wcs
application/leveller
application/x-ingr
application/x-gtx
image/sgi
application/x-pnm
image/raster
application/fits
application/x-r
image/gif
application/x-envi-hdr
application/x-http
application/x-rmf
application/x-ecrg-toc
application/aig
application/x-rpf-toc
image/adrg
application/x-srtmhgt
application/x-generic-bin
application/jdem
image/x-airsar
application/x-webp
application/x-ngs-geoid
application/x-pcidsk
image/x-fujibas
application/x-wms
application/x-map
image/ceos
application/xpm
application/x-zmap
image/envisat
application/x-ers
application/x-doq1
application/x-isis2
application/x-nwt-grd
application/x-ppi
image/ilwis
application/x-isis3
application/x-nwt-grc
application/x-blx
application/gff
application/x-ndf
image/jpeg
application/x-geo-pdf
application/x-l1b
image/fit
application/x-gsbg
application/x-sdat
application/x-ctable2
application/x-grib
application/x-coasp
application/x-dipex
application/grass-ascii-grid
image/fits
application/x-til
application/x-dods
image/png
application/x-gxf
application/x-gs7bg
application/x-cpg
application/x-lan
application/x-xyz
image/bsb
application/x-p-aux
application/dted
application/x-rasterlite
image/nitf
image/hfa
application/x-fast
application/x-los-las
org.apache.tika.parser.geo.topic.GeoParser
application/geotopic
org.apache.tika.parser.geoinfo.GeographicInformationParser
text/iso19139+xml
org.apache.tika.parser.grib.GribParser
application/x-grib2
org.apache.tika.parser.hdf.HDFParser
application/x-hdf
org.apache.tika.parser.html.HtmlParser
text/html
application/vnd.wap.xhtml+xml
application/x-asp
application/xhtml+xml
org.apache.tika.parser.image.BPGParser
image/bpg
image/x-bpg
org.apache.tika.parser.image.ICNSParser
image/icns
org.apache.tika.parser.image.ImageParser
image/png
image/vnd.wap.wbmp
image/bmp
image/x-xcf
image/gif
image/x-icon
image/x-ms-bmp
org.apache.tika.parser.image.PSDParser
image/vnd.adobe.photoshop
org.apache.tika.parser.image.TiffParser
image/tiff
org.apache.tika.parser.image.WebPParser
image/webp
org.apache.tika.parser.iptc.IptcAnpaParser
text/vnd.iptc.anpa
org.apache.tika.parser.isatab.ISArchiveParser
application/x-isatab
org.apache.tika.parser.iwork.IWorkPackageParser
application/vnd.apple.keynote
application/vnd.apple.iwork
application/vnd.apple.numbers
application/vnd.apple.pages
org.apache.tika.parser.jdbc.SQLite3Parser
org.apache.tika.parser.journal.JournalParser
application/pdf
org.apache.tika.parser.jpeg.JpegParser
image/jpeg
org.apache.tika.parser.mail.RFC822Parser
message/rfc822
org.apache.tika.parser.mat.MatParser
application/x-matlab-data
org.apache.tika.parser.mbox.MboxParser
application/mbox
org.apache.tika.parser.mbox.OutlookPSTParser
application/vnd.ms-outlook-pst
org.apache.tika.parser.microsoft.JackcessParser
application/x-msaccess
org.apache.tika.parser.microsoft.OfficeParser
application/x-tika-msoffice-embedded; format=ole10_native
application/msword
application/vnd.visio
application/vnd.ms-project
application/x-tika-msworks-spreadsheet
application/x-mspublisher
application/vnd.ms-powerpoint
application/x-tika-msoffice
application/sldworks
application/x-tika-ooxml-protected
application/vnd.ms-excel
application/vnd.ms-outlook
org.apache.tika.parser.microsoft.OldExcelParser
application/vnd.ms-excel.workspace.3
application/vnd.ms-excel.workspace.4
application/vnd.ms-excel.sheet.2
application/vnd.ms-excel.sheet.3
application/vnd.ms-excel.sheet.4
org.apache.tika.parser.microsoft.TNEFParser
application/vnd.ms-tnef
application/x-tnef
application/ms-tnef
org.apache.tika.parser.microsoft.ooxml.OOXMLParser
application/vnd.ms-word.document.macroenabled.12
application/vnd.ms-excel.addin.macroenabled.12
application/x-tika-ooxml
application/vnd.openxmlformats-officedocument.wordprocessingml.template
application/vnd.ms-powerpoint.addin.macroenabled.12
application/vnd.openxmlformats-officedocument.spreadsheetml.template
application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/vnd.openxmlformats-officedocument.presentationml.template
application/vnd.ms-powerpoint.slideshow.macroenabled.12
application/vnd.openxmlformats-officedocument.presentationml.presentation
application/vnd.ms-powerpoint.presentation.macroenabled.12
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/vnd.openxmlformats-officedocument.presentationml.slideshow
application/vnd.ms-excel.template.macroenabled.12
application/vnd.ms-excel.sheet.macroenabled.12
application/vnd.ms-word.template.macroenabled.12
org.apache.tika.parser.mp3.Mp3Parser
audio/mpeg
org.apache.tika.parser.mp4.MP4Parser
video/x-m4v
application/mp4
video/3gpp
video/3gpp2
video/quicktime
audio/mp4
video/mp4
org.apache.tika.parser.netcdf.NetCDFParser
application/x-netcdf
org.apache.tika.parser.ocr.TesseractOCRParser
org.apache.tika.parser.odf.OpenDocumentParser
application/x-vnd.oasis.opendocument.presentation
application/vnd.oasis.opendocument.chart
application/x-vnd.oasis.opendocument.text-web
application/x-vnd.oasis.opendocument.image
application/vnd.oasis.opendocument.graphics-template
application/vnd.oasis.opendocument.text-web
application/x-vnd.oasis.opendocument.spreadsheet-template
application/vnd.oasis.opendocument.spreadsheet-template
application/vnd.sun.xml.writer
application/x-vnd.oasis.opendocument.graphics-template
application/vnd.oasis.opendocument.graphics
application/vnd.oasis.opendocument.spreadsheet
application/x-vnd.oasis.opendocument.chart
application/x-vnd.oasis.opendocument.spreadsheet
application/vnd.oasis.opendocument.image
application/x-vnd.oasis.opendocument.text
application/x-vnd.oasis.opendocument.text-template
application/vnd.oasis.opendocument.formula-template
application/x-vnd.oasis.opendocument.formula
application/vnd.oasis.opendocument.image-template
application/x-vnd.oasis.opendocument.image-template
application/x-vnd.oasis.opendocument.presentation-template
application/vnd.oasis.opendocument.presentation-template
application/vnd.oasis.opendocument.text
application/vnd.oasis.opendocument.text-template
application/vnd.oasis.opendocument.chart-template
application/x-vnd.oasis.opendocument.chart-template
application/x-vnd.oasis.opendocument.formula-template
application/x-vnd.oasis.opendocument.text-master
application/vnd.oasis.opendocument.presentation
application/x-vnd.oasis.opendocument.graphics
application/vnd.oasis.opendocument.formula
application/vnd.oasis.opendocument.text-master
org.apache.tika.parser.pdf.PDFParser
application/pdf
org.apache.tika.parser.pkg.CompressorParser
application/zlib
application/x-gzip
application/x-bzip2
application/x-compress
application/x-java-pack200
application/gzip
application/x-bzip
application/x-xz
org.apache.tika.parser.pkg.PackageParser
application/x-tar
application/java-archive
application/x-archive
application/zip
application/x-cpio
application/x-tika-unix-dump
application/x-7z-compressed
org.apache.tika.parser.pkg.RarParser
application/x-rar-compressed
org.apache.tika.parser.pot.PooledTimeSeriesParser
org.apache.tika.parser.rtf.RTFParser
application/rtf
org.apache.tika.parser.txt.TXTParser
text/plain
org.apache.tika.parser.video.FLVParser
video/x-flv
org.apache.tika.parser.xml.DcXMLParser
application/xml
image/svg+xml
org.apache.tika.parser.xml.FictionBookParser
application/x-fictionbook+xml
org.gagravarr.tika.FlacParser
audio/x-oggflac
audio/x-flac
org.gagravarr.tika.OggParser
audio/ogg
application/kate
application/ogg
video/daala
video/x-ogguvs
video/x-ogm
audio/x-oggpcm
video/ogg
video/x-dirac
video/x-oggrgb
video/x-oggyuv
org.gagravarr.tika.OpusParser
audio/opus
audio/ogg; codecs=opus
org.gagravarr.tika.SpeexParser
audio/ogg; codecs=speex
audio/speex
org.gagravarr.tika.TheoraParser
video/theora
org.gagravarr.tika.VorbisParser
audio/vorbis

The next set of steps will be configuring and testing Tesseract prior to integrating it in Tika.

Update Tika to 1.9

Hey, thanks for your work!

Do you plan to update to Tika 1.9? Is the update process involved, or will I be able to do it easily (and submit a PR)?

There is a bug in 1.7 (and 1.8 as well) which causes WordParser to fail on some docs.

Produce TikaonDotnet Nuget

After automation gets updated support publishing a nuget.

Add console usage example

TikaOnDotNet namespace change?

Currently TikaOnDotNet is namespaced to match the Java jar. tiki-app. Is this the correct namespace or should this be changed to TikaOnDotNet?

Where is SolutionInfo.cs?

The project has a link to SolutionInfo.cs but that file is not part of the source. Is it needed? If so, where I can find it?

Feeding Tika with a stream?

Hello again.

Is there much to gain by using this method, instead of feeding Tika a simple byte array?

public TextExtractionResult Extract(Func<Metadata, InputStream> streamFactory);

If so, how do we use it?

Right now, my code looks like this:

    public List<string> ExtractText(Stream inputStream)
    {
        using (var memoryStream = new MemoryStream())
        {
            inputStream.CopyTo(memoryStream);

            var result = Tika.Extract(memoryStream.GetBuffer());

            var str = result.Text
                .Replace("\r", string.Empty)
                .Replace("§  ", string.Empty)
                .Split(new string[] { "\n\n\n\n" }, StringSplitOptions.RemoveEmptyEntries)
                .Select(t => t.Replace("\n", " ").Replace("    ", " ").Replace("   ", " ").Replace("  ", " ").Trim())
                .ToList();

            return str;
        }
    }

Add a strong name to the dlls

Not signed with strong names, which means it's not possible to reference them from other dlls having a strong name. Also unable to re-sign them using ILDASM/ILASM, hence it's a bit of pain...

Please sign the dlls with a strong name.

IKVM version of Tika hangs during Word file extracting

I'd converted something like a 10k .doc/.docx succesfully and then met this one.

Tika converts it in just a few seconds, but IKVM version hangs forever with high resource usage. I've tried both current master and IKVM8/Tika1.9 build.

UPD: If I re-save the file in Word 2013, both Java and IKVM versions are able to extract the text.

UPD2: It eventually blows up w/ OutOfMemory:

Unhandled Exception: TikaOnDotNet.TextExtractionException: Extraction of text from the file 'C:\Users\vorou\Desktop\ftw\input\hang-doc.docx' failed. ---> TikaOnDotNet.TextExtractionException: Extraction failed. ---> System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(SchemaTypeLoader stl, InputStream is, SchemaType type, XmlOptions options)
   at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(InputStream jiois, SchemaType type, XmlOptions options)
   at org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument.Factory.parse(InputStream is)
   at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead()
   at org.apache.poi.POIXMLDocument.load(POIXMLFactory factory)
   at org.apache.poi.xwpf.usermodel.XWPFDocument..ctor(OPCPackage pkg)
   at org.apache.poi.xwpf.extractor.XWPFWordExtractor..ctor(OPCPackage container)
   at org.apache.poi.extractor.ExtractorFactory.createExtractor(OPCPackage pkg)
   at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(InputStream stream, ContentHandler baseHandler, Metadata metadata, ParseContext context)
   at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
   at org.apache.tika.parser.CompositeParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
   at org.apache.tika.parser.CompositeParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
   at org.apache.tika.parser.AutoDetectParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
   at TikaOnDotNet.TextExtractor.Extract(Func`2 streamFactory) in c:\Users\vorou\code\tikaondotnet\src\TikaOnDotnet\TextExtractor.cs:line 92
   --- End of inner exception stack trace ---
   at TikaOnDotNet.TextExtractor.Extract(Func`2 streamFactory) in c:\Users\vorou\code\tikaondotnet\src\TikaOnDotnet\TextExtractor.cs:line 108
   at TikaOnDotNet.TextExtractor.Extract(String filePath) in c:\Users\vorou\code\tikaondotnet\src\TikaOnDotnet\TextExtractor.cs:line 51
   --- End of inner exception stack trace ---
   at TikaOnDotNet.TextExtractor.Extract(String filePath) in c:\Users\vorou\code\tikaondotnet\src\TikaOnDotnet\TextExtractor.cs:line 60
   at Word2Txt.Program.Main() in C:\Users\vorou\code\Word2Txt\Word2Txt\Program.cs:line 19

UPD3: full procmon output, in case you know what to look for.

Automate tika-app.dll creation

I would really like to automate the creation of tika-app.dll with the latest IKVM and Tika version.

Here is a spitball guestimation of the steps involved:

Download latest IKVM tools (is there a nuget for these?)
Download latest Tika jar.
ikvmc.exe -target:library tika-app-{version}.jar
copy result to lib directory
run unit tests
update version and IKVM dependencies
commit
publish nuget
- release.bat
- ripple publish ${Version} ${ApiKey}

My original blog post on manually doing 1-3

TextExtractor should be versioned separately

Now that they have been separated. Should TextExtractor be in a completely different repo?

The two Nugets should at least have separate Release notes and version numbers.

SAXParserFactoryImpl not found

Hi,

I'm getting the following errors when trying an extraction:
Provider com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl not found
Provider com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl not found

I have manage to workaround the issue by adding the following before calling a new TextExtractor()
[email protected] s = new [email protected]();
com.sun.org.apache.xalan.@internal.xsltc.trax.TransformerFactoryImpl t = new com.sun.org.apache.xalan.@internal.xsltc.trax.TransformerFactoryImpl();

This forces the files to be loaded and all works fine.

I have added the appSettings in the app.config as mentioned in your blog post but it still throws this error.

It's no big deal, I had run it through over 2k documents and it extracted the text perfectly.
I was just wondering was I missing something else in the setup process.

Thank you for making this. It's been a joy to use. It makes extracting text from documents very easy on .net.

The change should be simple enough, and should be fully backwards-compatible:

<dependencies>
  <dependency id="IKVM" version="7.4.5196" />
</dependencies>

(just remove the brackets around the version number)

Thanks.

Problem parsing PPT-file with encoding MacRoman (probably on all macroman office docs)

Hi!

We are using your solution to extract document content in a project, however we are having problems extracting office documents created on a mac. We get the following error:

Test method Tests.CoreTests.BinaryExtraction.Tika.TextExtractionUsingTikaTests.CanExtractEmptyPowerpointFromStream threw exception:

Core.Domain.Exceptions.TextExtractionException: 
TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@33203cd ---> 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@33203cd ---> 
java.io.UnsupportedEncodingException: MacRoman

Do you have any idea how this could be solved?

Fresh Nuget Package with Tika 1.11

I see that the pull request was merged in, any idea when we could see this released via Nuget?

kevm / tikaondotnet Goto Github PK

tikaondotnet's Introduction

Tika on .NET

Getting Started

Usage

How To Contribute

Having A Problem?

tikaondotnet's People

Contributors

Stargazers

Watchers

Forkers

tikaondotnet's Issues

Getting Started

Contributing

Build Time Report

Status: Failure

at Fake.TargetHelper.runSingleTarget(TargetTemplate`1 target) in C:\code\fake\src\app\FakeLib\TargetHelper.fs:line 492

Recommend Projects

Recommend Topics

Recommend Org