Comments (48)
Can you try the latest sourcecode that I just pushed to GitHub? I did put in a fallback mechanism so that the IFilterReader uses the old iPersistFile interface when the iPersistStream interface fails
var iPersistStream = iFilter as NativeMethods.IPersistStream;
Exception iPersistStreamException = null;
// IPersistStream is asumed on 64 bits systems
if (iPersistStream != null)
{
try
{
iPersistStream.Load(new IStreamWrapper(stream));
NativeMethods.IFILTER_FLAGS flags;
if (iFilter.Init(iflags, 0, IntPtr.Zero, out flags) == NativeMethods.IFilterReturnCode.S_OK)
return iFilter;
}
catch (Exception exception)
{
Marshal.ReleaseComObject(iFilter);
iPersistStreamException = exception;
}
}
if (iPersistStreamException != null)
{
if (string.IsNullOrWhiteSpace(fileName))
throw new IFOldFilterFormat("The IFilter does not support the IPersistStream interface, supply a filename to use the IFilter", iPersistStreamException);
// If we get here we probably are using an old IFilter so try to load it the old way
// ReSharper disable once SuspiciousTypeConversion.Global
var persistFile = iFilter as IPersistFile;
if (persistFile != null)
{
persistFile.Load(fileName, 0);
NativeMethods.IFILTER_FLAGS flags;
if (iFilter.Init(iflags, 0, IntPtr.Zero, out flags) == NativeMethods.IFilterReturnCode.S_OK)
return iFilter;
}
}
from ifiltertextreader.
My version is based on the same old utility that you found on codeproject. I reused most of the code... so I really don't know why my version doesn't work on your server. I just tested my version on a windows 2012 R2 server and that one works without any problems.
I didn't do anything special in the program.. the only thing that I did is that I upgraded some interfaces so that they would work with the IPersistantStream interface. It is really hard for me to guess what is going wrong on your side. Can you post some screenshots?
from ifiltertextreader.
What Adobe IFilter did you install? I have this one --> http://www.adobe.com/support/downloads/detail.jsp?ftpID=5542
And is your SQL instance 32 or 64 bits?
from ifiltertextreader.
I want to understand why it works so. I tested my SQL CLR with IFilterTextReader and it works as demo app.
Ok, this is rtf file
from ifiltertextreader.
That is pdf.
And SQL works good with rtf and pdf (the same? ) ifilters.
from ifiltertextreader.
Where did you get the IFilter Tester (the one on the right)... seems like an old version that I have made some time ago.
The most important question now is... is the program running in 32 bits or 64 bits mode?
The reason why I ask this is because when a program runs in 32 bits mode it will need 32 bits IFilters and the same for 64 bits mode.
from ifiltertextreader.
- pdf ifilter, this is sql server shows:
.pdf 6C337B26-3E38-4F98-813B-FBA18BAB64F5 C:\Windows\system32\glcndFilter.dll 6.2.9200.16451 Microsoft Corporation
(https://support.microsoft.com/en-us/kb/2791465)
- iFilerTester
Using IFilter in C#
Eyal Post, 19 Mar 2006
http://www.codeproject.com/Articles/13391/Using-IFilter-in-C
- I have built the application as it is:. Any CPU.
But I also use IFilterTextReader.dll in my SQL CLR function in SQL 2012 (EE x64). It works as demo app too: gives me the same exception on pdf.
from ifiltertextreader.
Ok, I will test it.
Btw, I found this info:
"c:\windows\system32\glcndFilter.dll which is the default PDF ifilter in 2012 server"
from ifiltertextreader.
I replaced this piece of code and tested:
- rtf - is hanging too
- pdf -
at IFilterTextReader.FilterLoader.LoadAndInitIFilter(Stream stream, String extension, Boolean disableEmbeddedContent, String fileName) in f:_Samples\RedisCLR\ConsoleTest\IFilterTextReader\FilterLoader.cs:line 162
at IFilterTextReader.FilterReader..ctor(String fileName, String extension, Boolean disableEmbeddedContent, Boolean includeProperties) in f:_Samples\RedisCLR\ConsoleTest\IFilterTextReader\FilterReader.cs:line 138
at IFilterTextViewer.MainForm.SelectButton_Click(Object sender, EventArgs e) in f:_Samples\RedisCLR\ConsoleTest\IFilterTextViewer\MainForm.cs:line 112
COM object that has been separated from its underlying RCW cannot be used.
I can try to replace the pdf filter with Adobe native but if SQL Server works with glcndFIlter.dll then it is interesting why app does not work....
from ifiltertextreader.
You can also try to remove all this code _job... it is needed when adobe ifilters are used:
public MainForm()
{
InitializeComponent();
// Add the current process to the sandbox
_job.AddProcess(Process.GetCurrentProcess().Handle);
}
from ifiltertextreader.
var persistFile = iFilter as IPersistFile;
if (persistFile != null) //<----------FilterLoader.cs:line 162
from ifiltertextreader.
Removed _job call, but the same error...
from ifiltertextreader.
Try to remove the complete iPersistStream code so that it only uses the old interface.... then it is the same as Eyal's code.
// ReSharper disable once SuspiciousTypeConversion.Global
var iPersistStream = iFilter as NativeMethods.IPersistStream;
// IPersistStream is asumed on 64 bits systems
if (iPersistStream != null)
{
iPersistStream.Load(new IStreamWrapper(stream));
NativeMethods.IFILTER_FLAGS flags;
if (iFilter.Init(iflags, 0, IntPtr.Zero, out flags) == NativeMethods.IFilterReturnCode.S_OK)
return iFilter;
}
else
Just another question... where you from? ... the USA?
from ifiltertextreader.
Also, glcndFilter.dll - is default pdf ifilter for Windows Server 2012 and Windows 8 too
I have checked on my server all is as described here:
https://ryancr.wordpress.com/category/computers-and-internet/windows-8/
b) Default value at HKEY_CLASSES_ROOT.pdf\PersistentHandler should be {1AA9BF05-9A97-48c1-BA28-D9DCE795E93C}
c) Default value at HKEY_CLASSES_ROOT\CLSID{1AA9BF05-9A97-48c1-BA28-D9DCE795E93C}\PersistentAddinsRegistered{89BCB740-6119-101A-BCB7-00DD010655AF} should be {6C337B26-3E38-4F98-813B-FBA18BAB64F5}
d) If you’re running Windows 8x:
Default value at HKEY_CLASSES_ROOT\CLSID{6C337B26-3E38-4F98-813B-FBA18BAB64F5}\InProcServer32 should be %systemroot%\system32\glcndFilter.dll
In an administrative command prompt, run: regsvr32 %systemroot%\system32\glcndFilter.dll and confirm you get “DllRegisterServer in C:\WINDOWS\system32\glcndFilter.dll succeeded.”
from ifiltertextreader.
Ok, I removed code which you wrote.
Now it is:
-
pdf
at IFilterTextReader.FilterReader..ctor(String fileName, String extension, Boolean disableEmbeddedContent, Boolean includeProperties) in f:_Samples\RedisCLR\ConsoleTest\IFilterTextReader\FilterReader.cs:line 138
at IFilterTextViewer.MainForm.SelectButton_Click(Object sender, EventArgs e) in f:_Samples\RedisCLR\ConsoleTest\IFilterTextViewer\MainForm.cs:line 112
There is no IFilter installed for the extension '.pdf' -
rtf: works correctly!
from ifiltertextreader.
Can you replace the IFilterReader constructors with this code? It will tell you if it is looking for a 32 or 64 bits ifilter
#region Constructor en Destructor
/// <summary>
/// Creates an TextReader object for the given <paramref name="fileName"/>
/// </summary>
/// <param name="fileName">The file to read</param>
/// <param name="extension">Overrides the file extension of the <paramref name="fileName"/>,
/// the extension is used to determine the <see cref="NativeMethods.IFilter"/> that needs to
/// be used to read the <paramref name="fileName"/></param>
/// <param name="disableEmbeddedContent">When set to <c>true</c> the <see cref="NativeMethods.IFilter"/>
/// doesn't read embedded content, e.g. an attachment inside an E-mail msg file. This parameter is default set to <c>false</c></param>
/// <param name="includeProperties">When set to <c>true</c> the metadata properties of
/// a document are also returned, e.g. the summary properties of a Word document. This parameter
/// is default set to <c>false</c></param>
public FilterReader(string fileName,
string extension = "",
bool disableEmbeddedContent = false,
bool includeProperties = false)
{
try
{
_fileName = fileName;
_fileStream = File.OpenRead(fileName);
if (string.IsNullOrWhiteSpace(extension))
extension = Path.GetExtension(fileName);
_filter = FilterLoader.LoadAndInitIFilter(_fileStream, extension, disableEmbeddedContent, fileName);
if (_filter == null)
{
if (string.IsNullOrWhiteSpace(extension))
throw new IFFilterNotFound("There is no " + (Environment.Is64BitProcess ? "64 bits" : "32 bits") +
"IFilter installed for the file '" + Path.GetFileName(fileName) + "'");
throw new IFFilterNotFound("There is no " + (Environment.Is64BitProcess ? "64 bits" : "32 bits") +
"IFilter installed for the extension '" + extension + "'");
}
_includeProperties = includeProperties;
}
catch (Exception)
{
Dispose();
throw;
}
}
/// <summary>
/// Creates an TextReader object for the given <see cref="Stream"/>
/// </summary>
/// <param name="stream">The file stream to read</param>
/// <param name="extension">The extension for the <paramref name="stream"/></param>
/// <param name="disableEmbeddedContent">When set to <c>true</c> the <see cref="NativeMethods.IFilter"/>
/// doesn't read embedded content, e.g. an attachment inside an E-mail msg file. This parameter is default set to <c>false</c></param>
/// <param name="includeProperties">When set to <c>true</c> the metadata properties of
/// a document are also returned, e.g. the summary properties of a Word document. This parameter
/// is default set to <c>false</c></param>
public FilterReader(Stream stream,
string extension,
bool disableEmbeddedContent = false,
bool includeProperties = false)
{
if (string.IsNullOrWhiteSpace(extension))
throw new ArgumentException("The extension cannot be empty", "extension");
_filter = FilterLoader.LoadAndInitIFilter(stream, extension, disableEmbeddedContent);
if (_filter == null)
throw new IFFilterNotFound("There is no " + (Environment.Is64BitProcess ? "64 bits" : "32 bits") +
"IFilter installed for the stream with the extension '" + extension + "'");
_includeProperties = includeProperties;
}
from ifiltertextreader.
Hmm.. I build lib for NET 3.5 because of SQL Server requires this version.
3.5 does not contain Environment.Is64BitProcess
Need to replace somehow
from ifiltertextreader.
I used this solution to detect 64 bit
[DllImport("kernel32.dll", SetLastError = true, CallingConvention = CallingConvention.Winapi)]
[return: MarshalAs(UnmanagedType.Bool)]
public static extern bool IsWow64Process([In] IntPtr hProcess, [Out] out bool lpSystemInfo);
private bool Is64Bit()
{
if (IntPtr.Size == 8 || (IntPtr.Size == 4 && Is32BitProcessOn64BitProcessor()))
{
return true;
}
else
{
return false;
}
}
private bool Is32BitProcessOn64BitProcessor()
{
bool retVal;
IsWow64Process(Process.GetCurrentProcess().Handle, out retVal);
return retVal;
}
The result is:
at IFilterTextReader.FilterReader..ctor(String fileName, String extension, Boolean disableEmbeddedContent, Boolean includeProperties) in f:_Samples\RedisCLR\ConsoleTest\IFilterTextReader\FilterReader.cs:line 141
at IFilterTextViewer.MainForm.SelectButton_Click(Object sender, EventArgs e) in f:_Samples\RedisCLR\ConsoleTest\IFilterTextViewer\MainForm.cs:line 112
There is no 64 bitsIFilter installed for the extension '.pdf'
from ifiltertextreader.
well then you have no 64 bits ifilter... install the adobe ifilter i mentioned some posts back
from ifiltertextreader.
ok, but how SQL Server works with pdf?
and what to do with rtf? use the latest your changes?
from ifiltertextreader.
Yes use the latest changes...
I have to go now... already 7 PM overhere.
Bye
from ifiltertextreader.
thank you,
I will try to research.
from ifiltertextreader.
Hi, the research continues...
I downloaded SearchFilterView utility from this site
http://www.nirsoft.net
Run it and it shows me all ifilters including glcndFilter.dll
from ifiltertextreader.
another note:
when I build 'Any CPU' - app gives finally
There is no 32 bits IFilter installed for the extension '.pdf'
I have rebuilt it as x64 and receive:
There is no 64 bits IFilter installed for the extension '.pdf'
But i think it is not the reason, it is final message only.
from ifiltertextreader.
Hi, it seems finally I can read pdf via standard MS pdf filter glcndFilter.dll ...
from ifiltertextreader.
Could you analyze and fix code?
This works for glcndFilter.dll
http://stackoverflow.com/questions/7313828/using-ifilter-in-c-sharp-and-retrieving-file-from-database-rather-than-file-syst
Also, I tested it for txt, ppt, docx, it works too.
from ifiltertextreader.
How are you trying to read from the database? Through a sqldatareader?
from ifiltertextreader.
I am working with your demo app now, not with database. I have added piece of code from link above to your demo app and call method from this link and it works (!) with glcndFilter.dll on W2012 and on W8. By the way on W8 the same problem was reproduced.
I want to make working application and then back to my SQLCLR experiments.
from ifiltertextreader.
Simply saying I made this test :
public static NativeMethods.IFilter LoadAndInitIFilter(Stream stream,
string extension,
bool disableEmbeddedContent,
string fileName = "")
{
string dllName, filterPersistClass;
FilterTester.ParseIFilter(extension, stream); // <------ that is their function
and received text from pdf file
But naturally need to review their code and somehow integrate it to your classes because NativeMethods intersect etc etc
from ifiltertextreader.
Could you paste your source over here so that I can see what you did?
The only difference that I can see between how they do it on stackoverflow and how I do it is that I wrap a .net stream around a IStream. And on stackoverflow they just copy everything to memory.
from ifiltertextreader.
I have made very simple:
created new class IFileTest and copy-paste their classes to it.
I think probably it must be a third way in your IFilterLoader class additionally to existing 2 ways (Persist Stream, Persist File).
Because this their code does not work as-is with another IFIlters, only with glcndFilter
from ifiltertextreader.
It seems that is the key place
// Copy the content to global memory
byte[] buffer = new byte[s.Length];
s.Read(buffer, 0, buffer.Length);
IntPtr nativePtr = Marshal.AllocHGlobal(buffer.Length);
Marshal.Copy(buffer, 0, nativePtr, buffer.Length);
// Create a COM stream
System.Runtime.InteropServices.ComTypes.IStream comStream;
NativeMethods.CreateStreamOnHGlobal(nativePtr, true, out comStream);
// Load the contents to the iFilter using IPersistStream interface
var persistStream = (IPersistStream)filter;
persistStream.Load(comStream);
from ifiltertextreader.
By the way, please check this article and messages below about
"Support for PDF file indexing"
http://www.codeproject.com/Articles/31944/Implementing-a-TextReader-to-extract-various-files
It is about Adobe but it looks like the same problem and approach.
And who knows what is inside of glcndFilter.dll ;-)
from ifiltertextreader.
I changed the constructor on the iFilterReader, I added an option so that you can chose to load everything in memory first before passing it to the iFilter. This way you can set it from you own code.
Just get the latest version from GitHub, it has this change included
from ifiltertextreader.
Any luck?
from ifiltertextreader.
I have tested this version, yes, thank you.
-
PDF filter:
It works if I use 'Read into Memory' flag. It seems it was a one right way for glcndFilters.
Now I need to think how to use it in different environments: for example when Adobe pdf filter or glcndFilter installed. May be call it and if exception then try to repeat call with 'read into memory' flag.. -
rtf still does not work in this version.
And on another machine with W7 rtf does not work, the filter is:
rtffilt.dll RTF Filter 2008.0.7600.16385 (win7_rtm.090713-1255) 2008.0.7600.16385 Microsoft Corporation C:\Windows\system32\rtffilt.dll {2e2294a9-50d7-4fe7-a09f-e6492e185884} {e2403e98-663b-4df6-b234-687789db8560} 7/14/2009 6:53:58 AM
from ifiltertextreader.
For the PDF, just read the registry and see what filter is installed and then use the correct flag (read file into memory).
I still find it strange that you have all these problems... because I can read every file for which I installed an IFilter... I wonder what is different on your server. That includes EML, RTF, DOC(X), XLS, etc...
from ifiltertextreader.
I have tested on pdf (glcndFilter) on 2 windows 8 and 2 windows server 2012 machines. It is reproducing. Now it works in the new version.
ahh,ok, RTF is reproducing on W7 and W2012 BUT that is on 3 files only. I open these files with notepad and see that inside is a plain text, no rtf formatting. Probably it is the reason.
I will test the latest version in my SQL CLR function again.
Thank you very much for your help. Your library is the best for today.
from ifiltertextreader.
You wrote: "For the PDF, just read the registry and see what filter is installed "
May be it does make sense to have a public function in FilterReader which can return the filename for the ifilter? In any case it detects it inside
from ifiltertextreader.
I have a very nice class that can detect filetypes... it was for a project with these kind of issues... wrong extensions, no extensions, etc...
I created a Gist ... you can find it overhere --> https://gist.github.com/Sicos1977/d968f30e23171b76abaa
For the CheckCompoundFileStorage method to work you need to add this nuget package --> https://www.nuget.org/packages/CompoundFileStorage/
from ifiltertextreader.
Returning the name of the iFilter will get you in a chicken egg discussion. You don't know what flag to set before you get the iFiltername, but you get the iFilter name after the flag has been set. So it is probably better to leave this outside the IFilterTextReader class. Also when you have an MSG file with attachments you will hit all kind of iFilters.
from ifiltertextreader.
Where you from if I may ask? Europe?
from ifiltertextreader.
Yes, Germany
from ifiltertextreader.
Nice... the Netherlands overhere... so we are almost neighbours :-)
from ifiltertextreader.
yes, right ;-)
Sometimes I visited Amsterdam, IBC exhibition
from ifiltertextreader.
ok, thanx again.
Btw, I have send pdf file (5mb) to your email. Probably you can check if performance can be improved.
On my server your demo app parses it ~ 10 secs.
Naturally, it is via f*g ms pdf filter :-)
from ifiltertextreader.
Sorry but I cant speed anything up, I already optimized the code to be as fast as possible. If you use the Adobe iFilters then you have to deal with that speed. It's not the fastest iFilter there is. If you need a fast one and want to spend a few hundred euro's then you can get PDFLib TET. This way is very fast but does cost a lot.
from ifiltertextreader.
Hi,
I have installed Adobe PDF filter and tested it on the same pdf file which I sent.
It works ~9-10 secs too, independently on flag 'Read into memory'.
That is ok.
from ifiltertextreader.
Related Issues (20)
- Cannot read text from .xls file HOT 11
- Text extraction hangs when reading .odt file HOT 4
- Index out of bounds reading a pdf document HOT 1
- Can't get the PDF filter to load the IPersistStream in FileLoader.cs HOT 4
- Question of requirements: does not contain a method named 'new' HOT 5
- TextReader not recognixing line breaks in .docx File HOT 4
- Keep file formatting HOT 1
- Open File Reader with MemoryStream HOT 3
- Document metadata properties HOT 8
- Exception if property with multiple values exists
- Weird text encoding issue with colons and section symbols HOT 1
- Registry DLL issue after upgrading HOT 1
- System.AccessViolationException HOT 19
- Outdated(?) OffFilter.dll on Windows Server 2012 HOT 2
- OffFilt.dll AccessViolationException HOT 11
- ReadToEnd() causes "Destination Array Not Long Enough" for legacy Word files HOT 1
- Missing filter return code? HOT 7
- Version 1.7+ - System.ExecutionEngineException and System.AccessViolationException HOT 16
- Cannot read text from .xls HOT 6
- License question HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ifiltertextreader.