modesty / pdf2json Goto Github PK

View Code? Open in Web Editor NEW

1.9K 51.0 373.0 122.97 MB

converts binary PDF to JSON and text, for server-side PDF processing and command-line use.

Home Page: https://github.com/modesty/pdf2json

License: Other

JavaScript 5.38% Java 94.53% Shell 0.09%

json pdf pdf-converter pdf-form pdf-text pdf2json pdf2text pdf2form

pdf2json's People

Contributors

Stargazers

Watchers

Forkers

ekanna giano baldurbjarnason paullryan sc13-bioinf purecreative won21kr maxwellrebo chadieb fbcouch garysieling rst-j devildeveloper whockey oitozero marcellodesales web5design robert-yarborough monwater marcosrmendez brianc anujku mayalekova kuguobing jkutianski nodejstw pingjiang mfiske classloader eyethereal edsoto gogistics kkdg jifffffy mbrioski motusdevelopers epappas joelmwas brandong84 holdfenytolvaj andrewluetgers lduchesne eeertekin ynagarjuna2012 tarunsinghal92 nagyistoce g0ogle wkryst humasae rvkishore pandipanda86 jchandra74 liu4lin modulexcite digitallandes owmf quanticpotato donnut shobhitg redroot sebasao jjviscomi blister suraj3006 ryanwilliamquinn iyuohz morganebilloud ahahxof mvanderw felipegtx ctstone sg1705 bugeats frenchbread lethalbrains kevinperumal m-h-miller anukat2015 tfg-urjc-2017 tfmv qiaoyuanmaxdeng tuningguide hydraseed jagannathan-m yonidejene pj035 alexandr2110pro scolustenko oleglustenko ericson-cepeda kishorsharma dafortune mjtworks pacharrin swifthero wanghaisheng lanxingshou crank50 vinayasathyanarayana webcarrot

pdf2json's Issues

Error: stream must have data

Error: stream must have data at error (eval at <anonymous> (/Users/raineroviir/best-scraper/node_modules/pdf2json/lib/pdf.js:60:6), <anonymous>:193:7)

Not sure why this is happening. How do I resolve this? In my console.log I see the file loads successfully but this comes up right afterwards

positions of fields

Hi,

Thanks for your library, helps out a lot with some pdf work I'm doing!

I am having a small problem with field positions, and this could be because I'm not familiar enough with the library, but wanted to bring it up just in case not. I've got an image of the pdf and I'm trying to place fields and markers on the pdf image where the fields are in the real pdf, but when I get the x and y positions, the fields seem to be just a little bit off. Here is the graphic:

date and contactName are slightly out of position, date to the right, and contactName to low. I'm using the following to get their coordinates:

cls.toPixelX(pdfField.x), cls.toPixelY(pdfField.y)

Do I need to convert the pixel values I get in some way? I'm a bit confused because some of the fields are right where I would have expected them, while others are offset in different directions...

Thank you for any insight/help!!

Mark

agenda throws error when working with pdf2json

after update to 1.1.5 from 0.7.1, when i use pdf2json

'TypeError: Cannot read property \'update\' of undefined', ' at unlockJobs (/home/jons/***/***/node_modules/sails-hook-jobs/node_modules/agenda/lib/agenda.js:319:11)', ' at Agenda.stop (/home/jons/***/***/node_modules/sails-hook-jobs/node_modules/agenda/lib/agenda.js:247:14)', ' at Sails.stopServer (/home/***/***/node_modules/sails-hook-jobs/index.js:14:12)', ' at emitNone (events.js:72:20)', ' at Sails.emit (events.js:166:7)', ' at Sails.emitter.emit (/home/***/***/node_modules/sails/lib/app/private/after.js:50:11)'

this agenda is a dependancy of sails-hook-jobs
sails 0.12.3
node 4.4.7
ubuntu 14.04
agenda 0.6.28

Running on Electron "Uncaught Error: No PDFJS.workerSrc specified"

For some reason when I run pdf2json on my electron app I get "Uncaught Error: No PDFJS.workerSrc specified".

FYI: I tried setting workerSrc to pdf.worker.js but that won't solve it. It just brings up another error,

encoding issues

i hava a pdf file like this

***同津巴布韦总统穆加贝举行会谈_国内新闻_环球网.pdf

i can get the chinese character successfully ,but text in the following pdf file
1.pdf

,only get

does this caused by pdf file encoding or something else?

Offering an ES5 Version

It'd be amazing for projects that aren't using node4 to be able to use this project without needing harmony bindings - would you be able to offer an es5 version of this package?

Pass in options that tell pdf2json what to output

Building off of the 'add PostScript coordinates' idea in #12 , perhaps we should support for an options object passed as a second parameter to loadPDF? And this options object could include key-value pairs like coordinates: 'PostScript' or useDictionary: false or excludeTextsProperties: ['clr', 'oc', 'A', 'R.S'] etc. This could also be an easy way to implement #20 .

I understand @modesty 's comment in closing #20 that an output format other than what he has produced is not what he had in mind, but surely we can make pdf2json more flexible to support more varied projects. I think it can produce the current output as well as others as different projects may desire. I'm happy to try to tackle this if others think it's a good idea, and I welcome ideas on the best implementation.

Cannot read property '0' of undefined when parsing pdf

PDF it fails on: http://www.novasoftware.se/ImgGen/schedulegenerator.aspx?format=pdf&schoolid=60410/nb-no&type=-1&id=2eda&period=&week=21&mode=0&printer=0&colors=32&head=0&clock=0&foot=0&day=0&width=1880&height=371&maxwidth=1880&maxheight=371

Stack trace:

(while reading XRef): TypeError: Cannot read property '0' of undefined
XRefParseException
    at XRefParseExceptionClosure (eval at <anonymous> (/home/petterroea/Dropbox/div-projects/bot/node_modules/pdf2json/lib/pdf.js:64:6), <anonymous>:379:34)
    at eval (eval at <anonymous> (/home/petterroea/Dropbox/div-projects/bot/node_modules/pdf2json/lib/pdf.js:64:6), <anonymous>:384:3)
    at Object.<anonymous> (/home/petterroea/Dropbox/div-projects/bot/node_modules/pdf2json/lib/pdf.js:64:1)
    at Module._compile (module.js:413:34)
    at Object.Module._extensions..js (module.js:422:10)
    at Module.load (module.js:357:32)
    at Function.Module._load (module.js:314:12)
    at Module.require (module.js:367:17)
    at require (internal/module.js:20:19)
    at Object.<anonymous> (/home/petterroea/Dropbox/div-projects/bot/node_modules/pdf2json/pdfparser.js:8:10)
Error
    at InvalidPDFExceptionClosure (eval at <anonymous> (/home/petterroea/Dropbox/div-projects/bot/node_modules/pdf2json/lib/pdf.js:64:6), <anonymous>:330:35)
    at eval (eval at <anonymous> (/home/petterroea/Dropbox/div-projects/bot/node_modules/pdf2json/lib/pdf.js:64:6), <anonymous>:334:3)
    at Object.<anonymous> (/home/petterroea/Dropbox/div-projects/bot/node_modules/pdf2json/lib/pdf.js:64:1)
    at Module._compile (module.js:413:34)
    at Object.Module._extensions..js (module.js:422:10)
    at Module.load (module.js:357:32)
    at Function.Module._load (module.js:314:12)
    at Module.require (module.js:367:17)
    at require (internal/module.js:20:19)
    at Object.<anonymous> (/home/petterroea/Dropbox/div-projects/bot/node_modules/pdf2json/pdfparser.js:8:10)

Code:

        var pdfParser = new PDFParser();
        console.log("Downloaded timeschedule.");
        pdfParser.on("pdfParser_dataReady", pdfData => {
        console.log("Got pdf data");
        console.log(pdfData);
        });
        pdfParser.loadPDF("temp.pdf");

Node -v:

v5.11.1

It might be a poorly generated pdf(2000's consultant work apparently), but other readers support it fine.

"An error occurred while rendering the page" when page contains image.

I forked the repo in order to inspect the exact error:

+ nodeUtil._logN.call(self, 'Error: ' + require('util').inspect(error, null, null));

The problem is:
{
message: 'Image is not defined',
stack: 'ReferenceError: Image is not defined\n at loadJpegStream (eval at (/Users/Tim/EG Server/Source/Engine/eg-exam/node_modules/pdf2json/pdf.js:46:6))'
}

I'm looking into this issue and will add a pull request when I fixed it. :)

StringifyStream is not defined

I used your stream example but am missing the StringifyStream:

request(pdfUrl).pipe(pdfParser).pipe(new StringifyStream())

how can I define / load it?

code example

in the read me code example PDFParser = require("./pdf2json/pdfparser"); is not good anymore.
it should be PDFParser = require("./pdf2json/PDFParser");

Text X positions Incorrect

The document I am working is a 11.5 x 16 PDF document. The height I get back from pdf2json is 51.75, which when examining the Text's locations (x,y), and assuming that they also are represented as page units (PU), the y seems to be correct. However, the x seems to be off for elements located on the right half of the document. For instance, I placed text ("BottomRight") in the bottom right and got back the following coordinates: { x: 193.45312500000003, y: 50.918749999999996 }. Seeing that the document is 11.5 x 16, and the PU for the height are 51.75, this would technically make the width 74.25 PU. How is it possible that a text can have a position of 193.45..., with a max PU of 74.25?

define(function(require,exports,modules){

```
var fs        = require('fs'),
    _         = require('underscore-node'),
    PDFParser = require('pdf2json/pdfparser'),
    pdfParser = new PDFParser(),
    pdfutils = require('pdfutils').pdfutils;


var PDF = function(base,file){

    var pdf = this;

    var location = '/Users/dayne/sites/wl/client/products/';

    pdf.base = null;
    pdf.file = null;

    pdf.adors = [];
    pdf.pages = [];

    pdf.init = function(base,file){

        console.log('starting pdf parsing');

        // set base path + file name
        pdf.file = file;
        pdf.base = base;

        // set the bindings
        pdfParser.on("pdfParser_dataReady", _.bind(pdf.initParse, this));
        pdfParser.on("pdfParser_dataError", _.bind(pdf.parseDataError, this));

        // start parsing
        pdfParser.loadPDF(base + file);
    };

    pdf.initParse = function(data){
```

//            console.log('parsing pdf data');

```
        pdfutils(pdf.base + pdf.file, function(err,doc){
```

//                for(var i = 0; i < data.PDFJS.pages.length; i++)
                for(var i = 0; i < 1; i++)
                    pdf.pages.push(pdf.parsePage(data.PDFJS.pages[i],doc[i]));

//                console.log(data.PDFJS.pages[0]);
            });

```
    };

    pdf.parsePage = function(page,doc){

        var parsedPage = {};

        parsedPage.adors  = [];

        parsedPage.ratio  = doc.height / page.Height;
        parsedPage.width  = doc.width;
        parsedPage.height = doc.height;

        for(var i = 0; i < page.Texts.length; i++)
            pdf.findCamelCase( page.Texts[i].R[0].T, page.Texts[i], page.Texts[i].R[0].TS, parsedPage, parsedPage.ratio);

        // TODO:: find solution for this xml parsing (grabbing pictures)...
```

//            console.log(parsedPage);
//            var meta   = doc.metadata.split('\n');
//            doc[0].asPNG({maxWidth: doc[0].width, maxHeight: doc[0].height }).toFile( pdf.base + 'test.png' )
            return parsedPage;
        };

```
    pdf.findCamelCase = function(text,textLocation,textData,parsedPage,ratio){
        // TODO :: fix regex to only accept camelcase without spacing...

        text.replace(/[A-Z]([A-Z0-9]*[a-z][a-z0-9]*[A-Z]|[a-z0-9]*[A-Z][A-Z0-9]*[a-z])[A-Za-z0-9]*/g, function(match){

            var t = {};
```

//                console.log(textLocation.x);
//                console.log(ratio);

```
            t.text    = text;
            t.size    = textData[1];
            t.bold    = textData[2] == 1;
            t.italics = textData[3] == 1;
            t.position = {
                x: textLocation.x,
                y: textLocation.y
            };
```

//                console.log(textLocation.x);
                console.log(t.text, t.position);

```
            parsedPage.adors.push(t);
        });
    };

    pdf.parseDataError = function(err){

        console.log('pdf parse error...',err);
    };

    pdf.init(base,file);
};

return new PDF('/Users/dayne/sites/wl/server/utils/','test.pdf');
```

});

Use with Other JS Servers?

@modesty Cool tool. Just doing a brief walkthrough I didn't really see much that screamed Node.js only.... Do you feel that a lot of this was done such that it would only work in node or do you see it working on other javascript engines such as Rhino with relative ease?

Just asking your opinion based on your indepth knowledge.

Thanks for putting this out as open source.

Boxsets stays empty

Hi,

I tried to use pdf2json with three different pdfs containing links to other websites.

But when I try, the boxsets returns empty.

This is my code :

var pdfParser = new PDFParser();

  pdfParser.on("pdfParser_dataError", errData => console.error(errData.parserError) );
  pdfParser.on("pdfParser_dataReady", pdfData => {
      for (var i = 0; i < pdfData.formImage.Pages.length; i++){
        console.log(pdfData.formImage.Pages[i].Boxsets) // why empty? Boxsets??
    }
  });

  pdfParser.loadPDF(pdf_path);

[http://www74.zippyshare.com/d/MzUNluNF/7310663/test1.pdf](this is my pdf test : http://www74.zippyshare.com/d/MzUNluNF/7310663/test1.pdf)

when I try to show pdfData.formImage.Pages[i].Boxsets it stays always empty

This is what i get :

{"Height":52.618,"HLines":[{"x":3.543,"y":10.757,"w":0.814,"l":1.529}],"VLines":[],"Fills":[{"x":0,"y":0,"w":0,"h":0,"clr":1},{"x":0,"y":-0.056,"w":37.25,"h":52.687,"clr":1}],"Texts":[{"x":3.313,"y":6.681,"w":17.597,"sw":null,"clr":0,"A":"left","R":[{"T":"TOTOTOTOTOTOOTOTOTOTOTOTOT","S":4,"TS":[0,14,0,0]}]},{"x":3.313,"y":9.931,"w":2.223,"sw":null,"clr":0,"A":"left","R":[{"T":"toto2","S":4,"TS":[0,14,0,0]}]}],"Fields":[],"Boxsets":[]}
any idea why?

Store pdf canvas in the output json file

How to store pdf canvas in the output json file ?

thank you.

Texts array empty on OSX not empty on CentOS

Any idea why this might be the case?

pdf2json 0.7.1: parseBuffer() stopping execution instead of gracefully returning via pdfParser_dataError

When parsing certain PDF files that cause errors (perhaps due to ill-formatted content), pdf2json quits program execution rather than gracefully handling the error via pdfParser_dataError.

Unfortunately I can't currently find the PDF that caused this situation for me, but the line that ultimately "crashes" pdf2json is the following line inside display/canvas.js (see also below console error log):

fontObj.spaceWidth = (spaceId >= 0 && isArray(fontObj.widths)) ? fontObj.widths[spaceId] : 250;

Placing this inside a try / catch at least allows pdf2json to return "ok", instead of stopping the program flow entirely.

The error that occurred in my case was:

Error: Required "glyf" or "loca" tables are not found
    at error (eval at <anonymous> (/Users/me/projects/p1/node_modules/pdf2json/lib/pdf.js:61:10), <anonymous>:193:7)
    at Object.Font_checkAndRepair [as checkAndRepair] (eval at <anonymous> (/Users/me/projects/p1/node_modules/pdf2json/lib/pdf.js:61:10), <anonymous>:12213:11)
    at Object.Font (eval at <anonymous> (/Users/me/projects/p1/node_modules/pdf2json/lib/pdf.js:61:10), <anonymous>:10756:21)
    at Object.PartialEvaluator_translateFont [as translateFont] (eval at <anonymous> (/Users/me/projects/p1/node_modules/pdf2json/lib/pdf.js:61:10), <anonymous>:8161:14)
    at Object.PartialEvaluator_loadFont [as loadFont] (eval at <anonymous> (/Users/me/projects/p1/node_modules/pdf2json/lib/pdf.js:61:10), <anonymous>:7311:29)
    at Object.PartialEvaluator_handleSetFont [as handleSetFont] (eval at <anonymous> (/Users/me/projects/p1/node_modules/pdf2json/lib/pdf.js:61:10), <anonymous>:7154:23)
    at Object.PartialEvaluator_getOperatorList [as getOperatorList] (eval at <anonymous> (/Users/me/projects/p1/node_modules/pdf2json/lib/pdf.js:61:10), <anonymous>:7470:37)
    at Object.eval [as onResolve] (eval at <anonymous> (/Users/me/projects/p1/node_modules/pdf2json/lib/pdf.js:61:10), <anonymous>:4345:26)
    at Object.runHandlers (eval at <anonymous> (/Users/me/projects/p1/node_modules/pdf2json/lib/pdf.js:61:10), <anonymous>:864:35)
undefined:40950
h = (spaceId >= 0 && isArray(fontObj.widths)) ? fontObj.widths[spaceId] : 250;
                                                                          ^
TypeError: Cannot assign to read only property 'spaceWidth' of Required "glyf" or "loca" tables are not found
    at Object.CanvasGraphics_setFont [as setFont] (eval at <anonymous> (/Users/me/projects/p1/node_modules/pdf2json/lib/pdf.js:61:10), <anonymous>:40950:104)
    at Object.CanvasGraphics_executeOperatorList [as executeOperatorList] (eval at <anonymous> (/Users/me/projects/p1/node_modules/pdf2json/lib/pdf.js:61:10), <anonymous>:40560:27)
    at Object.InternalRenderTask__next [as _next] (eval at <anonymous> (/Users/me/projects/p1/node_modules/pdf2json/lib/pdf.js:61:10), <anonymous>:43553:39)
    at Object.InternalRenderTask__continue [as _continue] (eval at <anonymous> (/Users/me/projects/p1/node_modules/pdf2json/lib/pdf.js:61:10), <anonymous>:43545:14)
    at Timer.listOnTimeout (timers.js:110:15)

The first "Error" causes some delay, the second "TypeError" then quits the program flow.

Is there an update for pdf2json that can fix this? (If the PDF is ill-formatted and there is no way for pdf2json to parse the file, it should then just exit gracefully.)

Expose pdf.js getTextContent method for a pdf page

Could it be possible for you to expose the getTextContent method via let's say a Content property to get easily a page raw text?

Use Case

The developer needs to generate a PDF via let's say PhantomJS for example.
Inside the PDF file, specific text content needs to be extracted.
When accessing data.Pages via the pdfParser_dataReady callback, the developer could grab a page text Content promise for further processing, instead of dealing with text.R[0].T manipulations(loops, encoding, etc.). pdf2json is invoked from phantomJS via a node.js sub-process.

Proposed Implementation
Add a Content property in pdf.js.

var page = {Height: pageParser.height,                                                                                                             
                 HLines: pageParser.HLines,                                                                                                                     
                 VLines: pageParser.VLines,                                                                                                                     
                 Fills:pageParser.Fills,                                                                                                                        
                 Content:pdfPage.getTextContent(),                                                                                                              
                 Texts: pageParser.Texts,                                                                                                                       
                 Fields: pageParser.Fields,                                                                                                                     
                 Boxsets: pageParser.Boxsets                                                                                                                    
             };

If there's another approach that deals with funky characters easily without introducing an API add-on, I'd be glad to hear about it.

Problem with CMYK colors

I am having some trouble with cmyk colors, in my example PDF the color becomes #00FF00 but it should be full magenta so purple-ish color. I don't think i has to do with dictionary translating this but somewere it's not recognizing my color as cmyk?

Im not sure, here is the file annyway
https://docs.google.com/file/d/0B6YLTkp6bMZPbGlJNE1vV3NpOWc/edit

Also the latest pdf.js has better support for translating cmyk to rgb with a lut table for comparing, maby this could be implemented.

An error occurred while rendering the page

Hi,

I was looking at this project, tried to render a pdf which renders fine with pdfjs browser version. It gave me error "An error occurred while rendering the page" for most of the pages.

I installed it via npm but had to move pdf2json out of node_modules folder.

Is this a known issue or I am doing something wrong.

Also had a query can i create the cache of canvas generation code via this so that i do most of my canvas stuff here, just rendering part will be on client side (I dont know should I post this here, this is my first issue post on github).

Page unit conversion to PDF points

@RichardLitt and I are also having a little understanding the 'page unit', the coordinate convention and how that relates to PDF points (8.5" x 11" = 612 x 792 points). Can you provide a little clarification?

Missing ) after argument error

Hi guys ,

/usr/local/lib/node_modules/pdf2json/lib/p2jcmd.js:49
fs.writeFile(fieldsTypesPath, JSON.stringify(pJSON), err => {
^^^

SyntaxError: missing ) after argument list

I'm new to node and javascript, currently i faced this problem when i tried to execute pdf2json directly from shell. What is the "=>" function anway. All the codes with this "=>" is giving me error, including the examples

Thanks

Crash in 0.4.5 - xmldom tagName error

In the current version, the xmldom module throws a fatal error when parsing a PDF (grayscale table with text only). Previous versions didn't have this issue.

7 Sep 20:50:25 - PDFParser1 -  is about to load PDF file uploads\035361c35f424d6885574aae35eae88b
7 Sep 20:50:25 - PDFJSClass1 - About to load fieldInfo XML : uploads\035361c35f424d6885574aae35eae88b
element parse error: Error: invalid tagName:<
@#[line:4,col:1]
element parse error: Error: invalid tagName:
@#[line:4,col:2]
element parse error: Error: invalid tagName:<
@#[line:4,col:376]
element parse error: Error: invalid tagName:
@#[line:4,col:377]
element parse error: Error: invalid tagName:<
@#[line:7,col:1]
end tag name: Filter /FlateDecode /Length 1613 is not match the current start tagName:undefined
@#[line:7,col:1]

C:\app\node_modules\pdf2json\node_modules\xmldom\dom-parser.js:185
            throw error;
                  ^
end tag name: Filter /FlateDecode /Length 1613 is not match the current start tagName:undefined

Does some one know why R is an array

in the README is written: 'R': an array of text run, each text run object has two main fields...
But all my pdf have a maximum length of 1 for all R's. So what is a text run?

Cannon't Read Property Num

'Warning: Unhandled rejection: TypeError: Cannot read property 'num' of undefined' at Obj.RefSetCache_has [as has] .....

The error occurs in base/core/objs.js, I replaced the line with a hack fix for the time being:

has: function RefSetCache_has(ref) { return ('R' + ref.num + '.' + ref.gen) in this.dict; }

after:

has: function RefSetCache_has(ref) { if(ref !== undefined) return ('R' + ref.num + '.' + ref.gen) in this.dict; else return null; }

Does not extract Hyperlink on text

It would be great it a text hyperlink could be exported to JSON. Only the text value is exported

Parsing USPTO forms?

The USPTO uses some kind of form, created by Adobe LiveCycle Designer, that can't be read in any PDF viewer except for Acrobat Reader, Acrobat Professional, and maybe other Adobe products. For example, see the ADS form.

I'm not even sure what format those forms are in, but pdf2json (like all other non-Adobe PDF viewers) doesn't see any data except for the standard message, "If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document."

Is there any chance that pdf2json might be able to parse form data from such forms at some point in future?

Thanks for any input you may have. And thanks for a very useful utility!

Segfaults

I get seg faults with many (the majority) pdfs I tested, eg:

http://www.schroders.com/staticfiles/Schroders/Sites/global/IRpdf/Annual_Report_2007.pdf
http://www.northnorfolk.org/files/Sports_Assoc_001.pdf
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf

Running as root does not produce seg fault, but produces no output.

I tried stepping through the execution with node.js debugger, in which case it ran OK and produced output.

pdf2json is referring to xmldom by './../node_modules/xmldom'

You should never refer to a module using a path:

On line 6 of pdf.js:
DOMParser = require('./../node_modules/xmldom').DOMParser

Should be
DOMParser = require('xmldom').DOMParser

'Cause now I'm getting the following error when using pdf2json:

Error: Cannot find module './../node_modules/xmldom'
    at Function.Module._resolveFilename (module.js:338:15)
    at Function.Module._load (module.js:280:25)
    at Module.require (module.js:362:17)
    at require (module.js:378:17)
    at Object.<anonymous> (/Some/great/path/node_modules/pdf2json/pdf.js:6:17)
    at Module._compile (module.js:449:26)
    at Object.Module._extensions..js (module.js:467:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Module.require (module.js:362:17)

new version

Can you publish a new version to npm? I'm depending on the ability to parse directly from a buffer and installing from github is a real pain. 🙏

Word wrapping

{
"x": 7.9914062500000025,
"y": 3.984375000000001,
"w": 1292.917,
"clr": 0,
"A": "left",
"R": [
{
"T": "SECTION%3A%20CON",
"S": -1,
"TS": [3, 182, 0, 0]
}
]
},
{
"x": 19.10241171875,
"y": 3.984375000000001,
"w": 984.6790000000001,
"clr": 0,
"A": "left",
"R": [
{
"T": "TACT%20LENS",
"S": -1,
"TS": [3, 182, 0, 0]
}
]
},

This was the "SECTION: CONTACT LENS" string in reference PDF.

Object has no method 'loadPdf'

var PdfParser = require('./pdf2json');
var parser = new PdfParser();
parser.loadPdf('./sample.pdf');

For this simple portion of code I'm getting the mentioned error.

Here is what I get when calling:

console.log(Object.keys(parser));

[ 'domain',
  '_events',
  '_maxListeners',
  'get_id',
  'get_name',
  'context',
  'pdfFilePath',
  'data',
  'PDFJS',
  'parsePropCount',
  'processFieldInfoXML' ]

Lot's of logging messages

I'm seeing lot's of logging messages like these:

6 Mar 11:18:54 - PDFFont2235 - Default - SymbolicFont - (NJTZCI+Constantia-Bold) : 50::NaN => 2 length = 1
6 Mar 11:18:54 - PDFFont2311 - Default - SymbolicFont - (NJTZCI+Constantia-Bold) : 66::NaN => B length = 1
6 Mar 11:18:54 - PDFPageParser19 - page 19 is rendered successfully.
6 Mar 11:18:54 - PDFJSClass1 - start to parse page:20
6 Mar 11:18:54 - PDFPageParser20 - page 20 is rendered successfully.
6 Mar 11:18:54 - PDFJSClass1 - complete parsing page:20

How can I turn them off?

Thanks

Is there a future path for pdf2json?

First of all this is a really great project, and there is none like it.

But I can't help but notice that the files copied from PDF.js are 2 years old and aging.

In last two years a bunch of work has been done @ PDF.js: https://github.com/mozilla/pdf.js/commits/master/src/core

If this project has to keep up, survive and flourish, there has to be a strategy to keep up to date.

I tried to do this myself, but failing horribly.

Is it possible to make use of the deliverable (combined file) in pdfjs-dist project: https://github.com/mozilla/pdfjs-dist/tree/master/build

Lets discuss ideas around this, even if we don't have sure shot solutions.

No way for consumer to handle crypto related errors when calling loadPdf()

Currently base/core/crypto.js calls error() on errors. There is no callback for catching these errors, and the errors don't emit to pdfjs_parseDataError.

Here is the PDF causing the error: https://drive.google.com/file/d/0B3yADm5p-GRCRExYVXFjTnhvQ2c/view?usp=sharing

pdf2json Performance over large PDF

Hi All,

I have a PDF file that contains about 500 pages (3.6mb) - I can't post because it contains sensitive data. When I load it up through pdf2json, it takes about 10 minutes to fire the dataReady callback... is this expected?

I am running the node application on an macbook pro, i7, 16GB... and seriously expected it to be faster.

The PDF contents are of a timetable nature... and all I want to extract are the text strings and their x/y locations for grouped by page.

Does anyone else have performance issues with pdf2json... or does anyone else have any suggestions as to other node modules to use for this purpose?

Looking forward to some help... and free to answer any questions.

Ta.

PDF files on the web

Hi,

I'm attempting to use the pdf2json utility and got this error:
{ [Error: ENOENT: no such file or directory, open 'http://www.patrick.af.mil/shared/media/document/AFD-070716-028.pdf' ] }
Am I doing something wrong here? Does the file need to be local?

Errors after parsing are getting eaten by library

I have run into a case where I complete a successful pdf parsing, but have an error down the road in my app that gets caught within the parser library. The problem is that the library actually ends up catching the error on line87 and eating it. I am given no indicators as to what happened or the ability to handle it properly in my app.

Here's a script that will demonstrate parsing a pdf, and then intentionally throwing an error.
https://gist.github.com/IanShoe/e92ee20f4862b187f9ae

Content output gives line break on dash -

this is referring to -c, --content option - it's experimental but still needs bug reports too

Everywhere a dash character - appears in the document, it is replaced by a line break before and after itself.

To recreate, I used http://static.e-publishing.af.mil/production/1/af_sg/publication/afi41-210/afi41-210.pdf and command node pdf2json.js -f /home/user/afi...pdf -o /home/user -c on Debian.

Example:

ORIGINAL

If the data is stored on a facility-shared computer drive, the drive or data folder must be locked so unauthorized users are prevented from gaining access to the information.

OUTPUT

If the data is stored on a facility
-
shared computer drive, the drive or ...

Didn't see the issue already listed but if I'm duplicating someone or just using it incorrectly, please feel free to close.

PS - thank you so very much for this code - it's exactly what I've been looking for.

accept input stream

I'm downloading a PDF from a third-party site and instead of storing, I would like to pipe the stream into pdf2json to retrieve the text. Is this possible yet?
This use case can be easily wrapped around file-centric approach, see http://stackoverflow.com/a/18658613/353337 for a simple example for the nodejs hash function.

Text escaping and granularity

I and @rkpatel33 are trying to use pdf2json to get a better understanding of each word in a .pdf file. We need this to enable word-level highlighting and to be able to run NLP libraries on the text. Currently, running a .pdf file through the command line outputs this:

{
  "formImage": {
    "Transcoder": "[email protected]",
    "Agency": "Microsoft Word - test.docx",
    "Id": {
      "AgencyId": "",
      "Name": "",
      "MC": false,
      "Max": 1,
      "Parent": ""
    },
    "Pages": [
      {
        "Height": 49.5,
        "HLines": [],
        "VLines": [],
        "Fills": [
          {
            "x": 0,
            "y": 0,
            "w": 0,
            "h": 0,
            "clr": 1
          },
          {
            "x": 29.083,
            "y": 7.936,
            "w": 6.105,
            "h": 0.015,
            "clr": 0
          }
        ],
        "Texts": [
          {
            "x": 15.262,
            "y": 4.471,
            "w": 6.573400000000001,
            "clr": 0,
            "A": "left",
            "R": [
              {
                "T": "This%09%0D%20%C2%A0is%09%0D%20%C2%A0a%09%0D%20%C2%A0test%09%0D%20%C2%A0of%09%0D%20%C2%A0",
                "S": -1,
                "TS": [
                  2,
                  53,
                  0,
                  0
                ]
              }
            ]
          },
          {
            "x": 28.827,
            "y": 4.471,
            "w": 2.6036,
            "clr": 0,
            "A": "left",
            "R": [
              {
                "T": "BOLD",
                "S": -1,
                "TS": [
                  2,
                  54,
                  1,
                  0
                ]
              }
            ]
          },
          {
            "x": 34.648,
            "y": 4.471,
            "w": 1.9335000000000002,
            "clr": 0,
            "A": "left",
            "R": [
              {
                "T": "font.",
                "S": -1,
                "TS": [
                  2,
                  53,
                  0,
                  0
                ]
              }
            ]
          }
        ],
        "Fields": [],
        "Boxsets": []
      }
    ],
    "Width": 105.188
  }
}

The text seems to be all globbed together and escaped: This%09%0D%20%C2%A0is%09%0D%20%C2%A0a%09%0D%20%C2%A0test%09%0D%20%C2%A0of%09%0D%20%C2%A0. Is there any way to get there in it's own object with positional properties? The content file also doesn't show the text in the original format - before, it was all on one line, but now I get:

This    
  is   
  a    
  test 
  of   
  
BOLD

  
font.

  

  

  

  
----------------Page (0) Break----------------

I'm not quite sure if I've misunderstood something. I was hoping for this is a test of BOLD font.

Any ideas?

It fail to parse pdf on window Server

Hi,

I am running a nodejs application on window server, and it could not parse pdf file there that no data was returned. The parsing work well when i run the code on window pc. Is there any reason that pdf2json can't work on window server? I executed the code from command prompt.

Checkbox status is always false

I tried extracting the fields of PDF which is already filled out with data. After extracting i can get all textfields available along with saved data in each fields. But in case of checkboxes or radio buttons, checked status is always false. Maybe i missed something out?

bounding boxes

Could the documentation explain how to calculate bounding boxes for text items?

Text has x, y and w but no h. I presume that the font size could give you h, but they seem to be in other units. How should I convert?

BTW, what is the "TS" element? Can this help me?

pdf2json interprets one word as 2 JSON objects

Hi,

I use pdf2json to parse some pdfs, which contain week-tables. However, after I parse the files, there’s a strange behavior - some of the words are separated in the JSON-file as different objects, while they’re actually one word inside of the pdf.

Example:
Files:
plan.pdf
plan.txt (sorry, can't upload JSON-Files)

The word:
„Champignons“ (column: „Mittwoch“, row: 2, line: 3) is interpreted as "Champig" and "nons" (2 JSON Objects)

JSON:
{"x":26.51,"y":16.388,"w":3.932,"sw":0.35678125,"clr":0,"A":"left","R":[{"T":"Champig","S":3,"TS":[0,12,0,0]}]},{"x":28.745,"y":16.388,"w":7.169,"sw":0.35678125,"clr":0,"A":"left","R":[{"T":"nons%20und%20Lauch%20","S":3,"TS":[0,12,0,0]}]}

This issue also occur in other rows. I suspect that it's caused by the specific pdf-structure.

Any ideas how I can fix that?

Thanks for your support!

_onPFBdataReady not defined error

var nodeUtil = require("util"),
fs = require('fs'),
_ = require('underscore'),
PDFParser = require("./pdfparser");

    var pdfParser = new PDFParser();

    pdfParser.on("pdfParser_dataReady", _.bind(_onPFBinDataReady, self));

    pdfParser.on("pdfParser_dataError", _.bind(_onPFBinDataError, self));

//    var pdfFilePath = _pdfPathBase + folderName + "/" + pdfId + ".pdf";

var pdfFilePath = 'ibpsrrb2012.pdf';
pdfParser.loadPDF(pdfFilePath);

    // or call directly with buffer
    fs.readFile(pdfFilePath, function (err, pdfBuffer) {
      if (!err) {
        pdfParser.parseBuffer(pdfBuffer);
      }
    })

When i fire the command npm mypdf.js i get the following error
pdfParser.on("pdfParser_dataReady",_.bind<_onPFBinDataReady,self>)

refrence error _onpfbindataready is not defined

Italics not working

Dear Mr. Zhang,

The Italic field (TS[3]) is always zero regardless of whether the text field is Italic or not. After digging in pdffont.js for a bit, I figured out that it's because the value is always the initial value (false) set in the constructor and it is never set anywhere else.

In my case, I corrected the issue by making this very simple change to pdffont.js:

    var _setFaceIndex = function() {
        var fontObj = this.fontObj;

        this.bold = fontObj.bold;
        if (!this.bold) {
            this.bold = this.typeName.indexOf("bold") >= 0 || this.typeName.indexOf("black") >= 0;
        }

        this.italic = fontObj.italic;  // <---- Added this line only

Please note that Bold works as advertised. I notice that you are also analyzing the typeface name to distinguish between bold and normal text in the case of "pseudobold" text fonts, I have not done anything like that for italics so it probably won't work for typefaces that oblique by design but not by formatting.

I have not forked the project so please accept this issue and code snippet in lieu of a pull request. :)

Yours faithfully,
Riaan

PS. Thanks for the package, it's much appreciated!

Color dictionary confusion

Is it posible to implement feature for getting correct color and not from the "dictionary", also it would be awesome to access cmyk colors. Perhaps dynamicly generate the dictionaries?

Great work in porting pdf.js over to node!

Height and Width confusion

Hello modesty,

I am a bit confused about how the page height and width sizes work. I am using the 1040 test form that comes with the library and am getting a larger Width than Height even though the pages are all portrait. It also seems that the hlines and vlines abide by the page dimensions so I think this is something I am doing wrong.

Any thoughts?

Loading pdfs from a remote server via a stream

I'm trying to load a single PDF from a remote server. Here is my approach:
(I can confirm that if I just pipe the request into a write stream it saves the PDF fine)

var request = require('request');
var pdfParser = require('pdf2json');
var pdfUrl = 'somepdf.pdf'

var pdfPipe = request({url: pdfUrl, encoding:null}).pipe(pdfParser);

pdfPipe.on("pdfParser_dataError", err => console.error(err) );
pdfPipe.on("pdfParser_dataReady", pdf => {
    //let pdf = pdfParser.getMergedTextBlocksIfNeeded();
    console.log(pdfParser.getAllFieldsTypes());
});

However, I'm getting an error:

stream.js:45
  dest.on('drain', ondrain);
       ^

TypeError: dest.on is not a function
    at Request.Stream.pipe (stream.js:45:8)
    at Request.pipe (/Users/zaf/development/minerva-bot/node_modules/request/request.js:1395:34)
    at Object.<anonymous> (/Users/zaf/development/minerva-bot/plugins/exam_module/index.js:9:53)
    at Module._compile (module.js:434:26)
    at Object.Module._extensions..js (module.js:452:10)
    at Module.load (module.js:355:32)
    at Function.Module._load (module.js:310:12)
    at Function.Module.runMain (module.js:475:10)
    at startup (node.js:117:18)
    at node.js:951:3

Code constructed from here: http://stackoverflow.com/a/36882510/3779915