ropensci / tabulapdf Goto Github PK

View Code? Open in Web Editor NEW

537.0 38.0 70.0 20.96 MB

Bindings for Tabula PDF Table Extractor Library

Home Page: https://docs.ropensci.org/tabulizer

License: Apache License 2.0

R 100.00%

tabula tabular-data pdf java pdf-document r r-package ropensci rstats peer-reviewed

tabulapdf's Issues

extract_tables() doesn't work with the new update

Hi,

All my packages were deleted due to some stupid mistake. Upon reinstalling (in windows), extract_tables() doesn't work anymore and gives me this error.

Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 
  java.lang.IllegalArgumentException: Comparison method violates its general contract!

I reinstalled java and but it gives me the same error. I tried repeating the process on my mac as well and it shows the same error. Is there something I'm doing wrong?

annoying warning

Hello,
I get plenty of messages:

There is already a file with this name in the temporary directory. It will be overwritten.

I saw that they come from localize_file(). It seems that the temp path (which is probably not needed for non-URL loaded files) is built using tempdir(), which explains the above message.
Why not using tempfile() to build a proper temporary path ? Or do not copy the file if it is not needed ?

P.S
tabulizer works great !

Pdf ideas for examples

Scientific papers often have tables and one would surely like to use the area argument.
Bus timetables, e.g. http://www.apsrtc.gov.in/Airport%20Liner%20Timings.pdf or http://www.morbihan.fr/fileadmin/Les_services/Vos_deplacements/Transports_collectifs/Fiches_horaires_TIM/TIM7-Hiver-Printemps-2016.pdf p.3

Password Protection error for PDF's without PW protection

Hi,

I'm having trouble with some PDFs (which work with the tabula browser). When trying to load them with `extract_areas`` the following error is raised:

f <- "path_to_pdf/0101.pdf"
out1 <- extract_areas(f, pages=c(1))

Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 
  java.io.IOException
In addition: Warning messages:
1: In load_doc(file, password = password) :
  PDF appears to be password protected and no password was supplied.
2: In load_doc(file, password = password) :
  PDF appears to be password protected and no password was supplied.

An example PDF is available here:
http://www.insee.fr/fr/ppp/bases-de-donnees/donnees-detaillees/circo_leg/donnees/0101.pdf
It is not password protected.

Installation fails

Hi,

I followed your instructions to install tabulizer (Windows x64) but the installation always fails:

also installing the dependency ‘png’

leeper/tabulizerjars     leeper/tabulizer 
             "0.1.2"                   NA 
Warning messages:
1: running command '"C:/PROGRA~1/R/R-33~1.0/bin/x64/R" CMD INSTALL --no-multiarch -l "C:\Users\kasus\Documents\R\win-library\3.3" C:\Users\kasus\AppData\Local\Temp\RtmpqIMHaQ/downloaded_packages/png_0.1-7.tar.gz' had status 1 
2: In utils::install.packages(to_install, type = "source", contriburl = contrib,  :
  installation of package ‘png’ had non-zero exit status
3: running command '"C:/PROGRA~1/R/R-33~1.0/bin/x64/R" CMD INSTALL --no-multiarch -l "C:\Users\kasus\Documents\R\win-library\3.3" C:\Users\kasus\AppData\Local\Temp\RtmpqIMHaQ/ghitdrat/src/contrib/tabulizer_0.1.21.tar.gz' had status 1 
4: In utils::install.packages(to_install, type = "source", contriburl = contrib,  :
  installation of package ‘tabulizer’ had non-zero exit status

I tried several things like different paths, different java versions, etc., but all without success. Can you help me out?

Improve Shiny-based `extract_areas()` functionality

Both locate_areas() and extract_areas() use, optionally, a Shiny interface to identify areas. This could probably improved because I'm not much of a Shiny expert. Any advice on improvements and new functionality can be pitched here and/or submitted as PRs.

Error installing

I have tried to install in RStudio both

ghit::install_github(c("ropenscilabs/tabulizerjars", "ropenscilabs/tabulizer"), INSTALL_opts = "--no-multiarch")

and

ghit::install_github(c("leeper/tabulizerjars", "leeper/tabulizer"), INSTALL_opts = "--no-multiarch", dependencies = c("Depends", "Imports"))

according to http://stackoverflow.com/questions/39132202/trouble-installing-tabulizer-package-for-r

but the result shows

'leeper/tabulizerjars leeper/tabulizer
"0.1.2" NA
There were 20 warnings (use warnings() to see them)'

There is tabulizerjars in the library but not tabulizer.

when i typed

'install_github("ropensci/tabulizer")'

it shows

Error in read.dcf(file = tmpf) : cannot open the connection
In addition: Warning message:
In read.dcf(file = tmpf) :
cannot open compressed file 'c:\temp\RtmpWQoyv5/ghitdrat/src/contrib/PACKAGES', probable

reason 'No such file or directory'

When I typed

ghit::install_github("leeper/tabulizer", INSTALL_opts = "--no-multiarch")
it shows

leeper/tabulizer
NA

Anyone know how to solve this please? There is no tabulizer in the library at the moment.

Thank You.

locate areas / extract tables renders unexpected results

I have a small issue with the way locate_areas and extract_tables interact

I use something like:

areas_to_extract<-locate_areas(PDFfile)
extract_tables(PDFfile, area=areas_to_extract)

areas_to_extract is a list of length pages, with each position representing a page.
Positions representing pages that I have specified areas for contain coordinates,
while the pages that I have not indicated an area for, are left empty.

When passing the generated list to extract_tables, empty positions invoke the autodetection algorithm to try and find tables. This seems rather illogical to me, as I had previously reviewed these pages manually as to assure that these pages in fact do not contain tables.

A possible solution may be that extract_tables skips a page in case no area is indicated for a particular page, so that the autodetection is not triggered. I think it would improve efficiency and consistent, and should be fairly easy to implement.

blank images in locate_areas and make_thumbnails

Hi,

I'm having some issues with PDF files that I created from scans (TIFF -> adobe acrobat OCR -> PDF). extract_tables works fine on most of them, but occasionally misses the last row of a table. I therefore tried to get better results with locate_areas. However, the shiny app simply shows an empty image. The same happens when I use make_thumbnails on these files; a blank PNG is created.

Both functions work fine with the demo file that ships with the package, so I suspect that it has something to do with the PDF I created. Do I need to enable particular PDF features when I collate the scanned TIFs?

For completeness sake, below is my sessionInfo. I also tried to attach a sample PDF, but that currently fails (I suspect because of the ongoing S3 issues). I'll try again later.

R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X Yosemite 10.10.5

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] shiny_1.0.0      tabulizer_0.1.23

loaded via a namespace (and not attached):
 [1] tabulizerjars_0.1.2 R6_2.2.0            htmltools_0.3.5     tools_3.3.2         Rcpp_0.12.9         jsonlite_1.2        digest_0.6.12      
 [8] xtable_1.8-2        httpuv_1.3.3        miniUI_0.1.1        mime_0.5            rJava_0.9-8         png_0.1-7

java.lang.IllegalArgumentException: Comparison method violates its general contract!

I had no issues installing the package and running the code example in the readme, so I know that my installation was successful.

The following attempt blew up:

location <- "http://usda.mannlib.cornell.edu/usda/nass/CropProd//2000s/2002/CropProd-11-12-2002.pdf"
  
out <- extract_tables(location)

The error was :
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 
  java.lang.IllegalArgumentException: Comparison method violates its general contract!

here is my sessionInfo():

R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tabulizer_0.1.22

loaded via a namespace (and not attached):
[1] tabulizerjars_0.1.2 tools_3.3.2         rJava_0.9-8         png_0.1-7

memory issues

Dear Tabulizer team,

When extracting hundreds of PDFs, is there a good way to clear memory? The memory use keeps growing and I assume this is due to unreleased objects floating around in the heap.

Non-western import test extracts two tables instead of one.

> f3 <- "https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf"
> tab3 <- tabulizer::extract_tables(f3, method = "asis")
> tab3
[1] "Java-Object{[[technology.tabula.TableWithRulingLines[x=0.0,y=72.0,w=612.0,h=720.0,bottom=792.000000,right=612.000000], technology.tabula.TableWithRulingLines[x=0.0,y=0.0,w=612.0,h=792.0,bottom=792.000000,right=612.000000]]]}"

Consider adding tabularizerjars to remotes until it's on CRAN?

Handle non-latin encodings

This seems really challenging given the quirkiness of PDF format, but is the big issue to left to implement from rOpenSci onboarding

extract_tables() error: java.lang.NoSuchMethodError

I've just installed tabulizer from github. I'm using MacOS Sierra. I also installed the legacy java from the link given on the install instructions: https://support.apple.com/kb/DL1572?locale=en_US.

When I use extract_tables(), I get the following error:

f <- system.file("examples", "data.pdf", package = "tabulizer")
extract_tables(f)
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 
  java.lang.NoSuchMethodError: java.lang.Integer.compare(II)I

extract_areas got wrong results for numerical values in pdf table?

I extracted table in pdf (text in format, not scanned image) by

extract_areas(file, encoding="UTF-8")
Interactive operation in Rstudio viewer (looks very blurred), see the screenshot:
https://goo.gl/OFvOLn

then the data.frame output got wrong results, like that:
https://goo.gl/YRnU2d
The column number was right, and got right English characters, but the values stored were all wrong.

I cannot figured out what's the possible issues. If someone can help to verify the problem, the pdf file can be downloaded from:
http://drp.mk/i/0qpZDLvxkW

The R environments of mine:

sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Chinese (Traditional)_Taiwan.950
[2] LC_CTYPE=Chinese (Traditional)_Taiwan.950
[3] LC_MONETARY=Chinese (Traditional)_Taiwan.950
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Traditional)_Taiwan.950

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] shiny_0.13.2 tabulizer_0.1.22 magrittr_1.5 data.table_1.9.6

loaded via a namespace (and not attached):
[1] Rcpp_0.12.7 png_0.1-7 digest_0.6.10 mime_0.5
[5] chron_2.3-47 R6_2.1.3 jsonlite_1.0 xtable_1.8-2
[9] git2r_0.15.0 ghit_0.2.12 miniUI_0.1.1 tabulizerjars_0.1.2
[13] Cairo_1.5-9 tools_3.3.1 rsconnect_0.4.3 httpuv_1.3.3
[17] rJava_0.9-8 htmltools_0.3.5

Understanding the error messages

Can you help me understand the following warnings and steps that I need to take to avoid them:

extracted_data_all <- list_of_files %>% lapply(extract_tables, guess = TRUE, method = 'character')
# Fontconfig warning: "/etc/fonts/infinality/conf.d/41-repl-os-win.conf", line 148: Having multiple values in <test> isn't supported and may not work as expected
# Fontconfig warning: "/etc/fonts/infinality/conf.d/41-repl-os-win.conf", line 160: Having multiple values in <test> isn't supported and may not work as expected
# May 31, 2016 3:43:34 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
# INFO: Can't find the specified font Microsoft Sans Serif
# May 31, 2016 3:43:34 PM org.apache.fontbox.ttf.TrueTypeFont initializeTable
# SEVERE: An error occured when reading table name
# java.io.EOFException
#   at java.io.RandomAccessFile.readUnsignedShort(RandomAccessFile.java:769)
#   at org.apache.fontbox.ttf.RAFDataStream.readUnsignedShort(RAFDataStream.java:118)
#   at org.apache.fontbox.ttf.NamingTable.initData(NamingTable.java:53)
#   at org.apache.fontbox.ttf.TrueTypeFont.initializeTable(TrueTypeFont.java:280)
#   at org.apache.fontbox.ttf.TrueTypeFont.getNaming(TrueTypeFont.java:114)
#   at org.apache.fontbox.util.FontManager.analyzeTTF(FontManager.java:112)
#   at org.apache.fontbox.util.FontManager.loadFonts(FontManager.java:75)
#   at org.apache.fontbox.util.FontManager.findTTFontname(FontManager.java:290)
#   at org.apache.fontbox.util.FontManager.findTTFont(FontManager.java:326)
#   at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getExternalFontFile2(PDTrueTypeFont.java:584)
#   at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getawtFont(PDTrueTypeFont.java:510)
#   at org.apache.pdfbox.pdmodel.font.PDSimpleFont.drawString(PDSimpleFont.java:110)
#   at org.apache.pdfbox.pdfviewer.PageDrawer.processTextPosition(PageDrawer.java:260)
#   at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:499)
#   at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:62)
#   at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
#   at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
#   at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
#   at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
#   at org.apache.pdfbox.pdfviewer.PageDrawer.drawPage(PageDrawer.java:139)
#   at org.apache.pdfbox.pdmodel.PDPage.convertToImage(PDPage.java:801)
#   at technology.tabula.detectors.NurminenDetectionAlgorithm.detect(NurminenDetectionAlgorithm.java:93)
#   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
#   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
#   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
#   at java.lang.reflect.Method.invoke(Method.java:606)
#   at RJavaTools.invokeMethod(RJavaTools.java:386)
# 
# May 31, 2016 3:43:35 PM org.apache.fontbox.util.FontManager findTTFontname
# WARNING: Font not found: Microsoft Sans Serif
# May 31, 2016 3:43:35 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
# INFO: Can't find the specified font Microsoft Sans Serif,Italic
# May 31, 2016 3:43:35 PM org.apache.fontbox.util.FontManager findTTFontname
# WARNING: Font not found: Microsoft Sans Serif,Italic
# May 31, 2016 3:43:36 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
# INFO: Can't find the specified font Microsoft Sans Serif
# May 31, 2016 3:43:36 PM org.apache.fontbox.util.FontManager findTTFontname
# WARNING: Font not found: Microsoft Sans Serif
# May 31, 2016 3:43:37 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
# INFO: Can't find the specified font Microsoft Sans Serif
# May 31, 2016 3:43:37 PM org.apache.fontbox.util.FontManager findTTFontname
# WARNING: Font not found: Microsoft Sans Serif
# May 31, 2016 3:43:37 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
# INFO: Can't find the specified font Microsoft Sans Serif,Italic
# May 31, 2016 3:43:37 PM org.apache.fontbox.util.FontManager findTTFontname
# WARNING: Font not found: Microsoft Sans Serif,Italic
# May 31, 2016 3:43:38 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
# INFO: Can't find the specified font Microsoft Sans Serif
# May 31, 2016 3:43:38 PM org.apache.fontbox.util.FontManager findTTFontname
# WARNING: Font not found: Microsoft Sans Serif
# May 31, 2016 3:43:41 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
# INFO: Can't find the specified font Microsoft Sans Serif
# May 31, 2016 3:43:41 PM org.apache.fontbox.util.FontManager findTTFontname
# WARNING: Font not found: Microsoft Sans Serif
# May 31, 2016 3:43:41 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
# INFO: Can't find the specified font Microsoft Sans Serif,Italic
# May 31, 2016 3:43:41 PM org.apache.fontbox.util.FontManager findTTFontname
# WARNING: Font not found: Microsoft Sans Serif,Italic
# May 31, 2016 3:43:42 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
# INFO: Can't find the specified font Microsoft Sans Serif,Italic
# May 31, 2016 3:43:42 PM org.apache.fontbox.util.FontManager findTTFontname
# WARNING: Font not found: Microsoft Sans Serif,Italic
# May 31, 2016 3:43:42 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
# INFO: Can't find the specified font Microsoft Sans Serif
# May 31, 2016 3:43:42 PM org.apache.fontbox.util.FontManager findTTFontname
# WARNING: Font not found: Microsoft Sans Serif
# May 31, 2016 3:43:42 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
# INFO: Can't find the specified font Microsoft Sans Serif
# May 31, 2016 3:43:42 PM org.apache.fontbox.util.FontManager findTTFontname
# WARNING: Font not found: Microsoft Sans Serif

Here is the session info

sessionInfo()
# R version 3.3.0 (2016-05-03)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 14.04.4 LTS
# 
# locale:
#  [1] LC_CTYPE=en_IN.UTF-8          LC_NUMERIC=C                  LC_TIME=en_IN.UTF-8          
#  [4] LC_COLLATE=en_IN.UTF-8        LC_MONETARY=en_IN.UTF-8       LC_MESSAGES=en_IN.UTF-8      
#  [7] LC_PAPER=en_IN.UTF-8          LC_NAME=en_IN.UTF-8           LC_ADDRESS=en_IN.UTF-8       
# [10] LC_TELEPHONE=en_IN.UTF-8      LC_MEASUREMENT=en_IN.UTF-8    LC_IDENTIFICATION=en_IN.UTF-8
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] data.table_1.9.6 magrittr_1.5     rlist_0.4.6.1    stringr_1.0.0    dplyr_0.4.3      tabulizer_0.1.14
# 
# loaded via a namespace (and not attached):
#  [1] tabulizerjars_0.1.2 R6_2.1.2            assertthat_0.1      parallel_3.3.0      DBI_0.4-1          
#  [6] tools_3.3.0         Rcpp_0.12.5         stringi_1.1.1       chron_2.3-47        rJava_0.9-8        
# [11] png_0.1-7

Subscript out of bounds error for much the same PDF

Thanks for this awesome package. It works well on all the .pdf-documents I have tried it on. I do however have a problem about the extract_tables like below. Also, You can reproduce this in your R studio, too.

This works with this pdf in 2015 :

library(tabulizer)
path2pdf <- "/Users/HidetakaKo/Desktop/2015-cookpad.pdf"
out <- extract_tables(path2pdf)
as.data.frame(out[[1]])

This doesn't work with this pdf in 2016 :

library(tabulizer)
path2pdf <- "/Users/HidetakaKo/Desktop/2016-cookpad.pdf"
out <- extract_tables(path2pdf)
as.data.frame(out[[1]])

These .pdf-documents format is much the same with the previous one.

I'm working on a MacAir with
OS X 10.11.6
R 3.3.1
Exploratory Desktop
RStudio Version 0.99.887

A parameter in extract_tables to assume row names and colnames from first row and column

If tables are coming in to R as matrices, Could the conversion to data.frames be made simpler with a parameter for assuming the first row or columns should be row and columns names?

extract_tables error message

I have four pdfs that from what I can tell were created by the same source. extract_tables works perfect for three of them. For the fourth I get the following error message,

Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 
  java.util.NoSuchElementException

extract_areas and extract_text appear to work fine with the pdf in question.

Do any watchers of this repo understand this error message? My googling has not been successful. Is there a specific pdf attribute that I should make sure exists in order for extract_tables to work successfully?

I apologize for not having a reproducible example but the pdfs I'm working with are confidential.

Add vignette

Error using get_n_pages()

Hi, an error as follows occurred when I tried to get the number of pages of a PDF file. I am not sure if it's because of the size of the file, but increasing the JAVA memory didn't help.

Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : org.apache.pdfbox.exceptions.WrappedIOException

Thank you!

locate_areas() widget issue

I am getting the following error when using locate_areas() with the native or reduced widgets;

Graphics device does not support event handling...
Entering reduced functionality mode.
Click upper-left and then lower-right corners of area.
Error in try_area_reduced(file = file, dims = dims, area = area, warn = warn) : 
  Graphics device does not support rasterImage() plotting

When searching this error, I was led to this file: try_area_methods.R

Looking on line 74 there is this condition that shows the final line of my error:

if (grDevices::dev.capabilities()[["rasterImage"]] != "no") {
        stop("Graphics device does not support rasterImage() plotting")
    }

There is also this similar condition on line 103 (note the != vs == while producing the same error):

if (grDevices::dev.capabilities()[["rasterImage"]] == "no") {
        stop("Graphics device does not support rasterImage() plotting")
    }

Having checked my grDevices::dev.capabilities(), rasterImage is enabled and so I would think this error would not apply to me. Is the condition from line 74 flipped and causing this error?

I attempted to clone the repo and make the change myself but couldn't figure out how to install locally, so I am pointing it out here.

Example use cases, tutorials, and applications

This is a help wanted! issue for anyone to contribute example uses of tabulizer to the package wiki. The idea is to add links to existing blog posts or tutorials, as well as add new examples to the wiki itself that showcase various functionality. Anyone can add an example by editing the wiki directly.

Better options for output

Currently, the output is a character matrix, which kind of makes sense. But there are other options:

List of character matrices (current default)
List of character vectors
- Could be delimited for parsing via read.csv(), etc.
List of data.frames
- This would be nice, but shouldn't be default because some tables won't work well with it if they're not perfectly rectangular. This would also enable automatic variable typing, which would be nice.
Tabula's CSVWriter (implemented but not exposed)
Tabula's TSVWriter
Tabula's JSONWriter

extract_text without trimming

I am not sure this is an issue per se but I think it would be very useful to preserve the spacing of the text without trimming. For example, if something appeared on screen as

" [whitespace....................]hello world gfdaggfdagfda [whitespace....................]"

right now i believe Tabulizer would yield

" hello world gfdaggfdagfda"

Another example would be

" hello world [whitespace....................] gfdaggfdagfda [whitespace....................]"

tabulizer might yield

" hello world gfdaggfdagfda "

Perhaps there is a way to do this now, but I missed it. Even trying something like extract_tables(guess=FALSE,columns...) won't do the trick because of the aforementioned trimming issue. The only thing I can think of doing is literally creating coordinate by coordinate columns. Like,

extract_tables(file=f,guess=FALSE,pages=1,columns=list(seq(1,900,by=1)))

Perhaps that is the recommended move? But it seems less than ideal as it is incredibly computationally expensive for what its doing

Handling issues related to upgrade to PDFBox 2.0

The tabula-java library is moving to PDFBox 2.0, which will have consequences not only for the tabula API but also for some of the utility functions that tabulizer implements by calling PDFBox classes directly. This is flagged as an issue at tabulizerjars and will likely have numerous consequences for tabulizer. Any help identifying and correcting this issues will be appreciated.

Specifying area and using nospreadsheet=True doesn't work

I have a pdf where guessing or autodetect tables isn't able to find the table and the table is a Stream table(columns separated by white spaces). I used tabula-py where the code goes like:

df=tabula.read_pdf("sample.pdf",nospreadsheet=True,area=(321.3,49.459,836.719,567.109))
I get an empty dataframe after executing .But mentioning the same area and using it through tabula in windows produces me a output which is what i want.

is there any way?

Multiple table in 1 page

Migrated from ropensci/tabulizerjars#1 (@khun84)

Is there param that I can parse in to extract more than 1 table per page?

I have a pdf page with 2 tables:

table 1 is 2 columns and multiple rows
table 2 has 2 columns and multiple rows, but some of the cells are merged).

I use the extract_table() function with default param and the output only has 1 table (table 1).

What I can think of is to set method = 'asis' but I do not know to proceed with the output java object. Is there any documentation I can refer to?

Loading into R unsuccessful

I followed the instruction, and installed the Java6. R keeps throwing out the same warning message:
-> ghit::install_github(c("ropenscilabs/tabulizerjars", "ropenscilabs/tabulizer"))

ropenscilabs/tabulizerjars     ropenscilabs/tabulizer 
                        NA                         NA 
Warning messages:
1: In utils::install.packages(to_install, type = type, repos = repos,  :
  installation of package ‘tabulizerjars’ had non-zero exit status
2: In utils::install.packages(to_install, type = type, repos = repos,  :
  installation of package ‘tabulizer’ had non-zero exit status

Any one has had the same issue? Any help would be appreciated.
I am using Unix, and R version is 3.3.2.

Handle Encrypted PDFs

Tabulizer can handle encrypted PDFs through a password. I should expose this functionality for completeness sake.

All this requires is optionally passing a password argument to the objectextractor constructor here.

Handle .jar files for CRAN

The .jar files are currently 8MB and apparently too big for CRAN. Standard practice is to dump these to a separate, rarely updated package, which we can do, but Tabula is a relatively young library so that may not work quite yet.

"Tabulizer not available for R 3.3.2"

Attempted to run through the tutorial given on DataSciencePlus.com and went to install the Tabulizer package and the R console throws up the following warning:

> install.packages("tabulizer")
Warning in install.packages : package ‘tabulizer’ is not available (for R version 3.3.2)

I did not see any open issues or discussion on the development page regarding this and wanted to bring it to the attention of the maintainers.

Please let me know if you all have any further questions
Javier - javier.ignacio.alonso (at) gmail dot com

PDF page in left and right format

Thanks for contributing this awesome package.
Most of my pages are seperated by a blank area in the middle, so that left paragraph and right paragraph are independent. Interestingly, extract_tables sometimes works well, sometimes it regards left and right paragraphs as an entire table. In the second condition, some columns in the table are combined and it's hard to extract information. I've upload test.pdf.

I wonder is it possible that function could auto detect or parameter specified this kind of format. Thank you.

Integrating extract_tables with Shiny-app - no reactivity

Thanks for this awesome package. It works well on all the .pdf-documents I have tried it on. I do however have a problem integrating the extract_tables / extract_text functions with my own Shiny-app.

More specifically the problem is that the fileInput-function to upload files doesn't seem to recognize that a new file has been uploaded. This works instantly with other R-functions like read.csv or pdf_text in the pdftools-library.

This works with pdftools :

library(pdftools)
shinyServer(function(input, output) {
    output$contents <- renderText({

        inFile <- input$file1

        if (is.null(inFile))
            return(NULL)
        pdf_text(inFile$datapath)
    })
})

This doesn't work with tabulizer :

library(shiny);library(tabulizer)
shinyServer(function(input, output) {
    output$contents <- renderText({#renderTable

        inFile <- input$file1

        if (is.null(inFile))
            return(NULL)
       extract_text(inFile$datapath)
        #extract_tables(inFile$datapath)[[1]]
        #read.csv(inFile$datapath, header=input$header, sep=input$sep, 
        #         quote=input$quote)
    })
})

ui.R is the same in both cases.

shinyUI(fluidPage(
    titlePanel("Uploading Files"),
    sidebarLayout(
        sidebarPanel(
            fileInput('file1', 'Choose PDF File',
                      accept=c('.pdf'))#,c("application/pdf","adobe-portable-document-format",".pdf"))
        ),
        mainPanel(
            tableOutput('contents')
        )
    )
))

I'm working on a MacPro with
OS X 10.11.4
R 3.2.3
RStudio Version 0.99.887

Installing tabulizer

I'm pretty well stuck trying to install the package as described in the instructions. After installing chocolatey, then Java, then the ghit package, after running the code as laid out in the instructions:
ghit::install_github(c("ropenscilabs/tabulizerjars", "ropenscilabs/tabulizer"), INSTALL_opts = "--no-multiarch")
I get the following error:
Error in build_and_insert(p$pkgname, d, vers, build_args, verbose = verbose) :
Package build for tabulizerjars failed!
In addition: Warning message:
running command '"C:/PROGRA~~1/MIE74D~~1/MRO-33~1.1/bin/x64/R" CMD build C:\Users\USERNAME\AppData\Local\Temp\Rtmp0mPJjx\tabulizerjars1cd83bee68cf ' had status 1

Any help troubleshooting would be very much appreciated.

is there a limit on the size of the extraction?

I tried to extract 816 pages using extract_tables from a PDF that has a size of 8.2MB. After 10 minutes of running, the following error message popped up:

Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : java.lang.OutOfMemoryError: Java heap space

Would really appreciate your help!

Thank you!

Not loading in R

Hi, I confess I'm not an expert R user but I seem to have some problems in installing Tabulizer in R.

I'm using R Studio and working in a 64bit Windows environment.

I tried loading the package using this line (as I had seen in another thread):

ghit::install_github(c("leeper/tabulizerjars", "leeper/tabulizer"), INSTALL_opts = "--no-multiarch", dependencies = c("Depends", "Imports"))

And that is what I got as an answer:

leeper/tabulizerjars leeper/tabulizer
NA NA
Warning messages:
1: running command '"C:/PROGRA~~1/R/R-33~~1.0/bin/x64/R" CMD INSTALL --no-multiarch -l "C:\Users...\Documents\R\win-library\3.3" C:\Users...\AppData\Local\Temp\RtmpeML0Qt/ghitdrat/src/contrib/tabulizerjars_0.9.2.tar.gz' had status 1
2: In utils::install.packages(to_install, type = type, repos = repos, :
installation of package ‘tabulizerjars’ had non-zero exit status
3: running command '"C:/PROGRA~~1/R/R-33~~1.0/bin/x64/R" CMD INSTALL --no-multiarch -l "C:\Users...\Documents\R\win-library\3.3" C:\Users...\AppData\Local\Temp\RtmpeML0Qt/ghitdrat/src/contrib/tabulizer_0.1.24.tar.gz' had status 1
4: In utils::install.packages(to_install, type = type, repos = repos, :
installation of package ‘tabulizer’ had non-zero exit status

Could you help me with that?

Thanks in advance.

Dependencies not available

When I tried to install the package, the following error message appear:

Warning: dependencies ‘BiocInstaller’, ‘Rcompression’, ‘glmmADMB’, ‘lme4.0’, ‘cacheSweave’, ‘weaver’, ‘graph’, ‘Biobase’, ‘GenomicRanges’, ‘marray’, ‘affy’, ‘limma’, ‘Rcampdf’, ‘Rgraphviz’, ‘tm.lexicon.GeneralInquirer’, ‘ReportingTools’, ‘globaltest’, ‘R2wd’, ‘RDCOMClient’, ‘rhdf5’ are not available

and was unable to continue installation.

Would you please help with this issue? Thank you!

Not loading into R like described

I've seemingly exhausted my limited knowledge but I'm unable to get the package to load into RStudio. I've installed rJava, and updated Java on my PC (Win7). I followed your instructions on installation, but I think I'm either missing a step or not downloading the correct versions of tabula. Any help would be greatly appreciated.

I'm using 3.3.1, if that matters.
Thanks!

add split and merge functions

This might be useful for handling portions of a very large PDF document, or for combining many PDFs into one for use with extract_areas() or extract_tables(), or for some other purposes.

split_pdf()
https://pdfbox.apache.org/docs/1.8.12/javadocs/org/apache/pdfbox/util/Splitter.html#Splitter()

merge_pdfs()
https://pdfbox.apache.org/docs/1.8.12/javadocs/org/apache/pdfbox/util/PDFMergerUtility.html#PDFMergerUtility()

Better rstudio integration for extract_areas function

Hopefully from the rstudio team can help with this.

Extract Area around a matching string

Hello,
I have tables in the format

abc : 12345566
cde : 456782
gef : 45345435

where abc,def are the same and the other number vary. When I extract specific area, I get dataframe with 2 columns which is perfect. My problem however is , the tables sometimes split over two pages depending on the extra lines on number side and there is one value "xyz" which is present for some tables.

Is there a way to be able to get the area around a search string that way I know at which value the table got split in second page and also , if "xyz" is present , I can change the area accordingly.

Hopefully I am making sense...

tabulizer is not available (for R version 3.3.2)

I tried installing this R package and it gave below error.

Installing package into ‘C:/Users/hskir/Documents/R/win-library/3.3’
(as ‘lib’ is unspecified)
Warning in install.packages :
package ‘tabulizer’ is not available (for R version 3.3.2)

May I know which version is supported. Thanks

Area problems with multiple pages

I keep getting an error trying to use the area parameter with a specified page range:

eg using this file and the command:

extract_tables('Lap Analysis.pdf',guess=F,pages=2,area=list(c(178, 10, 800,40)))

I can extract a header, but if I set pages=c(2,3) or remove the pages parameter I get an error:

May 2, 2016 10:55:48 AM org.apache.pdfbox.cos.COSDocument finalize
WARNING: Warning: You did not close a PDF Document
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 
   java.util.NoSuchElementException

Comprehensive installation instructions

This is a help wanted! issue. There are a wide and somewhat recurring set of installation issues (I've tried to systematically label these issues on GitHub), mostly related to variation in Java and rJava across platforms. If anyone wants to contribute PRs to help document troubleshooting, please submit them around this issue. In particular, the README could benefit from even more detail about installation processes, based around OS:

Add column and area tests

This is documented but not tested. It can be done using the same test PDF file to extract a subset of columns.

How to install from local directory?

Hi I have downloaded the zip file to C:/Users/Public.
My operating system is windows7 64 bit. My version of R is 3.3.2
I have tried ...

> install.packages('C:/Users/Public/tabulizer-master.zip', repos = NULL, type="binary", INSTALL_opts = "--no-multiarch")
> library(`tabulizer-master`)
Error in library(`tabulizer-master`) : 
  there is no package called ‘tabulizer-master’
> library(`tabulizer`)
Error in library(tabulizer) : there is no package called ‘tabulizer’

I have also tried

 ghit::install_github(c("leeper/tabulizerjars", "leeper/tabulizer"), INSTALL_opts = "--no-multiarch", verbose = TRUE) 
Parsing reponame for 'leeper/tabulizerjars'...
Creating local git repository for tabulizerjars in C:\Users\MARSHL~1\AppData\Local\Temp\RtmpqkiGbZ\tabulizerjarsa0c2a8e4674...
Checking out package tabulizerjars to local git repository...
Error in git2r::fetch(gitrepo, name = "github", credentials = credentials) : 
  Error in 'git2r_remote_fetch': failed to send request: A connection with the server could not be established

> devtools::install_github("leeper/tabulizer")
Error in curl::curl_fetch_disk(url, x$path, handle = handle) : 
  Couldn't connect to server

Can you please give some instructions for installing tabulizer from a local zip files? Thank you.

Java Error

I am encountering a Java error when I try to use tabulizer. Here is the relevant code and the error:

The options command maxes out the memory available to Java.

I'm working on a MacBook Pro (Retina, 13-inch, Mid 2014), 2.8 GHz Intel Core i5, 16 GB 1600 MHz DDR3
R 3.3.2

It seems when scraping a repeated filename, the first file remains cached

I was scraping repeated filenames in different folders and when I execute extract_table and/or extract_areas it returns the first instance of the filename.

For me worked a workaround: load again the library.

ropensci / tabulapdf Goto Github PK

tabulapdf's Issues

Recommend Projects

Recommend Topics

Recommend Org