Giter Club home page Giter Club logo

pdfplumber's Introduction

pdfplumber

Version Tests Code coverage Support Python versions

Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.

Works best on machine-generated, rather than scanned, PDFs. Built on pdfminer.six.

Currently tested on Python 3.8, 3.9, 3.10, 3.11.

Translations of this document are available in: Chinese (by @hbh112233abc).

To report a bug or request a feature, please file an issue. To ask a question or request assistance with a specific PDF, please use the discussions forum.

👋 This repository’s maintainers are available to hire for PDF data-extraction consulting projects. To get a cost estimate, contact Jeremy (for projects of any size or complexity) and/or Samkit (specifically for table extraction).

Table of Contents

Installation

pip install pdfplumber

Command line interface

Basic example

curl "https://raw.githubusercontent.com/jsvine/pdfplumber/stable/examples/pdfs/background-checks.pdf" > background-checks.pdf
pdfplumber < background-checks.pdf > background-checks.csv

The output will be a CSV containing info about every character, line, and rectangle in the PDF.

Options

Argument Description
--format [format] csv or json. The json format returns more information; it includes PDF-level and page-level metadata, plus dictionary-nested attributes.
--pages [list of pages] A space-delimited, 1-indexed list of pages or hyphenated page ranges. E.g., 1, 11-15, which would return data for pages 1, 11, 12, 13, 14, and 15.
--types [list of object types to extract] Choices are char, rect, line, curve, image, annot, et cetera. Defaults to all available.
--laparams A JSON-formatted string (e.g., '{"detect_vertical": true}') to pass to pdfplumber.open(..., laparams=...).
--precision [integer] The number of decimal places to round floating-point numbers. Defaults to no rounding.

Python library

Basic example

import pdfplumber

with pdfplumber.open("path/to/file.pdf") as pdf:
    first_page = pdf.pages[0]
    print(first_page.chars[0])

Loading a PDF

To start working with a PDF, call pdfplumber.open(x), where x can be a:

  • path to your PDF file
  • file object, loaded as bytes
  • file-like object, loaded as bytes

The open method returns an instance of the pdfplumber.PDF class.

To load a password-protected PDF, pass the password keyword argument, e.g., pdfplumber.open("file.pdf", password = "test").

To set layout analysis parameters to pdfminer.six's layout engine, pass the laparams keyword argument, e.g., pdfplumber.open("file.pdf", laparams = { "line_overlap": 0.7 }).

Invalid metadata values are treated as a warning by default. If that is not intended, pass strict_metadata=True to the open method and pdfplumber.open will raise an exception if it is unable to parse the metadata.

The pdfplumber.PDF class

The top-level pdfplumber.PDF class represents a single PDF and has two main properties:

Property Description
.metadata A dictionary of metadata key/value pairs, drawn from the PDF's Info trailers. Typically includes "CreationDate," "ModDate," "Producer," et cetera.
.pages A list containing one pdfplumber.Page instance per page loaded.

... and also has the following method:

Method Description
.close() Calling this method calls Page.close() on each page, and also closes the file stream (except in cases when the stream is external, i.e., already opened and passed directly to pdfplumber).

The pdfplumber.Page class

The pdfplumber.Page class is at the core of pdfplumber. Most things you'll do with pdfplumber will revolve around this class. It has these main properties:

Property Description
.page_number The sequential page number, starting with 1 for the first page, 2 for the second, and so on.
.width The page's width.
.height The page's height.
.objects / .chars / .lines / .rects / .curves / .images Each of these properties is a list, and each list contains one dictionary for each such object embedded on the page. For more detail, see "Objects" below.

... and these main methods:

Method Description
.crop(bounding_box, relative=False, strict=True) Returns a version of the page cropped to the bounding box, which should be expressed as 4-tuple with the values (x0, top, x1, bottom). Cropped pages retain objects that fall at least partly within the bounding box. If an object falls only partly within the box, its dimensions are sliced to fit the bounding box. If relative=True, the bounding box is calculated as an offset from the top-left of the page's bounding box, rather than an absolute positioning. (See Issue #245 for a visual example and explanation.) When strict=True (the default), the crop's bounding box must fall entirely within the page's bounding box.
.within_bbox(bounding_box, relative=False, strict=True) Similar to .crop, but only retains objects that fall entirely within the bounding box.
.outside_bbox(bounding_box, relative=False, strict=True) Similar to .crop and .within_bbox, but only retains objects that fall entirely outside the bounding box.
.filter(test_function) Returns a version of the page with only the .objects for which test_function(obj) returns True.

... and also has the following method:

Method Description
.close() By default, Page objects cache their layout and object information to avoid having to reprocess it. When parsing large PDFs, however, these cached properties can require a lot of memory. You can use this method to flush the cache and release the memory.

Additional methods are described in the sections below:

Objects

Each instance of pdfplumber.PDF and pdfplumber.Page provides access to several types of PDF objects, all derived from pdfminer.six PDF parsing. The following properties each return a Python list of the matching objects:

  • .chars, each representing a single text character.
  • .lines, each representing a single 1-dimensional line.
  • .rects, each representing a single 2-dimensional rectangle.
  • .curves, each representing any series of connected points that pdfminer.six does not recognize as a line or rectangle.
  • .images, each representing an image.
  • .annots, each representing a single PDF annotation (cf. Section 8.4 of the official PDF specification for details)
  • .hyperlinks, each representing a single PDF annotation of the subtype Link and having an URI action attribute

Each object is represented as a simple Python dict, with the following properties:

char properties

Property Description
page_number Page number on which this character was found.
text E.g., "z", or "Z" or " ".
fontname Name of the character's font face.
size Font size.
adv Equal to text width * the font size * scaling factor.
upright Whether the character is upright.
height Height of the character.
width Width of the character.
x0 Distance of left side of character from left side of page.
x1 Distance of right side of character from left side of page.
y0 Distance of bottom of character from bottom of page.
y1 Distance of top of character from bottom of page.
top Distance of top of character from top of page.
bottom Distance of bottom of the character from top of page.
doctop Distance of top of character from top of document.
matrix The "current transformation matrix" for this character. (See below for details.)
mcid The marked content section ID for this character if any (otherwise None). Experimental attribute.
tag The marked content section tag for this character if any (otherwise None). Experimental attribute.
ncs TKTK
stroking_pattern TKTK
non_stroking_pattern TKTK
stroking_color The color of the character's outline (i.e., stroke). See docs/colors.md for details.
non_stroking_color The character's interior color. See docs/colors.md for details.
object_type "char"

Note: A character’s matrix property represents the “current transformation matrix,” as described in Section 4.2.2 of the PDF Reference (6th Ed.). The matrix controls the character’s scale, skew, and positional translation. Rotation is a combination of scale and skew, but in most cases can be considered equal to the x-axis skew. The pdfplumber.ctm submodule defines a class, CTM, that assists with these calculations. For instance:

from pdfplumber.ctm import CTM
my_char = pdf.pages[0].chars[3]
my_char_ctm = CTM(*my_char["matrix"])
my_char_rotation = my_char_ctm.skew_x

line properties

Property Description
page_number Page number on which this line was found.
height Height of line.
width Width of line.
x0 Distance of left-side extremity from left side of page.
x1 Distance of right-side extremity from left side of page.
y0 Distance of bottom extremity from bottom of page.
y1 Distance of top extremity bottom of page.
top Distance of top of line from top of page.
bottom Distance of bottom of the line from top of page.
doctop Distance of top of line from top of document.
linewidth Thickness of line.
stroking_color The color of the line. See docs/colors.md for details.
non_stroking_color The non-stroking color specified for the line’s path. See docs/colors.md for details.
mcid The marked content section ID for this line if any (otherwise None). Experimental attribute.
tag The marked content section tag for this line if any (otherwise None). Experimental attribute.
object_type "line"

rect properties

Property Description
page_number Page number on which this rectangle was found.
height Height of rectangle.
width Width of rectangle.
x0 Distance of left side of rectangle from left side of page.
x1 Distance of right side of rectangle from left side of page.
y0 Distance of bottom of rectangle from bottom of page.
y1 Distance of top of rectangle from bottom of page.
top Distance of top of rectangle from top of page.
bottom Distance of bottom of the rectangle from top of page.
doctop Distance of top of rectangle from top of document.
linewidth Thickness of line.
stroking_color The color of the rectangle's outline. See docs/colors.md for details.
non_stroking_color The rectangle’s fill color. See docs/colors.md for details.
mcid The marked content section ID for this rect if any (otherwise None). Experimental attribute.
tag The marked content section tag for this rect if any (otherwise None). Experimental attribute.
object_type "rect"

curve properties

Property Description
page_number Page number on which this curve was found.
pts A list of (x, top) tuples indicating the points on the curve.
path A list of (cmd, *(x, top)) tuples describing the full path description, including (for example) control points used in Bezier curves.
height Height of curve's bounding box.
width Width of curve's bounding box.
x0 Distance of curve's left-most point from left side of page.
x1 Distance of curve's right-most point from left side of the page.
y0 Distance of curve's lowest point from bottom of page.
y1 Distance of curve's highest point from bottom of page.
top Distance of curve's highest point from top of page.
bottom Distance of curve's lowest point from top of page.
doctop Distance of curve's highest point from top of document.
linewidth Thickness of line.
fill Whether the shape defined by the curve's path is filled.
stroking_color The color of the curve's outline. See docs/colors.md for details.
non_stroking_color The curve’s fill color. See docs/colors.md for details.
dash A ([dash_array], dash_phase) tuple describing the curve's dash style. See Table 4.6 of the PDF specification for details.
mcid The marked content section ID for this curve if any (otherwise None). Experimental attribute.
tag The marked content section tag for this curve if any (otherwise None). Experimental attribute.
object_type "curve"

Derived properties

Additionally, both pdfplumber.PDF and pdfplumber.Page provide access to several derived lists of objects: .rect_edges (which decomposes each rectangle into its four lines), .curve_edges (which does the same for curve objects), and .edges (which combines .rect_edges, .curve_edges, and .lines).

image properties

[To be completed.]

Obtaining higher-level layout objects via pdfminer.six

If you pass the pdfminer.six-handling laparams parameter to pdfplumber.open(...), then each page's .objects dictionary will also contain pdfminer.six's higher-level layout objects, such as "textboxhorizontal".

Visual debugging

pdfplumber's visual debugging tools can be helpful in understanding the structure of a PDF and the objects that have been extracted from it.

Creating a PageImage with .to_image()

To turn any page (including cropped pages) into an PageImage object, call my_page.to_image(). You can optionally pass one of the following keyword arguments:

  • resolution: The desired number pixels per inch. Default: 72. Type: int.
  • width: The desired image width in pixels. Default: unset, determined by resolution. Type: int.
  • height: The desired image width in pixels. Default: unset, determined by resolution. Type: int.
  • antialias: Whether to use antialiasing when creating the image. Setting to True creates images with less-jagged text and graphics, but with larger file sizes. Default: False. Type: bool.
  • force_mediabox: Use the page's .mediabox dimensions, rather than the .cropbox dimensions. Default: False. Type: bool.

For instance:

im = my_pdf.pages[0].to_image(resolution=150)

From a script or REPL, im.show() will open the image in your local image viewer. But PageImage objects also play nicely with Jupyter notebooks; they automatically render as cell outputs. For example:

Visual debugging in Jupyter

Note: .to_image(...) works as expected with Page.crop(...)/CroppedPage instances, but is unable to incorporate changes made via Page.filter(...)/FilteredPage instances.

Basic PageImage methods

Method Description
im.reset() Clears anything you've drawn so far.
im.copy() Copies the image to a new PageImage object.
im.show() Opens the image in your local image viewer.
im.save(path_or_fileobject, format="PNG", quantize=True, colors=256, bits=8) Saves the annotated image as a PNG file. The default arguments quantize the image to a palette of 256 colors, saving the PNG with 8-bit color depth. You can disable quantization by passing quantize=False or adjust the size of the color palette by passing colors=N.

Drawing methods

You can pass explicit coordinates or any pdfplumber PDF object (e.g., char, line, rect) to these methods.

Single-object method Bulk method Description
im.draw_line(line, stroke={color}, stroke_width=1) im.draw_lines(list_of_lines, **kwargs) Draws a line from a line, curve, or a 2-tuple of 2-tuples (e.g., ((x, y), (x, y))).
im.draw_vline(location, stroke={color}, stroke_width=1) im.draw_vlines(list_of_locations, **kwargs) Draws a vertical line at the x-coordinate indicated by location.
im.draw_hline(location, stroke={color}, stroke_width=1) im.draw_hlines(list_of_locations, **kwargs) Draws a horizontal line at the y-coordinate indicated by location.
im.draw_rect(bbox_or_obj, fill={color}, stroke={color}, stroke_width=1) im.draw_rects(list_of_rects, **kwargs) Draws a rectangle from a rect, char, etc., or 4-tuple bounding box.
im.draw_circle(center_or_obj, radius=5, fill={color}, stroke={color}) im.draw_circles(list_of_circles, **kwargs) Draws a circle at (x, y) coordinate or at the center of a char, rect, etc.

Note: The methods above are built on Pillow's ImageDraw methods, but the parameters have been tweaked for consistency with SVG's fill/stroke/stroke_width nomenclature.

Visually debugging the table-finder

im.debug_tablefinder(table_settings={}) will return a version of the PageImage with the detected lines (in red), intersections (circles), and tables (light blue) overlaid.

Extracting text

pdfplumber can extract text from any given page (including cropped and derived pages). It can also attempt to preserve the layout of that text, as well as to identify the coordinates of words and search queries. Page objects can call the following text-extraction methods:

Method Description
.extract_text(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, layout=False, x_density=7.25, y_density=13, line_dir_render=None, char_dir_render=None, **kwargs) Collates all of the page's character objects into a single string.
  • When layout=False: Adds spaces where the difference between the x1 of one character and the x0 of the next is greater than x_tolerance. (If x_tolerance_ratio is not None, the extractor uses a dynamic x_tolerance equal to x_tolerance_ratio * previous_character["size"].) Adds newline characters where the difference between the doctop of one character and the doctop of the next is greater than y_tolerance.

  • When layout=True (experimental feature): Attempts to mimic the structural layout of the text on the page(s), using x_density and y_density to determine the minimum number of characters/newlines per "point," the PDF unit of measurement. Passing line_dir_render="ttb"/"btt"/"ltr"/"rtl" and/or char_dir_render="ttb"/"btt"/"ltr"/"rtl" will output the the lines/characters in a different direction than the default. All remaining **kwargs are passed to .extract_words(...) (see below), the first step in calculating the layout.

.extract_text_simple(x_tolerance=3, y_tolerance=3) A slightly faster but less flexible version of .extract_text(...), using a simpler logic.
.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True) Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the x1 of one character and the x0 of the next is less than or equal to x_tolerance and where the doctop of one character and the doctop of the next is less than or equal to y_tolerance. (If x_tolerance_ratio is not None, the extractor uses a dynamic x_tolerance equal to x_tolerance_ratio * previous_character["size"].) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing keep_blank_chars to True will mean that blank characters are treated as part of a word, not as a space between words. Changing use_text_flow to True will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments line_dir and char_dir tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The line_dir_rotated and char_dir_rotated arguments are similar, but for text that has been rotated. Passing a list of extra_attrs (e.g., ["fontname", "size"] will restrict each words to characters that share exactly the same value for each of those attributes, and the resulting word dicts will indicate those attributes. Setting split_at_punctuation to True will enforce breaking tokens at punctuations specified by string.punctuation; or you can specify the list of separating punctuation by pass a string, e.g., split_at_punctuation='!"&'()*+,.:;<=>?@[]^`{|}~'. Unless you set expand_ligatures=False, ligatures such as will be expanded into their constituent letters (e.g., fi).
.extract_text_lines(layout=False, strip=True, return_chars=True, **kwargs) Experimental feature that returns a list of dictionaries representing the lines of text on the page. The strip parameter works analogously to Python's str.strip() method, and returns text attributes without their surrounding whitespace. (Only relevant when layout = True.) Setting return_chars to False will exclude the individual character objects from the returned text-line dicts. The remaining **kwargs are those you would pass to .extract_text(layout=True, ...).
.search(pattern, regex=True, case=True, main_group=0, return_groups=True, return_chars=True, layout=False, **kwargs) Experimental feature that allows you to search a page's text, returning a list of all instances that match the query. For each instance, the response dictionary object contains the matching text, any regex group matches, the bounding box coordinates, and the char objects themselves. pattern can be a compiled regular expression, an uncompiled regular expression, or a non-regex string. If regex is False, the pattern is treated as a non-regex string. If case is False, the search is performed in a case-insensitive manner. Setting main_group restricts the results to a specific regex group within the pattern (default of 0 means the entire match). Setting return_groups and/or return_chars to False will exclude the lists of the matched regex groups and/or characters from being added (as "groups" and "chars" to the return dicts). The layout parameter operates as it does for .extract_text(...). The remaining **kwargs are those you would pass to .extract_text(layout=True, ...). Note: Zero-width and all-whitespace matches are discarded, because they (generally) have no explicit position on the page.
.dedupe_chars(tolerance=1) Returns a version of the page with duplicate chars — those sharing the same text, fontname, size, and positioning (within tolerance x/y) as other characters — removed. (See Issue #71 to understand the motivation.)

Extracting tables

pdfplumber's approach to table detection borrows heavily from Anssi Nurminen's master's thesis, and is inspired by Tabula. It works like this:

  1. For any given PDF page, find the lines that are (a) explicitly defined and/or (b) implied by the alignment of words on the page.
  2. Merge overlapping, or nearly-overlapping, lines.
  3. Find the intersections of all those lines.
  4. Find the most granular set of rectangles (i.e., cells) that use these intersections as their vertices.
  5. Group contiguous cells into tables.

Table-extraction methods

pdfplumber.Page objects can call the following table methods:

Method Description
.find_tables(table_settings={}) Returns a list of Table objects. The Table object provides access to the .cells, .rows, and .bbox properties, as well as the .extract(x_tolerance=3, y_tolerance=3) method.
.find_table(table_settings={}) Similar to .find_tables(...), but returns the largest table on the page, as a Table object. If multiple tables have the same size — as measured by the number of cells — this method returns the table closest to the top of the page.
.extract_tables(table_settings={}) Returns the text extracted from all tables found on the page, represented as a list of lists of lists, with the structure table -> row -> cell.
.extract_table(table_settings={}) Returns the text extracted from the largest table on the page (see .find_table(...) above), represented as a list of lists, with the structure row -> cell.
.debug_tablefinder(table_settings={}) Returns an instance of the TableFinder class, with access to the .edges, .intersections, .cells, and .tables properties.

For example:

pdf = pdfplumber.open("path/to/my.pdf")
page = pdf.pages[0]
page.extract_table()

Click here for a more detailed example.

Table-extraction settings

By default, extract_tables uses the page's vertical and horizontal lines (or rectangle edges) as cell-separators. But the method is highly customizable via the table_settings argument. The possible settings, and their defaults:

{
    "vertical_strategy": "lines", 
    "horizontal_strategy": "lines",
    "explicit_vertical_lines": [],
    "explicit_horizontal_lines": [],
    "snap_tolerance": 3,
    "snap_x_tolerance": 3,
    "snap_y_tolerance": 3,
    "join_tolerance": 3,
    "join_x_tolerance": 3,
    "join_y_tolerance": 3,
    "edge_min_length": 3,
    "min_words_vertical": 3,
    "min_words_horizontal": 1,
    "intersection_tolerance": 3,
    "intersection_x_tolerance": 3,
    "intersection_y_tolerance": 3,
    "text_tolerance": 3,
    "text_x_tolerance": 3,
    "text_y_tolerance": 3,
    "text_*": …, # See below
}
Setting Description
"vertical_strategy" Either "lines", "lines_strict", "text", or "explicit". See explanation below.
"horizontal_strategy" Either "lines", "lines_strict", "text", or "explicit". See explanation below.
"explicit_vertical_lines" A list of vertical lines that explicitly demarcate cells in the table. Can be used in combination with any of the strategies above. Items in the list should be either numbers — indicating the x coordinate of a line the full height of the page — or line/rect/curve objects.
"explicit_horizontal_lines" A list of horizontal lines that explicitly demarcate cells in the table. Can be used in combination with any of the strategies above. Items in the list should be either numbers — indicating the y coordinate of a line the full height of the page — or line/rect/curve objects.
"snap_tolerance", "snap_x_tolerance", "snap_y_tolerance" Parallel lines within snap_tolerance points will be "snapped" to the same horizontal or vertical position.
"join_tolerance", "join_x_tolerance", "join_y_tolerance" Line segments on the same infinite line, and whose ends are within join_tolerance of one another, will be "joined" into a single line segment.
"edge_min_length" Edges shorter than edge_min_length will be discarded before attempting to reconstruct the table.
"min_words_vertical" When using "vertical_strategy": "text", at least min_words_vertical words must share the same alignment.
"min_words_horizontal" When using "horizontal_strategy": "text", at least min_words_horizontal words must share the same alignment.
"intersection_tolerance", "intersection_x_tolerance", "intersection_y_tolerance" When combining edges into cells, orthogonal edges must be within intersection_tolerance points to be considered intersecting.
"text_*" All settings prefixed with text_ are then used when extracting text from each discovered table. All possible arguments to Page.extract_text(...) are also valid here.
"text_x_tolerance", "text_y_tolerance" These text_-prefixed settings also apply to the table-identification algorithm when the text strategy is used. I.e., when that algorithm searches for words, it will expect the individual letters in each word to be no more than text_x_tolerance/text_y_tolerance points apart.

Table-extraction strategies

Both vertical_strategy and horizontal_strategy accept the following options:

Strategy Description
"lines" Use the page's graphical lines — including the sides of rectangle objects — as the borders of potential table-cells.
"lines_strict" Use the page's graphical lines — but not the sides of rectangle objects — as the borders of potential table-cells.
"text" For vertical_strategy: Deduce the (imaginary) lines that connect the left, right, or center of words on the page, and use those lines as the borders of potential table-cells. For horizontal_strategy, the same but using the tops of words.
"explicit" Only use the lines explicitly defined in explicit_vertical_lines / explicit_horizontal_lines.

Notes

  • Often it's helpful to crop a page — Page.crop(bounding_box) — before trying to extract the table.

  • Table extraction for pdfplumber was radically redesigned for v0.5.0, and introduced breaking changes.

Extracting form values

Sometimes PDF files can contain forms that include inputs that people can fill out and save. While values in form fields appear like other text in a PDF file, form data is handled differently. If you want the gory details, see page 671 of this specification.

pdfplumber doesn't have an interface for working with form data, but you can access it using pdfplumber's wrappers around pdfminer.

For example, this snippet will retrieve form field names and values and store them in a dictionary.

import pdfplumber
from pdfplumber.utils.pdfinternals import resolve_and_decode, resolve

pdf = pdfplumber.open("document_with_form.pdf")

def parse_field_helper(form_data, field, prefix=None):
    """ appends any PDF AcroForm field/value pairs in `field` to provided `form_data` list

        if `field` has child fields, those will be parsed recursively.
    """
    resolved_field = field.resolve()
    field_name = '.'.join(filter(lambda x: x, [prefix, resolve_and_decode(resolved_field.get("T"))]))
    if "Kids" in resolved_field:
        for kid_field in resolved_field["Kids"]:
            parse_field_helper(form_data, kid_field, prefix=field_name)
    if "T" in resolved_field or "TU" in resolved_field:
        # "T" is a field-name, but it's sometimes absent.
        # "TU" is the "alternate field name" and is often more human-readable
        # your PDF may have one, the other, or both.
        alternate_field_name  = resolve_and_decode(resolved_field.get("TU")) if resolved_field.get("TU") else None
        field_value = resolve_and_decode(resolved_field["V"]) if 'V' in resolved_field else None
        form_data.append([field_name, alternate_field_name, field_value])


form_data = []
fields = resolve(resolve(pdf.doc.catalog["AcroForm"])["Fields"])
for field in fields:
    parse_field_helper(form_data, field)

Once you run this script, form_data is a list containing a three-element tuple for each form element. For instance, a PDF form with a city and state field might look like this.

[
 ['STATE.0', 'enter STATE', 'CA'],
 ['section 2  accident infoRmation.1.0',
  'enter city of accident',
  'SAN FRANCISCO']
]

Thanks to @jeremybmerrill for helping to maintain the form-parsing code above.

Demonstrations

Comparison to other libraries

Several other Python libraries help users to extract information from PDFs. As a broad overview, pdfplumber distinguishes itself from other PDF processing libraries by combining these features:

  • Easy access to detailed information about each PDF object
  • Higher-level, customizable methods for extracting text and tables
  • Tightly integrated visual debugging
  • Other useful utility functions, such as filtering objects via a crop-box

It's also helpful to know what features pdfplumber does not provide:

  • PDF generation
  • PDF modification
  • Optical character recognition (OCR)
  • Strong support for extracting tables from OCR'ed documents

Specific comparisons

  • pdfminer.six provides the foundation for pdfplumber. It primarily focuses on parsing PDFs, analyzing PDF layouts and object positioning, and extracting text. It does not provide tools for table extraction or visual debugging.

  • PyPDF2 is a pure-Python library "capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files." It can extract page text, but does not provide easy access to shape objects (rectangles, lines, etc.), table-extraction, or visually debugging tools.

  • pymupdf is substantially faster than pdfminer.six (and thus also pdfplumber) and can generate and modify PDFs, but the library requires installation of non-Python software (MuPDF). It also does not enable easy access to shape objects (rectangles, lines, etc.), and does not provide table-extraction or visual debugging tools.

  • camelot, tabula-py, and pdftables all focus primarily on extracting tables. In some cases, they may be better suited to the particular tables you are trying to extract.

Acknowledgments / Contributors

Many thanks to the following users who've contributed ideas, features, and fixes:

Contributing

Pull requests are welcome, but please submit a proposal issue first, as the library is in active development.

Current maintainers:

pdfplumber's People

Contributors

afriedman412 avatar alexreg avatar arlyon avatar asafh avatar augeos-grosso avatar austinzy avatar boblannon-picwell avatar cheungpat avatar conitrade-as avatar danshorstein avatar dependabot[bot] avatar dhdaines avatar fristhon avatar hussainshaikh12 avatar idan-david avatar jeremybmerrill avatar jhonatan-lopes avatar jsfenfen avatar jsvine avatar jwilk avatar lolipopshock avatar meldonization avatar oisinmoran avatar prilkop avatar ritchiep avatar samkit-jain avatar samyak24jain avatar weartist avatar yevgnen avatar yweweler avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdfplumber's Issues

collate chars w/ slight doctop variations

Collate_chars' sort assumes that colinear words have the same exact doctop. In the wild, there's sometimes variations on this--especially for fillable pdfs (at least I think that's what I'm seeing). I hacked around this with words by mucking with precision, but the variations here are much bigger--1 or 2 pixels.

I'm hacking around this with an optional snap_to_y_grid arg, and something like:

if snap_to_y_grid:
    snap_to_grid = lambda x: round(x/float(snap_to_y_grid))
    chars["doctop"] = chars["doctop"].apply(snap_to_grid)

This alters the underlying dataframe values (maybe?)--which I assume would require a copy or something, or setting a different value. Any thoughts?

Space character missing when using extract_tables()

I am using the following settings when extracting tables from PDF

'table_settings': {
    "vertical_strategy": "lines",
    "horizontal_strategy": "text",
    "intersection_x_tolerance": 15,
}

When using extract_text(x_tolerance=0, y_tolerance=0) on the Page object it preserves the spacing in the PDF file. But, when using extract_tables() with the table_settings mentioned above, no spacing is preserved. It is able to correctly extract the table but there are no spaces in any of the row's text.

Example, output of extract_text,
PAYPAL POS DEBIT (Correct)

output of extract_tables,
PAYPALPOSDEBIT (Incorrect)

text': '(cid:0) instead of character

>>> with pdfplumber.open("/Users/edwin/1.pdf") as pdf:
...     first_page = pdf.pages[0]
...     print(first_page.chars[0])
... 

{'adv': Decimal('15.975'), 'fontname': 'SRPUEP+SimSun', 'doctop': Decimal('8.092'), 'y1': Decimal('411.158'), 'bottom': Decimal('23.061'), 'text': '(cid:0)', 'top': Decimal('8.092'), 'object_type': 'char', 'height': Decimal('14.969'), 'width': Decimal('15.975'), 'page_number': 1, 'upright': True, 'y0': Decimal('396.189'), 'x0': Decimal('147.400'), 'x1': Decimal('163.375'), 'size': Decimal('14.969')}

pdf is https://github.com/clear-datacenter/plan/files/524831/1.pdf.zip
you can download and just remove the .zip in the filename to get pdf file

write cropped document to separate pdf

Hi thanks for the project,
This is the best library so far while working with pdfs, but i couldn't find anything to write a cropped pdf document to another pdf. How can I write cropped pdf to new document? and is scale for CropBox or crop same as that of doctop,x0,bottom of a char extracted from pdf (i.e. ".char" of pdfplumber.Page class)?

Multirows in one Cell

Let's say Table in pdf contains two columns, so simple key,value pair can be formed but in second column few cells there have multiple lines(say 5 and key is in middle row). Then pdfplumber prints [None, value] for 2 times then it print [key, value] because the key for this row is at pos3 then again prints [None, value] two times. We can scan this presence of 'None' join them and put them as value for this key. If multiple such rows occur, we can still manage. Problem is when two or more such rows are adjust to each other, then finding the end line of one and beginning of other is difficult. The length of content also differs per table so can't put a line counter.

TypeError: startswith first arg must be str or a tuple of str, not bytes (Python 3)

With a lot of different PDF documents I get the folllowing error in Python 3 :

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-35be4098fc3e> in <module>()
----> 1 pdfp.open(files[0])

/usr/local/lib/python3.5/dist-packages/pdfplumber/pdf.py in open(cls, path, **kwargs)
     38     @classmethod
     39     def open(cls, path, **kwargs):
---> 40         return cls(open(path, "rb"), **kwargs)
     41 
     42     def process_page(self, page):

/usr/local/lib/python3.5/dist-packages/pdfplumber/pdf.py in __init__(self, stream, pages, laparams, precision)
     30                 self.metadata[k] = list(map(decode_text, v))
     31             elif isinstance(v, PSLiteral):
---> 32                 self.metadata[k] = decode_text(v.name)
     33             else:
     34                 self.metadata[k] = decode_text(v)

/usr/local/lib/python3.5/dist-packages/pdfplumber/utils.py in decode_text(s)
     63     Adds py3 compatability to pdfminer's version.
     64     """
---> 65     if s.startswith(b'\xfe\xff'):
     66         return six.text_type(s[2:], 'utf-16be', 'ignore')
     67     else:

TypeError: startswith first arg must be str or a tuple of str, not bytes

However, I don't get this error in Python 2 so I think it is a compatibility issue.

checkboxes / LTRects + LTCurves

I've been messing with a pdf that uses checkboxes that are built with an LTRect and (if checked) two LTCurves (see below; objs[44] is the rectangle and 45 and 46 are used to make the lines of an X).(I was expecting that the checkboxes would be embedded as images and it took me a while to realize they weren't--though having them as rect/curves certainly saves space).

Not really sure how common this is--but figuring out if they are checked is critical to making sense of the form.

I'm hacking this out with PDF miner -- but wondered if you had thoughts on making pdfplumber able to extract this. I think earlier versions supported more types of objects. Although the 'pts' show each point of a curve, I don't actually need them here--just testing if there's an LTCurve inside the LTRect is equivalent to whether the box is checked. I dunno.

>>>page0._objs[44].__class__
<class 'pdfminer.layout.LTRect'>
>>> page0._objs[44].__dict__
{'linewidth': 0.75, 'height': 9.20999999999998, 'width': 9.210000000000008, 'bbox': (78.72, 263.22, 87.93, 272.43), 'y1': 272.43, 'y0': 263.22, 'x0': 78.72, 'x1': 87.93, 'pts': [(87.93, 272.43), (78.72, 272.43), (78.72, 263.22), (87.93, 263.22)]}
>>> page0._objs[45].__class__
<class 'pdfminer.layout.LTCurve'>
>>> page0._objs[45].__dict__
{'linewidth': 0.51, 'height': 9.449999999999989, 'width': 9.450000000000003, 'bbox': (78.6, 263.1, 88.05, 272.55), 'y1': 272.55, 'y0': 263.1, 'x0': 78.6, 'x1': 88.05, 'pts': [(78.6, 272.55), (88.05, 263.1)]}
>>> page0._objs[46].__class__
<class 'pdfminer.layout.LTCurve'>
>>> page0._objs[46].__dict__
{'linewidth': 0.51, 'height': 9.449999999999989, 'width': 9.450000000000003, 'bbox': (78.6, 263.1, 88.05, 272.55), 'y1': 272.55, 'y0': 263.1, 'x0': 78.6, 'x1': 88.05, 'pts': [(88.05, 272.55), (78.6, 263.1)]}

Unable to extract table or text

Hi, I have been trying to extract tables using the extract_tables function which was working well until I updated to the newer version.

Most functions are now returning the same error:
ValueError: Cannot convert None to Decimal.

This error occurred when I tried the functions extract_table, extract_tables, find_tables, extract_text and extract_words. I have not changed the table settings from the default. The pdf I tried this on was https://github.com/jsvine/pdfplumber/blob/master/examples/pdfs/ca-warn-report.pdf

Please let me know what may be causing this error and how it can be worked around.

Extracting filled polygons and saving as new pdf

My task is to separate out straight lines, filled polygons, text into different PDFs for further analysis.
I am successful in extracting straight lines by using '_page.edges'. I presume edges are the ones with x0=x1 or y0=y1 . Now filled polygons are to be saved into separate pdfs. Is it possible to separate out the filled polygons and text ?
Sample.pdf
Also I noticed , out of 5 texts available in the attachment, pdfminer could extract only two correctly.In one particular case 8524 is extracted as 8254. Degree symbol ° is getting extracted as (cid:176). For this reason , I am thinking of separating out the text and make use of OCR.

PDF fails to init when metadata includes a PostScript literal

when calling pdfplumber.open(), the PDF.__init__ method crashes if the metadata includes a value that is a PSLiteral, rather than a string.

     38     @classmethod
     39     def open(cls, path, **kwargs):
---> 40         return cls(open(path, "rb"), **kwargs)
     41 
     42     def process_page(self, page):

pdfplumber/pdfplumber/pdf.py in __init__(self, stream, pages, laparams, precision)
     31             else:
---> 32                 self.metadata[k] = decode_text(v)
     33         self.device = PDFPageAggregator(rsrcmgr, laparams=self.laparams)
     34         self.interpreter = PDFPageInterpreter(rsrcmgr, self.device)

/pdfplumber/pdfplumber/utils.pyc in decode_text(s)
     40     Adds py3 compatability to pdfminer's version.
     41     """
---> 42     if s.startswith(b'\xfe\xff'):
     43         return six.text_type(s[2:], 'utf-16be', 'ignore')
     44     else:

AttributeError: 'PSLiteral' object has no attribute 'startswith'```

word-level font names and heights

Having a font for an entire word helps parsing. A lot. Height also helps some.

I took a crack at this here, with some settings. Defaults also may need adjustment.

If you've got thoughts, @jsvine, lemme know and I can clean this up into a pr. Haven't gotten the testing set up yet.

jsfenfen@847a3bb

Requested setting PDF_MINER_IS_STRICT

When I used the module in python 3.6, an error occured:
ImproperlyConfigured: Requested setting PDF_MINER_IS_STRICT, but settings are not configured. You must either define the environment variable DJANGO_SETTINGS_MODULE or call settings.configure() before accessing settings.

KeyError raised when pdf image object has no bounding box information

image object

{
    'srcsize': (Decimal('210'), Decimal('198')), 
    'height': Decimal('148.600'),
    'object_type': 'image', 
    'bits': Decimal('8'), 
    'width': Decimal('157.550')
}

Error Msg.

a = pdf_page.crop((0, table.bbox[1], page_width, table.bbox[3]))
a.images
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "~/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pdfplumber/container.py", line 27, in images
    return self.objects.get("image", [])
  File "~/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pdfplumber/page.py", line 174, in objects
    self.bbox)
  File "~/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pdfplumber/utils.py", line 317, in crop_to_bbox
    for k,v in objs.items())
  File "~/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pdfplumber/utils.py", line 317, in <genexpr>
    for k,v in objs.items())
  File "~/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pdfplumber/utils.py", line 321, in crop_to_bbox
    scores = n_points_intersecting_bbox(objs, bbox)
  File "~/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pdfplumber/utils.py", line 281, in n_points_intersecting_bbox
    return list(scores)
  File "~/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pdfplumber/utils.py", line 280, in <genexpr>
    scores = (obj_inside_bbox_score(obj, bbox) for obj in objs)
  File "~/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pdfplumber/utils.py", line 230, in obj_inside_bbox_score
    (obj["x0"], obj["top"]),
KeyError: 'x0'

Units for x, y, top, bottom etc?

Thanks so much for this project, I can't for the life of me find another library in python that had anything similar to your crop within_bbox feature.

My question is, what are the units on the values of x0, y0, top and bottom? The "boxes" in pdfs like CropBox, MediaBox, TrimBox are all measured from the bottom left of the page and in the standard units of 1 inch=72 points.

I'm trying to use the CropBox of a pdf with the within_bbox feature but can't seem to find any correlation between the units. I realize that pdfplumber has origin (0,0) set to top left but I still can't figure out the size of the units.

Thanks.

Flexibility x and y axis for snap tolerance.

Hi @jsvine ,

I was working with a table where i needed to additional snap_tolerance flexibility on the y-axis relative to the x-axis. Therefore, I added a snap_x_tolerance and snap_y_tolerance parameters in the same fashion as text_tolerance and intersection_tolerance. I went ahead and submitted a pull request.

Thanks,
Dustin

Error while getting content of crop box

Getting following error while trying to extract text from CroppedPage

Traceback (most recent call last):
  File "D:/workbench/topcoder/cioms-pdf-json/main.py", line 27, in <module>
    print len(box_obj.chars)
  File "C:\Python27\lib\site-packages\pdfplumber\container.py", line 35, in chars
    return self.objects.get("char", [])
  File "C:\Python27\lib\site-packages\pdfplumber\page.py", line 185, in objects
    self.bbox)
  File "C:\Python27\lib\site-packages\pdfplumber\utils.py", line 302, in within_bbox
    for k,v in objs.items())
  File "C:\Python27\lib\site-packages\pdfplumber\utils.py", line 302, in <genexpr>
    for k,v in objs.items())
  File "C:\Python27\lib\site-packages\pdfplumber\utils.py", line 306, in within_bbox
    scores = n_points_intersecting_bbox(objs, bbox)
  File "C:\Python27\lib\site-packages\pdfplumber\utils.py", line 282, in n_points_intersecting_bbox
    return list(scores)
  File "C:\Python27\lib\site-packages\pdfplumber\utils.py", line 281, in <genexpr>
    scores = (obj_inside_bbox_score(obj, bbox) for obj in objs)
  File "C:\Python27\lib\site-packages\pdfplumber\utils.py", line 236, in obj_inside_bbox_score
    score = sum(point_inside_bbox(c, bbox) for c in corners)
  File "C:\Python27\lib\site-packages\pdfplumber\utils.py", line 236, in <genexpr>
    score = sum(point_inside_bbox(c, bbox) for c in corners)
  File "C:\Python27\lib\site-packages\pdfplumber\utils.py", line 226, in point_inside_bbox
    bx0, by0, bx1, by1 = map(decimalize, bbox)
ValueError: need more than 2 values to unpack

Error open PDF

Hi
I installed the pdfplumber
Run the first basic example
import pdfplumber

with pdfplumber.open("C:\Users\office\Desktop\Elasticsearch data\background checks.pdf") as pdf:
first_page = pdf.pages[0]
print(first_page.chars[0])

and got an error:
Message File Name Line Position
Traceback
C:\Users\office\Desktop\webpy\ES\Parse PDF to CSV.py 13
open C:\Python27\lib\site-packages\pdfplumber\pdf.py 40
IOError: [Errno 22] invalid mode ('rb') or filename: 'C:\Users\office\Desktop\Elasticsearch data\x08ackground checks.pdf'

Please advise what is wrong ?
Tal

extract words drops chars detected after delta x > x_tolerance

Collate words doesn't ever add characters it finds after a delta x greater than tolerance (for a given line of words). In the wild the word 'Klamath' with spaces greater than tolerance was returned as four separate words: ['K','a','a','h'].

Fix is this; I can submit it as a pr, just don't wanna double up since I just sent another for std.out.

jsfenfen@7cf92f9

please take into consideration that 'Rotate' information could be a ref object

>>> pdf = pdfplumber.open(fpath)
>>> page_num = len(pdf.pages)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "~/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pdfplumber/pdf.py", line 58, in pages
    p = Page(self, page, page_number=page_number, initial_doctop=doctop)
  File "~/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pdfplumber/page.py", line 19, in __init__
    self.rotation = self.page_obj.attrs.get("Rotate", 0) % 360
TypeError: unsupported operand type(s) for %: 'PDFObjRef' and 'int'

Text Box object

pdfplumber has character level information.

But I'm more interested in a text box - essentially a word - that also has font information.

How can that be achieved?

LTTextLine?

This is more of a question than an issue: do you plan to make PDFMiner's LTTextLine and LTTextBox objects / hierarchy available through pdfplumber? Or is the point of this to make only the raw character positions available so that one can roll one's own layout analysis without PDFMiner getting in the way at all?

Bug in utils.py

When I attempt to open this file using pdfplumber.open(filepath), I get the following error:

File "/Users/Jeff/anaconda/envs/py27/lib/python2.7/site-packages/pdfplumber/utils.py", line 42, in decode_text
    if s.startswith(b'\xfe\xff'):
AttributeError: 'list' object has no attribute 'starts with'

Seems like a list is being passed in when a string is expected?

pdfplumber relies on dead dependency pycrypto

Hi,
I am really interested in trying out this pdfplumber on my windows machine, but I can not install it with pip, cause it fails when trying to install dependency pycrypto. As from what I have found this project is dead, and should be switched to pycryptodome.
Hope this will be fixed.

test cases

Greetings,
trying to extract some data from 1.pdf from http://staff.icar.cnr.it/ruffolo/pdf-trex.htm1
got the following traceback

---------------------------------------------------------------------------
PSTypeError                               Traceback (most recent call last)
<ipython-input-24-8d8a1308bd34> in <module>()
      3 
      4 im = p0.to_image()
----> 5 p0.figures

/home/iurie/anaconda3/lib/python3.6/site-packages/pdfplumber/container.py in figures(self)
     29     @property
     30     def figures(self):
---> 31         return self.objects.get("figure", [])
     32 
     33     @property

/home/iurie/anaconda3/lib/python3.6/site-packages/pdfplumber/page.py in objects(self)
     63     def objects(self):
     64         if hasattr(self, "_objects"): return self._objects
---> 65         self._objects = self.parse_objects()
     66         return self._objects
     67 

/home/iurie/anaconda3/lib/python3.6/site-packages/pdfplumber/page.py in parse_objects(self)
    125                     process_object(child)
    126 
--> 127         for obj in self.layout._objs:
    128             process_object(obj)
    129 

/home/iurie/anaconda3/lib/python3.6/site-packages/pdfplumber/page.py in layout(self)
     57     def layout(self):
     58         if hasattr(self, "_layout"): return self._layout
---> 59         self._layout = self.pdf.process_page(self.page_obj)
     60         return self._layout
     61 

/home/iurie/anaconda3/lib/python3.6/site-packages/pdfplumber/pdf.py in process_page(self, page)
     41 
     42     def process_page(self, page):
---> 43         self.interpreter.process_page(page)
     44         return self.device.get_result()
     45 

/home/iurie/anaconda3/lib/python3.6/site-packages/pdfminer/pdfinterp.py in process_page(self, page)
    832             ctm = (1, 0, 0, 1, -x0, -y0)
    833         self.device.begin_page(page, ctm)
--> 834         self.render_contents(page.resources, page.contents, ctm=ctm)
    835         self.device.end_page(page)
    836         return

/home/iurie/anaconda3/lib/python3.6/site-packages/pdfminer/pdfinterp.py in render_contents(self, resources, streams, ctm)
    842         logging.info('render_contents: resources=%r, streams=%r, ctm=%r',
    843                      resources, streams, ctm)
--> 844         self.init_resources(resources)
    845         self.init_state(ctm)
    846         self.execute(list_value(streams))

/home/iurie/anaconda3/lib/python3.6/site-packages/pdfminer/pdfinterp.py in init_resources(self, resources)
    348                         objid = spec.objid
    349                     spec = dict_value(spec)
--> 350                     self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
    351             elif k == 'ColorSpace':
    352                 for (csid, spec) in six.iteritems(dict_value(v)):

/home/iurie/anaconda3/lib/python3.6/site-packages/pdfminer/pdfinterp.py in get_font(self, objid, spec)
    183             elif subtype == 'TrueType':
    184                 # TrueType Font
--> 185                 font = PDFTrueTypeFont(self, spec)
    186             elif subtype == 'Type3':
    187                 # Type3 Font

/home/iurie/anaconda3/lib/python3.6/site-packages/pdfminer/pdffont.py in __init__(self, rsrcmgr, spec)
    586             widths = list_value(spec.get('Widths', [0]*256))
    587             widths = dict((i+firstchar, w) for (i, w) in enumerate(widths))
--> 588         PDFSimpleFont.__init__(self, descriptor, widths, spec)
    589         if 'Encoding' not in spec and 'FontFile' in descriptor:
    590             # try to recover the missing encoding info from the font file.

/home/iurie/anaconda3/lib/python3.6/site-packages/pdfminer/pdffont.py in __init__(self, descriptor, widths, spec)
    552             strm = stream_value(spec['ToUnicode'])
    553             self.unicode_map = FileUnicodeMap()
--> 554             CMapParser(self.unicode_map, BytesIO(strm.get_data())).run()
    555         PDFFont.__init__(self, descriptor, widths)
    556         return

/home/iurie/anaconda3/lib/python3.6/site-packages/pdfminer/cmapdb.py in run(self)
    281     def run(self):
    282         try:
--> 283             self.nextobject()
    284         except PSEOF:
    285             pass

/home/iurie/anaconda3/lib/python3.6/site-packages/pdfminer/psparser.py in nextobject(self)
    627             elif isinstance(token,PSKeyword):
    628                 logging.debug('do_keyword: pos=%r, token=%r, stack=%r', pos, token, self.curstack)
--> 629                 self.do_keyword(pos, token)
    630             else:
    631                 logging.error('unknown token: pos=%r, token=%r, stack=%r', pos, token, self.curstack)

/home/iurie/anaconda3/lib/python3.6/site-packages/pdfminer/cmapdb.py in do_keyword(self, pos, token)
    317             try:
    318                 ((_, k), (_, v)) = self.pop(2)
--> 319                 self.cmap.set_attr(literal_name(k), v)
    320             except PSSyntaxError:
    321                 pass

/home/iurie/anaconda3/lib/python3.6/site-packages/pdfminer/psparser.py in literal_name(x)
    141     if not isinstance(x, PSLiteral):
    142         if STRICT:
--> 143             raise PSTypeError('Literal required: %r' % x)
    144         else:
    145             name=x

PSTypeError: Literal required: /b'CIDSystemInfo
```'`

it would be nice if it could passes all those tests,

Thank you

non-sequential pageid

I processed a 2 page document and was surprised that the pageids were '1' and '20'. Until now they'd always been sequential. Is this expected? Using version 0.4.3 on python 2.7.11 and this file.

OSError: Cannot load native module 'Crypto.Cipher._raw_ecb'

**The stacktrace is**
Traceback (most recent call last):
File "tongchengVocationSpider.py", line 67, in
File "c:\python27\Lib\site-packages\PyInstaller\loader\pyimod03_importers.py", line 389, in load_module
exec(bytecode, module.dict)
File "pdfplumber_init_.py", line 1, in
File "c:\python27\Lib\site-packages\PyInstaller\loader\pyimod03_importers.py", line 389, in load_module
exec(bytecode, module.dict)
File "pdfplumber\pdf.py", line 6, in
File "c:\python27\Lib\site-packages\PyInstaller\loader\pyimod03_importers.py", line 389, in load_module
exec(bytecode, module.dict)
File "pdfminer\pdfdocument.py", line 12, in
File "c:\python27\Lib\site-packages\PyInstaller\loader\pyimod03_importers.py", line 389, in load_module
exec(bytecode, module.dict)
File "Crypto\Cipher_init_.py", line 3, in
File "c:\python27\Lib\site-packages\PyInstaller\loader\pyimod03_importers.py", line 389, in load_module
exec(bytecode, module.dict)
File "Crypto\Cipher_mode_ecb.py", line 46, in
File "Crypto\Util_raw_api.py", line 258, in load_pycryptodome_raw_lib
OSError: Cannot load native module 'Crypto.Cipher._raw_ecb': Trying '_raw_ecb.pyd': cannot load library G:\WORKSP1\QUNAER1\qunaer\qunaer\qunaer\dist\TONGCH1\Crypto\Util..\Cipher_raw_ecb.pyd: error 0x7e. Additionally, ctypes.util.find_library() did not manage to locate a library called 'G:\WORKSP1\QUNAER1\qunaer\qunaer\qunaer\dist\TONGCH1\Crypto\Util\..\Cipher\_raw_ecb.pyd'
Failed to execute script tongchengVocationSpider

my python is 2.7.14 and my os is win10 64bit

Heuristic based title detection

Hi, I saw that you perform PDF table detection using an algorithm from the Nuriminen thesis.

Just like tables, one of the things that you can extract from a PDF is its title.

Some PDFs have the "title" metadata file, but more often than not, they contain wrong text.

KDEFileMetadata (written in C++) is one of the projects that used some heuristics to guess the title. You can have a look at the code here.

A feature like this is useful in detecting title from research papers - which could further be used to rename the PDF files!

So from a filename like gfs-sosp2003.pdf you can get Google File System - 2003.pdf

ValueError initializing pages

I get a ValueError: Cannot convert <PDFObjRef:4> to Decimal. when accessing the pages of this pdf with pdfplumber==0.5.5.

I notice the pdfplumber.open call seems to run much quicker on this file compared to other files that don't raise this error.

pdf = pdfplumber.open('Hays TX 11-8-2016+hays+county+total+canvass.pdf')
pdf.pages

⬇️

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-bc19284d68b8> in <module>()
      1 pdf = pdfplumber.open('Hays TX 11-8-2016+hays+county+total+canvass.pdf')
----> 2 pdf.pages

/anaconda3/lib/python3.6/site-packages/pdfplumber/pdf.py in pages(self)
     54             page_number = i+1
     55             if pp != None and page_number not in pp: continue
---> 56             p = Page(self, page, page_number=page_number, initial_doctop=doctop)
     57             self._pages.append(p)
     58             doctop += p.height

/anaconda3/lib/python3.6/site-packages/pdfplumber/page.py in __init__(self, pdf, page_obj, page_number, initial_doctop)
     21 
     22         cropbox = page_obj.attrs.get("CropBox", page_obj.attrs.get("MediaBox"))
---> 23         self.cropbox = self.decimalize(cropbox)
     24 
     25         if self.rotation in [ 90, 270 ]:

/anaconda3/lib/python3.6/site-packages/pdfplumber/page.py in decimalize(self, x)
     39 
     40     def decimalize(self, x):
---> 41         return utils.decimalize(x, self.pdf.precision)
     42 
     43     @property

/anaconda3/lib/python3.6/site-packages/pdfplumber/utils.py in decimalize(v, q)
     87             return Decimal(repr(v))
     88     else:
---> 89         raise ValueError("Cannot convert {0} to Decimal.".format(v))
     90 
     91 def is_dataframe(collection):

ValueError: Cannot convert <PDFObjRef:4> to Decimal.

ValueError: Cannot convert None to Decimal.

I'm having the same problem that this previous thread documented (not resolved there.) I'm running the following code, but get the same error with different pdfs.

pdf = pdfplumber.open("ca-warn-report.pdf") print(pdf.pages[0].extract_table())

I'm on a Mac running OS 10.11.6 and Python 3.6.

Any help would be appreciated!

Pycrypto installation conflict with python3.6

Hi, Thanks for the excellent package. I am working on a project to perform feature extraction in pdf files and recently I decided to move my code base from tabula-py to pdfplumber.

My CI service (jenkins) threw an error while running tests with the following message:

Running setup.py install for pycrypto: finished with status 'error'

After a bit of investigation I realized that pycrypto has some conflict installing under Python3.6. The recommended alternative seems to be pycryptodome and I was wondering if it would be possible to modify the base requirements here

base_reqs = [

Thanks and appreciate the excellent work!

Is it a bug?

I met this when trying to open a pdf:

Traceback (most recent call last):
  File "D:/P4/y/y/pledge_extraction/main.py", line 49, in read_pdf2
    with pdfplumber.open(file) as pdf:
  File "C:\Users\yyq\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pdfplumber\pdf.py", line 40, in open
    return cls(open(path, "rb"), **kwargs)
  File "C:\Users\yyq\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pdfplumber\pdf.py", line 34, in __init__
    self.metadata[k] = decode_text(v)
  File "C:\Users\yyq\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pdfplumber\utils.py", line 68, in decode_text
    ords = (ord(c) if type(c) == str else c for c in s)
TypeError: 'bool' object is not iterable

It seems that pdfplumber isn't correctly handling boolean values in documents' metadata

Best approach for text that spans columns?

I'm trying pdfplumber for the first time. I really like it!

Here's my question. I deal with a lot PDFs like this one. I have figured out how to get pdfplumber to identify the curved boxes, crop the page to each of those boxes, and then run extract_table() on those crops.

It's working great, with one exception. The first line (or lines) are essentially a title/header, and they span the columns of the table. So extract_table() will break the first line(s) at each column. eg:

[u'MERAMEC R-III SCHOOL', u'BOARD', u'MEMBER', None]

Is there a recommended approach for dealing with content which spans columns? Or just address it in the logic of my parser?

`to_image()` method does not take extra arguments

First of all, I'd just like to thank you so much for this incredible tool!

In the README it is stated that the to_image() method takes conversion_kwargs and links to this which has parameters such as background, yet the only parameter it actually seems to take is resolution. This leads to errors such as to_image() got an unexpected keyword argument 'background' when attempting to change the default background. Am I doing something wrong or does the documentation not quite match up with the functionality? If so is there a work-around or if not I'd be more than happy to contribute to try and add this functionality.

README

Method Description
.to_image(**conversion_kwargs) Returns an instance of the PageImage class. For more details, see "Visual debugging" below. For conversion_kwargs, see here.

Function

    def to_image(self, resolution=None):
        """
        For conversion_kwargs, see http://docs.wand-py.org/en/latest/wand/image.html#wand.image.Image
        """
        from .display import PageImage, DEFAULT_RESOLUTION
        res = resolution or DEFAULT_RESOLUTION
        return PageImage(self, resolution=res)

Vertical text in tables

Hi, thanks for this project, it's very useful.

One question, can pdfplumber deal with vertical text in headers or data cells? pdfminer has an object LTTextLineVertical but I cannot see the equivalent in pdfplumber. This lead me to a second question, why did you get rid of the LTTextLine objects of pdfminer and just kept the characters objects? I guess it's because the LTTextLine objects can span several cells? (pdfminer seems to lack word objects which would be the right level to use for table extraction).

Thanks.

Is there a way to get something similar to pdftotext's layout?

Is there an option similar to pdftotext's -layout flag, which "maintain[s] original physical layout"? I understand that's probably up to the pdfminer engine...

Here's what I mean:

Original PDF

http://www.supremecourt.gov/oral_arguments/argument_transcripts/07-1315.pdf

image

pdftotext with -layout

$ curl -O http://www.supremecourt.gov/oral_arguments/argument_transcripts/07-1315.pdf
$ pdftotext -f 1 -l 1 -layout 07-1315.pdf -

Output:

                        Official - Subject to Final Review


 1      IN THE SUPREME COURT OF THE UNITED STATES

 2   - - - - - - - - - - - - - - - - - x

 3   MICHAEL A. KNOWLES,                            :

 4   WARDEN,                                        :

 5              Petitioner                          :

 6         v.                                       :        No. 07-1315

 7   ALEXANDRE MIRZAYANCE.                          :

 8   - - - - - - - - - - - - - - - - - x

 9                              Washington, D.C.

10                              Tuesday, January 13, 2009

11

12                  The above-entitled matter came on for oral

13   argument before the Supreme Court of the United States

14   at 1:01 p.m.

15   APPEARANCES:

16   STEVEN E. MERCER, ESQ., Deputy Attorney General, Los

17     Angeles, Cal.; on behalf of the Petitioner.

18   CHARLES M. SEVILLA, ESQ., San Diego, Cal.; on behalf

19     of the Respondent.

20

21

22

23

24

25


                                        1

                           Alderson Reporting Company

pdfminer

import pdfplumber
import requests
fname = "/tmp/whatev.pdf"
resp = requests.get("http://www.supremecourt.gov/oral_arguments/argument_transcripts/07-1315.pdf")
with open(fname, 'wb') as f:
      f.write(resp.content)
pdf = pdfplumber.open(fname)
print(pdf.pages[0].extract_text())

Output:

Official - Subject to Final Review 
1 IN THE SUPREME COURT OF THE UNITED STATES 
2 - - - - - - - - - - - - - - - - - x 
3 MICHAEL A. KNOWLES,               : 
4 WARDEN,                           :
5  Petitioner            :
6  v.                         :  No. 07-1315 
7 ALEXANDRE MIRZAYANCE.             : 
8 - - - - - - - - - - - - - - - - - x
9  Washington, D.C.
10  Tuesday, January 13, 2009
11
12  The above-entitled matter came on for oral 
13 argument before the Supreme Court of the United States 
14 at 1:01 p.m. 
15 APPEARANCES: 
16 STEVEN E. MERCER, ESQ., Deputy Attorney General, Los
17  Angeles, Cal.; on behalf of the Petitioner. 
18 CHARLES M. SEVILLA, ESQ., San Diego, Cal.; on behalf
19  of the Respondent. 
20
21
22
23
24
25
1


Alderson Reporting Company 

pdfminer dependency chain?

Installing v0.3.1 on python 2.7.11 using 'pip install pdfplumber' it freaked out about pdfminer.utils being missing. Maybe the most recent version of pdfminer.six (there's a new version as of 2/2 ) doesn't explicitly include pdfminer as a dependency? I didn't bother to run it down, I just added pdfminer by hand and it worked.

Collecting pdfplumber
Downloading pdfplumber-0.3.1.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/private/var/folders/46/sq_lbdj547v5mrx7xzfldg0h0000gn/T/pip-build-VgxS3_/pdfplumber/setup.py", line 5, in
version = import("pdfplumber").VERSION
File "/private/var/folders/46/sq_lbdj547v5mrx7xzfldg0h0000gn/T/pip-build-VgxS3_/pdfplumber/pdfplumber/init.py", line 1, in
from pdfplumber.pdf import PDF
File "/private/var/folders/46/sq_lbdj547v5mrx7xzfldg0h0000gn/T/pip-build-VgxS3_/pdfplumber/pdfplumber/pdf.py", line 1, in
from pdfplumber.container import Container
File "/private/var/folders/46/sq_lbdj547v5mrx7xzfldg0h0000gn/T/pip-build-VgxS3_/pdfplumber/pdfplumber/container.py", line 2, in
from pdfplumber import helpers, utils
File "/private/var/folders/46/sq_lbdj547v5mrx7xzfldg0h0000gn/T/pip-build-VgxS3_/pdfplumber/pdfplumber/utils.py", line 2, in
from pdfminer.utils import PDFDocEncoding
ImportError: No module named pdfminer.utils

----------------------------------------

ValueError: Cannot convert <value> to Decimal.

I am encountering this problem when parsing the following source.

This problem occurs after upgrading from 0.5.6 to 0.5.7. Using 0.5.6 results in no errors.

Source example:

import pdfplumber

pdf = pdfplumber.open('2015-12-01.pdf')
for page in pdf.pages:
    header = page.extract_text().split('\n')[0]
    print(header)

Output:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-a3d25fbac20f> in <module>()
      3 pdf = pdfplumber.open('2015-12-01.pdf')
      4 for page in pdf.pages:
----> 5     header = page.extract_text().split('\n')[0]
      6     print(header)

~/.virtualenvs/project/lib/python3.6/site-packages/pdfplumber/page.py in extract_text(self, x_tolerance, y_tolerance)
    151         y_tolerance=utils.DEFAULT_Y_TOLERANCE):
    152 
--> 153         return utils.extract_text(self.chars,
    154             x_tolerance=x_tolerance,
    155             y_tolerance=y_tolerance)

~/.virtualenvs/project/lib/python3.6/site-packages/pdfplumber/container.py in chars(self)
     33     @property
     34     def chars(self):
---> 35         return self.objects.get("char", [])
     36 
     37     @property

~/.virtualenvs/project/lib/python3.6/site-packages/pdfplumber/page.py in objects(self)
     63     def objects(self):
     64         if hasattr(self, "_objects"): return self._objects
---> 65         self._objects = self.parse_objects()
     66         return self._objects
     67 

~/.virtualenvs/project/lib/python3.6/site-packages/pdfplumber/page.py in parse_objects(self)
    126 
    127         for obj in self.layout._objs:
--> 128             process_object(obj)
    129 
    130         return objects

~/.virtualenvs/project/lib/python3.6/site-packages/pdfplumber/page.py in process_object(obj)
     99         def process_object(obj):
    100             attr = dict((k, (v if (k in NON_DECIMALIZE or v == None) else d(v)))
--> 101                 for k, v in obj.__dict__.items()
    102                     if k not in IGNORE)
    103 

~/.virtualenvs/project/lib/python3.6/site-packages/pdfplumber/page.py in <genexpr>(.0)
    100             attr = dict((k, (v if (k in NON_DECIMALIZE or v == None) else d(v)))
    101                 for k, v in obj.__dict__.items()
--> 102                     if k not in IGNORE)
    103 
    104             kind = re.sub(lt_pat, "", obj.__class__.__name__).lower()

~/.virtualenvs/project/lib/python3.6/site-packages/pdfplumber/page.py in decimalize(self, x)
     44 
     45     def decimalize(self, x):
---> 46         return utils.decimalize(x, self.pdf.precision)
     47 
     48     @property

~/.virtualenvs/project/lib/python3.6/site-packages/pdfplumber/utils.py in decimalize(v, q)
     75     # If tuple/list passed, bulk-convert
     76     elif isinstance(v, (tuple, list)):
---> 77         return type(v)(decimalize(x, q) for x in v)
     78     # Convert int-like
     79     elif isinstance(v, numbers.Integral):

~/.virtualenvs/project/lib/python3.6/site-packages/pdfplumber/utils.py in <genexpr>(.0)
     75     # If tuple/list passed, bulk-convert
     76     elif isinstance(v, (tuple, list)):
---> 77         return type(v)(decimalize(x, q) for x in v)
     78     # Convert int-like
     79     elif isinstance(v, numbers.Integral):

~/.virtualenvs/project/lib/python3.6/site-packages/pdfplumber/utils.py in decimalize(v, q)
     87             return Decimal(repr(v))
     88     else:
---> 89         raise ValueError("Cannot convert {0} to Decimal.".format(v))
     90 
     91 def is_dataframe(collection):

ValueError: Cannot convert /'P1' to Decimal.

curve objects interfere with rect objects while table extraction

Pdf miner is recognising table as curve object for some reason. As a result, all the rows are not being extracted. I have tried a few table-settings params but none of them worked.

Link to single page PDF
Link to debug_tablefinder image

This is the output I get.

[[['Rule Out Myocardial Infarction \n(Revenue Code 0762 with Principal ICD-9-CM Diagnosis \nCodes: 411.0-411.89, 413.0-413.9, 414.00-414.05, 786.50-\n786.59, V71.7) (1)', 'Per Case', '$4,768.00']], [['Echocardiology (Revenue Code: 0483)', 'Per Unit', '$244.00']]]

page dimensions

Is there a way to return overall page dimensions? Or alternatively, relative positions (i.e. in fractional terms--so like 0.56 of the page)?

Use case is comparing / finding words at comparable positions in documents that have different sizes due to different prior processing. (This is also required for accurately displaying word positions as overlays on a pdf). One could get at this by using relative positions (and I guess doctop would be prior_pages + current relative position). If you just captured relative position you'd probably also want to add an orientation variable--though I guess that would be determinable based on letter box proportions.

Having page_width and page_height in every line in the csv seems awkward--but would work. In json output one could just add it as a variable outside the rest of the data. Maybe that's the cleanest approach. Do you have any thoughts?

edge detection algorithm

reference note in Anssi Nurminen's master's thesis, "An edge in an image is defined as an above-threshold change in intensity value of neighboring pixels. Choosing a threshold value too high, some of the more subtle visual aids on a page will not be detected, while a threshold value too low can result in a lot of erroneously interpreted edges"
"The edge detection process is divided into four distinct steps that are described in
more detail in the following chapters:

  1. Finding horizontal edges.
  2. Finding vertical edges.
  3. Finding crossing edges and creating “snapping points”.
  4. Finding cells (closed rectangular areas)."
    could you point me where this wonderful tools implement these edge detection algorithm?
    @jsvine

v0.6.0 issue divide by 0 when cropping page

I ran into an issue of divide by 0 when using crop with v0.6.0 (see snippet below).

image

It was caused by a 'char' object having either orig['height'] = 0 or orig['width'] = 0.

I was able fix the issue on my end. See PR.

Thanks,
Dustin

Repeating characters

I'm facing a weird problem wherein characters are repeated when using extract_text() or extract_tables(). Example, SSttaatteemmeenntt ooff AAccccoouunnttss is printed instead of Statement of Accounts.

Sometimes, it happens in a portion of the PDF and sometimes in the whole PDF. When this happens in a portion of PDF, it is fixable (not completely) via extract_text(x_tolerance=0, y_tolerance=0) but not when the issue affects the whole PDF. Also, note that I do not face this issue in all PDFs but in some.

Lines are also repeated. Example,

Year-to-date totals do not reflect any fee or interest refunds
Year-to-date totals do not reflect any fee or interest refunds
you may have received.
you may have received.

page height is a float in python 2.7?

I'm seeing this with a current version of pdf plumber. Seems to only apply to some pages?

...pdfplumber_env/lib/python2.7/site-packages/pdfplumber/page.py", line 65, in process_object
attr["top"] =h - attr["y1"]
TypeError: unsupported operand type(s) for -: 'float' and 'Decimal'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.