invoice-x / invoice2data Goto Github PK

View Code? Open in Web Editor NEW

1.7K 66.0 470.0 2.09 MB

Extract structured data from PDF invoices

License: MIT License

Python 99.81% Makefile 0.19%

python data-mining

invoice2data's Introduction

Data extractor for PDF invoices - invoice2data

A command line tool and Python library to support your accounting process.

extracts text from PDF files using different techniques, like pdftotext, text, ocrmypdf, pdfminer, pdfplumber or OCR -- tesseract, or gvision (Google Cloud Vision).
searches for regex in the result using a YAML or JSON-based template system
saves results as CSV, JSON or XML or renames PDF files to match the content.

With the flexible template system you can:

precisely match content PDF files
plugins available to match line items and tables
define static fields that are the same for every invoice
define custom fields needed in your organisation or process
have multiple regex per field (if layout or wording changes)
define currency
extract invoice-items using the lines-plugin developed by Holger Brunn

Go from PDF files to this:

{'date': (2014, 5, 7), 'invoice_number': '30064443', 'amount': 34.73, 'desc': 'Invoice 30064443 from QualityHosting', 'lines': [{'price': 42.0, 'desc': u'Small Business StandardExchange 2010\nGrundgeb\xfchr pro Einheit\nDienst: OUDJQ_office\n01.05.14-31.05.14\n', 'pos': u'7', 'qty': 1.0}]}
{'date': (2014, 6, 4), 'invoice_number': 'EUVINS1-OF5-DE-120725895', 'amount': 35.24, 'desc': 'Invoice EUVINS1-OF5-DE-120725895 from Amazon EU'}
{'date': (2014, 8, 3), 'invoice_number': '42183017', 'amount': 4.11, 'desc': 'Invoice 42183017 from Amazon Web Services'}
{'date': (2015, 1, 28), 'invoice_number': '12429647', 'amount': 101.0, 'desc': 'Invoice 12429647 from Envato'}

flowchart LR

    InvoiceFile[fa:fa-file-invoice Invoicefile\n\npdf\nimage\ntext] --> Input-module(Input Module\n\npdftotext\ntext\npdfminer\npdfplumber\ntesseract\ngvision)

    Input-module --> |Extracted Text| C{keyword\nmatching}

    Invoice-Templates[(fa:fa-file-lines Invoice Templates)] --> C{keyword\nmatching}

    C --> |Extracted Text + fa:fa-file-circle-check Template| E(Template Processing\n apply options from template\nremove accents, replaces etc...)

    E --> |Optimized String|Plugins&Parsers(Call plugins + parsers)

    subgraph Plugins&Parsers

      direction BT

        tables[fa:fa-table tables] ~~~ lines[fa:fa-grip-lines lines]

        lines ~~~ regex[fa:fa-code regex]

        regex ~~~ static[fa:fa-check static]

 

    end

    Plugins&Parsers --> |output| result[result\nfa:fa-file-csv,\njson,\nXML]

 

 click Invoice-Templates https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md

 click result https://github.com/invoice-x/invoice2data#usage

 click Input-module https://github.com/invoice-x/invoice2data#installation-of-input-modules

 click E https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#options

 click tables https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#tables

 click lines https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#lines

 click regex https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#regex

 click static https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#parser-static

Installation

Install pdftotext

If possible get the latest xpdf/poppler-utils version. It's included with macOS Homebrew, Debian and Ubuntu. Without it, pdftotext won't parse tables in PDF correctly.

Install invoice2data using pip

pip install invoice2data

Installation of input modules

An tesseract wrapper is included in auto language mode. It will test your input files against the languages installed on your system. To use it tesseract and imagemagick needs to be installed. tesseract supports multiple OCR engine modes. By default the available engine installed on the system will be used.

Languages: tesseract-ocr recognize more than 100 languages For Linux users, you can often find packages that provide language packs:

# Display a list of all Tesseract language packs
apt-cache search tesseract-ocr

# Debian/Ubuntu users
apt-get install tesseract-ocr-chi-sim  # Example: Install Chinese Simplified language pack

# Arch Linux users
pacman -S tesseract-data-eng tesseract-data-deu # Example: Install the English and German language packs

Usage

Basic usage. Process PDF files and write result to CSV.

invoice2data invoice.pdf
invoice2data invoice.txt
invoice2data *.pdf

Choose any of the following input readers:

pdftotext invoice2data --input-reader pdftotext invoice.pdf
pdftotext invoice2data --input-reader text invoice.txt
tesseract invoice2data --input-reader tesseract invoice.pdf
pdfminer.six invoice2data --input-reader pdfminer invoice.pdf
pdfplumber invoice2data --input-reader pdfplumber invoice.pdf
ocrmypdf invoice2data --input-reader ocrmypdf invoice.pdf
gvision invoice2data --input-reader gvision invoice.pdf (needs GOOGLE_APPLICATION_CREDENTIALS env var)

Choose any of the following output formats:

csv invoice2data --output-format csv invoice.pdf
json invoice2data --output-format json invoice.pdf
xml invoice2data --output-format xml invoice.pdf

Save output file with custom name or a specific folder

invoice2data --output-format csv --output-name myinvoices/invoices.csv invoice.pdf

Note: You must specify the output-format in order to create output-name

Specify folder with yml templates. (e.g. your suppliers)

invoice2data --template-folder ACME-templates invoice.pdf

Only use your own templates and exclude built-ins

invoice2data --exclude-built-in-templates --template-folder ACME-templates invoice.pdf

Processes a folder of invoices and copies renamed invoices to new folder.

invoice2data --copy new_folder folder_with_invoices/*.pdf

Processes a single file and dumps whole file for debugging (useful when adding new templates in templates.py)

invoice2data --debug my_invoice.pdf

Recognize test invoices: invoice2data invoice2data/test/pdfs/* --debug

Use as Python Library

You can easily add invoice2data to your own Python scripts as library.

from invoice2data import extract_data
result = extract_data('path/to/my/file.pdf')

Using in-house templates

from invoice2data import extract_data
from invoice2data.extract.loader import read_templates

templates = read_templates('/path/to/your/templates/')
result = extract_data(filename, templates=templates)

Template system

See invoice2data/extract/templates for existing templates. Just extend the list to add your own. If deployed by a bigger organisation, there should be an interface to edit templates for new suppliers. 80-20 rule. For a short tutorial on how to add new templates, see TUTORIAL.md.

Templates are based on Yaml or JSON. They define one or more keywords to find the right template, one or more exclude_keywords to further narrow it down and regexp for fields to be extracted. They could also be a static value, like the full company name.

Template files are tried in alphabetical order.

We may extend them to feature options to be used during invoice processing.

Example:

issuer: Amazon Web Services, Inc.
keywords:
- Amazon Web Services
exclude_keywords:
- San Jose
fields:
  amount: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
  amount_untaxed: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
  date: Invoice Date:\s+([a-zA-Z]+ \d+ , \d+)
  invoice_number: Invoice Number:\s+(\d+)
  partner_name: (Amazon Web Services, Inc\.)
options:
  remove_whitespace: false
  currency: HKD
  date_formats:
    - '%d/%m/%Y'
lines:
    start: Detail
    end: \* May include estimated US sales tax
    first_line: ^    (?P<description>\w+.*)\$(?P<price_unit>\d+\.\d+)
    line: (.*)\$(\d+\.\d+)
    skip_line: Note
    last_line: VAT \*\*

The lines package has multiple settings:

start > The pattern where the lines begin. This is typically the header row of the table. This row is not included in the line matching.
end > The pattern denoting where the lines end. Typically some text at the very end or immediately below the table. Also not included in the line matching.
first_line > Optional. This is the primary line item for each entry.
line > If first_line is not provided, this will be used as the primary line pattern. If first_line is provided, this is the pattern for any sub-lines such as line item details.
skip_line > Optional. If first_line is passed, this pattern indicates which sub-lines will be skipped and their data not recorded. This is useful if tables span multiple pages and you need to skip over page numbers or headers that appear mid-table.
last_line > Optional. If first_line is passed, this pattern denotes the final line of the sub-lines and is included in the output data.

⚠️ Invoice2data uses a yaml templating system. The yaml templates are loaded with pyyaml which is a pure python implementation. (thus rather slow) As an alternative json templates can be used. Which are natively better supported by python.

The performance with yaml templates can be greatly increased 10x by using libyaml It can be installed on most distributions by: sudo apt-get libyaml-dev

Development

If you are interested in improving this project, have a look at our developer guide to get you started quickly.

Roadmap and open tasks

integrate with online OCR?
try to 'guess' parameters for new invoice formats.
can apply machine learning to guess new parameters?
advanced table parsing with camelot

Maintainers

Contributors

Harshit Joshi: As Google Summer of Code student.
Holger Brunn: Add support for parsing invoice items.

Used By

Odoo, OCA module account_invoice_import_invoice2data

Related Projects

OCR-Invoice (FOSS | C#)
DeepLogic AI (Commercial | SaaS)
Docparser (Commercial | Web Service)
A-PDF (Commercial)
PDFdeconstruct (Commercial)
CVision (Commercial)

invoice2data's People

Contributors

Stargazers

Watchers

Forkers

twocngdagz orinocoz akretion leangjia pombredanne axellh antwal warp10 askz kyate yuanzhaoyz mjcortejo hbrunn biokys sunflowerit rubencabrera larmar lulzzz inoio yash0270 rkdsone maethor vleecpp omritreidel memmaker open-net-sarl keita1 bchopson canivel joanjunyent diegopenuela wedwardbeck polytechas rohit1707 azman0101 jopfeiff podilaaditya miorantsoarak felisamedia rsenwar srinidhi136 c4t4 ankurryder sid352 svmundada ikristjan thirunar thesoulbender navisk13 gstazure e-plus-healthcare-alliance rost314 aac-germany avatarsenju ahmedamrmohamed duskybomb domlowe uditjuneja diegoromero meghalagrawal tigershen23 roysh morganjk jdrew1303 scotthansonde us241098 robertlemmens maquadros kevintomsgithub dommmel cosmos-factory bokzor akashjobanputra abotiamnot ljusyu sascha78 flokli lihka1 invigor sunakshi132 stungkit devanshusomani99 emmanuelgrognet vdmasek merajat gusleig mahendra047 aastha-singh cgy1992 nevillew hochzehn dcnith gullik tehmichalis srngit julia-tantus simplec9000 aoaoyoujin sonnguyen478 hkng

invoice2data's Issues

Cannot convert to CSV

File "/home/harshit/anaconda3/lib/python3.6/site-packages/invoice2data-0.2.73-py3.6.egg/invoice2data/main.py", line 97, in main
    invoices_to_csv(output, csv_output_name)
NameError: name 'csv_output_name' is not defined

Need to replace csv_output_name with args.csv_output_name

Extract from JSON

Hello,

There is any way to use JSON or HOCR in input ?

Thanks for your help and you work !

Manu

Warn on version-mismatch with xpdf

It's great to be able to use the latest improvements of xpdf (option -table), but I find it a bit hard to oblige all users to use the latest 3.04 version. Even with the latest version of Ubuntu and on Debian Sid, xpdf is version 3.03 (which is a bit strange, because xpdf 3.04 was released in May 2014), so I cannot easily propose a backport via a PPA.

So I propose to auto-detect the version of pdftotext: it the version is over 3.04, we add the -table option ; otherwise, we don't use that option. And, in the README, we encourage to upgrade to the latest version of xpdf/pdftotext to benefit from the latest improvements.

@manuelRiel What do you think ? Do we really need to oblige ALL our users to use version 3.04 ?

If you are OK with my idea, I'll propose a PR to autodetect the version and adapt the arguments.

[regression] bad handling of decimal separator

@m3nu Following your changes of January 23, the amounts with "," as decimal separator are badly converted to float. For example, pdftotext extracts "27,80" and this is converted to "2780.0" !

When I look at the code in main.py:

locale.setlocale( locale.LC_ALL, 'en_US.UTF-8' )
[...]
output[k] = locale.atof(res_find[0])

So I understand that the conversion from string to float is always done with the en_US locale, so it doesn't support dot as decimal separator. What was your idea when you wrote this code ?

My idea would be to add 2 additionnal options in the YAML files: decimal_separator and thousand_separator. We can't just define a language parameter and use it, because for example some French invoices don't respect the official french decimal separator (dot) and thousand separator (space).

Add tests

This PDF invoice doesn't match a known template of the invoice2data lib.

I have installed xpdftotext like this:
http://support.sphiderpro.eu/knowledgebase.php?article=13

then I have installed invoice2pdf like this:
https://github.com/m3nu/invoice2data

then I have installed the modul account_invoice_import in odoo8

After trying to upload some pdf like the examples pdf, I get this error:

This PDF invoice doesn't match a known template of the invoice2data lib.

Allow the user to choose the CSV file

I think the CSV output should be a command argument, to allow the user to choose where to put it and, if not argument is given, to not generate it.

Refactor for better modularity

I took a first shot at restructuring to make the three processing steps more modular. I see three main steps that have interchangeable modules.

input: Read plain text from somewhere. Could be a PDF, website, ERP system. Currently only pdftotext works.
extract: Get fields using templates. I moved the line extraction feature into a subclass. There could be other subclasses for different ways to find fields. Templates should be able to choose from multiple extraction plugins. Still needs some refactoring to achieve this.
output: Save results to a file or somewhere else. Currently supports csv. json will be added soon.

Those options can be combined from the CLI tool

invoice2data --input-reader tesseract --output-format json FILE.PDF

or the main extract_data function

extract_data(invoicefile, input_module=pdftotext):

I tried to keep it all backwards-compatible. The few tests we have still work.

Please let me know your feedback and then we'll merge the refactoring. Code is in a separate branch called v2. @hbrunn @alexis-via @duskybomb @rsenwar

Other changes done or planned (todo list):

Add developer guide to help people start quickly. DEVELOP.md
Organize different ways to extract fields as subclasses.
Allow templates to choose from different plugins/extraction steps.
More tests. Especially for CLI tool. #8
Add JSON output module. #34
Update Python versions we test on. Currently up to 3.6
Add new contributors to README file.

How extract Total in 2 lines ?

Hello !

Can you have a look ?

Many Thanks !

PDF lib with layout support

This may be helpful: https://github.com/JonathanLink/PDFLayoutTextStripper

does this mean that templates is not properly configured?

File ".../lib/python2.7/site-packages/invoice2data/template.py", line 110, in matches_input
if all([keyword in optimized_str for keyword in self['keywords']]):
TypeError: coercing to Unicode: need string or buffer, NoneType found

Raise exception when date parsing fails

On this line of code:
https://github.com/m3nu/invoice2data/blob/master/invoice2data/template.py#L137

when date parsing fails, we log an error and we return None. But that's a problem when invoice2data is used by another software as a python lib because the software cannot display a good error message "date parsing fails ..." to the user.

How to handle Value which are below mentions

There are invoices where the mention and value are paired as
INVOICE #
xxxxxxxxx (x= any digit 0-9)
and there may be many other noise in front of this . Can Invoice2data handle this ?
My regex is working but not matching

regex used is INVOICE\s#\s+.*?\s+(\d+)
this works in regex101 nd matches with two groups however this fails to fetch the data ...

How can one handle these scenarios

read_template codec error

Hi,

I 've started and get the following error with
cloudns.yml ( I think from russian letters):

I 've no experience with Python. How can I switch the locale?

My Environment:
Win10 64 Home
Active Perl 3.6

C:\Python36\lib\site-packages\invoice2data-0.2.53-py3.6.egg\invoice2data\templates com.cloudns.yml
Traceback (most recent call last):
File "C:\Users\PapaNetz.p2\pool\plugins\org.python.pydev_5.7.0.201704111357\pysrc\pydevd.py", line 1546, in
globals = debugger.run(setup['file'], None, None, is_module)
File "C:\Users\PapaNetz.p2\pool\plugins\org.python.pydev_5.7.0.201704111357\pysrc\pydevd.py", line 982, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "C:\Users\PapaNetz.p2\pool\plugins\org.python.pydev_5.7.0.201704111357\pysrc_pydev_imps_pydev_execfile.py", line 25, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:\Users\PapaNetz\git\invoice2data\invoice2data\main.py", line 96, in
main()
File "C:\Users\PapaNetz\git\invoice2data\invoice2data\main.py", line 80, in main
templates += read_templates(pkg_resources.resource_filename('invoice2data', 'templates'))
File "C:\Python36\lib\site-packages\invoice2data-0.2.53-py3.6.egg\invoice2data\template.py", line 39, in read_templates
tpl = ordered_load(open(os.path.join(path, name)).read())
File "C:\Python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 53: character maps to

Missing setup.py

"Normal" users are used to install a Python module via:

sudo python ./setup.py install

We're lacking this file (and maybe others) to make it a standard python module.

Extract fields with line-plugin

Hello,

Are are my first attempt to get data from a pdf invoice. Some things are working, some I don’t understand.

Here is my template file is :

-- coding: utf-8 --

issuer: Legallais
fields:
amount: Montant total net TTC\n\s+\d+\s+\d+.\d{2}\s+\d+.\d{2}\s+\d+.\d{2}\s+(\d+.\d{2})
amount_untaxed: Montant total net TTC\n\s+\d+\s+(\d+.\d{2})
date: 138896\s+(\d{2}/\d{2}/\d{2})
date_due: Date limite de paiement le (\d+ .+ \d{4})
invoice_number: FACTURE Ndeg (\d\s\d{3}\s\d{3})
static_vat: FR 20 563 820 489
keywords:

FR 20 563 820 489
FACTURE
Eur
options:
currency: EUR
date_formats:
- '%d %B %Y'
  languages:
- fr
  decimal_separator: '.'
  remove_accents : True
  lines:
  start: Code\s+Designation
  end: Nos réf.+
  line: (?P\d+)\s+(?P\w+)\s+(?P\d+)\s+(?P\w+)\s+(?P<P.U.H.T.>\d+)\s+(?P\w+\s+\d)\s+(?P<R%>\d+)\s+(?P<P.U.net>\d+.\d{2})\s+(?P\d+.\d{2}) first_line: ^\s+(?P\d+)\s+(?P\w+) last_line: ^\s+Nos réf.+


1- the only data I get is a small CSV table :
date,desc,amount

09/01/2016,Invoice 6 002 393 from Legallais,89.32
How do I get more data. In debug mode I see invoice2data is getting much more than that :

INFO:invoice2data.main:{'amount_untaxed': 74.43, 'currency': 'EUR', 'amount': 89.32, 'date': datetime.datetime(2016, 1, 9, 0, 0), 'invoice_number': '6 002 393', 'vat': 'FR 20 563 820 489', 'desc': 'Invoice 6 002 393 from Legallais'}
What if, for example I want my amount_untaxed in the CSV file ?

How do I order my fields in the CSV file ?
2- Invoice2data doesn’t recognize my lines, what did I did wrong ?

The debug output is in attachment.
lines are shown as :

Code Designation Quantite UV P.U.H.T. UF T R% P.U.net Montant H.T.

Livraison ndeg 2489766 du 06/01/16

542241 AGRAPHEUSE CLOUEUSE T50RED 1 PI 108.23 PI 1 60.00 43.29 43.29

106582 AGRAFE T50 8MM B1250 1 BT 6.16 BT 1 60.00 2.46 2.46

106617 AGRAFE T50 14MM B1250 1 BT 8.20 BT 1 60.00 3.28 3.28

106603 AGRAFE T50 10MM B1250 1 BT 7.12 BT 1 60.00 2.85 2.85

REMISE EXCEPTIONNELLE 1 -2.59 PI 1 Nt -2.59 -2.59

Nos ref : 2489766 du 06/01/16 - V/ commande : TOTAL LIVRAISON 49.29
Livraison ndeg 2496744 du 08/01/16

400498 AEROSOL SUPER GLISSE BOIS 6190 3 PI 20.96 PI 1 60.00 8.38 25.14

Nos ref : 2496744 du 07/01/16 - V/ commande : atelier TOTAL LIVRAISON 25.14
3- How can I ask invoice2data to manage with this kind of layout where the invoice gets a subtotal for each sale order ?
4- In France (as it must be the case in other countries) we have several taxes. At the end of invoice there’s a taxe table giving taxe base amount, taxe rate and taxe amount.
Is invoice2data able to handle this taxe table ?
Thanks for your help
Regards
Vart

inv2data-DebugOutput.txt

Can't extract single line data from invoices

I'm able to extract all the data from invoice PDF except lines part.
Below is the line template. I've tested the regex on https://regex101.com/ and it works fine.

Template

lines:
  start: RETOUR
  end: Totaal Overzicht
  line: ^(?P<INVOICE_QUANTITY>\d+.\S)(?P<INVOICE_ID>\d+.\s)(?P<DESCRIPTION>\w+.*\s)(?P<GROSS_UNUSED>\w+.*\s)(?P<PRICE>\w+.*\s)(?P<AMOUNT>\w+.*\s)(?P<TAX_PERCENTAGE>\S+)

No line in result.

But if I add simple parsing
line: (.*\d\S)
then I get empty results
'lines': [{}, {}, {}, {}, {}, {}, {}, {}, {}],

Is it because of regex? All the sample templates either don't have lines or have multi line with first_line, last_line
Can someone push me right direction or give an example?

Processing a folder

I tried to process a folder of invoices to a single output CSV using command line but it gives me an error
"invoice2data: error: argument input_files: can't open '.pdf': [Errno 22] Invalid argument: '.pdf'"
I'm using windows 10.

The command line I used is
C:\Users\IT\Desktop\test>invoice2data *.pdf
All my PDF are in the folder test.
Please help me resolve?

Code used:

'''
Created on 15 Aug 2017
Code used:
@author: it
'''
import test_ui
import invoice2data
import logging
from PyQt5 import QtWidgets, QtGui, QtCore
import subprocess
import os
from invoice2data.main import extract_data

class MainWindow(QtWidgets.QMainWindow, test_ui.Ui_MainWindow):
def init(self,parent=None):
super(MainWindow,self).init(parent)
self.setupUi(self)

if name == "main":

os.chdir("C:\\Users\\it\\Desktop\\test")
os.system("cmd.exe /c invoice2data *.pdf")

Extract production information from line between FirstLine and LastLine of one product

example of one product details in invoice given below

Serla Kostea Siivousliina 32 kpl
221443 92537 1 TU 14.160 TU 24 0.00
/ 14.16

How should i extract middle line info
I have written a regular expression for FirstLine which cover the first two lines in regex101 editor but it wont work with the extract_lines() function because extracts text line by line.

firstline: (?P<desc>.+?)\s+(\d+)\s+(\d+)\s+(?P<qty>\d+\s+TU)\s+(?P<unitprice>\d+.\d+(:?\s+TU))\s+(?P<taxpercent>\d+(.\d+)?)\s+(?P<discount>\d+.\d+)

lastline: \s+/\s+(?P<totalamount>\d+.\d+)

Crash when de_DE locale is not installed

I am trying to use invoice2data on my Ubuntu 15.10:

% python -m invoice2data.main invoice2data/test/pdfs --debug
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/alexis/new_boite/dev/invoice2data/invoice2data/main.py", line 67, in <module>
    res = extract_data(join(args.file_folder, f))
  File "/home/alexis/new_boite/dev/invoice2data/invoice2data/main.py", line 34, in extract_data
    output[k] = str2date(raw_date)
  File "invoice2data/date_parser.py", line 33, in str2date
    with setlocale(l):
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "invoice2data/date_parser.py", line 18, in setlocale
    yield locale.setlocale(locale.LC_ALL, name)
  File "/usr/lib/python2.7/locale.py", line 579, in setlocale
    return _setlocale(category, locale)
locale.Error: unsupported locale setting
zsh: exit 1     python -m invoice2data.main invoice2data/test/pdfs --debug

As I am a French user, my default locale is French:

% locale
LANG=fr_FR.UTF-8
LANGUAGE=fr_FR
LC_CTYPE="fr_FR.UTF-8"
LC_NUMERIC="fr_FR.UTF-8"
LC_TIME="fr_FR.UTF-8"
LC_COLLATE="fr_FR.UTF-8"
LC_MONETARY="fr_FR.UTF-8"
LC_MESSAGES="fr_FR.UTF-8"
LC_PAPER="fr_FR.UTF-8"
LC_NAME="fr_FR.UTF-8"
LC_ADDRESS="fr_FR.UTF-8"
LC_TELEPHONE="fr_FR.UTF-8"
LC_MEASUREMENT="fr_FR.UTF-8"
LC_IDENTIFICATION="fr_FR.UTF-8"
LC_ALL=

If I run "locale-gen", I see that Ubuntu generate all fr_* and en_* locales.

The crash is linked to the fact that you try to set the locale de_DE. In date_parser.py line 32, if I remove 'de_DE.UTF-8' from the list, I will manage to parse the 2 first invoices, but it will crash later on:

% python -m invoice2data.main invoice2data/test/pdfs --debug
{'date': datetime.datetime(2014, 8, 3, 0, 0), 'invoice_number': '42183017', 'amount': 4.11, 'desc': 'Invoice 42183017 from Amazon Web Services'}
{'date': datetime.datetime(2015, 1, 28, 0, 0), 'invoice_number': '12429647', 'amount': 101.0, 'desc': 'Invoice 12429647 from Envato'}
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/alexis/new_boite/dev/invoice2data/invoice2data/main.py", line 67, in <module>
    res = extract_data(join(args.file_folder, f))
  File "/home/alexis/new_boite/dev/invoice2data/invoice2data/main.py", line 38, in extract_data
    output[k] = re.findall(v, str)[0]
IndexError: list index out of range
zsh: exit 1     python -m invoice2data.main invoice2data/test/pdfs --debug

Could not find a version that satisfies the requirement dateparser (from invoice2data==0.2.59)

While installing invoice2data0.2.59 I got the error
"Could not find a version that satisfies the requirement dateparser (from invoice2data==0.2.59)". I get the same with PDFminer.six. When I remove them from the requirements file it works. Have I missed out any part of the installation?

ResourceWarning: unclosed file

I was running tests on CLI and received this warning

/home/harshit/gsoc/gsoc-invoice2data/invoice2data/invoice2data/test/test.py:36: ResourceWarning: unclosed file <_io.TextIOWrapper name='/home/harshit/gsoc/gsoc-invoice2data/invoice2data/invoice2data/test/pdfs/2014-08-03 SALES Amazon Web Services  aws.amazon.coUS.pdf' mode='r' encoding='UTF-8'>
  self._run_test_on_folder(folder)])
/home/harshit/anaconda3/lib/python3.6/unittest/case.py:605: ResourceWarning: unclosed file <_io.TextIOWrapper name='/home/harshit/gsoc/gsoc-invoice2data/invoice2data/invoice2data/test/pdfs/2014-08-03 SALES Amazon Web Services  aws.amazon.coUS.pdf' mode='r' encoding='UTF-8'>
  testMethod()
.
----------------------------------------------------------------------
Ran 1 test in 0.004s

OK

The test script is:

https://github.com/duskybomb/invoice2data/blob/patch-1/test.py

machine-learning-approach for invoice2data

@m3nu I have immplemented the project you provided github regarding invoice2data. but is there any machine learning approach on it .....like you have provided the regex features in yml manualyy how can I approch these things with machine learning!!!thank you

Add Tutorial for template creation and best practice

and not really understanding how to go about it. A data file I am trying to work with looks like:-
(As this is a public forum, I have changed the supplier name to "Supplier" and the customer name to "Customer")
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Supplier Name
Carterton
Oxfordshire
OX18 3EZ

Invoice

Page 1

Tel:- (01993) xxxxxx Fax: - (01993) xxxxxx
Email:- [email protected]
VAT Reg No: xxx xxxx xx
Customer
Nr Lechlade, Glos
GL7 3QS

55574
11/05/2016

BPOO1

Quantity Details
1.00 Job 25326. To supply only 6no M12x150
8.8 bzp bolts as required.

Terms strictly 28 days from date of invoice
Goods remain the property of Supplier
until payment has been received in full.
BACS Details (HSBC)
Sort Code 40-16-46
Account 11055860

Unit Price

Net Amt

VAT %

VAT

4.70

20.00

0.94

Total Net Amount

4.70

Carriage Net

0.00

Total VAT Amount

0.94

Invoice Total

5.64

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Most of the bits of text seem to be on separate lines, so I have to tell it to step over whitespace, or does it do it for me.

Is there a description of the pattern matching formats?

Thanks in advance

Improve error message when pdftotext is missing

Currently, when pdftotext is not installed, we get this error:

% python -m invoice2data.main ~/new_boite/pdf-invoices/test/sfr-facture.pdf
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/alexis/new_boite/dev/invoice2data/invoice2data/main.py", line 96, in <module>
    main()
  File "/home/alexis/new_boite/dev/invoice2data/invoice2data/main.py", line 84, in main
    res = extract_data(f.name, templates=templates)
  File "/home/alexis/new_boite/dev/invoice2data/invoice2data/main.py", line 23, in extract_data
    extracted_str = pdftotext.to_text(invoicefile).decode('utf-8')
  File "invoice2data/pdftotext.py", line 13, in to_text
    stdout=subprocess.PIPE).communicate()
  File "/usr/lib/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/usr/lib/python2.7/subprocess.py", line 1340, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

But this error message doesn't give a clue about the cause of the problem. We should try to improve that if possible I think.

Add extended tests that also validates data extracted.

Currently we run tests without looking at the data returned.

There could be some validated dicts that are compared with output data from a few files.

--template-folder dir, so be an addition to the standard templates, not a remplacement

I have added support for --template-folder when using invoice2data via Odoo. But, when we use --template-folder xxx, the templates are read from the xxx directory, but the native templates of invoice2data are ignored.

I think that, when we use --template-folder xxx, these templates in xxx should be used in addition to the native templates. What do you think ? If you agree, I can try to find time to submit a patch.

Export to UBL or EDIFACT

Hello,

I want to start collaborating in this project. As I looked at https://github.com/m3nu/invoice2data/issues/75, there is an idea of new feature that involves export to UBL.

I want to know if there are enough benefits to this feature or if I should collaborate in other demand that is more important.

Any advice or information about the UBL feature will be appreciated.

Export multiple output formats, like json, csv

Hi,

I know we can use invoice2data directly as a lib in python, but it would be great to make the output parsable. With a --json parameter, for instance.

Thanks, this app is awesome.

Typo in TUTORIAL.md

The line:

issuer: The name of the invoice issuer. Can the company name and country.

looks incomplete. There should be a word in between Can the

The need for templates

I havent played with invoice2data yet but i wondered whether you have to have templates?

Is there any way one could maintain a db of possible terms for each distinguishable entity eg. Invoice number = invoice no

In that way it could test all terms until it found a match?

Missing licence information

I didn't find any information about the "licence" of this great "invoice2data" lib. Did I miss something ?

I think it's important to choose a free software licence and display it clearly.

Match language in addition to tax number

Ran into an issue with your fr35433115904.yml Scaleway template. First it's nice to be able to use other people's templates. But you are matching the French version. I guess each template needs one word under keywords to match the right version?

regular expression ^ and $

Hello !

Me again, but a little alone !

touble to use ^ and $ in regular expressions: I have validated expression in another context

any help ?

Thanks !

Windows support

Despite I downloaded the Poppler for Windows, I receive same error messages:

File "C:\Python\Python36\lib\site-packages\invoice2data\in_pdftotext.py", line 19, in to_text
raise EnvironmentError('pdftotext not installed. Can be downloaded from https://poppler.freedesktop.org/')
OSError: pdftotext not installed. Can be downloaded from https://poppler.freedesktop.org/

Any clues?

Cannot use extract_data() directly

OK, I found the time to investigate the issue. After the commits of this week-end, it still works well when you use invoice2data from the command line, but it doesn't work any more when you use it as a python lib.

In the README file, it says:

from invoice2data import extract_data

result = extract_data('path/to/my/file.pdf')

This doesn't work any more, because extract_data has one additionnal arg. This additional arg must be the result of a call to the read_templates() method, which is done in the main() method before calling extract_data().

I don't know what you have in mind about this. I see two options:

we update the README to explain that users have to call read_templates() first and then call extract_data()... which is a bit more work for our lib users than before.
we adapt the code to call read_template inside extract_data if the second arg is empty

Pip install fails

macbookpro:~ sander$ pip install invoice2data
Collecting invoice2data
  Using cached invoice2data-0.2.47.tar.gz
Collecting pytesseract (from invoice2data)
  Using cached pytesseract-0.1.6.tar.gz
Collecting pillow (from invoice2data)
  Using cached Pillow-4.0.0-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
Collecting pyyaml (from invoice2data)
  Using cached PyYAML-3.12.tar.gz
Collecting dateparser (from invoice2data)
  Using cached dateparser-0.6.0-py2.py3-none-any.whl
Collecting unidecode (from invoice2data)
  Using cached Unidecode-0.04.20-py2.py3-none-any.whl
Collecting pdfminer.six (from invoice2data)
  Using cached pdfminer.six-20160614.zip
Collecting olefile (from pillow->invoice2data)
  Using cached olefile-0.44.zip
Collecting tzlocal (from dateparser->invoice2data)
  Using cached tzlocal-1.3.tar.gz
Requirement already satisfied: pytz in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from dateparser->invoice2data)
Requirement already satisfied: python-dateutil in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from dateparser->invoice2data)
Collecting ruamel.yaml (from dateparser->invoice2data)
  Using cached ruamel.yaml-0.13.14.tar.gz
    Complete output from command python setup.py egg_info:
    /var/folders/kl/c7vdfbm92sz3j_frpf8m5tfm0000gn/T/tmp_ruamel_J1xp6L/test_ruamel_yaml.c:6:8: warning: explicitly assigning value of variable of type 'yaml_parser_t' (aka 'struct yaml_parser_s') to itself [-Wself-assign]
    parser = parser;  /* prevent warning */
    ~~~~~~ ^ ~~~~~~
    /var/folders/kl/c7vdfbm92sz3j_frpf8m5tfm0000gn/T/tmp_ruamel_J1xp6L/test_ruamel_yaml.c:6:10: warning: variable 'parser' is uninitialized when used here [-Wuninitialized]
    parser = parser;  /* prevent warning */
             ^~~~~~
    /var/folders/kl/c7vdfbm92sz3j_frpf8m5tfm0000gn/T/tmp_ruamel_J1xp6L/test_ruamel_yaml.c:5:1: note: variable 'parser' is declared here
    yaml_parser_t parser;
    ^
    2 warnings generated.
    /var/folders/kl/c7vdfbm92sz3j_frpf8m5tfm0000gn/T/tmp_ruamel_J1xp6L/test_ruamel_yaml.c:6:8: warning: explicitly assigning value of variable of type 'yaml_parser_t' (aka 'struct yaml_parser_s') to itself [-Wself-assign]
    parser = parser;  /* prevent warning */
    ~~~~~~ ^ ~~~~~~
    /var/folders/kl/c7vdfbm92sz3j_frpf8m5tfm0000gn/T/tmp_ruamel_J1xp6L/test_ruamel_yaml.c:6:10: warning: variable 'parser' is uninitialized when used here [-Wuninitialized]
    parser = parser;  /* prevent warning */
             ^~~~~~
    /var/folders/kl/c7vdfbm92sz3j_frpf8m5tfm0000gn/T/tmp_ruamel_J1xp6L/test_ruamel_yaml.c:5:1: note: variable 'parser' is declared here
    yaml_parser_t parser;
    ^
    2 warnings generated.
    sys.argv ['-c', 'egg_info', '--egg-base', 'pip-egg-info']
    test compiling test_ruamel_yaml
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/kl/c7vdfbm92sz3j_frpf8m5tfm0000gn/T/pip-build-VrgoKR/ruamel.yaml/setup.py", line 854, in <module>
        main()
      File "/private/var/folders/kl/c7vdfbm92sz3j_frpf8m5tfm0000gn/T/pip-build-VrgoKR/ruamel.yaml/setup.py", line 843, in main
        setup(**kw)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/core.py", line 111, in setup
        _setup_distribution = dist = klass(attrs)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/setuptools/dist.py", line 272, in __init__
        _Distribution.__init__(self,attrs)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py", line 287, in __init__
        self.finalize_options()
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/setuptools/dist.py", line 326, in finalize_options
        ep.require(installer=self.fetch_build_egg)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources/__init__.py", line 2385, in require
        reqs = self.dist.requires(self.extras)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources/__init__.py", line 2617, in requires
        dm = self._dep_map
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources/__init__.py", line 2606, in _dep_map
        if invalid_marker(marker):
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources/__init__.py", line 1424, in is_invalid_marker
        cls.evaluate_marker(text)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources/__init__.py", line 1549, in _markerlib_evaluate
        env = cls._translate_metadata2(_markerlib.default_environment())
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources/__init__.py", line 1537, in _translate_metadata2
        for key, value in env
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources/__init__.py", line 1536, in <genexpr>
        (key.replace('.', '_'), value)
    ValueError: too many values to unpack
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/kl/c7vdfbm92sz3j_frpf8m5tfm0000gn/T/pip-build-VrgoKR/ruamel.yaml/

Adding tesseract to OCR and extract

I am trying to get this module to work with image invoices too with pytesseract.
But I am facing a problem that the structure is not preserved when image is OCRed.

Hence some info is lost and I can't run regex over it. Lets say, I have a image of invoice of amazon. That have 2 orders that I bought(each order description can be 2 or more lines), for each order there is a corresponding amount attached to it within the same row. So after applying OCR over it, it prints the 1st line of 1st order + price + 2nd line of 1st order ... and so on.

I have tried applying the image to data options to retrieve coordinates to put the text in separate boxes. But in this approach too, for different invoices will have different spacing I am not able to make proper boxes.

Can anyone suggests some method of recognizing such images?

AttributeError

Hey guys,

i'm trying to create a template for an invoice with the following setup:

DEBUG:root:Date parsing: languages=[] date_formats=[]
DEBUG:root:Float parsing: decimal separator=,
DEBUG:root:keywords=['DeutschePostAG']
DEBUG:root:{'remove_whitespace': True, 'remove_accents': False, 'lowercase': False, 'currency': >'EUR', 'date_formats': [], 'languages': [], 'decimal_separator': ',', 'replace': []}
DEBUG:root:field=amount | regexp=Rechnungsbetrag(\W+)?(inklusive(\W+)?Umsatzsteuer)(\W+)?(\d+.\d+,\d+)(\W+)?EUR

The result is:

DEBUG:root:res_find=[('', '', '', '30.277,88', '')]

But i get following error:

Traceback (most recent call last):
File "/usr/local/bin/invoice2data", line 11, in
> sys.exit(main())
File "/usr/local/lib/python3.6/site-packages/invoice2data/main.py", line 84, in main
res = extract_data(f.name, templates=templates)
File "/usr/local/lib/python3.6/site-packages/invoice2data/main.py", line 40, in extract_data
return t.extract(optimized_str)
File "/usr/local/lib/python3.6/site-packages/invoice2data/template.py", line 151, in extract
amount_pipe = res_find[0].replace(self.options['decimal_separator'], '|')
AttributeError: 'tuple' object has no attribute 'replace'

Thanks for your help.

How about using machine learning to extract entities from invoices?

The below library is good at performing extraction from invoices.
https://github.com/mit-nlp/MITIE

If you think it is interesting, I would be glad to collaborate. Thanks

Adding a search feature

Hello @m3nu , How about adding a search feature based on some keywords like name/amount/some other user input from many pdf files. This would be easy to add but would be much more useful. I have already implemented on my local Repo. .

Why is OCR for scanned pdf not working at the moment

Allow custom data fields to be retrieved

Would it be possible to improve French EDF Entreprises template to retrieve energy consumption?

How to extract from a particular position using regex ?

I want to extract the Ship To details alone, but I am able to extract only the entire text from Ship to till the end of the specified text or line using (SHIP TO :(.*\n){5}) expression.

Extraction of field from invoice

Hello, i was wondering if there is any way that I could extract the name or the address from the ”buyer” field from this invoice using invoice2data.
inv4.pdf

pip install fails

Here is the error:

odoo@ns348518:~/erp$ sudo pip install invoice2data
[sudo] password for odoo: 
Downloading/unpacking invoice2data
  Downloading invoice2data-0.2.0.tar.gz (358kB): 358kB downloaded
  Running setup.py (path:/tmp/pip_build_root/invoice2data/setup.py) egg_info for package invoice2data
    Traceback (most recent call last):
      File "<string>", line 17, in <module>
      File "/tmp/pip_build_root/invoice2data/setup.py", line 3, in <module>
        version = open('VERSION').read().strip()
    IOError: [Errno 2] No such file or directory: 'VERSION'
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 17, in <module>

  File "/tmp/pip_build_root/invoice2data/setup.py", line 3, in <module>

    version = open('VERSION').read().strip()

IOError: [Errno 2] No such file or directory: 'VERSION'

----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip_build_root/invoice2data
Storing debug log for failure in /home/odoo/.pip/pip.log

Renamed files name structure

It would be great if the filename structure for renamed files could be defined in the template.

Structure templates in yml files and allow multiple files

First the current template.py is really a data structure, rather than Python code. It should be in json or even better YAML. Next it should be possible to add separate templates, depending on the current project.

Trouble Installing Poppler/pdftotext

Hi,

I'm trying to use this as a lib on Python 3.6 on Windows. However, I'm having trouble installing pdftotext. When I do

pip install pdftotext

It first gives me the following error message:

pdftotext/pdftotext.cpp(4): fatal error C1083: Cannot open include file: 'poppler/cpp/poppler-document.h': No such file or directory
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\x86_amd64\\cl.exe' failed with exit status 2

After Googling for a while, I download a Poppler Windows Binary and add the directory to the INCLUDE environment variable. Now I'm getting the following error message.

LINK : fatal error LNK1181: cannot open input file 'poppler-cpp.lib'
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\x86_amd64\\link.exe' failed with exit status 1181

I suppose I should add the directory containing 'poppler-cpp.lib' to the LIB environment variable. But I couldn't find it anywhere.

Any help is appreciated!

Thanks,

Brainstorming new features

After looking at the available literature, here some ideas on new features:

web interface and positional fields, like docparser[1] does. This means supporting different field types. Regex, positional, etc.
Instead of XML export to UBL, which is a standardized version of XML[2]
choose and classify fields with ML. [3, 4, 5]
testing and improvements to OCR module.
GUI to choose folders and output options.
GUI to select invoice fields.
Test on more Python versions. Some problems with Unicode on Windows.
Follow Python API checklist [6]

1: https://docparser.com/
2: https://en.wikipedia.org/wiki/Universal_Business_Language
3: https://medium.com/tradeshift-engineering/scaling-up-machine-learning-algorithm-for-form-recognition-bd09b319e14a
4: https://arxiv.org/pdf/1708.07403.pdf
5: http://cs229.stanford.edu/proj2016/report/LiuWanZhang-UnstructuredDocumentRecognitionOnBusinessInvoice-report.pdf
6: http://python.apichecklist.com/

invoice-x / invoice2data Goto Github PK

invoice2data's Introduction

Data extractor for PDF invoices - invoice2data

Installation

Installation of input modules

Usage

Use as Python Library

Template system

Development

Roadmap and open tasks

Maintainers

Contributors

Used By

Related Projects

invoice2data's People

Contributors

Stargazers

Watchers

Forkers

invoice2data's Issues

-- coding: utf-8 --

Recommend Projects

Recommend Topics

Recommend Org