Giter Club home page Giter Club logo

pdf-extractor's Introduction

PDF Text Extractor

PDF extractor used to generate text statistics of PDF files. Based on Apache PDFBox.

User Guide

  1. Download or build the latest release of pdf-extractor-{version}.jar (the JAR)

  2. Move the JAR to folder that is convenient to you

  3. Prepare the (relative or absolute) paths for the following

    1. The {keywords_file}, e.g. keywords/keywords.txt
      • Must be plain text of any extension
    2. A {pdf_folder} that contains the PDF files to be extracted, e.g. pdf/
      • Only PDF files with a .pdf extension will be processed
    3. A {output_file} path, e.g. output/output.xlsx
      • File name must end with .xlsx
  4. Open Terminal or Command Prompt and navigate to the folder that contains the JAR

  5. Run the JAR with the following command:

    java -jar pdf-extractor-{version}.jar --keyword-file-path {keywords_file} --pdf-folder-path {pdf_folder} --output-file-path {output_file} --parallel --case-sensitive
    • Mandatory flags
      • --keyword-file-path: path of {keywords_file}
      • --pdf-folder-path: path of {pdf_folder}
      • --output-file-path: path of {output_file}
    • Optional (but important) flags
      • --parallel: enables parallel processing
        • if this flag is not set, the program uses sequential processing
      • --case-sensitive: enables case-sensitive matching
        • if this flag is not set, the program converts both the keywords and the extracted text to lower case before comparing

An Example

java -jar pdf-extractor-2.0.0.jar --keyword-file-path "keywords/keywords.txt" --pdf-folder-path "pdf/" --output-file-path "output/output.xlsx"

Dependencies

  1. org.apache.commons.commons-lang3
  2. org.apache.pdfbox.pdfbox
  3. org.apache.poi.poi
  4. org.apache.poi.poi-ooxml
  5. org.javatuples.javatuples

pdf-extractor's People

Contributors

jameshskoh avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.