Giter Club home page Giter Club logo

benchmarks's Introduction

PDF Library Benchmarks

This benchmark is about reading pure PDF files - notscanned documents and not documents that applied OCR.

Benchmarking machine

Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz

Input Documents

# Name File Size Pages
1 2201.00214 2.4MiB 22
2 GeoTopo-book 5.1MiB 117
3 2201.00151 1.5MiB 12
4 1707.09725 7.0MiB 134
5 2201.00021 2.6MiB 10
6 2201.00037 2.9MiB 33
7 2201.00069 14.7MiB 15
8 2201.00178 2.3MiB 16
9 2201.00201 1.3MiB 9
10 1602.06541 2.9MiB 16
11 2201.00200 284.8KiB 7
12 2201.00022 1.1MiB 11
13 2201.00029 797.6KiB 12
14 1601.03642 1004.9KiB 8

Libraries

Name Last PyPI Release License Version Dependencies
Borb 2023-06-23 AGPL/Commercial 2.1.16
pypdfium2 2023-07-04 Apache-2.0 or BSD-3-Clause 4.18.0 PDFium (Foxit/Google)
pdfminer.six 2022-11-05 MIT/X 20221105
pdfplumber 2023-07-29 MIT 0.10.2 pdfminer.six
pdfrw 2017-09-18 MIT 0.4
pdftotext - GPL 0.86.1 build-essential libpoppler-cpp-dev pkg-config python3-dev
PyMuPDF 2023-08-24 GNU AFFERO GPL 3.0 / Commerical 1.23.1 MuPDF
pypdf 2023-08-26 BSD 3-Clause 3.15.4
Tika 2023-01-01 Apache v2 2.6.0 Apache Tika

Text Extraction Speed

# Library Average 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 PyMuPDF 0.1s 0.4s 0.2s 0.2s 0.2s 0.0s 0.1s 0.0s 0.0s 0.0s 0.0s 0.0s 0.0s 0.0s 0.0s
2 pypdfium2 0.2s 1.9s 0.2s 0.2s 0.2s 0.0s 0.1s 0.1s 0.1s 0.0s 0.1s 0.0s 0.0s 0.0s 0.0s
3 pdftotext 0.3s 0.8s 1.0s 0.3s 0.8s 0.1s 0.2s 0.2s 0.1s 0.0s 0.1s 0.1s 0.1s 0.0s 0.0s
4 Tika 1.1s 12.9s 0.9s 0.6s 0.4s 0.1s 0.3s 0.2s 0.1s 0.1s 0.1s 0.1s 0.1s 0.0s 0.0s
5 pypdf 2.6s 18.7s 4.8s 5.3s 2.3s 0.7s 0.9s 0.4s 0.5s 0.3s 0.6s 0.5s 0.4s 0.4s 0.2s
6 pdfminer.six 4.5s 26.0s 12.9s 8.0s 4.6s 1.3s 2.1s 1.0s 1.2s 0.8s 1.5s 0.9s 0.9s 0.6s 0.6s
7 pdfplumber 6.7s 41.7s 10.9s 11.5s 8.4s 2.4s 4.3s 2.0s 1.9s 1.9s 2.7s 1.8s 1.7s 1.0s 1.2s
8 Borb 34.7s 111.2s 105.0s 1.4s 87.2s 21.1s 7.4s 83.5s 16.4s 20.3s 5.4s 3.4s 18.8s 3.2s 2.1s

Image Extraction Speed

# Library Average 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 PyMuPDF 0.5s 0.3s 0.5s 0.0s 1.7s 0.4s 0.0s 3.2s 0.4s 0.4s 0.1s 0.0s 0.3s 0.2s 0.0s
2 pypdf 2.8s 16.4s 2.1s 0.8s 9.2s 1.1s 0.0s 6.7s 0.9s 0.9s 0.4s 0.0s 0.7s 0.2s 0.1s
3 pdfminer.six 6.5s 31.8s 13.7s 9.2s 24.0s 1.5s 2.3s 1.5s 1.4s 0.9s 1.5s 0.9s 1.0s 0.6s 0.5s

Watermarking Speed

# Library Average 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 PyMuPDF 0.0s 0.0s 0.1s 0.0s 0.1s 0.0s 0.0s 0.0s 0.0s 0.0s 0.0s 0.0s 0.0s 0.0s 0.0s
2 pdfrw 0.1s 0.0s 0.4s 0.0s 0.3s 0.1s 0.1s 0.1s 0.1s 0.1s 0.1s 0.0s 0.1s 0.0s 0.0s
3 pypdf 0.4s 0.6s 1.7s 0.4s 0.9s 0.2s 0.3s 0.4s 0.3s 0.2s 0.3s 0.1s 0.2s 0.0s 0.2s

Watermarking File Size

# Library Average 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 pdfrw 3.4MB 2.5MB 5.7MB 1.6MB 7.3MB 2.7MB 3.1MB 15.4MB 2.4MB 1.3MB 3.0MB 0.3MB 1.1MB 0.8MB 1.0MB
2 pypdf 3.5MB 2.5MB 5.7MB 1.6MB 7.3MB 2.7MB 3.1MB 15.4MB 2.4MB 1.3MB 3.0MB 0.3MB 1.1MB 0.8MB 1.0MB
3 PyMuPDF 3.7MB 2.7MB 6.8MB 1.7MB 8.5MB 2.8MB 3.4MB 15.5MB 2.5MB 1.4MB 3.2MB 0.3MB 1.2MB 0.9MB 1.1MB

Text Extraction Quality

# Library Average 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 pypdfium2 98% 99% 97% 94% 99% 98% 96% 99% 98% 99% 99% 98% 98% 99% 99%
2 pypdf 97% 98% 93% 94% 98% 98% 96% 97% 98% 99% 99% 98% 98% 98% 99%
3 PyMuPDF 97% 98% 96% 93% 97% 98% 96% 98% 98% 98% 98% 97% 97% 98% 99%
4 Tika 96% 99% 98% 92% 97% 98% 96% 93% 97% 98% 93% 98% 93% 98% 96%
5 pdftotext 93% 96% 93% 91% 94% 92% 96% 96% 96% 97% 83% 94% 96% 96% 79%
6 pdfminer.six 90% 95% 79% 86% 92% 86% 93% 95% 93% 92% 92% 93% 86% 98% 86%
7 pdfplumber 75% 94% 84% 61% 97% 61% 93% 61% 89% 57% 59% 67% 59% 98% 67%
8 Borb 45% 70% 79% 0% 40% 48% 92% 0% 64% 51% 41% 55% 43% 0% 53%

benchmarks's People

Contributors

mara004 avatar martinthoma avatar mqq-marek avatar ruddyscent avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

benchmarks's Issues

pdfrw vs pypdf page extraction & merge

Test run on Python 3.8, Windows 7:

  • I took 4 arbitrary page numbers (pages 4,6,8,9).
  • For each of the benchmark listed pdf files I extracted those pages from it (if available).
  • Then I created a new pdf using the extracted pages, and repeated them between 1 and 5 times (to check how well pdfrw / pypdf optimize size of created pdfs containing repetitive information). So output pdfs will have up to 4x5 = 20 pages
  • I measure time employed and output sizes

I recall my initial code also deleted original bookmarks/annotations from pdfs, but I removed that part for simplicity and commented where I had read about that.

Code:

#!/usr/bin/python
# -*- coding: utf-8 -*-

def fsize(filepath):
	import os
	finfo = os.stat(filepath)
	fsize = finfo.st_size
	KB = "%.2f" % (fsize/1024)
	return([fsize,KB])
	
#@profile
def createpdf_from_sourcepdf_pages_pdfrw(sourcepdf=None, pageslist=None, destpdf=None, debug=False):
	""" <https://github.com/pmaupin/pdfrw/blob/master/examples/subset.py>
		"""
	from pdfrw import PdfWriter,PdfReader
	#import pdfrw_bookmarks # code from https://github.com/pmaupin/pdfrw/issues/52#issuecomment-271190546
	pages = PdfReader(sourcepdf).pages
	totalpages = len(pages)
	outdata = PdfWriter(destpdf)
	for p in pageslist:
		if p<totalpages:
			if debug: print("pdfrw ",p)
			#pdfrw_pageannots(pages[p-1])
			outdata.addpage(pages[p-1])
	outdata.write()

#@profile
def createpdf_from_sourcepdf_pages_pypdf(sourcepdf=None, pageslist=None, destpdf=None, debug=False, compress=False):
	""" Generate destpdf with list of certain pages taken from sourcepdf. 
		- <https://pypdf2.readthedocs.io/en/stable/user/merging-pdfs.html>
		- SO [Extract specific pages of PDF and save it with Python](https://stackoverflow.com/a/51885963/710788)
		"""
	from pypdf import PdfWriter,PdfReader
	fsource = open(sourcepdf, "rb")
	merger = PdfWriter()
	totalpages = len(PdfReader(fsource).pages)
	for p in pageslist:
		if p<totalpages:
			if debug: print("pypdf ",p)
			# add page p (0-based index):
			merger.append(fileobj=fsource, pages=(p-1,p))
	if compress: # Compress the data
		for page in merger.pages:
			page.compress_content_streams()  # This is CPU intensive!
	# Write to an output PDF document
	output = open(destpdf, "wb")
	merger.write(output)
	# Close File Descriptors
	merger.close()
	output.close()

#from memory_profiler import profile
#@profile
def pypdf_vs_pdfrw():
	""" [performance comparative](https://github.com/pmaupin/pdfrw/issues/232#issuecomment-1436153435) between two packages:
		- pdfrw
		- pypdf
		"""
	print(datetime.now() - startTime, " before comparing")
	pdfurls = [
		"https://arxiv.org/pdf/2201.00151.pdf",
		"https://arxiv.org/pdf/1707.09725.pdf",
		"https://arxiv.org/pdf/2201.00021.pdf",
		"https://arxiv.org/pdf/2201.00037.pdf",
		"https://arxiv.org/pdf/2201.00069.pdf",
		"https://arxiv.org/pdf/2201.00178.pdf",
		"https://arxiv.org/pdf/2201.00201.pdf",
		"https://arxiv.org/pdf/1602.06541.pdf",
		"https://arxiv.org/pdf/2201.00200.pdf",
		"https://arxiv.org/pdf/2201.00022.pdf",
		"https://arxiv.org/pdf/2201.00029.pdf",
		"https://arxiv.org/pdf/1601.03642.pdf",
		]
	import requests,os
	pdfrw_Tsize = 0
	pdfrw_Ttime = 0
	pypdf_Tsize = 0
	pypdf_Ttime = 0
	for pdfurl in pdfurls:
		sourcepdf = pdfurl.split("/")[-1]
		if not os.path.exists(sourcepdf):
			response = requests.get(pdfurl, headers=None, params=None)
			if response.status_code == 200:
				with open(sourcepdf, 'wb') as f:
					f.write(response.content)
			else:
				print(response.status_code)
				print("COULDN'T DOWNLOAD  '{}' FILE:\n".format(pdfurl))
		if not os.path.exists(sourcepdf):
			print("\n","-_"*40,"\n\nSKIPPING '{}' FILE:\n".format(sourcepdf))
		else:
			print("\n","-_"*40,"\n\nTESTING WITH '{}' FILE:\n".format(sourcepdf))
			for i in range(1,6):
				pageslist=[4,6,8,9]*i #*5 eats all my memory when using pypdf with large pdf files
				print("-"*50,"\npageslist:",pageslist)
				start=datetime.now()
				destpdf=sourcepdf+"_pdfrw-test_{}.pdf".format(".".join([str(p) for p in pageslist]))
				createpdf_from_sourcepdf_pages_pdfrw(sourcepdf=sourcepdf, pageslist=pageslist, destpdf=destpdf); 
				pdfrw_t = round((datetime.now() - start).total_seconds(),3)
				pdfrw_s = fsize(destpdf)
				pdfrw_Ttime += pdfrw_t
				pdfrw_Tsize += pdfrw_s[0]
				print("pdfrw: {} KB output size, took {} seconds".format(pdfrw_s[1],pdfrw_t))
				start=datetime.now()
				destpdf=sourcepdf+"_pypdf-test_{}.pdf".format(".".join([str(p) for p in pageslist]))
				createpdf_from_sourcepdf_pages_pypdf(sourcepdf=sourcepdf, pageslist=pageslist, destpdf=destpdf);
				pypdf_t = round((datetime.now() - start).total_seconds(),3)
				pypdf_s = fsize(destpdf)
				pypdf_Ttime += pypdf_t
				pypdf_Tsize += pypdf_s[0]
				print("pypdf: {} KB output size, took {} seconds".format(pypdf_s[1],pypdf_t))
				print("pypdf_time / pdfrw_time = {} ratio".format(round(pypdf_t/pdfrw_t, 2)))
				print("pypdf_size / pdfrw_size = {} ratio".format(round(pypdf_s[0]/pdfrw_s[0], 2)))
			
	import pdfrw,pypdf
	print("-_"*40)
	print("\n pdfrw.__version__ {}\nAccumulated output file size: {:.2f} MB\nTotal time: {:.2f} seconds".format(
		pdfrw.__version__, pdfrw_Tsize/1024/1024, pdfrw_Ttime))
	print("\n pypdf.__version__ {}\nAccumulated output file size: {:.2f} MB\nTotal time: {:.2f} seconds".format(
		pypdf.__version__, pypdf_Tsize/1024/1024, pypdf_Ttime))

if __name__ == "__main__":
	import sys
	from datetime import datetime
	startTime = datetime.now()
	print("START: ",startTime)
	pypdf_vs_pdfrw()
	endTime = datetime.now()
	print("\nEND: ",endTime)
	print("\nTOTAL TIME: ",endTime-startTime)

OUTPUT:

START:  2023-07-01 22:06:17.718288
0:00:00  before comparing

 -_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_

TESTING WITH '2201.00151.pdf' FILE:

--------------------------------------------------
pageslist: [4, 6, 8, 9]
pdfrw: 591.29 KB output size, took 0.109 seconds
pypdf: 660.78 KB output size, took 0.499 seconds
pypdf_time / pdfrw_time = 4.58 ratio
pypdf_size / pdfrw_size = 1.12 ratio
--------------------------------------------------
(... LINES DELETED TO AVOID TOO LONG OUTPUT ...)
--------------------------------------------------
pageslist: [4, 6, 8, 9, 4, 6, 8, 9, 4, 6, 8, 9, 4, 6, 8, 9, 4, 6, 8, 9]
pdfrw: 130.20 KB output size, took 0.047 seconds
pypdf: 836.60 KB output size, took 1.031 seconds
pypdf_time / pdfrw_time = 21.94 ratio
pypdf_size / pdfrw_size = 6.43 ratio
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_

 pdfrw.__version__ 0.5.0
Accumulated output file size: 50.73 MB
Total time: 4.47 seconds

 pypdf.__version__ 3.2.0
Accumulated output file size: 193.77 MB
Total time: 108.14 seconds

END:  2023-07-01 22:08:11.767827

TOTAL TIME:  0:01:54.049539

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.