PDF Library Benchmarks

This benchmark is about reading pure PDF files - notscanned documents and not documents that applied OCR.

Benchmarking machine

Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz

Input Documents

#	Name	File Size	Pages
1	2201.00214	2.4MiB	22
2	GeoTopo-book	5.1MiB	117
3	2201.00151	1.5MiB	12
4	1707.09725	7.0MiB	134
5	2201.00021	2.6MiB	10
6	2201.00037	2.9MiB	33
7	2201.00069	14.7MiB	15
8	2201.00178	2.3MiB	16
9	2201.00201	1.3MiB	9
10	1602.06541	2.9MiB	16
11	2201.00200	284.8KiB	7
12	2201.00022	1.1MiB	11
13	2201.00029	797.6KiB	12
14	1601.03642	1004.9KiB	8

Libraries

Name	Last PyPI Release	License	Version	Dependencies
Borb	2023-06-23	AGPL/Commercial	2.1.16
pypdfium2	2023-07-04	Apache-2.0 or BSD-3-Clause	4.18.0	PDFium (Foxit/Google)
pdfminer.six	2022-11-05	MIT/X	20221105
pdfplumber	2023-07-29	MIT	0.10.2	pdfminer.six
pdfrw	2017-09-18	MIT	0.4
pdftotext	-	GPL	0.86.1	build-essential libpoppler-cpp-dev pkg-config python3-dev
PyMuPDF	2023-08-24	GNU AFFERO GPL 3.0 / Commerical	1.23.1	MuPDF
pypdf	2023-08-26	BSD 3-Clause	3.15.4
Tika	2023-01-01	Apache v2	2.6.0	Apache Tika

Text Extraction Speed

#	Library	Average	1	2	3	4	5	6	7	8	9	10	11	12	13	14
1	PyMuPDF	0.1s	0.4s	0.2s	0.2s	0.2s	0.0s	0.1s	0.0s	0.0s	0.0s	0.0s	0.0s	0.0s	0.0s	0.0s
2	pypdfium2	0.2s	1.9s	0.2s	0.2s	0.2s	0.0s	0.1s	0.1s	0.1s	0.0s	0.1s	0.0s	0.0s	0.0s	0.0s
3	pdftotext	0.3s	0.8s	1.0s	0.3s	0.8s	0.1s	0.2s	0.2s	0.1s	0.0s	0.1s	0.1s	0.1s	0.0s	0.0s
4	Tika	1.1s	12.9s	0.9s	0.6s	0.4s	0.1s	0.3s	0.2s	0.1s	0.1s	0.1s	0.1s	0.1s	0.0s	0.0s
5	pypdf	2.6s	18.7s	4.8s	5.3s	2.3s	0.7s	0.9s	0.4s	0.5s	0.3s	0.6s	0.5s	0.4s	0.4s	0.2s
6	pdfminer.six	4.5s	26.0s	12.9s	8.0s	4.6s	1.3s	2.1s	1.0s	1.2s	0.8s	1.5s	0.9s	0.9s	0.6s	0.6s
7	pdfplumber	6.7s	41.7s	10.9s	11.5s	8.4s	2.4s	4.3s	2.0s	1.9s	1.9s	2.7s	1.8s	1.7s	1.0s	1.2s
8	Borb	34.7s	111.2s	105.0s	1.4s	87.2s	21.1s	7.4s	83.5s	16.4s	20.3s	5.4s	3.4s	18.8s	3.2s	2.1s

Image Extraction Speed

#	Library	Average	1	2	3	4	5	6	7	8	9	10	11	12	13	14
1	PyMuPDF	0.5s	0.3s	0.5s	0.0s	1.7s	0.4s	0.0s	3.2s	0.4s	0.4s	0.1s	0.0s	0.3s	0.2s	0.0s
2	pypdf	2.8s	16.4s	2.1s	0.8s	9.2s	1.1s	0.0s	6.7s	0.9s	0.9s	0.4s	0.0s	0.7s	0.2s	0.1s
3	pdfminer.six	6.5s	31.8s	13.7s	9.2s	24.0s	1.5s	2.3s	1.5s	1.4s	0.9s	1.5s	0.9s	1.0s	0.6s	0.5s

Watermarking Speed

#	Library	Average	1	2	3	4	5	6	7	8	9	10	11	12	13	14
1	PyMuPDF	0.0s	0.0s	0.1s	0.0s	0.1s	0.0s	0.0s	0.0s	0.0s	0.0s	0.0s	0.0s	0.0s	0.0s	0.0s
2	pdfrw	0.1s	0.0s	0.4s	0.0s	0.3s	0.1s	0.1s	0.1s	0.1s	0.1s	0.1s	0.0s	0.1s	0.0s	0.0s
3	pypdf	0.4s	0.6s	1.7s	0.4s	0.9s	0.2s	0.3s	0.4s	0.3s	0.2s	0.3s	0.1s	0.2s	0.0s	0.2s

Watermarking File Size

#	Library	Average	1	2	3	4	5	6	7	8	9	10	11	12	13	14
1	pdfrw	3.4MB	2.5MB	5.7MB	1.6MB	7.3MB	2.7MB	3.1MB	15.4MB	2.4MB	1.3MB	3.0MB	0.3MB	1.1MB	0.8MB	1.0MB
2	pypdf	3.5MB	2.5MB	5.7MB	1.6MB	7.3MB	2.7MB	3.1MB	15.4MB	2.4MB	1.3MB	3.0MB	0.3MB	1.1MB	0.8MB	1.0MB
3	PyMuPDF	3.7MB	2.7MB	6.8MB	1.7MB	8.5MB	2.8MB	3.4MB	15.5MB	2.5MB	1.4MB	3.2MB	0.3MB	1.2MB	0.9MB	1.1MB

Text Extraction Quality

#	Library	Average	1	2	3	4	5	6	7	8	9	10	11	12	13	14
1	pypdfium2	98%	99%	97%	94%	99%	98%	96%	99%	98%	99%	99%	98%	98%	99%	99%
2	pypdf	97%	98%	93%	94%	98%	98%	96%	97%	98%	99%	99%	98%	98%	98%	99%
3	PyMuPDF	97%	98%	96%	93%	97%	98%	96%	98%	98%	98%	98%	97%	97%	98%	99%
4	Tika	96%	99%	98%	92%	97%	98%	96%	93%	97%	98%	93%	98%	93%	98%	96%
5	pdftotext	93%	96%	93%	91%	94%	92%	96%	96%	96%	97%	83%	94%	96%	96%	79%
6	pdfminer.six	90%	95%	79%	86%	92%	86%	93%	95%	93%	92%	92%	93%	86%	98%	86%
7	pdfplumber	75%	94%	84%	61%	97%	61%	93%	61%	89%	57%	59%	67%	59%	98%	67%
8	Borb	45%	70%	79%	0%	40%	48%	92%	0%	64%	51%	41%	55%	43%	0%	53%

pdfrw vs pypdf page extraction & merge

Test run on Python 3.8, Windows 7:

I took 4 arbitrary page numbers (pages 4,6,8,9).
For each of the benchmark listed pdf files I extracted those pages from it (if available).
Then I created a new pdf using the extracted pages, and repeated them between 1 and 5 times (to check how well pdfrw / pypdf optimize size of created pdfs containing repetitive information). So output pdfs will have up to 4x5 = 20 pages
I measure time employed and output sizes

I recall my initial code also deleted original bookmarks/annotations from pdfs, but I removed that part for simplicity and commented where I had read about that.

Code:

#!/usr/bin/python
# -*- coding: utf-8 -*-

def fsize(filepath):
	import os
	finfo = os.stat(filepath)
	fsize = finfo.st_size
	KB = "%.2f" % (fsize/1024)
	return([fsize,KB])
	
#@profile
def createpdf_from_sourcepdf_pages_pdfrw(sourcepdf=None, pageslist=None, destpdf=None, debug=False):
	""" <https://github.com/pmaupin/pdfrw/blob/master/examples/subset.py>
		"""
	from pdfrw import PdfWriter,PdfReader
	#import pdfrw_bookmarks # code from https://github.com/pmaupin/pdfrw/issues/52#issuecomment-271190546
	pages = PdfReader(sourcepdf).pages
	totalpages = len(pages)
	outdata = PdfWriter(destpdf)
	for p in pageslist:
		if p<totalpages:
			if debug: print("pdfrw ",p)
			#pdfrw_pageannots(pages[p-1])
			outdata.addpage(pages[p-1])
	outdata.write()

#@profile
def createpdf_from_sourcepdf_pages_pypdf(sourcepdf=None, pageslist=None, destpdf=None, debug=False, compress=False):
	""" Generate destpdf with list of certain pages taken from sourcepdf. 
		- <https://pypdf2.readthedocs.io/en/stable/user/merging-pdfs.html>
		- SO [Extract specific pages of PDF and save it with Python](https://stackoverflow.com/a/51885963/710788)
		"""
	from pypdf import PdfWriter,PdfReader
	fsource = open(sourcepdf, "rb")
	merger = PdfWriter()
	totalpages = len(PdfReader(fsource).pages)
	for p in pageslist:
		if p<totalpages:
			if debug: print("pypdf ",p)
			# add page p (0-based index):
			merger.append(fileobj=fsource, pages=(p-1,p))
	if compress: # Compress the data
		for page in merger.pages:
			page.compress_content_streams()  # This is CPU intensive!
	# Write to an output PDF document
	output = open(destpdf, "wb")
	merger.write(output)
	# Close File Descriptors
	merger.close()
	output.close()

#from memory_profiler import profile
#@profile
def pypdf_vs_pdfrw():
	""" [performance comparative](https://github.com/pmaupin/pdfrw/issues/232#issuecomment-1436153435) between two packages:
		- pdfrw
		- pypdf
		"""
	print(datetime.now() - startTime, " before comparing")
	pdfurls = [
		"https://arxiv.org/pdf/2201.00151.pdf",
		"https://arxiv.org/pdf/1707.09725.pdf",
		"https://arxiv.org/pdf/2201.00021.pdf",
		"https://arxiv.org/pdf/2201.00037.pdf",
		"https://arxiv.org/pdf/2201.00069.pdf",
		"https://arxiv.org/pdf/2201.00178.pdf",
		"https://arxiv.org/pdf/2201.00201.pdf",
		"https://arxiv.org/pdf/1602.06541.pdf",
		"https://arxiv.org/pdf/2201.00200.pdf",
		"https://arxiv.org/pdf/2201.00022.pdf",
		"https://arxiv.org/pdf/2201.00029.pdf",
		"https://arxiv.org/pdf/1601.03642.pdf",
		]
	import requests,os
	pdfrw_Tsize = 0
	pdfrw_Ttime = 0
	pypdf_Tsize = 0
	pypdf_Ttime = 0
	for pdfurl in pdfurls:
		sourcepdf = pdfurl.split("/")[-1]
		if not os.path.exists(sourcepdf):
			response = requests.get(pdfurl, headers=None, params=None)
			if response.status_code == 200:
				with open(sourcepdf, 'wb') as f:
					f.write(response.content)
			else:
				print(response.status_code)
				print("COULDN'T DOWNLOAD  '{}' FILE:\n".format(pdfurl))
		if not os.path.exists(sourcepdf):
			print("\n","-_"*40,"\n\nSKIPPING '{}' FILE:\n".format(sourcepdf))
		else:
			print("\n","-_"*40,"\n\nTESTING WITH '{}' FILE:\n".format(sourcepdf))
			for i in range(1,6):
				pageslist=[4,6,8,9]*i #*5 eats all my memory when using pypdf with large pdf files
				print("-"*50,"\npageslist:",pageslist)
				start=datetime.now()
				destpdf=sourcepdf+"_pdfrw-test_{}.pdf".format(".".join([str(p) for p in pageslist]))
				createpdf_from_sourcepdf_pages_pdfrw(sourcepdf=sourcepdf, pageslist=pageslist, destpdf=destpdf); 
				pdfrw_t = round((datetime.now() - start).total_seconds(),3)
				pdfrw_s = fsize(destpdf)
				pdfrw_Ttime += pdfrw_t
				pdfrw_Tsize += pdfrw_s[0]
				print("pdfrw: {} KB output size, took {} seconds".format(pdfrw_s[1],pdfrw_t))
				start=datetime.now()
				destpdf=sourcepdf+"_pypdf-test_{}.pdf".format(".".join([str(p) for p in pageslist]))
				createpdf_from_sourcepdf_pages_pypdf(sourcepdf=sourcepdf, pageslist=pageslist, destpdf=destpdf);
				pypdf_t = round((datetime.now() - start).total_seconds(),3)
				pypdf_s = fsize(destpdf)
				pypdf_Ttime += pypdf_t
				pypdf_Tsize += pypdf_s[0]
				print("pypdf: {} KB output size, took {} seconds".format(pypdf_s[1],pypdf_t))
				print("pypdf_time / pdfrw_time = {} ratio".format(round(pypdf_t/pdfrw_t, 2)))
				print("pypdf_size / pdfrw_size = {} ratio".format(round(pypdf_s[0]/pdfrw_s[0], 2)))
			
	import pdfrw,pypdf
	print("-_"*40)
	print("\n pdfrw.__version__ {}\nAccumulated output file size: {:.2f} MB\nTotal time: {:.2f} seconds".format(
		pdfrw.__version__, pdfrw_Tsize/1024/1024, pdfrw_Ttime))
	print("\n pypdf.__version__ {}\nAccumulated output file size: {:.2f} MB\nTotal time: {:.2f} seconds".format(
		pypdf.__version__, pypdf_Tsize/1024/1024, pypdf_Ttime))

if __name__ == "__main__":
	import sys
	from datetime import datetime
	startTime = datetime.now()
	print("START: ",startTime)
	pypdf_vs_pdfrw()
	endTime = datetime.now()
	print("\nEND: ",endTime)
	print("\nTOTAL TIME: ",endTime-startTime)

OUTPUT:

START:  2023-07-01 22:06:17.718288
0:00:00  before comparing

 -_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_

TESTING WITH '2201.00151.pdf' FILE:

--------------------------------------------------
pageslist: [4, 6, 8, 9]
pdfrw: 591.29 KB output size, took 0.109 seconds
pypdf: 660.78 KB output size, took 0.499 seconds
pypdf_time / pdfrw_time = 4.58 ratio
pypdf_size / pdfrw_size = 1.12 ratio
--------------------------------------------------
(... LINES DELETED TO AVOID TOO LONG OUTPUT ...)
--------------------------------------------------
pageslist: [4, 6, 8, 9, 4, 6, 8, 9, 4, 6, 8, 9, 4, 6, 8, 9, 4, 6, 8, 9]
pdfrw: 130.20 KB output size, took 0.047 seconds
pypdf: 836.60 KB output size, took 1.031 seconds
pypdf_time / pdfrw_time = 21.94 ratio
pypdf_size / pdfrw_size = 6.43 ratio
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_

 pdfrw.__version__ 0.5.0
Accumulated output file size: 50.73 MB
Total time: 4.47 seconds

 pypdf.__version__ 3.2.0
Accumulated output file size: 193.77 MB
Total time: 108.14 seconds

END:  2023-07-01 22:08:11.767827

TOTAL TIME:  0:01:54.049539

py-pdf / benchmarks Goto Github PK

benchmarks's Introduction

PDF Library Benchmarks

Benchmarking machine

Input Documents

Libraries

Text Extraction Speed

Image Extraction Speed

Watermarking Speed

Watermarking File Size

Text Extraction Quality

benchmarks's People

Contributors

Stargazers

Watchers

Forkers

benchmarks's Issues

Code:

OUTPUT:

Recommend Projects

Recommend Topics

Recommend Org