klippa-app / go-pdfium Goto Github PK
View Code? Open in Web Editor NEWEasy to use PDF library using Go and PDFium
License: MIT License
Easy to use PDF library using Go and PDFium
License: MIT License
create a package POC which can be implemented by external Golang programs
I see that access to "structured" (character-by-character text, with various kinds of style and position metadata) is available via GetPageTextStructured
.
Is something similar available for retrieving drawn paths (lines and rects, ideally with analogous style metadata like fill and stroke colors)?
Thanks!
Web assembly implementation no longer builds after upgrade to go-pdfium v1.12.0
Results in the following error:
# github.com/klippa-app/go-pdfium/webassembly
../../go/pkg/mod/github.com/klippa-app/[email protected]/webassembly/generated.go:515:27: i.worker.Instance.FPDFAnnot_AddFileAttachment undefined (type *implementation_webassembly.PdfiumImplementation has no field or method FPDFAnnot_AddFileAttachment)
../../go/pkg/mod/github.com/klippa-app/[email protected]/webassembly/generated.go:529:27: i.worker.Instance.FPDFAnnot_AddInkStroke undefined (type *implementation_webassembly.PdfiumImplementation has no field or method FPDFAnnot_AddInkStroke)
../../go/pkg/mod/github.com/klippa-app/[email protected]/webassembly/generated.go:543:27: i.worker.Instance.FPDFAnnot_AppendAttachmentPoints undefined (type *implementation_webassembly.PdfiumImplementation has no field or method FPDFAnnot_AppendAttachmentPoints)
../../go/pkg/mod/github.com/klippa-app/[email protected]/webassembly/generated.go:557:27: i.worker.Instance.FPDFAnnot_AppendObject undefined (type *implementation_webassembly.PdfiumImplementation has no field or method FPDFAnnot_AppendObject)
../../go/pkg/mod/github.com/klippa-app/[email protected]/webassembly/generated.go:571:27: i.worker.Instance.FPDFAnnot_CountAttachmentPoints undefined (type *implementation_webassembly.PdfiumImplementation has no field or method FPDFAnnot_CountAttachmentPoints)
../../go/pkg/mod/github.com/klippa-app/[email protected]/webassembly/generated.go:585:27: i.worker.Instance.FPDFAnnot_GetAP undefined (type *implementation_webassembly.PdfiumImplementation has no field or method FPDFAnnot_GetAP)
../../go/pkg/mod/github.com/klippa-app/[email protected]/webassembly/generated.go:599:27: i.worker.Instance.FPDFAnnot_GetAttachmentPoints undefined (type *implementation_webassembly.PdfiumImplementation has no field or method FPDFAnnot_GetAttachmentPoints)
../../go/pkg/mod/github.com/klippa-app/[email protected]/webassembly/generated.go:613:27: i.worker.Instance.FPDFAnnot_GetBorder undefined (type *implementation_webassembly.PdfiumImplementation has no field or method FPDFAnnot_GetBorder)
../../go/pkg/mod/github.com/klippa-app/[email protected]/webassembly/generated.go:627:27: i.worker.Instance.FPDFAnnot_GetColor undefined (type *implementation_webassembly.PdfiumImplementation has no field or method FPDFAnnot_GetColor)
../../go/pkg/mod/github.com/klippa-app/[email protected]/webassembly/generated.go:641:27: i.worker.Instance.FPDFAnnot_GetFileAttachment undefined (type *implementation_webassembly.PdfiumImplementation has no field or method FPDFAnnot_GetFileAttachment)
../../go/pkg/mod/github.com/klippa-app/[email protected]/webassembly/generated.go:641:27: too many errors
I've used the example from the readme documentation:
main.go
package renderer
import (
"log"
"github.com/klippa-app/go-pdfium"
"github.com/klippa-app/go-pdfium/webassembly"
)
// Be sure to close pools/instances when you're done with them.
var pool pdfium.Pool
var instance pdfium.Pdfium
func init() {
// Init the PDFium library and return the instance to open documents.
// You can tweak these configs to your need. Be aware that workers can use quite some memory.
pool, err = webassembly.Init(webassembly.Config{
MinIdle: 1, // Makes sure that at least x workers are always available
MaxIdle: 1, // Makes sure that at most x workers are ever available
MaxTotal: 1, // Maxium amount of workers in total, allows the amount of workers to grow when needed, items between total max and idle max are automatically cleaned up, while idle workers are kept alive so they can be used directly.
})
var err error
instance, err = pool.GetInstance(time.Second * 30)
if err != nil {
log.Fatal(err)
}
}
go.mod
module pdfium-webassembly-test
go 1.22
require github.com/klippa-app/go-pdfium v1.12.0
require (
github.com/google/uuid v1.6.0 // indirect
github.com/jolestar/go-commons-pool/v2 v2.1.2 // indirect
github.com/tetratelabs/wazero v1.7.1 // indirect
golang.org/x/net v0.24.0 // indirect
golang.org/x/text v0.14.0 // indirect
)
Setup a well documented way of testing of the package and implement CI which runs all tests before merge to development/master
Validating errors by its message is unsafe and error prone.
Using standardized errors (i.e. errors of a curtain type) is easier to use and easier to compare.
Implement a set of standardized errors of which some are also accessible from external and some only internal.
Nice to meet you and thank you for making a good library.
How should we do if we want to shy an alternative font when the font in the PDF is not found?
We are experiencing an event where characters are lost when generating a Bitmap from a PDF because the font is not found.
0 is a valid value if the meta tag doesn't exist, but that shouldn't make GetMetaData
completely error out.
i read you readme and try run it,but it failed
can you supply a whole demo
thank you
https://github.com/bblanchon/pdfium-binaries Has a web assembly version.
golang is very capable in running web assembly. For example Wazero can run wasm with no cgo
why ?
One pdfium for all targets ( web, desktop, server, etc )
No cgo.
Easy to debug using chrome . https://blog.noops.land/debugging-webAssembly-from-go-sources-in-chrome-devtools
Anyone interested in exploring this architecture ?
Hi folks !
I'm struggling debugging my app loading a document using the ReadSeeker
way. My document is stored on a webserver and I use Range request to only read the bytes that I want on the distant file. It works fine until I try to render a page in pixel or DPI.
Every Reads are fine loading the document, getting page count, getting the first page num but when it comes to render the page to an image, first Read(p []byte) calls are fine but after a specific read on the document (fairly large []byte, few kb), pdfium triggers a segmentation violation.
I thought it was the go-pdfium go_read_seeker_cb function that was faulty, but when I open the file locally and pass it to OpenDocument as a ReadSeeker
, it works !
I think it is my ReadSeeker
implementation that is missing something related to the memory management that I don't understand. Few ideas around the GC detroying the bytes before pdfium has a chance to use it, i don't know...
Do you guys have already experienced this kind of errors ? How do you debug these ?
Thanks a lot !
SIGSEGV: segmentation violation
PC=0x7ff9720b78d7 m=4 sigcode=128
signal arrived during cgo execution
goroutine 50 [syscall]:
runtime.cgocall(0xd2bc20, 0xc00001bc18)
/home/yann/.gvm/gos/go1.21.4/src/runtime/cgocall.go:157 +0x4b fp=0xc00001bbf0 sp=0xc00001bbb8 pc=0x41348b
github.com/klippa-app/go-pdfium/internal/implementation_cgo._Cfunc_FPDF_LoadPage(0x7ff920000d00, 0x0)
_cgo_gotypes.go:3598 +0x4c fp=0xc00001bc18 sp=0xc00001bbf0 pc=0xca7fac
github.com/klippa-app/go-pdfium/internal/implementation_cgo.(*PdfiumImplementation).loadPage.func1(0x100f820?, {0xc000124918?, 0x0?})
/home/yann/GO/pkg/mod/github.com/klippa-app/[email protected]/internal/implementation_cgo/page.go:48 +0x5c fp=0xc00001bc58 sp=0xc00001bc18 pc=0xce647c
github.com/klippa-app/go-pdfium/internal/implementation_cgo.(*PdfiumImplementation).loadPage(0xc0005c8100, {0xc000124918?, 0x0?})
/home/yann/GO/pkg/mod/github.com/klippa-app/[email protected]/internal/implementation_cgo/page.go:48 +0x245 fp=0xc00001bca0 sp=0xc00001bc58 pc=0xce6345
github.com/klippa-app/go-pdfium/internal/implementation_cgo.(*PdfiumImplementation).getPageSize(0xc00001bd88?, {0xc000124918?, 0x0?})
/home/yann/GO/pkg/mod/github.com/klippa-app/[email protected]/internal/implementation_cgo/render.go:31 +0x1d fp=0xc00001bcd8 sp=0xc00001bca0 pc=0xce6bfd
github.com/klippa-app/go-pdfium/internal/implementation_cgo.(*PdfiumImplementation).calculateRenderImageSize(0x0?, {0xc000124918?, 0x0?}, 0x2e2, 0x420)
/home/yann/GO/pkg/mod/github.com/klippa-app/[email protected]/internal/implementation_cgo/render.go:173 +0x2b fp=0xc00001bd28 sp=0xc00001bcd8 pc=0xce7cab
github.com/klippa-app/go-pdfium/internal/implementation_cgo.(*PdfiumImplementation).RenderPageInPixels(0xc0005c8100, 0xc00022a390)
/home/yann/GO/pkg/mod/github.com/klippa-app/[email protected]/internal/implementation_cgo/render.go:219 +0x13a fp=0xc00001bdf0 sp=0xc00001bd28 pc=0xce7f5a
github.com/klippa-app/go-pdfium/single_threaded.(*pdfiumInstance).RenderPageInPixels(0x150d560?, 0xc00011a020?)
/home/yann/GO/pkg/mod/github.com/klippa-app/[email protected]/single_threaded/generated.go:6129 +0xb4 fp=0xc00001be48 sp=0xc00001bdf0 pc=0xd23454
github.com/org/repo/internal/pdf_test.TestPDFCPUPerf(0xc0000e6680)
/home/yann/00_Projects/repo/internal/pdf/pdf_public_test.go:248 +0x58e fp=0xc00001bf70 sp=0xc00001be48 pc=0xd2726e
testing.tRunner(0xc0000e6680, 0x11958c8)
/home/yann/.gvm/gos/go1.21.4/src/testing/testing.go:1595 +0xff fp=0xc00001bfc0 sp=0xc00001bf70 pc=0x5337ff
testing.(*T).Run.func1()
/home/yann/.gvm/gos/go1.21.4/src/testing/testing.go:1648 +0x25 fp=0xc00001bfe0 sp=0xc00001bfc0 pc=0x534785
runtime.goexit()
/home/yann/.gvm/gos/go1.21.4/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc00001bfe8 sp=0xc00001bfe0 pc=0x47ce41
created by testing.(*T).Run in goroutine 1
/home/yann/.gvm/gos/go1.21.4/src/testing/testing.go:1648 +0x3ad
write tests in file called [file name here]_test.go
these tests should validate that the internal functionality of pdfium returns the correct values for known correct inputs (and errors on known incorrect inputs).
write tests in file called [file name here]_test.go
these tests should validate that the outbound functionality of pdfium calls the correct internal functions with the correct values.
the internal functions should not be validated however the return values should be logical and true to live (use mocking to replace the internal functions)
It should be possible to set the number of workers in some sort of config.
Currently we support 1 worker count however a more flexible way of setting this number of workers is required.
Reseach and setup a guide for which code lives where.
Form the current code according to the guide.
Try to add the current code as a package in existing package
Currently there's a lot of functions to create when a new implementation is added to the actual pdfium implementation.
When a new method is added to the Pdfium
struct in pdfium/internal/subprocess
, we also have to add the following to make it work:
interface Document
and struct pdfiumDocument
in pdfium/pdfium.go
interface Pdfium
and struct PdfiumRPCServer
in pdfium/internal/commons/pdfium-plugin.go
Since the refactors a lot of the methods in the actual implementation work the same:
There is still some stuff that doesn't work the same, like Ping
, Close
and OpenDocument
, those needs to be changed first.
My proposal would be to generate the implementations in pdfium/pdfium.go
and pdfium/internal/commons/pdfium-plugin.go
since they just contain boilerplate code to talk between the main process and the subprocess.
All implementations can be the same for every boilerplate method, so it should be quite simple to generate implementations based on the exposed methods on the Pdfium
struct in pdfium/internal/subprocess
. Something like this could be one of the templates:
func (g *PdfiumRPC) {{ .MethodName }}(request *requests.{{ .MethodName }}) (*responses.{{ .MethodName }}, error) {
resp := &responses.{{ .MethodName }}{}
err := g.client.Call("Plugin.{{ .MethodName }}", request, resp)
if err != nil {
return nil, err
}
return resp, nil
}
how to choose the right parameters for multithreading and what will be the improvement from this?
MinIdle: 2, // Makes sure that at least x workers are always available
MaxIdle: 4, // Makes sure that at most x workers are ever available
MaxTotal: 5,
now I understand their meaning, but I don't understand how to choose them correctly for optimal use
Hello.
I am trying to create a new PDF document and create a one page PDF with one image.
I have created a page in the wasm version of pdfium and inserted an image, but it just shows a blank page and no image.
The procedure is as follows
I am trying to do the above, but it is not working properly, so I would like to have a sample code made.
Using file readers in OpenDocument
is not thread safe.
It causes a concurrent read and write to map.
This is the map that is being accessed from multiple places at once.
var FileReaders = map[uint32]*FileReaderRef{}
Expected behavior:
OpenDocument working with Reader as input in a multithreaded WASM configuration.
Current workaround:
Read the file into memory and pass it as a byte array.
Possible solution:
sync.Mutex
for accessing this map, or the usage of sync.Map
sync.Mutex
is probably the best option here, because it keeps the type safety.
Error logs:
fatal error: concurrent map read and map write
goroutine 77 [running]:
github.com/klippa-app/go-pdfium/webassembly/imports.FPDF_FILEACCESS_CB.Call({}, {0x2000?, 0xc00d336b10?}, {0x100f71bd0?, 0xc000176a20?}, {0xc00bcd74c0, 0x176a01?, 0x101a5caa0?})
/Users/user/go/pkg/mod/github.com/klippa-app/[email protected]/webassembly/imports/callbacks.go:32 +0xc7
github.com/tetratelabs/wazero/internal/engine/compiler.(*callEngine).execWasmFunction(0xc00bcce500, {0x100f6c788?, 0xc000136748?}, 0xc00bcce500?)
/Users/user/go/pkg/mod/github.com/tetratelabs/[email protected]/internal/engine/compiler/engine.go:1007 +0x1b2
github.com/tetratelabs/wazero/internal/engine/compiler.(*callEngine).call(0xc00bcce500, {0x100f6c788, 0xc000136748}, {0xc000abea20?, 0x10?, 0xc000101b00?}, {0x0, 0x0, 0x0})
/Users/user/go/pkg/mod/github.com/tetratelabs/[email protected]/internal/engine/compiler/engine.go:758 +0x2c5
github.com/tetratelabs/wazero/internal/engine/compiler.(*callEngine).Call(0xc000176a20?, {0x100f6c788?, 0xc000136748?}, {0xc000abea20?, 0x10?, 0x101a5c708?})
/Users/user/go/pkg/mod/github.com/tetratelabs/[email protected]/internal/engine/compiler/engine.go:712 +0xd9
github.com/klippa-app/go-pdfium/internal/implementation_webassembly.(*PdfiumImplementation).OpenDocument(0xc00574c240, 0xc003998690)
/Users/user/go/pkg/mod/github.com/klippa-app/[email protected]/internal/implementation_webassembly/implementation.go:255 +0x753
github.com/klippa-app/go-pdfium/webassembly.(*pdfiumInstance).OpenDocument(0xc00048d9e0?, 0x0?)
/Users/user/go/pkg/mod/github.com/klippa-app/[email protected]/webassembly/generated.go:6031 +0xca
fatal error: concurrent map writes
goroutine 29 [running]:
github.com/klippa-app/go-pdfium/internal/implementation_webassembly.(*PdfiumImplementation).CreateFileAccessReader(0xc0011e3560, 0x19ccba, {0x100f69370?, 0xc00d29c000})
/Users/user/go/pkg/mod/github.com/klippa-app/[email protected]/internal/implementation_webassembly/data.go:51 +0x210
github.com/klippa-app/go-pdfium/internal/implementation_webassembly.(*PdfiumImplementation).OpenDocument(0xc0011e3560, 0xc00d1d09f0)
/Users/user/go/pkg/mod/github.com/klippa-app/[email protected]/internal/implementation_webassembly/implementation.go:248 +0x696
github.com/klippa-app/go-pdfium/webassembly.(*pdfiumInstance).OpenDocument(0xc00042db00?, 0x0?)
/Users/user/go/pkg/mod/github.com/klippa-app/[email protected]/webassembly/generated.go:6031 +0xca
I am using this and figured it might be on interest since it has shared goals.
https://github.com/benoitkugler/pdf
It’s not like other pdf golang packages because it’s a low level pdf parser and builder.
https://github.com/go-text/typesetting Is a 100% golang replacement of harfbuz. It’s heavily used by gio and Fyne, but can be used with he pdf parsing too due to its ability to handle fonts it’s myriad of complexities.
move the source code from klippa to this package
Hi,
I tested and found that the GetPageTextStructured function extracts results with repeat areas
For example, the following :
Here is a specific error:
{
"left": 36.88567352294922,
"top": 215.13877868652344,
"right": 87.23951721191406,
"bottom": 207.6760711669922,
"text": "hibernators ("
},
{
"left": 87.57408905029297,
"top": 214.72274780273438,
"right": 92.2285385131836,
"bottom": 208.967529296875,
"text": "5)"
},
{
"left": 92.16873168945312,
"top": 215.13877868652344,
"right": 96.52155303955078,
"bottom": 207.6760711669922,
"text": "5)."
},
Create a new release when the C library creates a release.
Using a CRON action (running every 24 hours?) generate a release with the new C release if it passes all tests.
I'm using a macOS and I'm trying to avoid installing pdfium globally but I'm facing the following problem:
dyld[23643]: Library not loaded: ./libpdfium.dylib
Referenced from: <5A04DF34-A570-3BAD-864E-E1F0BF481CD4> /private/var/folders/c0/klrf3qzn37gd3zfgqsfnfxjc0000gr/T/go-build3974641135/b001/exe/main
Reason: tried: './libpdfium.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS./libpdfium.dylib' (no such file), './libpdfium.dylib' (no such file), '/usr/local/lib/libpdfium.dylib' (no such file), '/usr/lib/libpdfium.dylib' (no such file, not in dyld cache), '/Users/diego/code/service/cmd/worker/libpdfium.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/diego/code/service/cmd/worker/libpdfium.dylib' (no such file), '/Users/diego/code/service/cmd/worker/libpdfium.dylib' (no such file), '/usr/local/lib/libpdfium.dylib' (no such file), '/usr/lib/libpdfium.dylib' (no such file, not in dyld cache)
signal: abort trap
I'm executing the command like this:
export DYLD_LIBRARY_PATH=/Users/diego/code/service/misc/pdfium/darwin-arm64/lib
export LD_LIBRARY_PATH=/Users/diego/code/service/misc/pdfium/darwin-arm64/lib
export PKG_CONFIG_PATH=/Users/diego/code/service/misc/pdfium/darwin-arm64
go run main.go
I tried LD_LIBRARY_PATH
and DYLD_LIBRARY_PATH
, both failed. If I copy the binary and put at the same folder of the package main
it works but I'm really trying to avoid that because other developers are using Linux. Any idea on how to fix this problem?
Android failed to use webassembly instance, err = pool.GetInstance(time.Second * 30) failed
Hello
Just want to ask if there is any plan to include XFA/V8 functionality as original issue mentioned in readme was fixed more then years ago:
bblanchon/pdfium-binaries#62
Regards
When we are returning font information for rectangles returned by pdfium (added in #22), we have to lookup the first char of the rectangle to get the font information. The lookup of the first char has a tolerance to find the char (and we also need this tolerance to find it for every rectangle). Currently the tolerance is set to 5 (points), we have to research a little whether that is a good tolerance. Perhaps we could also start with lower and up the tolerance until we find something.
setup
always we do cp /opt/pdfium/lib/libpdfium.so /usr/lib/libpdfium.so
but on Mac OS instead of the .so file, libpdfium.dylib is located
For some documents combined with a specific page width the RenderPageInPixels
function results in a module closed with exit_code(1)
error.
I have tried setting WithDebugInfoEnabled(true)
on the wazero.NewRuntimeConfig()
with a debug build of pdfium.wasm, but I am unable to get any specific error details.
Attached is a sample document in which the error can be reproduced.
Width: 2600px error
Width: 2900px works
Test code:
package main
import (
"github.com/klippa-app/go-pdfium/requests"
"github.com/klippa-app/go-pdfium/webassembly"
"math"
"os"
"time"
)
func main() {
pool, err := webassembly.Init(webassembly.Config{
MinIdle: 1,
MaxIdle: 1,
MaxTotal: 1,
})
pdfBytes, err := os.ReadFile("test.pdf")
checkError(err)
instance, err := pool.GetInstance(time.Second * 5)
checkError(err)
defer instance.Close()
doc, err := instance.OpenDocument(&requests.OpenDocument{
File: &pdfBytes,
})
checkError(err)
defer instance.FPDF_CloseDocument(&requests.FPDF_CloseDocument{
Document: doc.Document,
})
pageRequest := requests.Page{
ByIndex: &requests.PageByIndex{
Document: doc.Document,
Index: 0,
},
}
pageWidth, err := instance.FPDF_GetPageWidth(&requests.FPDF_GetPageWidth{Page: pageRequest})
checkError(err)
pageHeight, err := instance.FPDF_GetPageHeight(&requests.FPDF_GetPageHeight{Page: pageRequest})
checkError(err)
width := 2600
ratio := float64(width) / pageWidth.Width
height := int(math.Floor(pageHeight.Height * ratio))
pageRender, err := instance.RenderPageInPixels(&requests.RenderPageInPixels{
Page: pageRequest,
Width: width,
Height: height,
RenderFlags: 0,
})
checkError(err)
pageRender.Cleanup()
}
func checkError(err error) {
if err != nil {
panic(err)
}
}
when compiling at windows 7 32 has following error error
fpdf_formfill.go:366:2: size declared and not used
fpdf_formfill.go:367:15: array length 1 << 50 - 1 (untyped int constant 1125899906842623) must be integer
fpdf_save.go:35:13: array length 1 << 50 - 1 (untyped int constant 1125899906842623) must be integer
fpdfview.go:581:13: array length 1 << 50 - 1 (untyped int constant 1125899906842623) must be integer
fpdfview.go:573:2: size declared and not used
implementation.go:59:15: array length 1 << 50 - 1 (untyped int constant 1125899906842623) must be integer
render.go:364:15: cannot use 0xFFFFFFFF (untyped int constant 4294967295) as int value in assignment (overflows)
Add an easy way to update the c package in both the image and localy
Add tests which test calling the package as it would for real life scenarios.
the package should have all its functionality as it would in the normal use of the package.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.