openlabs / docker-wkhtmltopdf-aas Goto Github PK

View Code? Open in Web Editor NEW

99.0 4.0 92.0 193 KB

wkhtmltopdf in a docker container as a web service.

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

docker-wkhtmltopdf-aas's Introduction

docker-wkhtmltopdf-aas

wkhtmltopdf in a docker container as a web service.

This image is based on the wkhtmltopdf container.

Running the service

Run the container with docker run and binding the ports to the host. The web service is exposed on port 80 in the container.

docker run -d -P openlabs/docker-wkhtmltopdf-aas

The container now runs as a daemon.

Find the port that the container is bound to:

docker port 071599a1373e 80

where 071599a1373e is the container SHA that docker assigned when docker run was executed in the previous command.

Take a note of the public port number where docker binds to.

Using the webservice

There are multiple ways to generate a PDF of HTML using the service.

Uploading a HTML file

This is a convenient way to use the service from command line utilities like curl.

curl -X POST -vv -F 'file=@path/to/local/file.html' http://<docker-host>:<port>/ -o path/to/output/file.pdf

where:

docker-host is the hostname or address of the docker host running the container
port is the public port to which the container is bound to.

JSON API

If you are planning on using this service in your application, it might be more convenient to use the JSON API that the service uses.

Here is an example using python requests:

import json
import requests

url = 'http://<docker_host>:<port>/'
data = {
    'contents': open('/file/to/convert.html').read().encode('base64'),
}
headers = {
    'Content-Type': 'application/json',    # This is important
}
response = requests.post(url, data=json.dumps(data), headers=headers)

# Save the response contents to a file
with open('/path/to/local/file.pdf', 'wb') as f:
    f.write(response.content)

Here is another example in python, but this time we pass options to wkhtmltopdf. When passing our settings we omit the double dash "--" at the start of the option. For documentation on what options are available, visit http://wkhtmltopdf.org/usage/wkhtmltopdf.txt

import json
import requests

url = 'http://<docker_host>:<port>/'
data = {
    'contents': open('/file/to/convert.html').read().encode('base64'),
    'options': {
        #Omitting the "--" at the start of the option
        'margin-top': '6', 
        'margin-left': '6', 
        'margin-right': '6', 
        'margin-bottom': '6', 
        'page-width': '105mm', 
        'page-height': '40mm'
    }
}
headers = {
    'Content-Type': 'application/json',    # This is important
}
response = requests.post(url, data=json.dumps(data), headers=headers)

# Save the response contents to a file
with open('/path/to/local/file.pdf', 'wb') as f:
    f.write(response.content)

TODO

Implement conversion of URLs to PDF
Add documentation on passing options to the service
Add curl example for JSON api
Explain more gunicorn options

Bugs and questions

The development of the container takes place on Github. If you have a question or a bug report to file, you can report as a github issue.

Authors and Contributors

This image was built at Openlabs.

Professional Support

This image is professionally supported by Openlabs. If you are looking for on-site teaching or consulting support, contact our sales and support teams.

docker-wkhtmltopdf-aas's People

Contributors

Stargazers

Watchers

docker-wkhtmltopdf-aas's Issues

Added _ping endpoint for health check

We need to have health check, so example we can get this containers work with Amazon ALB for example.

I added my work on: #24

timeout at 30 seconds

On larger pdf's, at 30 seconds we receive [CRITICAL] WORKER TIMEOUT (pid:762). The pid changes each time of course. If I remove some of the html I am able to shorten the amount of time needed to build the pdf to below 30 seconds and it works. So I believe this is a genuine timeout issue due to a larger amount of html.

Research indicates that this is a gunicorn timeout. Explaining how to set gunicorn options is on the list of to-do's at https://github.com/openlabs/docker-wkhtmltopdf-aas. Can you explain how to set the gunicorn timeout? Or do you have any other suggestions?

Thanks,
-Mikey

OpenLabs link to adult site.

The OpenLabs link at the bottom redirects to a porn site. Please fix.

Doesn't appear to work at all

When using the uploaded test2.html, I am getting the following error with a file containing these html contents https://gist.github.com/mdedetrich/6dd23c5848922e5686dc

[2015-09-10 07:14:20 +0000] [12] [ERROR] Error handling request
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/sync.py", line 130, in handle
    self.handle_request(listener, req, client, addr)
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/sync.py", line 171, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
  File "/usr/local/lib/python2.7/dist-packages/werkzeug/wrappers.py", line 290, in application
    return f(*args[:-2] + (request,))(*args[-2:])
  File "/app.py", line 48, in application
    if options:
UnboundLocalError: local variable 'options' referenced before assignment

And when trying to post, I get the following error

ValueError: Invalid control character at: line 1 column 78 (char 77)
[2015-09-10 07:12:13 +0000] [12] [ERROR] Error handling request
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/sync.py", line 130, in handle
    self.handle_request(listener, req, client, addr)
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/sync.py", line 171, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
  File "/usr/local/lib/python2.7/dist-packages/werkzeug/wrappers.py", line 290, in application
    return f(*args[:-2] + (request,))(*args[-2:])
  File "/app.py", line 33, in application
    payload = json.loads(request.data)
  File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
    obj, end = self.scan_once(s, idx)

In general, I can't seem to be able to get it to work at all, even with different HTML data

Please add a version tag in dockerhub

This is a brilliant project. Saved me a bunch of time implementing my own version.

It would be great to be able to fix the version in stone when using in my project, just so I can be sure it will continue to work until I'm ready to update it. For this reason would you be able to add version number tags to this and the project it relies on? At the moment I am checking out the repo and building myself.

Really appreciate your work on this project.

Rob

PHP example

Thought it might be useful for someone to see a PHP example.

class PdfGenerator {

private $serviceUrl;

private $servicePort;

// ...

private function getHtml()
{
  // ... generates the HTML to convert to PDF
}

private function generatePdf()
{
        $data = array(
            'contents' => base64_encode($this->getHtml($loan)),
            'options' => array(
               // ...
            ),
        );
        $dataString = json_encode($data);
        $headers = array(
            'Content-Type: application/json',
            'Content-Length: ' . strlen($dataString),
        );

        $ch = curl_init();

        curl_setopt($ch, CURLOPT_URL, $this->serviceUrl);
        curl_setopt($ch, CURLOPT_PORT, $this->servicePort);
        curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'POST');
        curl_setopt($ch, CURLOPT_POSTFIELDS, $dataString);
        curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

        $result = curl_exec($ch);
        curl_close($ch);

        return $result;
}
}

Invalid PDF due to "WORKER TIMEOUT"

When rendering a large PDF (containing about 40 images, each about 500KB), most of the time an invalid file is produced which cannot be opened by the PDF viewer. Apparently this is due to an internal timeout (see log output below). Is there a way to increase this timeout?

Loading pages (1/6)
Counting pages (2/6)                                               
Resolving links (4/6)                                                       
Loading headers and footers (5/6)                                           
Printing pages (6/6)
Done                                                                      
[2016-06-29 16:13:48 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:107)
[2016-06-29 16:13:48 +0000] [107] [INFO] Worker exiting (pid: 107)
[2016-06-29 16:13:48 +0000] [123] [INFO] Booting worker with pid: 123

Tested for concurrency?

In the past running multiple copies of wkhtmltopdf concurrently had issues; there were threading problems and some named-pipes were at the same fs path across multiple wkhtmltopdf processes. (I found out after an ugly incident involving customers getting other's PDFs in a parallelized batch run).

wkhtmltopdf seems to have had many version bumps since then, but nothing I've read from the commits screams out that this issue has been fixed.

In openlabs/docker-wkhtmltopdf-aas the gunicorn WSGI daemon seems to fork on request, so if the concurrency issue still exists in wkhtmltopdf then this service exports the problem to the service's users.

In my use case, I needed to substitute the version of wkhtmltopdf shipped with openlabs/docker-wkhtmltopdf-aas with a staticly linked copy of wkhtmltopdf 0.10.0rc2 because the PDF output from identical HTML had changed over the years due to webkit html rendering fixes. (I have legacy HTML that would be a massive PITA to change).

As I know at least my version of wkhtmltopdf (0.10.0rc2) has concurrency issues, I'm treating docker as an isolation mechanism rather than simply a deployment helper. I have 20 identical containers running with a home-made HTTP load-balancing proxy sitting in front of them. It hands off (unmodifed) requests to available containers and makes subsequent requests wait until workers become available (by simply blocking on the HTTP response).

wkhtmltopdf.sh: No such file or directory

I get a "no such file" error when a request is being processed:

bash: /usr/local/bin/wkhtmltopdf.sh: No such file or directory
2014-09-02 06:30:14 [13] [ERROR] Error handling request
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/sync.py", line 93, in handle
    self.handle_request(listener, req, client, addr)
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/sync.py", line 134, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
  File "/usr/local/lib/python2.7/dist-packages/werkzeug/wrappers.py", line 286, in application
    return f(*args[:-2] + (request,))(*args[-2:])
  File "/app.py", line 57, in application
    execute(' '.join(args))
  File "/usr/local/lib/python2.7/dist-packages/executor/__init__.py", line 92, in execute
    raise ExternalCommandFailed(msg % (shell.returncode, command))
ExternalCommandFailed: External command failed with exit code 127! (command: /usr/local/bin/wkhtmltopdf.sh /tmp/tmpnTO3Fm.html /tmp/tmpnTO3Fm.html.pdf)

how do I put fonts in?

This is a cool project! But everything comes out as Helvetica. Is there any easy way to install more fonts in the container?

License file missing

Can you please add the full license text to this repository? App.py says it is using a "BSD" and to see "LICENSE" for details but that file does not exist.

Link to a fork that is active

Thanks for creating this image, it's useful.

But it's getting very outdated, and it seems you might be too busy to maintain it.

So maybe you could link to a fork that's more actively maintained in the readme?

This one seems most popular, but really any active one is good.

Temp file not flushed before read by wkhtmltopdf

The temp file used to store the posted source HTML is not flushed before being read by the wkhtmltopdf process. This results in an incomplete view of the uploaded HTML file being rendered. In the case of small files it always results in an empty PDF being generated.

For example, the following command will unexpectedly produce an empty PDF document (assuming app listening on port 8090):

$ echo '<html><body><h1>Hello World!</h1></body></html>' \
  | curl -X POST -F file=@- http://localhost:8090/ > output.pdf

pages

Hi! I love your work. I am trying to make page numbers alternately appear at the bottom left for odd pages and bottom right for even pages. Is there a way to do that?
What I have so far is this...

wkhtmltopdf_path = 'C:/Program Files/wkhtmltopdf/bin/wkhtmltopdf.exe'
config = pdfkit.configuration(wkhtmltopdf=wkhtmltopdf_path)
options = {
    'footer-center': '~ [page] of [topage] ~',

JSON example does not work in Python3 due to encoding differences

In Python3 encoding to base64 is done with base64.b64encode which creates a bytes object. Json.dumps only takes a string, so the example produces an error.

Attempting to read the file as utf-8 doesn't produce an error, but the resulting PDF is garbled because I assume wkhtmltopdf is expecting a base64 encoded HTML string? Also giving it the --encoding utf-8 option still produces a garbled PDF.

Basically I can't figure out how to get the JSON API working in Python3.