Down is a utility tool for streaming, flexible and safe downloading of remote
files. It can use open-uri + Net::HTTP
or HTTP.rb as the backend HTTP
library.
gem "down"
The primary method is Down.download
, which downloads the remote file into a
Tempfile:
require "down"
tempfile = Down.download("http://example.com/nature.jpg")
tempfile #=> #<Tempfile:/var/folders/k7/6zx6dx6x7ys3rv3srh0nyfj00000gn/T/20150925-55456-z7vxqz.jpg>
The returned Tempfile has #content_type
and #original_filename
attributes
determined from the response headers:
tempfile.content_type #=> "image/jpeg"
tempfile.original_filename #=> "nature.jpg"
When you're accepting URLs from an outside source, it's a good idea to limit
the filesize (because attackers want to give a lot of work to your servers).
Down allows you to pass a :max_size
option:
Down.download("http://example.com/image.jpg", max_size: 5 * 1024 * 1024) # 5 MB
# raises Down::TooLarge
What is the advantage over simply checking size after downloading? Well, Down
terminates the download very early, as soon as it gets the Content-Length
header. And if the Content-Length
header is missing, Down will terminate the
download as soon as the downloaded content surpasses the maximum size.
Down.download
and Down.open
will automatically detect and apply HTTP basic
authentication from the URL:
Down.download("http://user:[email protected]")
Down.open("http://user:[email protected]")
There are a lot of ways in which a download can fail:
- Response status was 4xx or 5xx
- Domain was not found
- Timeout occurred
- URL is invalid
- ...
Down attempts to unify all of these exceptions into one Down::NotFound
error
(because this is what actually happened from the outside perspective). If you
want to retrieve the original error raised, in Ruby 2.1+ you can use
Exception#cause
:
begin
Down.download("http://example.com")
rescue Down::Error => exception
exception.cause #=> #<Timeout::Error>
end
Down has the ability to retrieve content of the remote file as it is being
downloaded. The Down.open
method returns a Down::ChunkedIO
object which
represents the remote file on the given URL. When you read from it, Down
internally downloads chunks of the remote file, but only how much is needed.
remote_file = Down.open("http://example.com/image.jpg")
remote_file.size # read from the "Content-Length" header
remote_file.read(1024) # downloads and returns first 1 KB
remote_file.read(1024) # downloads and returns next 1 KB
remote_file.eof? #=> false
remote_file.read # downloads and returns the rest of the file content
remote_file.eof? #=> true
remote_file.close # closes the HTTP connection and deletes the internal Tempfile
By default the downloaded content is internally cached into a Tempfile
, so
that when you rewind the Down::ChunkedIO
, it continues reading the cached
content that it had already retrieved.
remote_file = Down.open("http://example.com/image.jpg")
remote_file.read(1*1024*1024) # downloads, caches, and returns first 1MB
remote_file.rewind
remote_file.read(1*1024*1024) # reads the cached content
remote_file.read(1*1024*1024) # downloads the next 1MB
If you want to save on IO calls and on disk usage, and don't need to be able to
rewind the Down::ChunkedIO
, you can disable caching downloaded content:
Down.open("http://example.com/image.jpg", rewindable: false)
You can also yield chunks directly as they're downloaded via #each_chunk
, in
which case the downloaded content is not cached into a file regardless of the
:rewindable
option.
remote_file = Down.open("http://example.com/image.jpg")
remote_file.each_chunk { |chunk| ... }
remote_file.close
You can access the response status and headers of the HTTP request that was made:
remote_file = Down.open("http://example.com/image.jpg")
remote_file.data[:status] #=> 200
remote_file.data[:headers] #=> { ... }
remote_file.data[:response] # returns the response object
Note that Down::NotFound
error will automatically be raised if response
status was 4xx or 5xx.
The Down.open
method uses Down::ChunkedIO
internally. However,
Down::ChunkedIO
is designed to be generic, it can wrap any kind of streaming.
Down::ChunkedIO.new(...)
:chunks
– anEnumerator
which retrieves chunks:size
– size of the file if it's known (returned by#size
):on_close
– called when streaming finishes or IO is closed:data
- custom data that you want to store (returned by#data
):rewindable
- whether to cache retrieved data into a file (defaults totrue
):encoding
- force content to be returned in specified encoding (defaults to ASCII-8BIT)
Here is an example of wrapping streaming MongoDB files:
require "down/chunked_io"
mongo = Mongo::Client.new(...)
bucket = mongo.database.fs
content_length = bucket.find(_id: id).first[:length]
stream = bucket.open_download_stream(id)
io = Down::ChunkedIO.new(
size: content_length,
chunks: stream.enum_for(:each),
on_close: -> { stream.close },
)
Then open-uri + Net::HTTP is the default backend, loaded by requiring down
or down/net_http
:
require "down"
# or
require "down/net_http"
Down.download
is implemented as a wrapper around open-uri, and fixes some of
open-uri's undesired behaviours:
- open-uri returns
StringIO
for files smaller than 10KB, andTempfile
otherwise, butDown.download
always returns aTempfile
- open-uri doesn't give any extension to the returned
Tempfile
, butDown.download
adds the extension from the URL - ...
Since open-uri doesn't expose support for partial downloads, Down.open
is
implemented using Net::HTTP
directly.
Down.download
turns off open-uri's following redirects, as open-uri doesn't
have a way to limit the maximum number of hops, and implements its own. By
default maximum of 2 redirects will be followed, but you can change it via the
:max_redirects
option:
Down.download("http://example.com/image.jpg") # 2 redirects allowed
Down.download("http://example.com/image.jpg", max_redirects: 5) # 5 redirects allowed
Down.download("http://example.com/image.jpg", max_redirects: 0) # 0 redirects allowed
Both Down.download
and Down.open
support a :proxy
option, where you can
specify a URL to an HTTP proxy which should be used when downloading.
Down.download("http://example.com/image.jpg", proxy: "http://proxy.org")
Down.open("http://example.com/image.jpg", proxy: "http://user:[email protected]")
Any additional options passed to Down.download
will be forwarded to
open-uri, so you can for example add basic authentication or a timeout:
Down.download "http://example.com/image.jpg",
http_basic_authentication: ['john', 'secret'],
read_timeout: 5
Down.open
accepts :ssl_verify_mode
and :ssl_ca_cert
options with the same
semantics as in open-uri, and any options with String keys will be interpreted
as request headers, like with open-uri.
Down.open("http://example.com/image.jpg", {"Authorization" => "..."})
The HTTP.rb backend can be used by requiring down/http
:
gem "http", "~> 2.1"
gem "down"
require "down/http"
Some features that give the HTTP.rb backend an advantage over open-uri + Net::HTTP include:
- Correct URI parsing with Addressable::URI
- Proper support for streaming downloads (
#download
and now reuse#open
) - Proper support for SSL
- Chaninable HTTP client builder API for setting default options
- Persistent connections
- Auto-inflating compressed response bodies
- ...
You can change the default HTTP::Client
to be used in all download requests
via Down::Http.client
:
# reuse Down's default client
Down::Http.client = Down::Http.client.timeout(read: 3).feature(:auto_inflate)
Down::Http.client.default_options.merge!(ssl_context: ctx)
# or set a new client
Down::Http.client = HTTP.via("proxy-hostname.local", 8080)
All additional options passed to Down::Download
and Down.open
will be
forwarded to HTTP::Client#request
:
Down.download("http://example.org/image.jpg", headers: {"Accept-Encoding" => "gzip"})
If you prefer to add options using the chainable API, you can pass a block:
Down.open("http://example.org/image.jpg") do |client|
client.timeout(read: 3)
end
Down::Http.client
is stored in a thread-local variable, so using the HTTP.rb
backend is thread safe.
- MRI 2.2
- MRI 2.3
- MRI 2.4
- JRuby
The test suite runs the http://httpbin.org/ server locally, and uses it to test downloads. Httpbin is a Python package which is run with GUnicorn:
$ pip install gunicorn httpbin
Afterwards you can run tests with
$ rake test