rlmcpherson / s3gof3r Goto Github PK

Fast, concurrent, streaming access to Amazon S3, including gof3r, a CLI. http://godoc.org/github.com/rlmcpherson/s3gof3r

License: MIT License

Go 100.00%

s3gof3r's Introduction

s3gof3r

s3gof3r provides fast, parallelized, pipelined streaming access to Amazon S3. It includes a command-line interface: gof3r.

It is optimized for high speed transfer of large objects into and out of Amazon S3. Streaming support allows for usage like:

  $ tar -czf - <my_dir/> | gof3r put -b <s3_bucket> -k <s3_object>    
  $ gof3r get -b <s3_bucket> -k <s3_object> | tar -zx

Speed Benchmarks

On an EC2 instance, gof3r can exceed 1 Gbps for both puts and gets:

  $ gof3r get -b test-bucket -k 8_GB_tar | pv -a | tar -x
  Duration: 53.201632211s
  [ 167MB/s]
  

  $ tar -cf - test_dir/ | pv -a | gof3r put -b test-bucket -k 8_GB_tar
  Duration: 1m16.080800315s
  [ 119MB/s]

These tests were performed on an m1.xlarge EC2 instance with a virtualized 1 Gigabit ethernet interface. See Amazon EC2 Instance Details for more information.

Features

Speed: Especially for larger s3 objects where parallelism can be exploited, s3gof3r will saturate the bandwidth of an EC2 instance. See the Benchmarks above.
Streaming Uploads and Downloads: As the above examples illustrate, streaming allows the gof3r command-line tool to be used with linux/unix pipes. This allows transformation of the data in parallel as it is uploaded or downloaded from S3.
End-to-end Integrity Checking: s3gof3r calculates the md5 hash of the stream in parallel while uploading and downloading. On upload, a file containing the md5 hash is saved in s3. This is checked against the calculated md5 on download. On upload, the content-md5 of each part is calculated and sent with the header to be checked by AWS. s3gof3r also checks the 'hash of hashes' returned by S3 in the Etag field on completion of a multipart upload. See the S3 API Reference for details.
Retry Everything: All http requests and every part is retried on both uploads and downloads. Requests to S3 frequently time out, especially under high load, so this is essential to complete large uploads or downloads.
Memory Efficiency: Memory used to upload and download parts is recycled. For an upload or download with the default concurrency of 10 and part size of 20 MB, the maximum memory usage is less than 300 MB. Memory footprint can be further reduced by reducing part size or concurrency.

Installation

s3gof3r is written in Go and requires go 1.5 or later. It can be installed with go get to download and compile it from source. To install the command-line tool, gof3r set GO15VENDOREXPERIMENT=1 in your environment:

$ go get github.com/rlmcpherson/s3gof3r/gof3r

To install just the package for use in other Go programs:

$ go get github.com/rlmcpherson/s3gof3r

Release Binaries

To try the latest release of the gof3r command-line interface without installing go, download the statically-linked binary for your architecture from Github Releases.

gof3r (command-line interface) usage:

  To stream up to S3:
     $  <input_stream> | gof3r put -b <bucket> -k <s3_path>
  To stream down from S3:
     $ gof3r get -b <bucket> -k <s3_path> | <output_stream>
  To upload a file to S3:
     $ $ gof3r cp <local_path> s3://<bucket>/<s3_path>
  To download a file from S3:
     $ gof3r cp s3://<bucket>/<s3_path> <local_path>

Set AWS keys as environment Variables:

  $ export AWS_ACCESS_KEY_ID=<access_key>
  $ export AWS_SECRET_ACCESS_KEY=<secret_key>

gof3r also supports IAM role-based keys from EC2 instance metadata. If available and environment variables are not set, these keys are used are used automatically.

Examples:

$ tar -cf - /foo_dir/ | gof3r put -b my_s3_bucket -k bar_dir/s3_object -m x-amz-meta-custom-metadata:abc123 -m x-amz-server-side-encryption:AES256
$ gof3r get -b my_s3_bucket -k bar_dir/s3_object | tar -x

see the gof3r man page for complete usage

Documentation

s3gof3r package: See the godocs for api documentation.

gof3r cli : godoc and gof3r man page

Have a question? Ask it on the s3gof3r Mailing List

s3gof3r's People

Contributors

Stargazers

Watchers

Forkers

arjunpola patricktoca andradeandrey scalp42 bag-of-projects bigcommerce-labs lightsofapollo jalkoby whitmo xsleonard aybabtme shino zhujo01 vmware-archive uccellolibero johndagostino hydrajump marcelom subramanyamchitti eadz wolfd ajschumacher jameswei mhoglan tunedb trice-imaging mlarraz franklinwise mgrennan jamesclonk sayumeaw eldondevcg suzuken cce reem tempbottle kung-foo dudymas icebourg fdelbos hashbrowncipher bgdavidx biainc gstarnberger eyethereal taik thesoftwarefactoryuk elblivion lloyd raizyr emb cloudxtreme geofffranks klmnopqrs a3linux dystonie juniorjbn cowboyrushforth abhishekgahlot lvdlvd taylorskalyo codeguard matthieu ewalker-cogo mrwacky42 teamniteo benoahriz jackric vmtrooper gleclaire hartzell njiuko fgroenendijk harshavardhana theojulienne ecolitan bogdansorlea tradeshift alxschwarz prateekpandey14 jakekeeys mathiasaerts punalpatel gxed hackintoshrao carlosmn leffen cspargo mrwlad suthariy kcompher github hughe mediapeers clearbrain jumping silky yoavfeld lapygithub priestd09

s3gof3r's Issues

gof3r dial tcp: i/o timeout on fedora 21

When running gof3r on Fedora 21 (whether the 0.4.10 bin or the locally compiled 0.4.11) I receive a dial tcp: i/o timeout on both get and put. The same command with the same credentials works on CentOS 6 & 7. Below is a sample command:

$ cat output | pv | gof3r put -b sqlbackup.wts.edu -k s3backup.test

This results in the following:

18.6KiB 0:00:00 [5.99MiB/s] [    <=>                               ]
gof3r error: Post https://s3.amazonaws.com/sqlbackup.wts.edu/s3backup.test?uploads: dial tcp: i/o timeout

A sample failed download looks as follows:

gof3r error: Get https://s3.amazonaws.com/sqlbackup.wts.edu/nathan.wts.edu_backup_20150811035201.xbcrypt: dial tcp: i/o timeout                                                                                                              ]

I know that the credentials are not at fault, because the s3cmd works flawlessly with them. I also know that on the download, I am using the correct key.

Any thoughts?

Add better abort handling on multipart upload errors.

Add example usage to API documentation.

expose respError

I'm using s3gof3r as a library and want to use *s3gof3r.respErr to get the status code out. I'd even take an interface with something like StatusCode() int, though being able to access the other attributes would be cool too.

I realize this is a cli app first and foremost, but it works pretty well as a package. If you're open to it, I'd be more than happy to file pull requests to improve the s3gof3r package. Some things I've wanted:

Some admin stuff (delete files or buckets)
StatsD integration through an interface.

EDIT: I hit tab then enter before I was done typing :)

Get command hangs on zero-byte files

We sometimes have zero-byte files in S3, and gof3r hangs indefinitely when trying to download one.

Support for credentials file

Support for the aws credentials file format.
An example from the official go aws sdk - https://github.com/awslabs/aws-sdk-go/blob/49af3ce27393e6b9a5276c013a2f268300f148b5/aws/config.go

Example from aws-cli http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html

Glacier support

Any considerations to add support for amazon glacier?

Edit: Solved with lifecycle rule in aws.

Multipart upload causes panic: runtime error: slice bounds out of range

We have been experiencing deterministic failure on multipart upload against Riak CS. We are not sure how to fix this as don't have solid golang experience, and just reporting it here. See [1] for complete stack trace, and [2] for what we are doing in detail. Also it seems that the commit 7138fa9 [3] introduced this issue, as we don't see this issue against v0.4.3 but later than [3].

We found a workaround increasing partsize by one byte, to -s 5242881 to avoid this issue, hope it helps.

nil pointer crash in

Got this error after running for 5 days in production:

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x4fac57]

goroutine 265104500 [running]:
github.com/rlmcpherson/s3gof3r.(*putter).Close(0xc21c8d9a20, 0x0, 0x0)
/home/ec2-user/go/bin/src/github.com/rlmcpherson/s3gof3r/putter.go:255 +0xd17
main.func·006(0x0, 0x0, 0x0, 0x0)

Seems like S3 failure? The offending line of code is below:

https://github.com/rlmcpherson/s3gof3r/blob/master/putter.go#L255

defer checkClose(resp.Body, &err)

streaming anomoly

I am streaming to S3 like this: gof3r put -b trice-app-files -k bucketPath -m x-amz-server-side-encryption:AES256

When bucketPath is just a file name (going straight into the top-level bucket), it all works fine.

However, when bucketPath is a path (e.g., 2014/09/01/fileName), it fails with the following error:
gof3r error: 403: "The request signature we calculated does not match the signature you provided. Check your key and signing method."

Any help you can provide is greatly appreciated. I love the tool, btw.

buckets with period in their name cause certs to fail validation incorrectly

s3gof3r appears to fail if the bucket name contains periods. It is common to use hostnames as bucket names but this seems to trigger some hostname/certificate verification problems wherein buckets named some.hostname.com don't match the wildcard cert presented by s3.amazonaws.com which appears to be *.s3.amazonaws.com.

Disabling name checks by modifying http_client.go to add a TLSClientConfig block allows it but this is insecure:

func ClientWithTimeout(timeout time.Duration) *http.Client {
        transport := &http.Transport{
                TLSClientConfig: &tls.Config{InsecureSkipVerify: true},

--no-ssl flag doesn't work

./bin/gof3r put --no-ssl --endpoint=geo.viprds.ad.forest -b bucket1 -p src/ -k src/
gof3r error: Post https://bucket1.geo.viprds.ad.forest/src?uploads: x509: certificate is valid for vipr1, vipr2, vipr3, not bucket1.geo.viprds.ad.forest

Does this support sync?

Similar to s3cmd sync?

If not, can I make a feature request to have sync?

config handling in CLI

get and put both have these two lines which I don't think are doing what you want:
conf := new(s3gof3r.Config)
conf = s3gof3r.DefaultConfig

the first allocates memory and returns a pointer, the second changes conf to use a different pointer. Any modifications to conf change the DefaultConfig. Which is ok I suppose because it's only used once anyway.

Support for AWS signature version 4

All regions created after January 30, 2014 (e.g. Frankfurt) support only signature version 4. http://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-authenticating-requests.html

Log formatting

It looks like gof3r logging is pretty sloppy in a number of places. I was testing it recently for large-ish uploads and it does great, but it's logging raw structs in many different places instead of friendlier messages. Would you accept a patch to clean up logging?

Copy command to copy s3 objects to and from files and between buckets.

Add a copy command, 'cp', with positional arguments that allow easily coping s3 objects to files, files to s3 objects, and s3 objects to other s3 objects. Example command:

gof3r cp s3://<bucket>/<key_path>  <local_file_path>

gof3r cp s3://<bucket1>/<key_path1> s3://<bucket2>/key_path2>

0.3.4 does not build

Hi,

I've tried to install following documentation example:

$go get github.com/rlmcpherson/s3gof3r/gof3r

github.com/rlmcpherson/s3gof3r/gof3r

/home/hrez/src/gocode/src/github.com/rlmcpherson/s3gof3r/gof3r/get.go:10: import /home/hrez/src/gocode/pkg/linux_amd64/github.com/rlmcpherson/s3gof3r.a: not a package file

go version go1.2.2 linux/amd64

$file /home/hrez/src/gocode/pkg/linux_amd64/github.com/rlmcpherson/s3gof3r.a
/home/hrez/src/gocode/pkg/linux_amd64/github.com/rlmcpherson/s3gof3r.a: current ar archive

$ar -t /home/hrez/src/gocode/pkg/linux_amd64/github.com/rlmcpherson/s3gof3r.a
__.PKGDEF
go.6

I tried go 1.3 and go 1.1.2 with the same result.
What am I doing wrong? Granted I know little about golang builds.

"ls" command enhancement

Hi,
It would be convenient to have "ls" command enhancement.

Support path style addressing for buckets

Hello =)

When testing a project that uses goamz (or any S3 lib that tests using a local stub server), a local server is created and shut down. This local server uses a randomly assigned port and its endpoint looks like http://127.0.0.1:53374.

When using virtual hosted style endpoints (the current method of this package), requests pointing to this fake server will have the form:

https://mybucket.127.0.0.1:53374/

which will result in an error from the TCP layer:

dial tcp: lookup mybucket.127.0.0.1: no such host

The solution that goamz take is to address buckets using the path-style endpoint:

https://127.0.0.1:53374/mybucket/keyname

Or in general:

# the current way
https://mybucket.s3.amazonaws.com/keyname
# becomes
https://s3.amazonaws.com/mybucket/keyname

My proposition is to either replace the current method to be path-styled, or to at least make it optional.

I'm opening this issue first, if you agree with this I'll work on a PR.

See http://docs.aws.amazon.com/AmazonS3/latest/dev/VirtualHosting.html

pass headers to all http requests for encryption

If I add the necessary headers for SSE-C ( customer side encryption keys) for uploads to S3 I get the error: 400: "The multipart upload initiate requested encryption. Subsequent part requests must include the appropriate encryption parameters."

It doesn't work with either cp or put commands.

Encryption seems to work if you use the SSE-S3 (x-amz-server-side-encryption) but not with SSE-KMS since it depends on signature v4.

read tcp <ip address>: i/o timeout

I tried several times to upload a file that is 300M, but I always got this error.

'gof3r rm' does not remove md5 files

By default the gof3r cp command creates .md5 files in S3. But when removing a file from S3 with the gof3r rm command, the .md5 files are left. Is this intentional? A flag to also delete the .md5 files would be nice (or delete the .md5 files by default and have a flag to leave them / not check).

Progress output

Any plans to implement some kind of progress output? Streaming some very large files, and having no indication of progress is unsettling at times—especially when getting put errors and having retries performed.

Will there be more features?

I was wondering if you intend to add anything like a mkdir command or a way to set folders/files to public from gof3r.

Support for EC2 instance profiles

Looking at the code, I don't see support for pulling down credentials from the EC2 metadata service if they are provided. Seeding AWS access keys to instances is not really a recommended way of authorizing these applications inside of the cloud, although environment variable support should be available for the user's workstation.

Occasional deadlock in runtime.futexsleep due to https://code.google.com/p/go/issues/detail?id=5123

Appears to be due to https://code.google.com/p/go/issues/detail?id=5123

Backtrace, such as it is, from gdb:
#0 runtime.futex () at /usr/local/go/src/pkg/runtime/sys_linux_amd64.s:268
#1 0x0000000000414b77 in runtime.futexsleep () at /usr/local/go/src/pkg/runtime/os_linux.c:48
#2 0x000000c218f460e8 in ?? ()
#3 0x0000000000000000 in ?? ()

ssl?

question: how do we pass in whether we want it to be ssl or not ssl? Thanks.

Add ini file support

Add support for a ~/.gof3r configuration file in ini format that can include default values for commands.

What about 's3gof3r' as the name for the actual cli?

it feels confusing to have the binary called gof3r

OBOE (off by one error) when determining if all chunks have been read

The code for determining if all chunks have been read has an OBOE that causes not all bytes to be read if the number of bytes in the chunk is 1 byte more than the amount of bytes read after the copy

        if g.cIdx >= g.rChunk.size-1 { // chunk complete
            g.sp.give <- g.rChunk.b
            g.chunkID++
            g.rChunk = nil
        }

This will occur if the chunk size is a multiple of the default byte buffer size (32KiB, 32 * 1024) that io.Copy uses + 1 byte; When the copy exits, g.cIdx is updated with the amount bytes read. This makes the variable a count which is 1-based (opposed to 0-based if it was an index into the buffer); The chunk size is a length which is 1-based, and there is no need to offset by 1.

When this is encountered, the goroutines will end up in an infinite select situation as the next iteration of the loop will check if g.bytesRead == g.contentLen which it does not because there is 1 byte remaining in the chunk, and it will proceed to call g.nextChunk() since the g.rChunk was cleared out earlier. Now the goroutine will block the select in nextChunk() and there are no more chunks.

goroutine trace from pprof

goroutine 43 [select, 83 minutes]:
github.com/rlmcpherson/s3gof3r.(*getter).nextChunk(0xc2083705a0, 0x44c2d4, 0x0, 0x0)
    /go/src/github.com/rlmcpherson/s3gof3r/getter.go:244 +0x2d9
github.com/rlmcpherson/s3gof3r.(*getter).Read(0xc2083705a0, 0xc21b1e0000, 0x8000, 0x8000, 0x8000, 0x0, 0x0)
    /go/src/github.com/rlmcpherson/s3gof3r/getter.go:207 +0x182
io.Copy(0x7feef9a5e5a8, 0xc208064800, 0x7feef9a6e320, 0xc2083705a0, 0x180000, 0x0, 0x0)
    /usr/local/go/src/pkg/io/io.go:353 +0x1f3
project/dbimport.(*bucketUtils).concat(0xc21b2071c0, 0xc208242000, 0x5a, 0x80, 0xc21b64ab40, 0x40, 0x0, 0x0)
    /go/src/project/main.go:217 +0x7ef
project/dbimport.processS3Files(0xc2080ec1e0, 0xc2081ec660, 0xc2083ce2c0, 0x7fffc6ee6e1f, 0xf, 0xc20821ff50, 0x2a, 0xc208269200, 0x5a, 0x5a, ...)
    /go/src/project/main.go:792 +0x17a3
project/dbimport.funcÂ·007(0xc2083cc2a0)
    /go/src/project/main.go:622 +0x13b6


goroutine 93481 [select]:
github.com/rlmcpherson/s3gof3r.funcÂ·002()
    /go/src/github.com/rlmcpherson/s3gof3r/pool.go:42 +0x6cd
created by github.com/rlmcpherson/s3gof3r.bufferPool
    /go/src/github.com/rlmcpherson/s3gof3r/pool.go:68 +0x15a

Can be reproduced with any file that is 1 byte larger than a multiple of 32KiB, such as in my case 1081345 (32 * 1024 * 33 + 1); Default PartSize of 20MiB

The last couple iterations of the loop with some debugs

...
s3gof3r2015/03/06 23:37:16 getter.go:203: -----loop start------
s3gof3r2015/03/06 23:37:16 getter.go:204: g.chunkTotal: 1
s3gof3r2015/03/06 23:37:16 getter.go:205: g.chunkID: 0
s3gof3r2015/03/06 23:37:16 getter.go:207: g.rChunk.id: 0
s3gof3r2015/03/06 23:37:16 getter.go:211: nw: 0
s3gof3r2015/03/06 23:37:16 getter.go:212: len(p): 32768
s3gof3r2015/03/06 23:37:16 getter.go:213: g.cIdx: 1015808
s3gof3r2015/03/06 23:37:16 getter.go:214: g.bytesRead: 1015808
s3gof3r2015/03/06 23:37:16 getter.go:215: g.contentLen: 1081345
s3gof3r2015/03/06 23:37:16 getter.go:228: g.cIdx: 1015808
s3gof3r2015/03/06 23:37:16 getter.go:229: g.rChunk.size: 1081345
s3gof3r2015/03/06 23:37:16 getter.go:230: len(p): 32768
s3gof3r2015/03/06 23:37:16 getter.go:232: len(p): 32768
s3gof3r2015/03/06 23:37:16 getter.go:233: n: 32768
s3gof3r2015/03/06 23:37:16 getter.go:234: bytesRead: 1015808
s3gof3r2015/03/06 23:37:16 getter.go:238: bytesRead: 1048576
s3gof3r2015/03/06 23:37:16 getter.go:244: g.chunkID: 0
s3gof3r2015/03/06 23:37:16 getter.go:246: g.rChunk.id: 0
s3gof3r2015/03/06 23:37:16 getter.go:250: -----loop end------
s3gof3r2015/03/06 23:37:16 getter.go:203: -----loop start------
s3gof3r2015/03/06 23:37:16 getter.go:204: g.chunkTotal: 1
s3gof3r2015/03/06 23:37:16 getter.go:205: g.chunkID: 0
s3gof3r2015/03/06 23:37:16 getter.go:207: g.rChunk.id: 0
s3gof3r2015/03/06 23:37:16 getter.go:211: nw: 0
s3gof3r2015/03/06 23:37:16 getter.go:212: len(p): 32768
s3gof3r2015/03/06 23:37:16 getter.go:213: g.cIdx: 1048576
s3gof3r2015/03/06 23:37:16 getter.go:214: g.bytesRead: 1048576
s3gof3r2015/03/06 23:37:16 getter.go:215: g.contentLen: 1081345
s3gof3r2015/03/06 23:37:16 getter.go:228: g.cIdx: 1048576
s3gof3r2015/03/06 23:37:16 getter.go:229: g.rChunk.size: 1081345
s3gof3r2015/03/06 23:37:16 getter.go:230: len(p): 32768
s3gof3r2015/03/06 23:37:16 getter.go:232: len(p): 32768
s3gof3r2015/03/06 23:37:16 getter.go:233: n: 32768
s3gof3r2015/03/06 23:37:16 getter.go:234: bytesRead: 1048576
s3gof3r2015/03/06 23:37:16 getter.go:238: bytesRead: 1081344
s3gof3r2015/03/06 23:37:16 getter.go:244: g.chunkID: 1
s3gof3r2015/03/06 23:37:16 getter.go:248: g.rChunk.id: nil
s3gof3r2015/03/06 23:37:16 getter.go:250: -----loop end------
s3gof3r2015/03/06 23:37:16 getter.go:203: -----loop start------
s3gof3r2015/03/06 23:37:16 getter.go:204: g.chunkTotal: 1
s3gof3r2015/03/06 23:37:16 getter.go:205: g.chunkID: 1
s3gof3r2015/03/06 23:37:16 getter.go:209: g.rChunk.id: nil
s3gof3r2015/03/06 23:37:16 getter.go:211: nw: 0
s3gof3r2015/03/06 23:37:16 getter.go:212: len(p): 32768
s3gof3r2015/03/06 23:37:16 getter.go:213: g.cIdx: 1081344
s3gof3r2015/03/06 23:37:16 getter.go:214: g.bytesRead: 1081344
s3gof3r2015/03/06 23:37:16 getter.go:215: g.contentLen: 1081345
s3gof3r2015/03/06 23:37:16 getter.go:271: ------nextChunk select------

Results in infinite select loop. File on disk is missing 1 byte (the last byte of file).

Can also be reproduced by using any PartSize that is a multiple of 32KiB + 1; Example of a file that succeeded with default config of 20MiB PartSize, but fails with a PartSize of 131073 (32 * 1024 * 4 + 1); Also shows a multipart download which above example did not.

...
s3gof3r2015/03/06 23:55:15 getter.go:203: -----loop start------
s3gof3r2015/03/06 23:55:15 getter.go:204: g.chunkTotal: 10
s3gof3r2015/03/06 23:55:15 getter.go:205: g.chunkID: 9
s3gof3r2015/03/06 23:55:15 getter.go:207: g.rChunk.id: 9
s3gof3r2015/03/06 23:55:15 getter.go:211: nw: 0
s3gof3r2015/03/06 23:55:15 getter.go:212: len(p): 32768
s3gof3r2015/03/06 23:55:15 getter.go:213: g.cIdx: 32768
s3gof3r2015/03/06 23:55:15 getter.go:214: g.bytesRead: 1212416
s3gof3r2015/03/06 23:55:15 getter.go:215: g.contentLen: 1275093
s3gof3r2015/03/06 23:55:15 getter.go:228: g.cIdx: 32768
s3gof3r2015/03/06 23:55:15 getter.go:229: g.rChunk.size: 95436
s3gof3r2015/03/06 23:55:15 getter.go:230: len(p): 32768
s3gof3r2015/03/06 23:55:15 getter.go:232: len(p): 32768
s3gof3r2015/03/06 23:55:15 getter.go:233: n: 32768
s3gof3r2015/03/06 23:55:15 getter.go:234: bytesRead: 1212416
s3gof3r2015/03/06 23:55:15 getter.go:238: bytesRead: 1245184
s3gof3r2015/03/06 23:55:15 getter.go:244: g.chunkID: 9
s3gof3r2015/03/06 23:55:15 getter.go:246: g.rChunk.id: 9
s3gof3r2015/03/06 23:55:15 getter.go:250: -----loop end------
s3gof3r2015/03/06 23:55:15 getter.go:203: -----loop start------
s3gof3r2015/03/06 23:55:15 getter.go:204: g.chunkTotal: 10
s3gof3r2015/03/06 23:55:15 getter.go:205: g.chunkID: 9
s3gof3r2015/03/06 23:55:15 getter.go:207: g.rChunk.id: 9
s3gof3r2015/03/06 23:55:15 getter.go:211: nw: 0
s3gof3r2015/03/06 23:55:15 getter.go:212: len(p): 32768
s3gof3r2015/03/06 23:55:15 getter.go:213: g.cIdx: 65536
s3gof3r2015/03/06 23:55:15 getter.go:214: g.bytesRead: 1245184
s3gof3r2015/03/06 23:55:15 getter.go:215: g.contentLen: 1275093
s3gof3r2015/03/06 23:55:15 getter.go:228: g.cIdx: 65536
s3gof3r2015/03/06 23:55:15 getter.go:229: g.rChunk.size: 95436
s3gof3r2015/03/06 23:55:15 getter.go:230: len(p): 32768
s3gof3r2015/03/06 23:55:15 getter.go:232: len(p): 32768
s3gof3r2015/03/06 23:55:15 getter.go:233: n: 29900
s3gof3r2015/03/06 23:55:15 getter.go:234: bytesRead: 1245184
s3gof3r2015/03/06 23:55:15 getter.go:238: bytesRead: 1275084
s3gof3r2015/03/06 23:55:15 getter.go:244: g.chunkID: 10
s3gof3r2015/03/06 23:55:15 getter.go:248: g.rChunk.id: nil
s3gof3r2015/03/06 23:55:15 getter.go:250: -----loop end------
s3gof3r2015/03/06 23:55:15 getter.go:203: -----loop start------
s3gof3r2015/03/06 23:55:15 getter.go:204: g.chunkTotal: 10
s3gof3r2015/03/06 23:55:15 getter.go:205: g.chunkID: 10
s3gof3r2015/03/06 23:55:15 getter.go:209: g.rChunk.id: nil
s3gof3r2015/03/06 23:55:15 getter.go:211: nw: 29900
s3gof3r2015/03/06 23:55:15 getter.go:212: len(p): 32768
s3gof3r2015/03/06 23:55:15 getter.go:213: g.cIdx: 95436
s3gof3r2015/03/06 23:55:15 getter.go:214: g.bytesRead: 1275084
s3gof3r2015/03/06 23:55:15 getter.go:215: g.contentLen: 1275093
s3gof3r2015/03/06 23:55:15 getter.go:271: ------nextChunk select------

Results in infinite select loop. File on disk will be missing 1 byte for every chunk.

Pull request incoming which addresses the issue and adds a couple guards

If bytes read does not equal content length and all chunks have been processed, error
If bytes read is greater than content length, error
- This should not occur as golang uses a LimitedReader up to the content length, but for completeness should be here

For more robustness, there could be timeout on the select in the getNextChunk, If for any reason the routine gets into here and there are no more chunks, it will block forever. Cannot rely on the underlying HTTP connection being gone and triggering a close because the data has been read into memory. I did think about that the worker() function should close the g.readCh but that would not have it break out of the select (it would if it was a range on the select); Have not thought this fully out, but feel something can be done to signal that no more chunks will be arriving on the channel because the workers are gone.

Specify default part size in bytes

Question: Linux binary doesn't work

Hi, it seems to me that the release Linux binary doesn't work. Compiled it by my own (GOOS=linux go build) and got a much bigger binary (that finally worked out).

'#' in S3 key/"path"

In an S3 bucket, I have a folder like so: #foo

I am trying to upload a file to it, e.g. put a file with local pathname /tmp/bar into the key #foo/bar

$my_bucket_name contains only alpha-numerics and the - character.

[notroot@aio14 ~]$ touch /tmp/bar
[notroot@aio14 ~]$ cat /tmp/bar | gof3r put -b "$my_bucket_name" -k \#foo/bar
gof3r error: 412: "At least one of the pre-conditions you specified did not hold"
[notroot@aio14 ~]$ gof3r cp /tmp/bar s3://"$my_bucket_name"/#foo/bar
gof3r error: 400: "A key must be specified"
[notroot@aio14 ~]$ 
[notroot@aio14 ~]$ gof3r cp /tmp/bar s3://"$my_bucket_name"/foo/bar
duration: 770.022417ms
[notroot@aio14 ~]$ cat /tmp/bar | gof3r put -b "$my_bucket_name" -k foo/bar
duration: 760.235401ms

Behavior w.r.t. backpressure

We've seen some situations where gof3r is using unexpectedly large amounts of memory (e.g., hundreds of megs of RSS). This seems odd, since there's nothing that's inherently memory-intensive in its functionality as I understand it. I thought the issue might be buffering: does gof3r provide backpressure when reading stdin is a lot faster than the S3 upload? Does it provide backpressure when writing to stdout is a lot slower than the S3 download?

HTTP HEAD request should be used instead of GET for initial request

To determine the amount of chunks the content-length of the object is needed. A HTTP GET is performed and the body is not read from the response in order to get this header. A HTTP HEAD would have returned this information and would have not included the payload. This would also eliminate unnecessary TCP packets that are carrying segments of the payload but never read.

There is an opportunity I think to optimize further and have the initial request be a HTTP GET with the byte range for the first chunk. HTTP spec says that if the byte range exceeds the content-length, then the content-length is used. Then the initial response to get the content length for calculating chunk totals could re-used and there would not need to be a second HTTP GET request. Could also determine that if there is no need for any additional chunks, then do not startup any workers and just proceed with current response processing. This is bigger work and probably should be moved into its own feature / issue. Or this issue can stay open to resolve.

When using s3gof3r as a library, is there a progress bar feature?

When uploading a large blob, it would be nice to see progress bar.

requested signature is invalid with PathStyle true

When using s3gof3r to download or upload to/from a bucket with credentials provided, I get a 403 from AWS:

The request signature we calculated does not match the signature you provided. Check your key and signing method.

better support for endpoint routing

Getting a lot of this:

gof3r error: 307: "Please re-send this request to the specified temporary endpoint. Continue to use the original request endpoint for future requests."

It'd be nice to have a little more debugging information. It seems that gof3r could handle the 307 as well.

It doesn't seem that the endpoint directive is really being picked up for use.

Add version flag / command to CLI.

This will be helpful for binary users to determine what code they're running from a compiled go binary.

support simple ACLs

It's currently possible to add -m x-amz-acl:public-read but it would be nicer & easier to have a command-line option for these pretty frequent use-cases.

Add ability to continue broken transfers

Often when transferring large objects the connection breaks or there is an io timeout and i have to start the transfer all over again. It would be great if i just could continue the transfer if there is such an issue.

gof3r: poor error message on failure

I ran into another logging-related issue. I haven't changed anything in my application, but all of a sudden, I'm getting

Error:  400: "The XML you provided was not well-formed or did not validate against our published schema"

That in itself is not necessarily a problem--I haven't changed the gof3r binary I'm using so I suspect I'm doing something wrong--but the error message here is pretty opaque--I as the user am certainly not providing any XML.

AWS Key ID not existed in records

I got this message, "The AWS Access Key Id you provided does not exist in our records."
And what I typed was this, " ./gof3r put -b testbucket1 -k /root/gof3r_0.4.9_linux_amd64/bbb "

I was trying to put a file to a Japanese Object Storage Service which is compatible with S3 service.
I succeeded to put a file to it using S3 command.
API key couldn't read correctly from environment variable.

Would you be kind to teach me how to fix this problem?
Where to change endpoint, access key, and secret key?
Also gof3r command is the same as go language?

'put' path param to accept a directory of files to upload

gof3r error: Get : 301 response missing Location header

I cannot use that tool sadly it seems there are no additional debug information available.

$ AWS_ACCESS_KEY_ID=xxx AWS_SECRET_ACCESS_KEY=xxx gof3r get -b <bucket> -k <key>
gof3r error: Get : 301 response missing Location header

Some environment information (I am on vagrant)

$ go version
go version go1.3 linux/amd64
$ uname -a
Linux precise64 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

memory leak?

Hi,

Can something be done about memory consumption?
I've "gof3r put" with default options of what turned out to be 671Gb object.
It took 416m56.254s and gof3r memory utilization grew in the end to 21G virtual and 14G residential.

gof3r version 0.4.5

Quick make deb script

Hi I just wanted to share quick make deb and deploy to public repo script, maybe someone will find is useful...
:)

VERSION := 0.4.9

all: clean build deb

build:
    mkdir -p packaging/root/usr/local/bin
    curl -L https://github.com/rlmcpherson/s3gof3r/releases/download/v$(VERSION)/gof3r_$(VERSION)_linux_amd64.tar.gz | tar xvz --strip-components=1
    mv gof3r packaging/root/usr/local/bin/s3gof3r
clean:
    find . -name "*.out" -delete
    rm -rf packaging

deb:
    fpm -s dir -t deb -n s3gof3r -v $(VERSION) -p packaging/s3gof3r.deb \
        --deb-priority optional --category admin \
        --force \
        --deb-compression bzip2 \
        -a amd64 \
        packaging/root/=/

deploy:
    package_cloud push dz0ny/opensource/any/any packaging/s3gof3r.deb

Reduce get memory usage with fixed q_wait_threshold

Currently, if the size of q_wait exceeds concurrency, no new chunks are sent to the workers to download. This limits unbounded heap growth, but setting the threshold to concurrency is fairly arbitrary.

From my testing in EC2, changing the threshold to a fixed value of 2 has no impact on get performance, but does reduce memory usage to bring it closer to that of a put. With the default settings of 20 MB chunk size and concurrency of 10, the memory usage tops out at ~300 MB due to allocation of 15 20MB slices.

Nil key request results in infinite select

If a nil key, which results in string "", is requested an infinite select will occur since the workers will exit out because the initChunks() will have closed the g.getCh channel without putting anything on the channel.

The for loop in initChunks() will not execute due to the negative content length

for i := int64(0); i < g.contentLen; {
...
}

Content length is set to -1 by golang if the HTTP response is deemed to be chunked transfer encoding or EOF close response.

When a nil key is requested, the resulting HTTP request is to the bucket URL only, which results in AWS returning a listing of the bucket which is a XML streamed response. The streamed response will not contain any content-length header, and golang will set the content-length to -1 in the response.

Since to execute multipart GET requests, the content length is needed to be known, it can be assumed the client does not support working with chunked transfer encoded responses or EOF close responses, and all responses must contain a content-length.

Incoming pull request to address the issue. Two guards put in place

If content-length of response is -1, error
If path requested is "" (nil for string), error

Feel like a similar fix can be added as discussed at the end of #55 that if the workers closed the channel when exiting or some notification that they have finished (maybe new channel?) then the nextChunk() would exit.