wal-g / wal-g Goto Github PK

View Code? Open in Web Editor NEW

3.1K 61.0 447.0 100.1 MB

Archival and Restoration for databases in the Cloud

License: Other

Go 89.38% Makefile 0.49% Shell 8.93% Dockerfile 0.66% Gherkin 0.55%

go golang postgresql postgres backup archiving s3 aws aws-s3 azure

wal-g's People

Contributors

Stargazers

Watchers

Forkers

sehrope willglynn emrul fdr tchen0123 bithavoc eydunn decibel cuulee blunney1 x4m db4u david-dever-23-box matisojka skyformat99 aevin1387 kubedb jnicholls bdpreom adamdecaf kamabe ng-pe ferhatelmas antoinefink scukerman currantlabs davidfetter kcollasarundell nickbruun cyberdem0n tekliner tinsane weiboyiyou joe2hpimn banno snero phil-arena iketutg evansmurithi afefelov mynom savichev-igor dirbaio rogervaas g0djan paramadeep bn0ir bixu timescale alijkl intfrr forging2012 lubaoyilang mbrukman iamrz1 codelingobot friendlydan cjhubert zeronetto kristinaetc femaref levshinmisha tools-env awesomegolang akulin defntvdm zarix908 eliyy4hu emuhedo burningmarshmallow anovosiolov alsoedit tonyfresher jellymary mialinx giobim barbaniahra heikoonnebrink gainanist secret9 admirallll kadukm reshke van10000 khanova dragonsmith munakoiso ilicmilan jkylin pdxmolab tri0l kruftik dyson yangmain romutchio daniilkurlovich alexsvalov99 alexnguyen91 tbartelmess niquola

wal-g's Issues

Logs for upload retry errors

It's a bit hard to debug what the issue is with failed uploads, the following error doesn't tell much:

postgres_1  | 2017/08/19 11:54:20 upload: failed to upload 'pg_xlog/000000010000000000000091'. Restarting in 1.75 seconds

A more useful output would be:

2017/08/19 11:56:27 upload: failed to upload 'pg_xlog/000000010000000000000091': RequestError: send request failed
postgres_1  | caused by: Put https://backups.s3.amazonaws.com/wal-g/wal_005/000000010000000000000091.lz4: x509: certificate signed by unknown authority. Restarting in 2.02 seconds
service_1   | Insert

upload.go

if multierr, ok := e.(s3manager.MultiUploadFailure); ok {
				log.Printf("upload: failed to upload '%s' with UploadID '%s'. Restarting in %0.2f seconds", path, multierr.UploadID(), et.wait)
			} else {
				log.Printf("upload: failed to upload '%s': %s. Restarting in %0.2f seconds", path, e.Error(), et.wait)
			}

Crash when fetching

Unexpected EOF

github.com/katie31/wal-g.(*FileTarInterpreter).Interpret
	/home/mapi/var/go/src/github.com/katie31/wal-g/tar.go:55
github.com/katie31/wal-g.extractOne
	/home/mapi/var/go/src/github.com/katie31/wal-g/extract.go:37
github.com/katie31/wal-g.ExtractAll.func1.2
	/home/mapi/var/go/src/github.com/katie31/wal-g/extract.go:106
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:2197
extractOne: Interpret failed
github.com/katie31/wal-g.extractOne
	/home/mapi/var/go/src/github.com/katie31/wal-g/extract.go:39
github.com/katie31/wal-g.ExtractAll.func1.2
	/home/mapi/var/go/src/github.com/katie31/wal-g/extract.go:106
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:2197
/home/mapi/lib/util.rb:4:in `r': unhandled exception

Missing WAL

Hi,

I'm running into an issue with a missing WAL file. Currently I have postgres archiving with both wal-e and wal-g

archive_command = '/usr/bin/envdir /etc/wal-e.d/writer /usr/local/bin/wal-e wal-push %p && /usr/bin/envdir /etc/wal-g.d/writer /usr/local/bin/wal-g wal-push %p'

However when restoring from wal-g, postgres reports an error:

2018/02/05 22:18:06 WAL-prefetch file:  000000010000B6A90000008E
2018/02/05 22:18:06 Archive '000000010000B6A90000008E' does not exist.
2018-02-05 22:18:06.792 UTC [15164] LOG:  restored log file "000000010000B6A90000008D" from archive
2018-02-05 22:18:07.580 UTC [15164] FATAL:  WAL ends before end of online backup
2018-02-05 22:18:07.580 UTC [15164] HINT:  All WAL generated while online backup was taken must be available at recovery.

In the wal-g log it does in fact seem to skip right over 000000010000B6A90000008E

BUCKET: database
SERVER: wal-g/db3
WAL PATH: wal-g/db3/wal_005/000000010000B6A90000008D.lz4

BUCKET: database
SERVER: wal-g/db3
WAL PATH: wal-g/db3/wal_005/000000010000B6A90000008F.lz4

The wal-e log does show this segment:

Feb  5 04:12:41 db3 wal_e.worker.upload INFO     
MSG: begin archiving a file#012        
DETAIL: Uploading "pg_wal/000000010000B6A90000008D" to "s3://database/wal-e/db3/wal_005/000000010000B6A90000008D.lzo".#012        
STRUCTURED: time=2018-02-05T10:12:41.865887-00 pid=15524 action=push-wal key=s3://database/wal-e/db3/wal_005/000000010000B6A90000008D.lzo prefix=wal-e/db3/ seg=000000010000B6A90000008D state=begin

Feb  5 04:12:41 db3 wal_e.worker.upload INFO     
MSG: begin archiving a file#012        
DETAIL: Uploading "pg_wal/000000010000B6A90000008E" to "s3://database/wal-e/db3/wal_005/000000010000B6A90000008E.lzo".#012        
STRUCTURED: time=2018-02-05T10:12:41.869823-00 pid=15524 action=push-wal key=s3://database/wal-e/db3/wal_005/000000010000B6A90000008E.lzo prefix=wal-e/db3/ seg=000000010000B6A90000008E state=begin

Feb  5 04:12:42 db3 wal_e.worker.upload INFO     
MSG: completed archiving to a file#012        
DETAIL: Archiving to "s3://database/wal-e/db3/wal_005/000000010000B6A90000008D.lzo" complete at 14886.7KiB/s.#012        
STRUCTURED: time=2018-02-05T10:12:42.557404-00 pid=15524 action=push-wal key=s3://database/wal-e/db3/wal_005/000000010000B6A90000008D.lzo prefix=wal-e/db3/ rate=14886.7 seg=000000010000B6A90000008D state=complete

Feb  5 04:12:42 db3 wal_e.worker.upload INFO     
MSG: completed archiving to a file#012        
DETAIL: Archiving to "s3://database/wal-e/db3/wal_005/000000010000B6A90000008E.lzo" complete at 11440.3KiB/s.#012        
STRUCTURED: time=2018-02-05T10:12:42.792245-00 pid=15524 action=push-wal key=s3://database/wal-e/db3/wal_005/000000010000B6A90000008E.lzo prefix=wal-e/db3/ rate=11440.3 seg=000000010000B6A90000008E state=complete

Feb  5 04:12:44 db3 wal_e.worker.upload INFO     
MSG: begin archiving a file#012        
DETAIL: Uploading "pg_wal/000000010000B6A90000008F" to "s3://database/wal-e/db3/wal_005/000000010000B6A90000008F.lzo".#012        
STRUCTURED: time=2018-02-05T10:12:44.996389-00 pid=15545 action=push-wal key=s3://database/wal-e/db3/wal_005/000000010000B6A90000008F.lzo prefix=wal-e/db3/ seg=000000010000B6A90000008F state=begin

Feb  5 04:12:46 db3 wal_e.worker.upload INFO     
MSG: completed archiving to a file#012        
DETAIL: Archiving to "s3://database/wal-e/db3/wal_005/000000010000B6A90000008F.lzo" complete at 5754.38KiB/s.#012        
STRUCTURED: time=2018-02-05T10:12:46.528526-00 pid=15545 action=push-wal key=s3://database/wal-e/db3/wal_005/000000010000B6A90000008F.lzo prefix=wal-e/db3/ rate=5754.38 seg=000000010000B6A90000008F state=complete

I do not see any errors reported in the log.

Is there anything I can provide that would help figure out why the segment is missing?

Thanks!

backup-push runtime error

Hi,

I'm getting the following error when attempting a backup-push. Any ideas on what might be wrong? Please let me know if I can provide any more information.

Thanks!

/usr/bin/envdir /etc/wal-g.d/writer /usr/local/bin/wal-g backup-push /var/lib/postgresql/9.2/main
BUCKET: database
SERVER: wal-g/db3-00:25:90:f5:33:aa
panic: runtime error: index out of range

goroutine 1 [running]:
github.com/wal-g/wal-g.ParseLsn(0x0, 0x0, 0x3, 0x3, 0xc38620)
        /home/travis/gopath/src/github.com/wal-g/wal-g/timeline.go:29 +0x223
github.com/wal-g/wal-g.(*Bundle).StartBackup(0xc4201519e0, 0xc420012700, 0xc4201d6ed0, 0x26, 0x26, 0x433b2e, 0xc4200001a0, 0x200000003, 0xc4200001a0, 0xc420020000)
        /home/travis/gopath/src/github.com/wal-g/wal-g/connect.go:76 +0x361
github.com/wal-g/wal-g.HandleBackupPush(0xc42017c900, 0x1c, 0xc420116e70, 0xc42017c6c0)
        /home/travis/gopath/src/github.com/wal-g/wal-g/commands.go:549 +0x318
main.main()
        /home/travis/gopath/src/github.com/wal-g/wal-g/cmd/wal-g/main.go:107 +0x6db

Surprising UX followed by panic

I have just added wal-g to one of my machines, but I don't entirely understand how to use it. Here is a transcript from a first time user:

[root@hydra:~]# wal-g
Please choose a command:
  backup-fetch  fetch a backup from S3
  backup-push   starts and uploads a finished backup to S3
  wal-fetch     fetch a WAL file from S3
  wal-push      upload a WAL file to S3

[root@hydra:~]# wal-g backup-push
Please choose a command:
  backup-fetch  fetch a backup from S3
  backup-push   starts and uploads a finished backup to S3
  wal-fetch     fetch a WAL file from S3
  wal-push      upload a WAL file to S3

[root@hydra:~]# wal-g backup-push --help
panic: runtime error: slice bounds out of range

goroutine 1 [running]:
github.com/wal-g/wal-g.Configure(0x10, 0xc42016e540, 0x0, 0x18)
        /tmp/nix-build-wal-g-0.1.2.drv-0/go/src/github.com/wal-g/wal-g/upload.go:77 +0xa6e
main.main()
        /tmp/nix-build-wal-g-0.1.2.drv-0/go/src/github.com/wal-g/wal-g/cmd/wal-g/main.go:83 +0x1f6

I didn't expect the second command (wal-g backup-push) to work, but I did expect it to inform me about the path. It didn't, so I called with --help which caused some kind of panic and crash.

Need to strip more from the archive path

I was giving wal-g backup+restore a try and I think I spotted a difference in behavior from WAL-E:

While it is true most tar archives box all archive contents in a directory (e.g. postgres.tar.gz would untar to the directory postgres), not so for Postgres backups, because there is no sensible/idiomatic directory name in which a database directory is contained.

Thus the member data/pg_hba.conf may be better as merely pg_hba.conf.

If not for this, I think the whole round trip from archive to restore is working!

wal-g should support other PG* environment variables

PostgreSQL client tools support a common set of environment variables. wal-g should respect them.

Similar to #5, wal-g is being smarter than it needs to be, reading PGHOST and creating a pgx.ConnConfig. wal-g should instead use pgx.ParseEnvLibpq(), which requires less code and supports more features.

DecompressLzo: write to pipe failed

Versions

CentOS 7.3
wal-g v0.1.2
wal-e 1.0.3 (creator of source basebackup)

Problem

Two attempts to backup-fetch a ~1TB basebackup have resulted in wal-g failing with the following stack trace:

base/16417/12983_vm
base/16417/27620292
base/16417/10323582
base/16417/10324516
base/16417/33825612_fsm
2017/08/29 20:07:43 DecompressLzo: write to pipe failed
github.com/wal-g/wal-g.DecompressLzo
        /home/travis/gopath/src/github.com/wal-g/wal-g/decompress.go:126
github.com/wal-g/wal-g.tarHandler
        /home/travis/gopath/src/github.com/wal-g/wal-g/extract.go:66
github.com/wal-g/wal-g.ExtractAll.func2.2
        /home/travis/gopath/src/github.com/wal-g/wal-g/extract.go:138
runtime.goexit
        /home/travis/.gimme/versions/go1.8.3.linux.amd64/src/runtime/asm_amd64.s:2197
ExtractAll: lzo decompress failed
github.com/wal-g/wal-g.tarHandler
        /home/travis/gopath/src/github.com/wal-g/wal-g/extract.go:68
github.com/wal-g/wal-g.ExtractAll.func2.2
        /home/travis/gopath/src/github.com/wal-g/wal-g/extract.go:138
runtime.goexit
        /home/travis/.gimme/versions/go1.8.3.linux.amd64/src/runtime/asm_amd64.s:2197

In both cases, wal-g appeared to be near the end of the restore (over 1TB of data was written to the restore directory) and failed with the same trace. After inspecting the restore and attempting to start postgres, I can confirm that the restore is indeed incomplete.

The basebackup was taken with wal-e 1.0.3, which was also able to restore the same backup without any issues.

Crash with disappearing base backup segment

Hi,

I ran into a crash recently during a backup-push. The server happened to be running pg_repack at the same time, and I'm guessing that's the cause of this since it creates and drops lot of temporary tables.

I'm not sure if this is something wal-g could/should handle, but I thought I'd report it to see if it could recover from this particular error.

Thanks!

/base/16400/94197891.8
2018/02/11 03:17:12 lstat /var/lib/postgresql/10/main/base/16400/94197891.9: no such file or directory
TarWalker: walk failed
github.com/wal-g/wal-g.(*Bundle).TarWalker
	/home/travis/gopath/src/github.com/wal-g/wal-g/walk.go:33
github.com/wal-g/wal-g.(*Bundle).TarWalker-fm
	/home/travis/gopath/src/github.com/wal-g/wal-g/commands.go:570
path/filepath.walk
	/home/travis/.gimme/versions/go1.8.5.linux.amd64/src/path/filepath/path.go:372
path/filepath.walk
	/home/travis/.gimme/versions/go1.8.5.linux.amd64/src/path/filepath/path.go:376
path/filepath.walk
	/home/travis/.gimme/versions/go1.8.5.linux.amd64/src/path/filepath/path.go:376
path/filepath.Walk
	/home/travis/.gimme/versions/go1.8.5.linux.amd64/src/path/filepath/path.go:398
github.com/wal-g/wal-g.HandleBackupPush
	/home/travis/gopath/src/github.com/wal-g/wal-g/commands.go:570
main.main
	/home/travis/gopath/src/github.com/wal-g/wal-g/cmd/wal-g/main.go:107
runtime.main
	/home/travis/.gimme/versions/go1.8.5.linux.amd64/src/runtime/proc.go:185
runtime.goexit
	/home/travis/.gimme/versions/go1.8.5.linux.amd64/src/runtime/asm_amd64.s:2197

S3 only

Will the scope of WAL-G remain S3 only, or should it grow support for OpenStack Swift and Azure like WAL-E ?

Support alternative S3 implementations

Additionally, in order to support other non-official S3 implementations such as https://minio.io/ we would need to be able to setup a few custom settings:

Endpoint:         aws.String(os.Getenv("AWS_ENDPOINT")),
S3ForcePathStyle: aws.Bool(os.Getenv("AWS_S3_FORCE_PATH_STYLE") == "true"),

Example:

WALE_S3_PREFIX: s3://backups/wal-g
AWS_ENDPOINT: http://minio:9000

S3ForcePathStyle is required so the aws go sdk doesn't try to call http://backups.s3.amazonaws.com/wal-g and instead reach the minio server using sub path resources as in http://minio:9000/backups/wal-g.

Source https://github.com/minio/cookbook/blob/master/docs/aws-sdk-for-go-with-minio.md

Any plans to implement backup-list?

basebackup_005 and wal_005

What is the reason for these folders being named as such? Is there intention to have these folders actually be dynamically named based on part of the WAL prefix, in order to better organize the objects rather than dumping them all into the same logical location? If yes, then I will rephrase this issue's title to suggest implementing this behavior.

backup-fetch fails with "Interpret: copy failed"

Two servers failed to boot up and restore their databases last night. These two servers were booting in different AWS Regions, restoring from the same S3 backup. They both failed on exactly the same file, which indicates that we pushed a corrupt backup in some way.

The backup-push from the DB master had been run by hand, and there were no errors in the log output.

Master Server Details:

Ubuntu 14.04
Postgres 9.6.6
WAL-G 0.1.7

DB Size: ~1.8TB

Server 1 Failure

Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: /base/16400/187655160
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: /base/16400/187655162
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: /base/16400/187655164
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 2018/04/05 10:56:19 unexpected EOF
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: Interpret: copy failed
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: github.com/wal-g/wal-g.(*FileTarInterpreter).Interpret
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/gopath/src/github.com/wal-g/wal-g/tar.go:86
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: github.com/wal-g/wal-g.extractOne
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/gopath/src/github.com/wal-g/wal-g/extract.go:51
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: github.com/wal-g/wal-g.ExtractAll.func2.3
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/gopath/src/github.com/wal-g/wal-g/extract.go:156
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: runtime.goexit
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/.gimme/versions/go1.8.7.linux.amd64/src/runtime/asm_amd64.s:2197
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: extractOne: Interpret failed
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: github.com/wal-g/wal-g.extractOne
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/gopath/src/github.com/wal-g/wal-g/extract.go:53
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: github.com/wal-g/wal-g.ExtractAll.func2.3
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/gopath/src/github.com/wal-g/wal-g/extract.go:156
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: runtime.goexit
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/.gimme/versions/go1.8.7.linux.amd64/src/runtime/asm_amd64.s:2197
Info: Class[Wal_g::Db_restore]: Unscheduling all events on Class[Wal_g::Db_restore]

Server 2 Failure

Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: /base/16400/187655162
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: /base/16400/187655164
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 2018/04/05 06:56:25 unexpected EOF
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: Interpret: copy failed
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: github.com/wal-g/wal-g.(*FileTarInterpreter).Interpret
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/gopath/src/github.com/wal-g/wal-g/tar.go:86
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: github.com/wal-g/wal-g.extractOne
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/gopath/src/github.com/wal-g/wal-g/extract.go:51
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: github.com/wal-g/wal-g.ExtractAll.func2.3
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/gopath/src/github.com/wal-g/wal-g/extract.go:156
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: runtime.goexit
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/.gimme/versions/go1.8.7.linux.amd64/src/runtime/asm_amd64.s:2197
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: extractOne: Interpret failed
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: github.com/wal-g/wal-g.extractOne
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/gopath/src/github.com/wal-g/wal-g/extract.go:53
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: github.com/wal-g/wal-g.ExtractAll.func2.3
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/gopath/src/github.com/wal-g/wal-g/extract.go:156
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: runtime.goexit
Notice: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: 	/home/travis/.gimme/versions/go1.8.7.linux.amd64/src/runtime/asm_amd64.s:2197
Info: Class[Wal_g::Db_restore]: Unscheduling all events on Class[Wal_g::Db_restore]
STDERR> Error: /Stage[main]/Wal_g::Db_restore/Exec[wal_g::db_restore]/returns: change from 'notrun' to ['0'] failed: '/usr/local/bin/wal-g-restore.sh' returned 1 instead of one of [0]

[UI BUG] Command 'delete' is not supported

~ # wal-g
Please choose a command:
  backup-fetch	fetch a backup from S3
  backup-push	starts and uploads a finished backup to S3
  backup-list	prints available backups
  wal-fetch	fetch a WAL file from S3
  wal-push	upload a WAL file to S3
  delete	clear old backups and WALs
~ # wal-g delete
Command 'delete' is unsupported by WAL-G.

~ #

Looks like there's no case for usage of the delete command:
https://github.com/wal-g/wal-g/blob/master/cmd/wal-g/main.go#L69

Decompress1X panic crash

While fetching a backup (not a wal segment). Commit: 96e46ad

panic: Decompress1X

goroutine 14 [running]:
github.com/katie31/extract.Decompress(0x935060, 0xc42000e088, 0x9354e0, 0xc42029c200)
	/home/mapi/var/go/src/github.com/katie31/extract/lzo.go:96 +0x747
github.com/katie31/extract.ExtractAll.func1.1(0x934fa0, 0xc420015ca0, 0xc42000e088)
	/home/mapi/var/go/src/github.com/katie31/extract/decomex.go:66 +0xbf
created by github.com/katie31/extract.ExtractAll.func1
	/home/mapi/var/go/src/github.com/katie31/extract/decomex.go:68 +0x113
/home/mapi/lib/util.rb:4:in `r': unhandled exception
	from /home/mapi/lib/postgres_installer.rb:104:in `fetchdb'

Postscript:

I think these two branches should be swapped, i.e. an error should be printed before doing the output-length check. Typical Go convention the output of a multi-valued-return that includes an error may still allow an output result value to be returned in an undefined state.

https://github.com/katie31/extract/blob/96e46adc0722e20462be8e0e1ac96bd84f1792b5/lzo.go#L95-L100

PSPS:

I swapped the order and gave things another try:

...
base/13290/2605
base/13290/2610_vm
base/13290/2601_fsm
base/13290/2685
base/13290/2618
base/13290/3164
base/13290/3079
base/13290/13136
base/13290/3079_vm
base/13290/3118_vm
base/13290/1255_vm
panic: EOF

Support backups located in the root of a bucket

It looks like backups are expected to be located in subfolders inside an S3 bucket, but our WAL-E backups are located in the root - it'd be great if this were supported too! Currently, a lot of panic: runtime error: index out of range messages are thrown when trying to hack around this by changing the locations in the configuration.

S3 Object Listing Does Not Paginate

Calls such as GetBackups and GetWals do not paginate across S3's maximum object return list size of 1000. These parts of the code base—especially the GetWals call—should be refactored to call ListObjectsV2Pages rather than ListObjectsV2. The consequence of this today is that deleting old backups may only clear up to the oldest 1000 WAL segments, but none further.

Permission denied for function pg_start_backup

I am trying to get backup-push working but keep hitting a permissions problem.

$ sudo -u postgres /usr/local/bin/wal-g-wrapper backup-push /var/lib/postgresql/9.6/main
BUCKET: mybucket
SERVER: db
2017/09/20 13:14:50 ERROR: permission denied for function pg_start_backup (SQLSTATE 42501)
QueryFile: start backup failed
github.com/wal-g/wal-g.StartBackup
	/home/travis/gopath/src/github.com/wal-g/wal-g/connect.go:36
main.main
	/home/travis/gopath/src/github.com/wal-g/wal-g/cmd/wal-g/main.go:280
runtime.main
	/home/travis/.gimme/versions/go1.8.3.linux.amd64/src/runtime/proc.go:185
runtime.goexit
	/home/travis/.gimme/versions/go1.8.3.linux.amd64/src/runtime/asm_amd64.s:2197

The file wal-g-wrapper sets up the env variables and then calls wal-g:

$ cat /usr/local/bin/wal-g-wrapper
#!/bin/bash
#
# Passes all arguments through to wal-g with correct env variables.

export WALE_S3_PREFIX=s3://mybucket/db

export AWS_ACCESS_KEY_ID=<redacted>
export AWS_SECRET_ACCESS_KEY=<redacted>
export AWS_REGION=eu-west-2

export PGUSER=myuser
export PGPASSWORD=thepassword
export PGDATABASE=mydatabase

/usr/local/bin/wal-g "$@"

The myuser postgresql user/role has permission for the mydatabase database. Initially it didn't have any role attributes so I gave it replication thinking that would solve the permission problem. It didn't.

So then I gave it the superuser attribute and backup-push was able to run successfully.

Should the replication attribute alone be sufficient for backup-push? If so, how can I get it to work?

This is with WAL-G v0.1.2 and postgresql 9.6 on Ubuntu 16.04.

Thanks!

Tablespace Support

Does WAL-G support tablespace backup , It seems WAL-G didn't pick the data from tablespace and while restoration it didn't ask any tablespace details(RESTORE SPEC) like --restore-spec, which WAL-E asks while using backup-fetch command.

And upon starting the restored cluster throws the error of missing tablespace.

Are we missing anything here ..?

Any pointers in this direction will be appreciated.

WAL-G should support Google Cloud Storage

It would be wonderful if WAL-G would support storage to Google Cloud Storage.
For reference, Google Cloud Storage using Go is documented here: https://cloud.google.com/storage/docs/reference/libraries
https://godoc.org/cloud.google.com/go/storage

wal-g should be less smart about AWS credentials

wal-g requires the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in order to prepare an aws.Config that uses them. This allows wal-g to return an error message early if it's not configured, but this approach breaks three other methods of supplying credentials which I actually use:

AWS_PROFILE pointing to something defined in ~/.aws/{config,credentials},
EC2 instances with dynamic credentials from an IAM role, and
Containers with dynamic credentials, both via IAM roles for tasks inside EC2 and iam-docker outside EC2

The AWS SDK for Go enables all of these by default, in addition to allowing configuration via AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. Removing the environment check and leaving Credentials unspecified will preserve the current behavior and re-enable all the other ways that typical AWS tools search for credentials.

wal-push failed, but exited with a 0 code?

We got into a nasty situation yesterday when a wal-g wal-push failed, threw some log output (below), and then seems to have exited with a 0 code which allowed Postgres to clean up the failed WAL log.

Host Details:

Ubuntu 14.04
Postgres 9.6.6
WAL-G 0.1.7

archive_command = 'envdir /etc/wal-g /usr/local/bin/wal-g wal-push %p'

Log:

WAL PATH: us1/xxx/wal_005/000000020000D96A00000039.lz4
BUCKET: xxx.com
SERVER: us1/xxxx=
2018/04/04 20:54:21 upload: failed to upload 'pg_xlog/000000020000D96A0000003A': SerializationError: failed to decode S3 XML error response
	status code: 400, request id: 342A75FE7FA0F3D6, host id: 77+fUhrRM9zyLPA/OFYJxqEHOeryiTI4zwVzLOrz7U0LtU4eazY8uw+dLSo2gnocSAnj5Q3Dbng=
caused by: unexpected EOF. Restarting in 1.02 seconds
WAL PATH: us1/xxx/wal_005/000000020000D96A0000003A.lz4
BUCKET: xxx.com
SERVER: us1/xxx

Finally, here's a snapshot of the file as it was uploaded to S3 .. note, its 0 bytes:

$ aws s3 ls s3://xxx.com/us1/xxx/wal_005/000000020000D96A0000003A.lz4
2018-04-04 13:54:23          0 000000020000D96A0000003A.lz4

AWS_REGION should be optional

Right now, the user has to specify WALE_S3_PREFIX and AWS_REGION. However, WALE_S3_PREFIX indicates an S3 bucket, S3 buckets have globally unique names, and the s3:GetBucketLocation API can take a S3 bucket name and return its region. AWS_REGION should therefore be optional, with wal-g able to determine the bucket's region automatically.

Backup folder is not properly deleted during a delete operation

When deleting a backup, the backup's folder is specified as the object key name, which is not an actual object and therefore will not match nor delete any objects.

https://github.com/wal-g/wal-g/blob/master/commands.go#L175

This implementation must be refactored to first query the objects located under that backup folder, and then pass those object identifiers to the delete function.

backup-list doesn't show backups

If backups stored in the root directory of the bucket, the 'backup-list' command shows "No backups found". Replacing backups to subdirectory is solving the problem.

postgres@pg0:~$ aws s3 ls s3://production-backups/
                           PRE basebackups_005/
                           PRE wal_005/
                           PRE walg/
postgres@pg0:~$ aws s3 ls s3://production-backups/walg/
                           PRE basebackups_005/
                           PRE wal_005/
postgres@pg0:~$
postgres@pg0:~$
postgres@pg0:~$ export WALE_S3_PREFIX=s3://production-backups/
postgres@pg0:~$ wal-g backup-list
BUCKET: production-backups
SERVER: 
2018/01/16 13:12:48 No backups found
postgres@pg0:~$
postgres@pg0:~$
postgres@pg0:~$ export WALE_S3_PREFIX=s3://production-backups/walg
postgres@pg0:~$ wal-g backup-list
BUCKET: production-backups
SERVER: walg
name    last_modified   wal_segment_backup_start
base_0000000100000001000000D4   2018-01-16T13:06:28Z    0000000100000001000000D4

Support other cloud providers

We are using wal-g in our project https://github.com/kubedb . We would like to support Google Cloud Store, Azure Blob Store and Openstack Swift as supported backends. This issue is intended to discuss the general design for this.

We have contributed this type of changes to other tools, example: restic: https://github.com/restic/restic/pulls?utf8=%E2%9C%93&q=is%3Apr+diptadas . The general pattern was to extract a Backend interface and then plugin cloud provider specific implementation based on some flag. We are willing to sponsor cloud provider accounts that can be used to run e2e tests via travis.

If this sounds good, this is process I propose:

Refactor the S3 implementation to extract the interface and merge that pr.
Create a separate pr for each of the 3 other providers.

What do you think? @fdr @x4m

cc: @aerokite @diptadas

Support for postgresql 9.5

wal-g/connect.go

Line 34 in a2f3a70

 err := conn.QueryRow("SELECT file_name FROM pg_xlogfile_name_offset(pg_start_backup($1, true, false))", backup).Scan(&name) 

Hi, we would like to test wal-g, but it seem, PG 9.6 is only supported as it uses function pg_start_backup from 9.6 with 3 arguments, where as in 9.5 is only 2:
https://www.postgresql.org/docs/9.5/static/functions-admin.html

Deletion Failures

Seeing errors such as this when deleting a large backlog of WALs (this is the first time I've tested multi-page WAL deletion, i.e. more than 1000 objects):

2017/12/05 22:32:45 Unable to delete WALS before base_000000010000006A00000073MalformedXML: The XML you provided was not well-formed or did not validate against our published schema
	status code: 400, request id: E783694080458AC0, host id: p7AI/ZTMbb9aeRdsouupEC0ziU4w6Gy2H3AK6EWdJbpwo8tFnabJr81OGfbaf1frDUbvdgYYqog=

wal-g crashes when datadir is symlink.

Hi,
wal-g crashes when it's trying to push backup and datadir is symlink.
It's not a critical issue, but it seems to me it's better to return error (or walk to destination dir) instead of crash.
wal-g version 0.1.3

postgres@pg0:~$ ls -l /var/lib/postgresql/9.6/main
lrwxrwxrwx 1 postgres postgres 16 Nov 21 07:19 /var/lib/postgresql/9.6/main -> /data/postgresql
postgres@pg0:~$ wal-g backup-push /var/lib/postgresql/9.6/main
BUCKET: production-backups
SERVER: 
Walking ...
Starting part 1 ...

Finished writing part 1.
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x89766b]

goroutine 1 [running]:
github.com/wal-g/wal-g.(*Bundle).HandleSentinel(0xc4201579e0, 0x0, 0x0)
        /home/travis/gopath/src/github.com/wal-g/wal-g/upload.go:281 +0x3b
github.com/wal-g/wal-g.HandleBackupPush(0x7ffe8fd6be04, 0x1c, 0xc42011cfc0, 0xc420188660)
        /home/travis/gopath/src/github.com/wal-g/wal-g/commands.go:555 +0x796
main.main()
        /home/travis/gopath/src/github.com/wal-g/wal-g/cmd/wal-g/main.go:107 +0x6db
postgres@pg0:~$
postgres@pg0:~$
postgres@pg0:~$ wal-g backup-push /data/postgresql/
BUCKET: production-backups
SERVER: 
Walking ...
Starting part 1 ...

PG_VERSION
backup_label.old
base
base/1
....

AWS_SECURITY_TOKEN is not really optional

Readme states AWS_SECURITY_TOKEN is optional, I keep getting:

postgres_1  | 2017/08/18 20:44:54 FATAL: Did not set the following environment variables:
postgres_1  | AWS_SECURITY_TOKEN

archive file has wrong size

Hi, we are currently testing wal-g with a small database that writes an entry every minute. When we try to restore the DB, we sometimes get this error:

2018-02-10 05:48:59.478 EST [25962] FATAL:  archive file "000000010000000000000028" has wrong size: 8388608 instead of 16777216

If we start over (rm -rf the data dir, backup-fetch, then recovery), we sometimes manage to fully restore the db, sometimes we get another similar error:

2018-02-10 06:01:29.671 EST [27419] FATAL:  archive file "000000010000000000000027" has wrong size: 8388608 instead of 16777216

The relevant part of the postgresql.conf:

wal_level = logical
archive_mode = on
archive_command = 'envdir /etc/wal-g.d/env /usr/local/bin/wal-g wal-push %p'
archive_timeout = 60

The recovery.conf file when we try to restore the db on a secondary cluster:

restore_command = 'envdir /etc/wal-g.d/env /usr/local/bin/wal-g wal-fetch "%f" "%p"'

We are not using GPG and we only declare basic environment variables:

ls -l /etc/wal-g.d/env/
total 12
-rwxr-x--- 1 root postgres 13 Feb  9 04:32 AWS_REGION
-rwxr-x--- 1 root postgres 16 Feb  9 04:53 PGHOST
-rwxr-x--- 1 root postgres 39 Feb  9 04:32 WALE_S3_PREFIX

[FEATURE-REQUEST] swift support?

is there roadmap with swift storage support for backups?

Environment Variables for Connection with S3

Hi,

We are using Softlayer S3 storage & which works on Swift API. I have successfully setup WAL-E using S3 storage by defining WALE_S3_PREFIX, SWIFT_AUTH_VERSION, SWIFT_PASSWORD, SWIFT_USER & SWIFT_AUTHURL under /etc/wal.e.d/env.

I have installed the wal-g & moved it to /usr/bin but when I am executing wal-push it is giving an error "FATAL: Did not set the following environment variables:WALE_S3_PREFIX".
I tried setting up this environment variable which I had setup for WAL-E implementation in bashrc file but after that I am getting below error:
goroutine 1 [running]:
github.com/wal-g/wal-g.Configure(0x10, 0xc420156620, 0x0, 0x18)
/home/travis/gopath/src/github.com/wal-g/wal-g/upload.go:77 +0xa6e
main.main()
/home/travis/gopath/src/github.com/wal-g/wal-g/cmd/wal-g/main.go:83 +0x1f6

Request you to please let me know if i missed any step or i need to make some changes in my environment variable for WAL-G to connect with S3 storage on Softlayer.

Postgres10 support

When I try to run wal-g backup-push I get this error:

ERROR:  function pg_xlogfile_name_offset(pg_lsn) does not exist at character 23
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
STATEMENT:  SELECT file_name FROM pg_xlogfile_name_offset(pg_start_backup($1, true, false))

wal-g version: 0.1.2
PostgreSQL version: 10.1

Looks like it's something similar to this issue here: wal-e/wal-e#339

Recreate folder structure during backup-fetch

Seems like this commit is incomplete.
I've encountered problem with absent pg_logical\snapshots and pg_logical\mappings.
Getting errors
[ 2017-12-04 15:29:14.792 MSK ,,,402290,58P01 ]:ERROR: could not open directory "pg_logical/snapshots": No such file or directory
until manual mkdir

There are two possible options - add empty marker files to tar or save empty folders to json sentinel.

Issues with parallel backup-push

I've been testing out the parallel backup-push that is currently in master, and have run into a few breaking scenarios.

First, and foremost, if the backup doesn't crash it never finishes. It gets through the last files, and seems to freeze while waiting for the last few to upload. Example:

...
/pg_xlog
/recovery.done
/server.crt
/server.key
Finished writing part 78.

Secondly, I've run into a couple of crashes, details of which can be found in this gist. I've hit the "Wait Group" exception the vast majority of the time, and the "concurrent map iteration and map write" a handful of times.

I've not yet been able to track down the cause of any of these issues, but I haven't spent a lot of time looking.

create restore checkpoint right before every backup

Hi.

I was wondering if it makes sense to create a checkpoint right before each backup, so in case we're using a backup combined with recovery.conf then we can use recovery_target_name so it doesn't restore beyond the checkpoint for the backup but it's still able to fetch all the wal logs necessary to reach the point of recovery via restore_command.

It should be as easy as calling pg_create_restore_point('base_<wal-id>') once we compose the name of wal log for the backup.

This way when you list all backups with backup-list, you know you can use the name of the backup as recovery_target_name to perform PITR.

Let me know if this makes sense, I can send a PR for it.

Should detect when `archive_command` is not set right..

We are doing some development of new puppet code for managing databases, and in my testing I hadn't yet set archive_command to anything on our dbmaster. Obviously we need to do that .. but if you miss it, and you try to execute a wal-g backup-push, it just hangs near the end:

/pg_subtrans
/pg_tblspc
/pg_twophase
/pg_xlog
/postgresql.auto.conf
/postgresql.conf
Finished writing part 1.
Starting part 2 ...
/global/pg_control
Finished writing part 2.
<hang is here>

Digging into the Postgres logs, we saw:

2018-01-05 08:00:02.232 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,3,"SELECT",2018-01-05 08:00:02 UTC,3/1674,0,LOG,00000,"duration: 124.497 ms  execute <unnamed>: SELECT case when pg_is_in_recovery() then '' else (pg_xlogfile_name_offset(lsn)).file_name end, lsn::text, pg_is_in_recovery() FROM pg_start_backup($1, true, false) lsn","parameters: $1 = '2018-01-05 08:00:02.106987155 +0000 UTC'",,,,,,,,""
2018-01-05 08:01:02.103 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,4,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (60 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""
2018-01-05 08:02:02.171 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,5,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (120 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""
2018-01-05 08:04:02.305 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,6,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (240 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""
2018-01-05 08:08:02.570 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,7,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (480 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""
2018-01-05 08:16:03.100 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,8,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (960 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""
2018-01-05 08:32:04.156 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,9,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (1920 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""
2018-01-05 09:04:06.275 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,10,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (3840 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""
2018-01-05 10:08:10.511 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,11,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (7680 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""
2018-01-05 12:16:18.992 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,12,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (15360 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""

Adding archive_command = "/bin/true" and HUPing postgres solved the issue. However, it seems that wal-g should have some way to detect when its in this hung state and get out of it with a useful error.

Deleting backups leaves sentinel file

Apologies if this is me misunderstanding what is actually correct behaviour, but when I run this:

$ wal-g delete retain FULL 1 --confirm
2018/03/25 11:45:20 base_000000010000000000000008 skipped         
2018/03/25 11:45:20 base_000000010000000000000004 will be deleted

The backup base_000000010000000000000004 correctly disappears from my S3 bucket. However the sentinel file base_000000010000000000000004_backup_stop_sentinel.json remains, meaning that the deleted backup shows up in wal-g backup-list even though it no longer exists:

$ wal-g backup-list    
name    last_modified   wal_segment_backup_start                                                       
base_000000010000000000000004   2018-03-25T11:44:39Z    000000010000000000000004                       
base_000000010000000000000008   2018-03-25T11:45:02Z    000000010000000000000008

Manually deleting the sentinel file stops the actually-deleted backup from showing up.

Is this intended behaviour? I found it a bit misleading, so thought it was worth checking. Thanks!

prefetch failed

Hi,
I've restored a backup and during recovery, WAL segments are copied as usual but I also see messages WAL-prefetch failed: no such file or directory. What is the cause?

recovery.conf

standby_mode = 'on'
restore_command = '. /etc/wal-g/env && wal-g wal-fetch "%f" "%p"'
recovery_target = 'immediate'

backup-push ignores set WALG_S3_STORAGE_CLASS

backup-push ignores set WALG_S3_STORAGE_CLASS and keeps uploading files as STANDARD.
wal-push works as expected.

Refactoring

Hi!
There are few places that I want to refactor, e.g.:

Reorganize main()
https://github.com/wal-g/wal-g/blob/master/cmd/wal-g/main.go#L92
Handle absence of backups in backup-fetch LATEST gracefully, not in panic
https://github.com/wal-g/wal-g/blob/master/backup.go#L94
Rename TarBall.SetSize() to TarBall.AddSize()
https://github.com/wal-g/wal-g/blob/master/structs.go#L84

et c.

How do you think, is it feasible to create such PRs? Will they be reviewed or do they interfere with your plans on product development?

On WAL-PUSH, WAL-G sometimes doesn't get the same log files as WAL-E?

Ok, this is a strange one. I am trying to migrate from WAL-E to WAL-G, but for a number of reasons, we have to maintain both for now. So we are dual-writing backups and WAL logs to two different S3 paths.

To do that, we run two backups each night (one at 8AM UTC, one at 12AM UTC). We also set our archive_command to:

archive_command = '/mnt/postgres-scripts/wale/wale_archive_command.sh "%p" && /usr/local/bin/wal-g.sh wal-push %p'

The behavior we see thats strange is that our wale_archive_command.sh script will run and seemingly get one WAL file passed in, but then upload two WAL files in its execution:

+ WAL=pg_xlog/000000010000CC0A000000D2
+++ dirname /mnt/postgres-scripts/wale/wale_archive_command.sh
++ cd /mnt/postgres-scripts/wale
++ pwd  
+ ENV=/mnt/postgres-scripts/wale/env.sh
+ . /mnt/postgres-scripts/wale/env.sh
++ set +x
AWS Credentials Sourced: /mnt/postgres-scripts/wale/aws.sh
Configured WAL-E Bucket: s3://company.com/us1/company @ us-west-2
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-push".
        STRUCTURED: time=2018-01-22T03:03:12.757110-00 pid=82794
wal_e.worker.upload INFO     MSG: begin archiving a file
        DETAIL: Uploading "pg_xlog/000000010000CC0A000000D2" to "s3://company.com/us1/company/wal_005/000000010000CC0A000000D2.lzo".
        STRUCTURED: time=2018-01-22T03:03:12.800269-00 pid=82794 action=push-wal key=s3://company.com/us1/company/wal_005/000000010000CC0A000000D2.lzo prefix=us1/company/ seg=000000010000CC0A000000D2 state=begin
wal_e.worker.upload INFO     MSG: begin archiving a file
        DETAIL: Uploading "pg_xlog/000000010000CC0A000000D3" to "s3://company.com/us1/company/wal_005/000000010000CC0A000000D3.lzo".
        STRUCTURED: time=2018-01-22T03:03:12.818554-00 pid=82794 action=push-wal key=s3://company.com/us1/company/wal_005/000000010000CC0A000000D3.lzo prefix=us1/company/ seg=000000010000CC0A000000D3 state=begin
wal_e.worker.upload INFO     MSG: completed archiving to a file
        DETAIL: Archiving to "s3://company.com/us1/company/wal_005/000000010000CC0A000000D2.lzo" complete at 16356.5KiB/s.
        STRUCTURED: time=2018-01-22T03:03:13.513893-00 pid=82794 action=push-wal key=s3://company.com/us1/company/wal_005/000000010000CC0A000000D2.lzo prefix=us1/company/ rate=16356.5 seg=000000010000CC0A000000D2 state=complete
wal_e.worker.upload INFO     MSG: completed archiving to a file
        DETAIL: Archiving to "s3://company.com/us1/company/wal_005/000000010000CC0A000000D3.lzo" complete at 18033.6KiB/s.
        STRUCTURED: time=2018-01-22T03:03:13.528637-00 pid=82794 action=push-wal key=s3://company.com/us1/company/wal_005/000000010000CC0A000000D3.lzo prefix=us1/company/ rate=18033.6 seg=000000010000CC0A000000D3 state=complete

Meanwhile, our wal-g.sh script will only upload the first file:

BUCKET: company.com
SERVER: us1/company/wal-g    
WAL PATH: us1/company/wal-g/wal_005/000000010000CC0A000000D2.lz4

I don't understand the behavior at all. It feels like the wale_archive_command.sh script is being invoked with two WAL files.. but if it was, the extended output we have should show that in the WAL=... line. However, its pretty clear that the script runs once, and yet wal-e sees two files and uploads them in order.

Meanwhile, the wal-g.sh script seems so simple that I can't imagine I'm doing anything wrong there.

Any thoughts on what could be happening? Here are our scripts, just for your reference:

wale_archive_command.sh

#!/bin/bash -x
WAL=$1
ENV="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/env.sh"
. $ENV || fail "Could not find ${ENV}. Exiting."
wal-e wal-push "$WAL"

wal-g.sh

envdir /etc/wal-g /usr/local/bin/wal-g $*

Some LZOP archives don't work

I'm still taking this apart but it can be related to #22.

I have this backup that has hundreds of archives, but one of those archives causes a systematic crash, whereas lzop seems fine with it. There's not much to do but for me to go through it with a fine-tooth comb, but, FYI.

cc @x4m

runtime error: slice bounds out of range

Hello,

I'm interested in trying out wal-g. Currently I'm trying to restore an existing backup that was created by wal-e and I'm getting an error:

envdir /etc/wal-e.d/reader /usr/local/bin/wal-g backup-fetch /var/lib/postgresql/9.2/main LATEST
BUCKET: database-backups
SERVER: path/to/backup
panic: runtime error: slice bounds out of range

goroutine 1 [running]:
main.main()
	/home/travis/gopath/src/github.com/wal-g/wal-g/cmd/wal-g/main.go:136 +0x1890

These are the environment variables in /etc/wal-e.d/reader/:

ls -1 /etc/wal-e.d/reader/
AWS_ACCESS_KEY_ID
AWS_REGION
AWS_SECRET_ACCESS_KEY
WALE_S3_PREFIX

Any idea what I'm doing wrong?

Thanks!

File encryption

Will wal-g support GPG encryption as wal-e does ? It's a main feature when storing on public clouds sensitive datas.

backup-push to size-limited bucket

@istarling recently have reported the interesting problem. When there is not enough space in bucket WAL-G retries every file many times.
I think that best solution should be to remove existing retry infrastructure in favor of AWS SDK built-in retries.
This may interfere with #74. @tamalsaha , how do you think, if I'll do that in a week or two I won't make a problem for your implementations of #74?

Google Summer of Code 2018 with PostgreSQL

Hi!
If you are a student and wish to contribute to WAL-G I've created GSoC project under the umbrella of PostgreSQL.
Feel free to contact me on the matter.

Also, if you are doing graduation project I have some ideas for you too. And surely, you can combine both graduation project and GSoC project (with different, but related topics).

If this is important, I'm Associated Professor at Ural Federal University, Russia. I've about 8 years of advising student researchers in the field of Computer Science and Software Engineering. I hold Ph.D. in Theoretical Informatics (but projects are going to be 100% practical).

WAL prefetch

@fdr , I'll create an issue here to keep the discussion on implementation details open.

Brief description

I'm working now on WAL-prefetch feature. Postgres asks WAL-fetch command only when it has nothing else to do. This is not very performant design, we can download WALs that probably will be needed when Postgres is busy with a replay of what's in hand already.

Details

In the email conversation, @fdr stated that we should prefetch 8 files ahead of what was asked by Postgres.

I have few more questions:

Should we trigger prefetch after fetch, or start prefetch along with fetch? I propose triggering prefetch just before returning.
Current wal-fetch command before the start is doing the following:
a. Check if there is .lzo from WAL-E
b. Check if there is .lz4 from WAL-G
c. Start downloading whatever is there
I propose changing this behavior towards more optimistic path:
a. Start downloading .lz4
b. If failed - start downloading .lzo
c. If failed - return failure

wal-g / wal-g Goto Github PK

wal-g's People

Contributors

Stargazers

Watchers

Forkers

wal-g's Issues

Versions

Problem

Server 1 Failure

Server 2 Failure

wale_archive_command.sh

wal-g.sh

Brief description

Details

Recommend Projects

Recommend Topics

Recommend Org