Giter Club home page Giter Club logo

Comments (9)

aknuds1 avatar aknuds1 commented on May 29, 2024 2

@seizethedave what I'm wondering upon reading the code is whether what we need to do is to modify Ingester.LabelValues such that it only closes the querier (returned from db.Querier) after LabelValuesResponse has been marshaled. I believe tsdb.Block.Close waits until all pending readers are done, and closing the block index reader decrements the pending reader count. Not sure how we can eventually customize the LabelValuesResponse marshaling, to take care of the closing. WDYT?

from mimir.

seizethedave avatar seizethedave commented on May 29, 2024

per the backtrace:

  • the byte slice {0xc137890000, 0x2d6a64, 0x2d6a64} is heap allocated and is ~2.83 MiB.
  • the sigsegv signal indicates
    • address= 0x7976968b824e: yes, this is definitely a wild memory address in the ~120 TiB neighborhood.
    • code=0x1: the address isn't in the mapped space for this process.
  • is that giant address in the source or destination of the memmove call? seems more likely that it's the slice of source strings. Is there something fancy going on in there?
  • Ah. The values are backed by the index file which is mmapped. The strings' byte slices are pointing to memory mapped regions. Note the comment in there:
    // It is not safe to use the return value beyond the lifetime of the byte slice
    // passed into the Reader.
    
  • There's nothing preventing the file from being unmapped while the gRPC/proto machinery marshals the string slice to the caller. This is probably the bug. There are other RPCs in Ingester that would suffer from this same thing. LabelValuesCardinality, etc.

from mimir.

aknuds1 avatar aknuds1 commented on May 29, 2024

Which Mimir version is this?

from mimir.

aknuds1 avatar aknuds1 commented on May 29, 2024

@dimitarvdimitrov this might be of interest to you. I think we discussed whether the use of yoloString would be safe in index.Reader.LabelValues?

from mimir.

seizethedave avatar seizethedave commented on May 29, 2024

Yeah, that is the bug. In one of the crashes, there's a goroutine running ingester's Compact() loop, which calls deleteBlocks, which closes compacted blocks, which had just called munmap before this process crashed.

2024-04-09 06:51:51.918	goroutine 3773562 [runnable]:
2024-04-09 06:51:51.918	syscall.Syscall6(0xc092f90ac0?, 0x11?, 0x470100?, 0x0?, 0x0?, 0x7975a8d5bbc8?, 0xa?)
2024-04-09 06:51:51.919		$GOROOT/src/syscall/syscall_linux.go:91 +0x30 fp=0xc0014dedd8 sp=0xc0014ded50 pc=0x4de230
2024-04-09 06:51:51.919	syscall.openat(0x797712069a38?, {0xc092f90ac0?, 0xc000098400?}, 0xc0953b4060?, 0x0)
2024-04-09 06:51:51.919		$GOROOT/src/syscall/zsyscall_linux_amd64.go:83 +0x8d fp=0xc0014dee50 sp=0xc0014dedd8 pc=0x4da9ad
2024-04-09 06:51:51.919	syscall.Open(...)
2024-04-09 06:51:51.919		$GOROOT/src/syscall/syscall_linux.go:272
2024-04-09 06:51:51.919	os.open({0xc092f90ac0?, 0x2f41bc0?}, 0xc0b8ce0001?, 0x14deef8?)
2024-04-09 06:51:51.919		$GOROOT/src/os/file_open_unix.go:15 +0x2b fp=0xc0014dee88 sp=0xc0014dee50 pc=0x507f0b
2024-04-09 06:51:51.919	os.openFileNolog({0xc092f90ac0, 0x11}, 0x0, 0x0)
2024-04-09 06:51:51.919		$GOROOT/src/os/file_unix.go:272 +0x8a fp=0xc0014deed0 sp=0xc0014dee88 pc=0x50938a
2024-04-09 06:51:51.919	os.OpenFile({0xc092f90ac0, 0x11}, 0x0, 0x5384ae0?)
2024-04-09 06:51:51.919		$GOROOT/src/os/file.go:334 +0x3e fp=0xc0014def08 sp=0xc0014deed0 pc=0x5074be
2024-04-09 06:51:51.919	os.Open(...)
2024-04-09 06:51:51.919		$GOROOT/src/os/file.go:314
2024-04-09 06:51:51.919	os.removeAll({0xc092f90ac0, 0x3d})
2024-04-09 06:51:51.919		$GOROOT/src/os/removeall_at.go:38 +0x19a fp=0xc0014defc0 sp=0xc0014def08 pc=0x50afda
2024-04-09 06:51:51.919	os.RemoveAll(...)
2024-04-09 06:51:51.919		$GOROOT/src/os/path.go:67
2024-04-09 06:51:51.919	github.com/prometheus/prometheus/tsdb.(*DB).deleteBlocks(0xc000933900, 0xc0014df448?)
2024-04-09 06:51:51.919		/drone/src/vendor/github.com/prometheus/prometheus/tsdb/db.go:1752 +0x448 fp=0xc0014df188 sp=0xc0014defc0 pc=0x1acd828
2024-04-09 06:51:51.919	github.com/prometheus/prometheus/tsdb.(*DB).reloadBlocks(0xc000933900)
2024-04-09 06:51:51.919		/drone/src/vendor/github.com/prometheus/prometheus/tsdb/db.go:1596 +0xaa5 fp=0xc0014dfb48 sp=0xc0014df188 pc=0x1acbea5
2024-04-09 06:51:51.919	github.com/prometheus/prometheus/tsdb.(*DB).compactOOOHead(0xc000933900, {0x3a6c568?, 0x540c3e0?})
2024-04-09 06:51:51.919		/drone/src/vendor/github.com/prometheus/prometheus/tsdb/db.go:1321 +0xf3 fp=0xc0014dfc98 sp=0xc0014dfb48 pc=0x1ac9993
2024-04-09 06:51:51.919	github.com/prometheus/prometheus/tsdb.(*DB).Compact(0xc000933900, {0x3a6c568, 0x540c3e0})
2024-04-09 06:51:51.919		/drone/src/vendor/github.com/prometheus/prometheus/tsdb/db.go:1277 +0x68d fp=0xc0014dfdf8 sp=0xc0014dfc98 pc=0x1ac904d
2024-04-09 06:51:51.919	github.com/grafana/mimir/pkg/ingester.(*userTSDB).Compact(...)
2024-04-09 06:51:51.919		/drone/src/vendor/github.com/grafana/mimir/pkg/ingester/user_tsdb.go:181
2024-04-09 06:51:51.919	github.com/grafana/mimir/pkg/ingester.(*Ingester).compactBlocks.func1({0xc000e12a30?, 0xc001475001?}, {0xc00143117b, 0x6})
2024-04-09 06:51:51.919		/drone/src/vendor/github.com/grafana/mimir/pkg/ingester/ingester.go:2919 +0x44a fp=0xc0014dfee8 sp=0xc0014dfdf8 pc=0x2211fea
2024-04-09 06:51:51.919	github.com/grafana/dskit/concurrency.ForEachUser.func1()
2024-04-09 06:51:51.919		/drone/src/vendor/github.com/grafana/dskit/concurrency/runner.go:45 +0x132 fp=0xc0014dffe0 sp=0xc0014dfee8 pc=0x1856192
2024-04-09 06:51:51.919	runtime.goexit()
2024-04-09 06:51:51.919		$GOROOT/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0014dffe8 sp=0xc0014dffe0 pc=0x471f61
2024-04-09 06:51:51.919	created by github.com/grafana/dskit/concurrency.ForEachUser in goroutine 3814
2024-04-09 06:51:51.919		/drone/src/vendor/github.com/grafana/dskit/concurrency/runner.go:36 +0x126

general solutions:

  1. offer variants of the methods in tsdb's index.Reader that return copied strings.
  2. add the notion of a handle that callers can obtain from the index. This would also probably require us to write a custom gRPC-go codec, because we'd need to release the handle after Marshal is called. And there'd be the complexity of making ingester+tsdb block have a "closing" state where munmap doesn't happen until all handles are closed. (And new RPCs don't see "closing" blocks.)
  3. others?

from mimir.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.