When stenographer starts up, it tries to recover all

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Slow start up time for `stenographer` about stenographer HOT 18 OPEN

google commented on May 14, 2024

Slow start up time for `stenographer`

from stenographer.

Comments (18)

gconnell commented on May 14, 2024

Hey, yew,

What you're saying sounds 100% accurate. That said, we haven't experienced this as a bottleneck, probably since stenographer doesn't restart much, so an initial startup cost isn't noticed in our environment. Could you define a "non-negligible amount of time"? I do worry that we might hit our watchdog limit of 5 minutes (https://github.com/google/stenographer/blob/master/thread/thread.go#L131), which could cause a crash-loop on start.

Steno does need to know about files before it can serve requests, but maybe this could be more parallelized? Also, does the server currently accept a request, then block until ready? Or does it reject the request (maybe because it hasn't yet opened its port)? We might fix it to do the former if it currently does the latter, so a request would hang but eventually succeed.

--Graeme

from stenographer.

yew011 commented on May 14, 2024

@gconnell thx for the reply,

it may take up to 15 minutes to finish recovery of a disk with 30K index files,

the watchdog won't timeout, since we add 1 min for each new file, see here
https://github.com/google/stenographer/blob/master/thread/thread.go#L135

also, it is not parallel, since in the env.go, we call it one by one, see here
https://github.com/google/stenographer/blob/master/env/env.go#L258-L262
and in each call of t.SyncFiles() the thread mutex is held which prevent any
lookup on that thread.

yes, the server will still accept a request, but block until ready/or reaching the
15 minutes http handling timeout,

Thx,
Alex Wang,

from stenographer.

gconnell commented on May 14, 2024

Thanks for the followup, Alex.

Let me see if I can do some quick parallelization work that might speed things up... stay tuned.

from stenographer.

yew011 commented on May 14, 2024

@gconnell also, we hit this pretty badly since we use four spin disks,
one for storing indexfiles and three for storing packets,... @@

and even though i parallelize this part, https://github.com/google/stenographer/blob/master/env/env.go#L258-L262
we are still limited by the throughput of one spin disk...

so, would also like to know if you store indexfile and packetfile on the same disk,
or you use separate disks?

Thanks~

from stenographer.

gconnell commented on May 14, 2024

I've done some work in the 'syncfiles' branch which might help with this... would you be able to pull it down and take a look? See the top three commits here: https://github.com/google/stenographer/commits/syncfiles

from stenographer.

yew011 commented on May 14, 2024

@gconnell thx so much for putting up the improvement, looks good to me, two comments:

seems that we do not Reset watchdog for opening each file anymore... so, if there are
ton of index files on one disk, we may still trigger the watchdog timeout... from the
description of Reset(), i think it is okay to call it from multiple go routines, so maybe
we can call Reset() after passing newFileThrottle <- true...
the following code in the Lookup() also needs to be adjusted:

 	for _, file := range t.getSortedFiles() {
 		files = append(files, t.files[file])
 	}

if Lookup() is called during the packet/index file recovery, the
t.files[file] can be nil, so we want to guard it with if statement.

Also, i'm thinking if we could also try indicating the packet/index file recovery
not finishing in the HTTP response sent to stenoread... e.g. like using the
HTTP Warning field?

from stenographer.

gconnell commented on May 14, 2024

Point 1 is easily fixed, I've committed it to the branch.
M'm not sure why 2 is required? The write to t.files in syncFilesWithDisk and trackNewFile is protected with a t.mu.Lock(), and Lookup is protected with a t.mu.RLock(). Also, I'm not seeing anywhere that t.files[x] is set to nil.

Re: HTTP response, I think the best bet is still to block incoming requests until the first sync is complete, then best-effort any future syncs. I've committed that as well to syncfiles.

from stenographer.

yew011 commented on May 14, 2024

@gconnell thx for adding the watchdog reset,

for my second comment, i was wrong, i somehow confused the getSortedFiles() with
listPacketFilesOnDisk() and thought it will search for all files on disk... so, nothing to
worry about, there,

the for taking care of the HTTP response, I'm good with that, just like the log level to be
higher,

from stenographer.

gconnell commented on May 14, 2024

Sounds good. Could you run some tests on your end to make sure the startup time is in fact sane now? If it is, I'll push this to master and you can send pull requests for any additional small stuff (changing loglevel, etc).

from stenographer.

yew011 commented on May 14, 2024

Sure, I'll try that today,~ and post it here asap~

from stenographer.

yew011 commented on May 14, 2024

@gconnell one more issue,

@@ -260,9 +265,15 @@ func (d *Env) removeOldFiles() {
 }

 func (d *Env) syncFiles() {
+       var wg sync.WaitGroup
        for _, t := range d.threads {
-               t.SyncFiles()
+               wg.Add(1)
+               go func() {
+                       t.SyncFiles()
+                       wg.Done()
+               }()
        }
+       wg.Wait()
 }

we need to do the t := t trick again, inside the for loop otherwise
it will all refer to the last element in d.threads,

from stenographer.

gconnell commented on May 14, 2024

9ccb527

…

On Tue, Nov 29, 2016 at 6:05 PM, alex ***@***.***> wrote: @gconnell <https://github.com/gconnell> one more issue, @@ -260,9 +265,15 @@ func (d *Env) removeOldFiles() { } func (d *Env) syncFiles() { + var wg sync.WaitGroup for _, t := range d.threads { - t.SyncFiles() + wg.Add(1) + go func() { + t.SyncFiles() + wg.Done() + }() } + wg.Wait() } we need to do the t := t trick again, inside the for loop otherwise it will all refer to the last element in d.threads, — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#149 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABJ9tt3JX78iGOmUEpWv0zrSKJNUVliVks5rDMvSgaJpZM4K9X_s> .

from stenographer.

yew011 commented on May 14, 2024

@gconnell i still get abortion from the Watchdog for syncing files with disk...
even though i move the fido.Reset(time.Minute) inside the critical section
of t.mu.Lock()... still trying to figure it out,

still not competent with golang, and want to confirm if we need to worry
about any possible race with resetting the watchdog timer,

thanks,

from stenographer.

gconnell commented on May 14, 2024

Could you add the stack trace(s) to this thread?

…

On Wed, Nov 30, 2016 at 1:30 AM, alex ***@***.***> wrote: @gconnell <https://github.com/gconnell> i still get abortion from the Watchdog for syncing files with disk... even though i move the fido.Reset(time.Minute) inside the critical section of t.mu.Lock()... still trying to figure it out, still not competent with golang, and want to confirm if we need to worry about any possible race with resetting the watchdog timer, thanks, — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#149 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABJ9tsrk5SH6di388iYL5CNYotqfy1_Wks5rDTQsgaJpZM4K9X_s> .

from stenographer.

yew011 commented on May 14, 2024

i'll do that once back to my setup,

but guessing that i used 3 7200rpm disks for storing pkts and 1 7200rpm disk for
storing index... and when this recovery happens, there will be 32 (throttle) * 3 (thread module)
simultaneous IO read from the index disk, and when though we Reset(1min) the timer,
requests from the same thread module... can still be starved for more than 1min?...

so, i'm thinking if we can Reset() timer after we fully recover the block/index file?
i'll also see if this can fix the abortion,

thanks,

from stenographer.

yew011 commented on May 14, 2024

@gconnell sorry for this delayed update,

i tried to debug.PrintStack() before the log.Fatalf() here,
but it does not really show any meaningful try during the abortion, only two lines
like below:

<omitted>/google/stenographer/base/base.go:452 (0x4af3a8)
<omitted>/go/src/runtime/asm_amd64.s:1721 (0x45f431)

i think this is basically caused by the contention of accessing the disk
which stores index files for three packet disks...

i saw logs like this:

Nov 30 07:35:07 ev004 stenographer[28060]: 2016/11/30 07:35:07 Blockfile opening: "/tmp/stenographer338850325/PKT4/148037305476175
Nov 30 07:35:08 ev004 stenographer[28060]: 2016/11/30 07:35:07 opening index "/tmp/stenographer338850325/IDX4/1480373054761754"
Nov 30 07:36:48 ev004 stenographer[28060]: 2016/11/30 07:36:48 new blockfile "/tmp/stenographer338850325/PKT4/1480373054761754"

basically indicating a lot of contentions,... i'm fine with the current implementation
perhaps i'll customize it a bit more for my specific use,

please feel free to merge into master,

thanks,
alex wang,

from stenographer.

gconnell commented on May 14, 2024

Could you try reducing the parallelism (at https://github.com/google/stenographer/blob/syncfiles/thread/thread.go#L84) from 32 to maybe 3 and see what effect that gives you?

…

On Wed, Nov 30, 2016 at 11:05 PM, alex ***@***.***> wrote: @gconnell <https://github.com/gconnell> sorry for this delayed update, i tried to debug.PrintStack() before the log.Fatalf() here <https://github.com/google/stenographer/blob/master/base/base.go#L376-L380> , but it does not really show any meaningful try during the abortion, only two lines like below: <omitted>/google/stenographer/base/base.go:452 (0x4af3a8) <omitted>/go/src/runtime/asm_amd64.s:1721 (0x45f431) i think this is basically caused by the contention of accessing the disk which stores index files for three packet disks... i saw logs like this: Nov 30 07:35:07 ev004 stenographer[28060]: 2016/11/30 07:35:07 Blockfile opening: "/tmp/stenographer338850325/PKT4/148037305476175 Nov 30 07:35:08 ev004 stenographer[28060]: 2016/11/30 07:35:07 opening index "/tmp/stenographer338850325/IDX4/1480373054761754" Nov 30 07:36:48 ev004 stenographer[28060]: 2016/11/30 07:36:48 new blockfile "/tmp/stenographer338850325/PKT4/1480373054761754" basically indicating a lot of contentions,... i'm fine with the current implementation perhaps i'll customize it a bit more for my specific use, please feel free to merge into master, thanks, alex wang, — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#149 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABJ9tkSJT5LXwO0ddycUohDqNc9HIIZFks5rDmOygaJpZM4K9X_s> .

from stenographer.

yew011 commented on May 14, 2024

@gconnell changing to 4 is help with the contention and starvation,

also, i found that we must call the Reset() inside the mutex t.mu.Lock()...
otherwise, i always got timeout triggered on watchdog even after the
syncFilesWithDisk() returns... this is so weird, i tried to check the golang
runtime code to understand but couldn't figure it out,... seemed that
the timer was registered to the runtime heap multiple times...

thanks,

from stenographer.

Slow start up time for `stenographer` about stenographer HOT 18 OPEN

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent