Giter Club home page Giter Club logo

Comments (18)

gconnell avatar gconnell commented on May 14, 2024

Hey, yew,

What you're saying sounds 100% accurate. That said, we haven't experienced this as a bottleneck, probably since stenographer doesn't restart much, so an initial startup cost isn't noticed in our environment. Could you define a "non-negligible amount of time"? I do worry that we might hit our watchdog limit of 5 minutes (https://github.com/google/stenographer/blob/master/thread/thread.go#L131), which could cause a crash-loop on start.

Steno does need to know about files before it can serve requests, but maybe this could be more parallelized? Also, does the server currently accept a request, then block until ready? Or does it reject the request (maybe because it hasn't yet opened its port)? We might fix it to do the former if it currently does the latter, so a request would hang but eventually succeed.

--Graeme

from stenographer.

yew011 avatar yew011 commented on May 14, 2024

@gconnell thx for the reply,

it may take up to 15 minutes to finish recovery of a disk with 30K index files,

the watchdog won't timeout, since we add 1 min for each new file, see here
https://github.com/google/stenographer/blob/master/thread/thread.go#L135

also, it is not parallel, since in the env.go, we call it one by one, see here
https://github.com/google/stenographer/blob/master/env/env.go#L258-L262
and in each call of t.SyncFiles() the thread mutex is held which prevent any
lookup on that thread.

yes, the server will still accept a request, but block until ready/or reaching the
15 minutes http handling timeout,

Thx,
Alex Wang,

from stenographer.

gconnell avatar gconnell commented on May 14, 2024

Thanks for the followup, Alex.

Let me see if I can do some quick parallelization work that might speed things up... stay tuned.

from stenographer.

yew011 avatar yew011 commented on May 14, 2024

@gconnell also, we hit this pretty badly since we use four spin disks,
one for storing indexfiles and three for storing packets,... @@

and even though i parallelize this part, https://github.com/google/stenographer/blob/master/env/env.go#L258-L262
we are still limited by the throughput of one spin disk...

so, would also like to know if you store indexfile and packetfile on the same disk,
or you use separate disks?

Thanks~

from stenographer.

gconnell avatar gconnell commented on May 14, 2024

I've done some work in the 'syncfiles' branch which might help with this... would you be able to pull it down and take a look? See the top three commits here: https://github.com/google/stenographer/commits/syncfiles

from stenographer.

yew011 avatar yew011 commented on May 14, 2024

@gconnell thx so much for putting up the improvement, looks good to me, two comments:

  1. seems that we do not Reset watchdog for opening each file anymore... so, if there are
    ton of index files on one disk, we may still trigger the watchdog timeout... from the
    description of Reset(), i think it is okay to call it from multiple go routines, so maybe
    we can call Reset() after passing newFileThrottle <- true...
  2. the following code in the Lookup() also needs to be adjusted:
 	for _, file := range t.getSortedFiles() {
 		files = append(files, t.files[file])
 	}

if Lookup() is called during the packet/index file recovery, the
t.files[file] can be nil, so we want to guard it with if statement.

Also, i'm thinking if we could also try indicating the packet/index file recovery
not finishing in the HTTP response sent to stenoread... e.g. like using the
HTTP Warning field?

from stenographer.

gconnell avatar gconnell commented on May 14, 2024

Point 1 is easily fixed, I've committed it to the branch.
M'm not sure why 2 is required? The write to t.files in syncFilesWithDisk and trackNewFile is protected with a t.mu.Lock(), and Lookup is protected with a t.mu.RLock(). Also, I'm not seeing anywhere that t.files[x] is set to nil.

Re: HTTP response, I think the best bet is still to block incoming requests until the first sync is complete, then best-effort any future syncs. I've committed that as well to syncfiles.

from stenographer.

yew011 avatar yew011 commented on May 14, 2024

@gconnell thx for adding the watchdog reset,

for my second comment, i was wrong, i somehow confused the getSortedFiles() with
listPacketFilesOnDisk() and thought it will search for all files on disk... so, nothing to
worry about, there,

the for taking care of the HTTP response, I'm good with that, just like the log level to be
higher,

from stenographer.

gconnell avatar gconnell commented on May 14, 2024

Sounds good. Could you run some tests on your end to make sure the startup time is in fact sane now? If it is, I'll push this to master and you can send pull requests for any additional small stuff (changing loglevel, etc).

from stenographer.

yew011 avatar yew011 commented on May 14, 2024

Sure, I'll try that today,~ and post it here asap~

from stenographer.

yew011 avatar yew011 commented on May 14, 2024

@gconnell one more issue,

@@ -260,9 +265,15 @@ func (d *Env) removeOldFiles() {
 }

 func (d *Env) syncFiles() {
+       var wg sync.WaitGroup
        for _, t := range d.threads {
-               t.SyncFiles()
+               wg.Add(1)
+               go func() {
+                       t.SyncFiles()
+                       wg.Done()
+               }()
        }
+       wg.Wait()
 }

we need to do the t := t trick again, inside the for loop otherwise
it will all refer to the last element in d.threads,

from stenographer.

gconnell avatar gconnell commented on May 14, 2024

from stenographer.

yew011 avatar yew011 commented on May 14, 2024

@gconnell i still get abortion from the Watchdog for syncing files with disk...
even though i move the fido.Reset(time.Minute) inside the critical section
of t.mu.Lock()... still trying to figure it out,

still not competent with golang, and want to confirm if we need to worry
about any possible race with resetting the watchdog timer,

thanks,

from stenographer.

gconnell avatar gconnell commented on May 14, 2024

from stenographer.

yew011 avatar yew011 commented on May 14, 2024

i'll do that once back to my setup,

but guessing that i used 3 7200rpm disks for storing pkts and 1 7200rpm disk for
storing index... and when this recovery happens, there will be 32 (throttle) * 3 (thread module)
simultaneous IO read from the index disk, and when though we Reset(1min) the timer,
requests from the same thread module... can still be starved for more than 1min?...

so, i'm thinking if we can Reset() timer after we fully recover the block/index file?
i'll also see if this can fix the abortion,

thanks,

from stenographer.

yew011 avatar yew011 commented on May 14, 2024

@gconnell sorry for this delayed update,

i tried to debug.PrintStack() before the log.Fatalf() here,
but it does not really show any meaningful try during the abortion, only two lines
like below:

<omitted>/google/stenographer/base/base.go:452 (0x4af3a8)
<omitted>/go/src/runtime/asm_amd64.s:1721 (0x45f431)

i think this is basically caused by the contention of accessing the disk
which stores index files for three packet disks...

i saw logs like this:

Nov 30 07:35:07 ev004 stenographer[28060]: 2016/11/30 07:35:07 Blockfile opening: "/tmp/stenographer338850325/PKT4/148037305476175
Nov 30 07:35:08 ev004 stenographer[28060]: 2016/11/30 07:35:07 opening index "/tmp/stenographer338850325/IDX4/1480373054761754"
Nov 30 07:36:48 ev004 stenographer[28060]: 2016/11/30 07:36:48 new blockfile "/tmp/stenographer338850325/PKT4/1480373054761754"

basically indicating a lot of contentions,... i'm fine with the current implementation
perhaps i'll customize it a bit more for my specific use,

please feel free to merge into master,

thanks,
alex wang,

from stenographer.

gconnell avatar gconnell commented on May 14, 2024

from stenographer.

yew011 avatar yew011 commented on May 14, 2024

@gconnell changing to 4 is help with the contention and starvation,

also, i found that we must call the Reset() inside the mutex t.mu.Lock()...
otherwise, i always got timeout triggered on watchdog even after the
syncFilesWithDisk() returns... this is so weird, i tried to check the golang
runtime code to understand but couldn't figure it out,... seemed that
the timer was registered to the runtime heap multiple times...

thanks,

from stenographer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.