Comments (18)
Hey, yew,
What you're saying sounds 100% accurate. That said, we haven't experienced this as a bottleneck, probably since stenographer doesn't restart much, so an initial startup cost isn't noticed in our environment. Could you define a "non-negligible amount of time"? I do worry that we might hit our watchdog limit of 5 minutes (https://github.com/google/stenographer/blob/master/thread/thread.go#L131), which could cause a crash-loop on start.
Steno does need to know about files before it can serve requests, but maybe this could be more parallelized? Also, does the server currently accept a request, then block until ready? Or does it reject the request (maybe because it hasn't yet opened its port)? We might fix it to do the former if it currently does the latter, so a request would hang but eventually succeed.
--Graeme
from stenographer.
@gconnell thx for the reply,
it may take up to 15 minutes to finish recovery of a disk with 30K index files,
the watchdog won't timeout, since we add 1 min for each new file, see here
https://github.com/google/stenographer/blob/master/thread/thread.go#L135
also, it is not parallel, since in the env.go
, we call it one by one, see here
https://github.com/google/stenographer/blob/master/env/env.go#L258-L262
and in each call of t.SyncFiles()
the thread mutex is held which prevent any
lookup on that thread.
yes, the server will still accept a request, but block until ready/or reaching the
15 minutes http handling timeout,
Thx,
Alex Wang,
from stenographer.
Thanks for the followup, Alex.
Let me see if I can do some quick parallelization work that might speed things up... stay tuned.
from stenographer.
@gconnell also, we hit this pretty badly since we use four spin disks,
one for storing indexfiles and three for storing packets,... @@
and even though i parallelize this part, https://github.com/google/stenographer/blob/master/env/env.go#L258-L262
we are still limited by the throughput of one spin disk...
so, would also like to know if you store indexfile and packetfile on the same disk,
or you use separate disks?
Thanks~
from stenographer.
I've done some work in the 'syncfiles' branch which might help with this... would you be able to pull it down and take a look? See the top three commits here: https://github.com/google/stenographer/commits/syncfiles
from stenographer.
@gconnell thx so much for putting up the improvement, looks good to me, two comments:
- seems that we do not
Reset
watchdog for opening each file anymore... so, if there are
ton of index files on one disk, we may still trigger the watchdog timeout... from the
description ofReset()
, i think it is okay to call it from multiple go routines, so maybe
we can callReset()
after passingnewFileThrottle <- true
... - the following code in the
Lookup()
also needs to be adjusted:
for _, file := range t.getSortedFiles() {
files = append(files, t.files[file])
}
if Lookup()
is called during the packet/index file recovery, the
t.files[file]
can be nil
, so we want to guard it with if
statement.
Also, i'm thinking if we could also try indicating the packet/index file recovery
not finishing in the HTTP response sent to stenoread
... e.g. like using the
HTTP Warning field?
from stenographer.
Point 1 is easily fixed, I've committed it to the branch.
M'm not sure why 2 is required? The write to t.files
in syncFilesWithDisk
and trackNewFile
is protected with a t.mu.Lock()
, and Lookup is protected with a t.mu.RLock()
. Also, I'm not seeing anywhere that t.files[x]
is set to nil.
Re: HTTP response, I think the best bet is still to block incoming requests until the first sync is complete, then best-effort any future syncs. I've committed that as well to syncfiles.
from stenographer.
@gconnell thx for adding the watchdog reset,
for my second comment, i was wrong, i somehow confused the getSortedFiles()
with
listPacketFilesOnDisk()
and thought it will search for all files on disk... so, nothing to
worry about, there,
the for taking care of the HTTP response, I'm good with that, just like the log level to be
higher,
from stenographer.
Sounds good. Could you run some tests on your end to make sure the startup time is in fact sane now? If it is, I'll push this to master and you can send pull requests for any additional small stuff (changing loglevel, etc).
from stenographer.
Sure, I'll try that today,~ and post it here asap~
from stenographer.
@gconnell one more issue,
@@ -260,9 +265,15 @@ func (d *Env) removeOldFiles() {
}
func (d *Env) syncFiles() {
+ var wg sync.WaitGroup
for _, t := range d.threads {
- t.SyncFiles()
+ wg.Add(1)
+ go func() {
+ t.SyncFiles()
+ wg.Done()
+ }()
}
+ wg.Wait()
}
we need to do the t := t
trick again, inside the for
loop otherwise
it will all refer to the last element in d.threads
,
from stenographer.
from stenographer.
@gconnell i still get abortion from the Watchdog for syncing files with disk
...
even though i move the fido.Reset(time.Minute)
inside the critical section
of t.mu.Lock()
... still trying to figure it out,
still not competent with golang, and want to confirm if we need to worry
about any possible race with resetting the watchdog timer,
thanks,
from stenographer.
from stenographer.
i'll do that once back to my setup,
but guessing that i used 3 7200rpm disks for storing pkts and 1 7200rpm disk for
storing index... and when this recovery happens, there will be 32 (throttle) * 3 (thread module)
simultaneous IO read from the index disk, and when though we Reset(1min)
the timer,
requests from the same thread module... can still be starved for more than 1min?...
so, i'm thinking if we can Reset()
timer after we fully recover the block/index file?
i'll also see if this can fix the abortion,
thanks,
from stenographer.
@gconnell sorry for this delayed update,
i tried to debug.PrintStack()
before the log.Fatalf()
here,
but it does not really show any meaningful try during the abortion, only two lines
like below:
<omitted>/google/stenographer/base/base.go:452 (0x4af3a8)
<omitted>/go/src/runtime/asm_amd64.s:1721 (0x45f431)
i think this is basically caused by the contention of accessing the disk
which stores index files for three packet disks...
i saw logs like this:
Nov 30 07:35:07 ev004 stenographer[28060]: 2016/11/30 07:35:07 Blockfile opening: "/tmp/stenographer338850325/PKT4/148037305476175
Nov 30 07:35:08 ev004 stenographer[28060]: 2016/11/30 07:35:07 opening index "/tmp/stenographer338850325/IDX4/1480373054761754"
Nov 30 07:36:48 ev004 stenographer[28060]: 2016/11/30 07:36:48 new blockfile "/tmp/stenographer338850325/PKT4/1480373054761754"
basically indicating a lot of contentions,... i'm fine with the current implementation
perhaps i'll customize it a bit more for my specific use,
please feel free to merge into master,
thanks,
alex wang,
from stenographer.
from stenographer.
@gconnell changing to 4
is help with the contention and starvation,
also, i found that we must call the Reset()
inside the mutex t.mu.Lock()
...
otherwise, i always got timeout triggered on watchdog even after the
syncFilesWithDisk()
returns... this is so weird, i tried to check the golang
runtime code to understand but couldn't figure it out,... seemed that
the timer was registered to the runtime heap multiple times...
thanks,
from stenographer.
Related Issues (20)
- Ubuntu 18.04 Seccomp Failures HOT 11
- gRPC support for stenoread HOT 7
- Feature Request: Utility to dump offline stenographer sensor data to pcap HOT 4
- [Ubuntu] Stenotype keeps crashing HOT 9
- Prometheus metrics support
- Stenographer will not start on Ubuntu 16.04LTS HOT 1
- Temporary directory created before dropping privileges
- Wishlist: Community ID indexed flow extraction
- VLAN tags HOT 3
- multiple threads writing to same disk causes issues with disk cleanup HOT 10
- No such device exception seen while running install.sh on Ubuntu 18.04 HOT 1
- Export the packet through the stenoread and it will automatically break in 15 minutes HOT 1
- rpm created stenographer doesn't run HOT 3
- Notifying about index completion HOT 1
- Decapsulate ERSPAN
- Watchdog failure stenotype abort after 2 minutes (Debian 10) HOT 15
- BlockFile.AllPackets() fail when using non-standard blocksize
- Not able to set stenotype packet directory HOT 1
- Non-pretty results from output file open failure HOT 3
- PCAP-over-IP client in stenotype
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from stenographer.