fishrock123 / bob Goto Github PK
View Code? Open in Web Editor NEWπ° binary data "streams+" via data producers, data consumers, and pull flow.
License: MIT License
π° binary data "streams+" via data producers, data consumers, and pull flow.
License: MIT License
The current status is:
Moving sinks / sources to npm modules:
Next up:
Previous status - 30/3/2018 (2018 week 13): #9
Due to potential conflicts with iterator#next()
, perhaps this part of the protocol should be renamed. Any thoughts?
Maybe something like give()
?
I'm really not convinced the binding apis are very good.
Might be better to have a helper function like streamline(a, b, c)
- ideally so that you can do streamline(streamline(a, b, c), d)
too.
Gona try to work on a PoC module this week..
Binding is ugly and I bet it is the least comprehensible part right now.
Ideally I guess it would look something like Stream(source, transform, sink)
.
Completed moving moving sinks / sources to npm modules:
Next up:
Previous status - 1/6/2018 (Berlin Collab Summit): #11
I think it would be useful soon have a voice meeting to discuss various unresolved conversations.
Things that should be talked about:
ccing everyone who has showed interest so far:
@jasnell, @Raynos, @mcollina, @mafintosh, @benjamingr
Here is my presentation I gave at the collab summit at Node+JS Interactive 2018: https://fishrock123.github.io/nodejs-collab-summit-2018
Ok so, a lot of stuff has happened since the last update (15/11/2018 - November).
An unfortunately incomplete list of actions since then:
Stream()
composition #31start(cb)
required for Sinks #32AssertionSink
& AssertionSource
bob-streams
. #33crc-transform
as proof for an internal live coding demo
offset
to pull()
#23I think Stream() would be a good place for this, what does everyone else think?
We should also be sure to handle cases such as nodejs/node#28194
@benjamingr Mentioned Deno was discussing their approach to streams. We should check it out.
So I've been realizing a couple related things...
A big one is that I think protocol enforcement needs to be done via classes / helpers somehow, but at the same time I don't think it should be 100% necessary. (Heck it isn't even in streams3)
A more formal specification than just the reference would be useful, in the same way it would be useful to have (or had) that kind of thing for streams3.
This should also allow the project at a spec and then "core classes" level to be moved into the node org without having to drag every sub module in.
Completed sink / source / "duplex" npm modules:
Next up:
Previous status - Progress 5/10/2018 - October: #13
The last update was fairly large (23/07/2019 - July#40), but this one is much smaller.
Notably I'm no longer employed and paid to continue this kind of work around Node and I don't really do this as my hobby or have much default motivation to continue.
I did merge a pull request that adds WritableSource
and ReadableSink` for streams3 interoperability: 57a78d1
I am supposed to present this initiative's status again at the Montreal collab summit. I am not quite sure what will come out of that and it may end up being more of a post-mortem.
Moving out from #23 (comment)
It seems that newer network protocols like QUIC desire multiple chunks of data to be in-flight at once (besides consider re-sending).
This probably violates these two core design ideas:
- One-to-one: The protocol assumes a one-to-one relationship between producer and consumer.
- In-line errors and EOF: Errors, data, and EOF ("end") should flow through the same call path.
It may also unleash zalgo? lol.
Anyways, I think it is possible to still keep things simple and "pretend" that things are multiplexed, by doing slightly more waiting at the network sink end. I'm not really sure that perf would be considerably impacted in most cases?
Edit: See #30 (comment) for updated thoughts.
From my collab summit presentation (#16), here are the open questions I presented at the end:
Not really super into moving stuff into an official repo atm but once some more status has been completed (see #2), we may want to so as to get some more traction, as forming a "team" may help there.
idk if that would mean actually moving the repo or just it's contents
this module implements a stop method:
https://github.com/Fishrock123/bob/blob/master/reference-extension-stop.js
but it doesn't notify the source in anyway. If the source is using a heavy resource, (eg, a file descriptor) the source needs to know the sink has stopped so that it can close the resource.
For errors, any negative value should be considered an error, with multiple negative values permitted. That would obviously mean not using an enum for status, and instead using constants... e.g.
#define BOB_ERR_{whatever} -{some integer}
#define BOB_ERR_END 0
#define BOB_ERR_CONTINUE 1
See #35 (review)
The current status is:
Next up:
Previous status - 15/1/2018 (2018 week 3): #2
The current status is:
I'm thinking of taking a swing at doing a C++ fs after the js-only zlib transform is working.
I'm excited to see this development as I am a heavy user of the pull-stream ecosystem for etl processing. This approach feels and reads extremely similar, but with obvious gains to be made by making it natively supported by node. Do these two efforts align (or differ) in any way? Is bob expected to support existing pull-stream patterns so as to benefit the variety of libraries already available on npm? Could it? π
For reference: https://github.com/pull-stream/pull-stream
Heya @mcollina & @mafintosh - I think this is the next required step here, since we'd need this kinda thing to be able to do anything in node core anyways.
I've created a repo for this at https://github.com/Fishrock123/bob-streams3 and invited you both as collaborators. It contains some basic bits but no functioning code.
There are a good bit of docs and examples lying around here & in the linked-to module repos, please holler if you need any of my help.
Slides at: https://fishrock123.github.io/nodejs-collab-summit-montreal-2019/#0
Issue for the summit was https://github.com/openjs-foundation/summit/issues?q=is%3Aissue+is%3Aclosed
So, I finally profiled this on my linux box (macOS is useless because of ___channel_get_opt
, good luck).
I have documented the results so far in performance.md
. I only really tried doing a very large file and have not yet made cases that make many small streams.
The results are looking good. The HDD is the limiting factor of my linux system, and the profiles show file copying has ~7x less CPU time in JS, and zlib transform has ~33% less CPU time in JS. π₯ (C++ time does not seem significantly affected for either case.)
It could expose StdioSink(fd)
with shortcuts to fd 1/2 via StdoutSink
and StderrSink
.
Might be better done outside of this repo.
I have added a section to the readme in this repo about "extensions": API Extension Reference
So far this has seemed the best way to deal with possible optional additional APIs, such as an explicit start, or a stop for handling timeouts.
Essentially an extension of #53 with a different focus.
There are a number of reasons (#53, #52, #30, etc) for why some kind of state management that does not need to be reimplemented by every stream would be useful.
There are two primary ways to deal with this (that I can think of):
Of course, if we inherit form a class the obvious thing to do would be to integrate the verify transform into said class, so that guarentees are at the absolute minimal by convention but rather by code.
Completed sink / source / "duplex" npm modules:
Next up:
Previous status - Progress 23/7/2018 - July: #12
Some kind of construct flow would be very useful for a couple of significant reasons:
Arguments for doing it all inline could be getting pretty long (and very variable), not even counting the first point:
pull(status_type, error, buffer, size, offset)
Maybe?
Idk, maybe separating this all out into multiple flows would be better, similar to Streams3 but just sans the dreaded EventEmitter.
construct(...)
-> ready(...)
pull(...)
-> give(...)
destroy(error)
-> destroy(error)
Very related to nodejs/node#29314
currently, the bob sink calls this.source.pull()
and then the source calls this.sink.next()
however, if the source calls back synchronously, and the sink calls pull again synchronously, and the stream moves enough data, this could cause a stack overflow.
I worked around this in pull-stream with this (ugly) code:
https://github.com/pull-stream/pull-stream/blob/master/sinks/drain.js#L12-L37
Basically, it checks if it's next was called sync, (i.e. if the last call to pull hasn't returned yet) and if so, falls out to a loop that calls pull again. if the pull() returns before next is called, then the source is async, so exit the loop. this is the most complicated part of pull-streams. bob streams will need to have a thing like this too. you can use setImmedate
too, but that's actually a more overhead than the loop, and the loop means that a completely sync stream can stay completely sync.
push-stream
solves this a much simpler way: sinks have a paused
property, which the source can check before it calls write
. A sync source can just loop until the sink pauses. then wait until resume
is called. this means it uses less stack memory.
Hmm, that wouldn't work with bob streams because of the way the sink allocates the buffer...
I'm not really sure about that, though. (and also forbidding object streams, but not discuss that in this issue, the stackoverflow problem is more important)
could do it, might be good for exposure? Idk.
need to do this... currently it's really just done via the C++ passthrough
The current status is:
Next up:
Previous status - 16/2/2018 (2018 week 7): #5
I like being explicit in intent - I feel it would be better to not include default code examples that automatically start from sinks and rather just always include start()
. (And no longer have start me an "extension")
I probably have enough material, could be good for visibility?
As much as I'd like to avoid it, it seems like some kind of hint system would be useful for telling who should allocate buffers and of what size.
Ideally, this would be done out of the regular pull flow (to avoid passing like 7 arguments every time). So, probably going to be connected to an other (yet to be made) issue about doing some kind of "construct" flow...
See #30 (comment)
One piece of contention in the current design of the sink API is whom allocates the buffer. If I have data already in buffers that needs to be written to a socket or disk it doesn't make sense for the sink to allocate a write buffer and tell me to copy values into it.
One thing I like about node.js streams is composability. It's easy to compose pipelines that mix binary and object streams. For example, parsing a csv file, transforming the rows (objects), then writing back to a file. When raw performance isn't important (it often isn't) then node.js streams are pretty great.
Bob is binary-only, so what will replace object streams? If the answer is "async iterators", how does one compose a pipeline as easily as x.pipe(y).pipe(z)
? And if each of those is an async iterator, wouldn't that hurt throughput, as you can no longer transform multiple objects in one tick?
From the collab summit, after talking with @jasnell extensively about QUIC, I think it would be useful to have pull support an offset to ask the source to read from (the source may choose to still return whatever data it chooses).
This should allow for ACK
s in a network implementation (i.e. if you need to ACK
, you just request the last / desired offset again).
It also would support a level of content-addressability. A sink or downstream intermediate could be the one to inform a file source of where to read from.
Relatedly, this may also make disk-based sources require less state...
e.g. require('bob/status')
, require('bob/verify')
, require('bob/stream')
...
Would make maintaining the spec-ness of this repo much easier.
The npm modules (fs-source
, etc) have automated tests but this repo never yet did...
The current status is:
c++
Next up:
Previous status - 13/3/2018 (2018 week 11): #9
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.