attic-labs / noms Goto Github PK
View Code? Open in Web Editor NEWThe versioned, forkable, syncable database
License: Apache License 2.0
The versioned, forkable, syncable database
License: Apache License 2.0
See #59 for context.
@rafael-atticlabs already started this.
$ ./xml2noms
Usage of ./xml2noms:
-aws-store-auth-from-env=false: creates the aws-based chunkstore from authorization found in the environment. This is typically used in production to get keys from IAM profile. If not specified, then -aws-store-key and aws-store-secret must be specified instead
-aws-store-bucket="": s3 bucket to create an aws-based chunkstore in
-aws-store-dynamo-table="noms-root": dynamodb table to store the root of the aws-based chunkstore in
-aws-store-key="": aws key to use to create the aws-based chunkstore
-aws-store-region="us-west-2": aws region to put the aws-based chunkstore in
-aws-store-secret="": aws secret to use to create the aws-based chunkstore
-dataset-id="": dataset id to store data for
-file-store="": directory to use for a file-based chunkstore
-file-store-root="root": filename which holds the root ref in the filestore
-httptest.serve="": if non-empty, httptest.NewServer serves on this address and blocks
-memory-store=false: use a memory-based (ephemeral, and private to this application) chunkstore
It should tell you that it crawls a dir and how to specify that dir.
Current thinking is that, for the MLB pitcher heatmap demo, we want to use a generic json importer to suck in the data and then have a "distiller" app the post-processes the data into some kind of more structure application-specific form. It seems like the most logical way to build something like this is to enable each app to have its own view of the data, and the simplest way to do THAT is to allow each app to have its own root set.
See also clients/json_importer and #29.
XML is interesting because it is so rich compared to json and csv (attributes, namespaces, entities, etc). We probably want to capture all of that into some noms equivalent of the xml dom.
This seems like a lower priority than json and csv to me, but we can increase the priority if we run into some important app that needs it.
We want two nodes to be able to work with the same datastore offline easily. Like git, the way that they catch each other up will be through push and pull operations.
noms pull http://noms.io/cmasone/mlb/xml
would tell the node at noms.io
that I'm currently at hash XYZ of the "mlb/xml" dataset in the "cmasone" datastore and I wish to update to head. noms.io
would figure out what refs I'm missing and send them all down to me, including the new root.
Currently it becomes invalid, but other ChunkStores simply reset themselves and could be used again.
In retrospect, it's a bit weird to have them in some random file in the package. We should just have a convention of using, e.g., _gen.go or something and having the command in there.
Even after #11 is implemented, we still have to do more work in the encoding path, otherwise we end up re-expanding everything when trying to save. For example:
v, _ := ReadValue(<bigval>, cs)
v = v.Append(Int32(42))
// At this point,we've only read one ref
// This line ends up reading all the refs because it walks the tree in order to serialize :-/.
r, _ := WriteValue(v, cs)
Both in the explorer demo and in the heatmap demo.
We currently do one get per value which gets slow.
Blocking #52
Once we have various aspects of the bits tuned the way we like (see #52), we should add some kind of continuous benchmarking so that we know when we regress it. There is a 'benchmark' thing built into go testing, but glancing at it, it's not clear how you're supposed to integrate it into your workflow.
Heatmap outputs data in a structure that should be easy for pitchmap to consume now!
My ocd is kicking in.
Currently, DataStore always branches the repo if there is a conflict. For some use cases it is better to just return failure if optimistic locking failed, and let the caller retry. See references to this bug in the code where this is a known problem.
Currently you feed nomgen loosely-typed maps and lists. We should use nomgen to create go types to feed to it instead.
Just because it's fun.
A lot of data in the world is csv or tsv. It would be good to have generic support for this.
An interesting question is how to represent csv in noms types. Assuming the column names are known, the first thing that comes to mind is a list of maps. But that would have a problem: it wouldn't enforce that all the maps are the same shape, so it's hard to do things like graphs.
What we really want is List<T>
where T
is a struct with fields corresponding to to the columns of the csv.
We can easily create a StructDef dynamically using the column names as the fields. But it is not easy with the current generated Go code to dynamically create instances of such a struct. Perhaps there should be.
E.g., Map<Foo, Bar>
. This will help with #54.
We want to split large pieces of data up into chunks. Initially, we'll do this for blobs, but probably want to expand beyond that.
We'll use a rolling hash function to figure out where to chunk things, and will need to deal with chunks that are still too big.
A blob value will really just contain some list of Refs to the chunks that make it up, and the Ref of a blob will be some kind of hash-of-hashes thing. This does introduce some complexity to the Reader() of a blob, since it now needs to find and decode all the chunks that make up the blob. Aaron proposes a "Resolver" interface that all composite types will have a pointer to, which looks like this:
type Resolver struct{}
func (r *Resolver) Resolve(r ref.Ref) Value
Our "structs" are currently just helpers around loosely-typed maps. We need to actually encode structs in order to enable various other features -- such as commit-time validation of structs, search by struct type, etc.
I'm not sure exactly what form this should take. The first thing that springs to mind in the current JSON encoding is something like:
{"sha1-xyz": <current map encoding>}
{"List<sha1-xyz>": <current list encoding>}
{"Set<List<sha1-syz>>" ...
... but there is likely something better.
The idea is to add a new chunk store that takes two chunks stores, one acting as "caching" which is first consulted in the case of reads. For puts we always put to the "backing" store.
We need some. Probably in connection with #67 .
Each user of nomgen (e.g., dataset/gen/gen.go) is sepcifying an -o flag that is getting passed in from the rungen.go file. I think it would be simpler to just have dataset/gen/gen.go hardcode that path instead of having a flag. We don't need two layers of indirection.
Because it uses both s3 and dynamo as part of its impl
So like:
I was originally thinking that the interfaces isolated the guts of the impl, and that there might be different implementations of each interface (e.g., a chunked implementation).
But there's no need to maintain multiple implementations at the same time, and isolating the guts can be done just with normal private fields.
This will simplify the types package somewhat, which is getting a bit hairy.
It's easy to accidentally use assert.Equals() on two values. This will sometimes work, but it depends on whether the internal caches inside the two values are populated or not. Instead, we should be calling Value::Equals().
It should be possible to use some Go interface trickery to make assert.Equals() automatically delegate to Value::Equals() but it is complicated by the fact that the dbg package would then depend on types, which already depends on dbg.
Google has started reducing the visibility of this feature. Capturing it into noms could be a great thing for some people who are worried about losing that data.
Right now we are testing on just 1 game worth of data. I think we should be able to comfortable suck in all the data from all of history on one macbook and display it in the UI. Let's step up through increasingly large chunks (day, month, year, forever) and use that to drive improvements throughout the stack.
This bug will be complete when:
Currently, only ::Ref() commits the write. I believe this is different than other io.Writer impls in the Go stdlib. For example File will flush the write on Close (I think).
If this indeed a guarantee in other io.Writer impls, then it should be for us too. And then I guess we'd want an Abandon() method to actually abandon halfway through a write.
See clients/json_importer as an example.
Since csv is a table, we should import it into List<T>
. Not sure what best way to determine T
is ... maybe look at some existing csv data, but ideas include:
if i run it repeatedly in clients/flickr, it seems to keep changing its mind what order it defines types.
We don't really need it right? xml2noms can do same thing when combined with wget, but xml2noms does more!
See Camlistore for a start.
It should feel very declarative and direct to write to noms, but instead it feels super frustrating.
We need write a codegen system that generates immutable Go structs and collections from definitions.
A simpler short-term solution might be to write reflection code that can decode a Go struct from a types.Value. But this has the disadvantage that we won't have strongly-typed collections or lazy collections.
We should have some. I wonder if the fact that we are actually a DAG with a single root helps us at all.
We probably want something simple for v0 that runs periodically on commit, but it's a lowish priority.
Should take in a datastore and an input dataset by name
Currently the datasets is a Set and we iterate over the set to find the dataset with the correct ID. We should just use a Map and use the ID as the key.
Particularly for s3store this is important, but can also be important depending on where the temp directory is in the FileStore implementation.
We end up typing it a lot, so shorter would be better. "ds", or "datas" as an ungrammatical joke and analog to "chunks"?
We currently eagerly load all values recursively.
While this is sometimes what you want, it frequently isn't. For example right now, when someone asks DataStore for the current roots, we will read the entire database into memory.
It requires some thought as to where to put the gradual expansion.
I'm sure this will be a number of things, and not sure how to tell when it's done. But the goal is for the code to feel very direct and declarative. Reacty.
The number this shell script outputs each time it runs increases. I expect it to be constant since we should only be looking at previous commit and new commit each time.
./counter <deets> | grep Getting | sort | uniq | wc -l
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.