attic-labs / noms Goto Github PK

View Code? Open in Web Editor NEW

7.4K 198.0 266.0 70.63 MB

The versioned, forkable, syncable database

License: Apache License 2.0

Go 99.05% JavaScript 0.44% Python 0.50% Shell 0.01% Dockerfile 0.01%

noms's Issues

Cache root graph somewhere so it doesn't have to be read on first commit

See #59 for context.

xml2noms usage unclear

$ ./xml2noms 
Usage of ./xml2noms:
  -aws-store-auth-from-env=false: creates the aws-based chunkstore from authorization found in the environment. This is typically used in production to get keys from IAM profile. If not specified, then -aws-store-key and aws-store-secret must be specified instead
  -aws-store-bucket="": s3 bucket to create an aws-based chunkstore in
  -aws-store-dynamo-table="noms-root": dynamodb table to store the root of the aws-based chunkstore in
  -aws-store-key="": aws key to use to create the aws-based chunkstore
  -aws-store-region="us-west-2": aws region to put the aws-based chunkstore in
  -aws-store-secret="": aws secret to use to create the aws-based chunkstore
  -dataset-id="": dataset id to store data for
  -file-store="": directory to use for a file-based chunkstore
  -file-store-root="root": filename which holds the root ref in the filestore
  -httptest.serve="": if non-empty, httptest.NewServer serves on this address and blocks
  -memory-store=false: use a memory-based (ephemeral, and private to this application) chunkstore

It should tell you that it crawls a dir and how to specify that dir.

Implement lazy loading of refs in js sdk

Re-introduce the notion of per-application roots

Current thinking is that, for the MLB pitcher heatmap demo, we want to use a generic json importer to suck in the data and then have a "distiller" app the post-processes the data into some kind of more structure application-specific form. It seems like the most logical way to build something like this is to enable each app to have its own view of the data, and the simplest way to do THAT is to allow each app to have its own root set.

Implement generic json import

Lots of data out there is JSON. You should be able to dump it into noms and work with it.

Seems like we'd have to import it as just loosely typed lists and maps. Leaving this as milestone:later since it's not very clear what form this should take right now.

Generic xml importer

noms pull

We want two nodes to be able to work with the same datastore offline easily. Like git, the way that they catch each other up will be through push and pull operations.

noms pull http://noms.io/cmasone/mlb/xml would tell the node at noms.io that I'm currently at hash XYZ of the "mlb/xml" dataset in the "cmasone" datastore and I wish to update to head. noms.io would figure out what refs I'm missing and send them all down to me, including the new root.

explore.html doesn't working safari desktop/mobile

MemoryStore should have consistent behavior with other ChunkStores on close

Currently it becomes invalid, but other ChunkStores simply reset themselves and could be used again.

explore.html doesn't support blob encoding

Move all the go:generate commands into their own files

In retrospect, it's a bit weird to have them in some random file in the package. We should just have a convention of using, e.g., _gen.go or something and having the command in there.

Encoding needs to be aware of lazy-loading

Even after #11 is implemented, we still have to do more work in the encoding path, otherwise we end up re-expanding everything when trying to save. For example:

v, _ := ReadValue(<bigval>, cs)
v = v.Append(Int32(42))
// At this point,we've only read one ref

// This line ends up reading all the refs because it walks the tree in order to serialize :-/.
r, _ := WriteValue(v, cs)

Web UI does too many HTTP requests

Both in the explorer demo and in the heatmap demo.

We currently do one get per value which gets slow.

Blocking #52

Setup benchmarking system

Once we have various aspects of the bits tuned the way we like (see #52), we should add some kind of continuous benchmarking so that we know when we regress it. There is a 'benchmark' thing built into go testing, but glancing at it, it's not clear how you're supposed to integrate it into your workflow.

Hook up "pitchmap" client to "heatmap" client

Heatmap outputs data in a structure that should be easy for pitchmap to consume now!

Reorganize clients directory

My ocd is kicking in.

It should be possible to use DataStore::Commit() without branching

Currently, DataStore always branches the repo if there is a conflict. For some use cases it is better to just return failure if optimistic locking failed, and let the caller retry. See references to this bug in the code where this is a known problem.

Use nomgen to create types to feed to nomgens

Currently you feed nomgen loosely-typed maps and lists. We should use nomgen to create go types to feed to it instead.

Just because it's fun.

Implement generic csv import

A lot of data in the world is csv or tsv. It would be good to have generic support for this.

An interesting question is how to represent csv in noms types. Assuming the column names are known, the first thing that comes to mind is a list of maps. But that would have a problem: it wouldn't enforce that all the maps are the same shape, so it's hard to do things like graphs.

What we really want is List<T> where T is a struct with fields corresponding to to the columns of the csv.

We can easily create a StructDef dynamically using the column names as the fields. But it is not easy with the current generated Go code to dynamically create instances of such a struct. Perhaps there should be.

Implement strongly-typed maps

E.g., Map<Foo, Bar>. This will help with #54.

Implement chunking for blobs, at least

We want to split large pieces of data up into chunks. Initially, we'll do this for blobs, but probably want to expand beyond that.

We'll use a rolling hash function to figure out where to chunk things, and will need to deal with chunks that are still too big.

A blob value will really just contain some list of Refs to the chunks that make it up, and the Ref of a blob will be some kind of hash-of-hashes thing. This does introduce some complexity to the Reader() of a blob, since it now needs to find and decode all the chunks that make up the blob. Aaron proposes a "Resolver" interface that all composite types will have a pointer to, which looks like this:

type Resolver struct{}
func (r *Resolver) Resolve(r ref.Ref) Value

Encode structs

Our "structs" are currently just helpers around loosely-typed maps. We need to actually encode structs in order to enable various other features -- such as commit-time validation of structs, search by struct type, etc.

I'm not sure exactly what form this should take. The first thing that springs to mind in the current JSON encoding is something like:

{"sha1-xyz": <current map encoding>}
{"List<sha1-xyz>": <current list encoding>}
{"Set<List<sha1-syz>>" ...

... but there is likely something better.

Add a read through/caching store

The idea is to add a new chunk store that takes two chunks stores, one acting as "caching" which is first consulted in the case of reads. For puts we always put to the "backing" store.

Document how to get UI working on main readme

Continuous integration!

We need some. Probably in connection with #67 .

Reduce boilerplate in uses of nomgen.go

Each user of nomgen (e.g., dataset/gen/gen.go) is sepcifying an -o flag that is getting passed in from the rungen.go file. I think it would be simpler to just have dataset/gen/gen.go hardcode that path instead of having a flag. We don't need two layers of indirection.

Enable nomgen to generate code for lists of all things

Rename s3store->awsStore

Because it uses both s3 and dynamo as part of its impl

Get rid of the String/flatString, List/flatList, etc... abstraction

So like:

Remove string.go
Rename flat_string->string.go
Repeat for the others

I was originally thinking that the interfaces isolated the guts of the impl, and that there might be different implementations of each interface (e.g., a chunked implementation).

But there's no need to maintain multiple implementations at the same time, and isolating the guts can be done just with normal private fields.

This will simplify the types package somewhat, which is getting a bit hairy.

Make assert.Equals() call Value::Equals()

It's easy to accidentally use assert.Equals() on two values. This will sometimes work, but it depends on whether the internal caches inside the two values are populated or not. Instead, we should be calling Value::Equals().

It should be possible to use some Go interface trickery to make assert.Equals() automatically delegate to Value::Equals() but it is complicated by the fact that the dbg package would then depend on types, which already depends on dbg.

G+ stream importer

Google has started reducing the visibility of this feature. Capturing it into noms could be a great thing for some people who are worried about losing that data.

MLB demo

Right now we are testing on just 1 game worth of data. I think we should be able to comfortable suck in all the data from all of history on one macbook and display it in the UI. Let's step up through increasingly large chunks (day, month, year, forever) and use that to drive improvements throughout the stack.

This bug will be complete when:

xml2noms on the entire directory of mlb data takes "some reasonable amount of time". I don't know how long this should be, but my spidey sense says "< ~2x the time it takes to cp -R the entire directory?". @cmasone-attic, @arv, @rafael-atticlabs - please override me here if you have a better idea how long is "reasonable" for this to take.
running heatmap should take < 25% the time it takes to cp -R the entire xml2noms dir (because we're only reading a tiny bit of it?)
loading the initial ui and switching pitchers should be "fast" for normal definitions of ui speediness.

ChunkStore::Writer::Close() should also commit the write

Currently, only ::Ref() commits the write. I believe this is different than other io.Writer impls in the Go stdlib. For example File will flush the write on Close (I think).

If this indeed a guarantee in other io.Writer impls, then it should be for us too. And then I guess we'd want an Abandon() method to actually abandon halfway through a write.

Generic csv importer

See clients/json_importer as an example.

Since csv is a table, we should import it into List<T>. Not sure what best way to determine T is ... maybe look at some existing csv data, but ideas include:

If it's common for the first row to be the headers, have a -first-row-is-header flag on the importer that we use to get field names of T.
Take a flag that either references or defines a type to use as T.

output of codegen appears to be non-deterministic

if i run it repeatedly in clients/flickr, it seems to keep changing its mind what order it defines types.

Remove xml_importer

We don't really need it right? xml2noms can do same thing when combined with wget, but xml2noms does more!

Picasa importer

See Camlistore for a start.

Using types.* is very verbose and frustrating

It should feel very declarative and direct to write to noms, but instead it feels super frustrating.

We need write a codegen system that generates immutable Go structs and collections from definitions.

A simpler short-term solution might be to write reflection code that can decode a Go struct from a types.Value. But this has the disadvantage that we won't have strongly-typed collections or lazy collections.

Need design sketch: GC

We should have some. I wonder if the fact that we are actually a DAG with a single root helps us at all.

We probably want something simple for v0 that runs periodically on commit, but it's a lowish priority.

Add tool to put MLB XML data into the right shape for heatmap demo

Should take in a datastore and an input dataset by name

datasets should be a map

Currently the datasets is a Set and we iterate over the set to find the dataset with the correct ID. We should just use a Map and use the ID as the key.

@cmasone-attic

Make a plan for open sourcing

Implementations of ChunkStore should check if target file exists before overwriting

Particularly for s3store this is important, but can also be important depending on where the temp directory is in the FileStore implementation.

Generic xml import

Lots of data out there is XML. You should be able to dump it into noms and work with it.

Seems like we'd have to come up with some kind of type family that is essentially XML DOM. See also issue #19 and issue #20.

Marking this milestone:later since it's not clear what the requirements are exactly.

The package name 'datastore' is too long

We end up typing it a lot, so shorter would be better. "ds", or "datas" as an ungrammatical joke and analog to "chunks"?

Lazy loading/decoding of data

We currently eagerly load all values recursively.

While this is sometimes what you want, it frequently isn't. For example right now, when someone asks DataStore for the current roots, we will read the entire database into memory.

It requires some thought as to where to put the gradual expansion.

./counter <deets> | grep Getting | sort | uniq | wc -l

attic-labs / noms Goto Github PK

noms's Issues

Recommend Projects

Recommend Topics

Recommend Org