attic-labs / noms Goto Github PK

The versioned, forkable, syncable database

License: Apache License 2.0

Go 99.05% JavaScript 0.44% Python 0.50% Shell 0.01% Dockerfile 0.01%

noms's Introduction

Warning - This project is not active

Noms is not being maintained. You shouldn't use it, except maybe for fun or research.

If you are interested in something like Noms, you probably want Dolt (https://github.com/dolthub/dolt) which is a fork of this project and actively maintained.

Send me (aaron at aaronboodman.com) a message if you have questions.

Use Cases | Setup | Status | Documentation | Contact

Welcome

Noms is a decentralized database philosophically descendant from the Git version control system.

Like Git, Noms is:

Versioned: By default, all previous versions of the database are retained. You can trivially track how the database evolved to its current state, easily and efficiently compare any two versions, or even rewind and branch from any previous version.
Synchronizable: Instances of a single Noms database can be disconnected from each other for any amount of time, then later reconcile their changes efficiently and correctly.

Unlike Git, Noms is a database, so it also:

Primarily stores structured data, not files and directories (see: the Noms type system)
Scales well to large amounts of data and concurrent clients
Supports atomic transactions (a single instance of Noms is CP, but Noms is typically run in production backed by S3, in which case it is "effectively CA")
Supports efficient indexes (see: Noms prolly-trees)
Features a flexible query model (see: GraphQL)

A Noms database can reside within a file system or in the cloud:

The (built-in) NBS ChunkStore implementation provides two back-ends which provide persistence for Noms databases: one for storage in a file system and one for storage in an S3 bucket.

Finally, because Noms is content-addressed, it yields a very pleasant programming model.

Working with Noms is declarative. You don't INSERT new data, UPDATE existing data, or DELETE old data. You simply declare what the data ought to be right now. If you commit the same data twice, it will be deduplicated because of content-addressing. If you commit almost the same data, only the part that is different will be written.

Use Cases

Decentralization

Because Noms is very good at sync, it makes a decent basis for rich, collaborative, fully-decentralized applications.

Mobile Offline-First Database

Embed Noms into mobile applications, making it easier to build offline-first, fully synchronizing mobile applications.

Install

Download the latest release:

Unzip the directory somewhere and add it to your $PATH
Verify Noms is installed correctly:

$ noms version
format version: 7.18
built from <developer build>

Run

Import some data:

go install github.com/attic-labs/noms/samples/go/csv/csv-import
curl 'https://data.cityofnewyork.us/api/views/kku6-nxdu/rows.csv?accessType=DOWNLOAD' > /tmp/data.csv
csv-import /tmp/data.csv /tmp/noms::nycdemo

Explore:

noms show /tmp/noms::nycdemo

Should show:

struct Commit {
  meta: struct Meta {
    date: "2017-09-19T19:33:01Z",
    inputFile: "/tmp/data.csv",
  },
  parents: set {},
  value: [  // 236 items
    struct Row {
      countAmericanIndian: "0",
      countAsianNonHispanic: "3",
      countBlackNonHispanic: "21",
      countCitizenStatusTotal: "44",
      countCitizenStatusUnknown: "0",
      countEthnicityTotal: "44",
...

Status

Nobody is working on this right now. You shouldn't rely on it unless you're willing to take over development yourself.

Major Open Issues

These are the major things you'd probably want to fix before relying on this for most systems.

Sync performance with long commit chains (#2233)
Migration (#3363)
Garbage Collection (#3374)
Query language
- We started trying to hack in GraphQL but it's incomplete and maybe not the right thing. See: ngql
Various other smaller bugs and improvements

Learn More About Noms

For the decentralized web: The Decentralized Database

Learn the basics: Technical Overview

Tour the CLI: Command-Line Interface Tour

Tour the Go API: Go SDK Tour

Contact Us

Interested in using Noms? Awesome! We would be happy to work with you to help understand whether Noms is a fit for your problem. Reach out at:

Licensing

Noms is open source software, licensed by Attic Labs, Inc. under the Apache License, Version 2.0.

noms's People

Contributors

Stargazers

Watchers

Forkers

hermanocabral arv cmasone-attic ippy04 ahl kalman rtvt123 lantran golang-kit willglynn yornaath devopsbox mgedigian sullvn mbrukman sysbot rklabs modulexcite ivanistheone huygn kaizengliu weimingtom tchen0123 foio gianfebrian stephensong vishnu100 cicicali scottkiss kustomzone moonarcher hedgefair hanyedeleng pap ezhangle kartikv11 boniface pkroma driky mokacao suryagaddipati akrizae ithoughts empia dieterbe roujdami felixdasgupta techscientist ehalpern nikolay-zakharov bryanwillis cerebralmischief rainhead danjesus ranjeet-floyd hackintoshrao hu19891110 montazze rlugojr ianmiell svendowideit oldtree ilinsky jmptrader ismailmechbal tegansnyder sunglim ademilly jangie sup3rfly cmars cloudxtreme yencarnacion zjsxwc fanyer zcstarr ngryman suteerth eikeon songtianheng theshizzah vsivsi mrkebubun sandalwing fcrick strogo livitki benjamesbabala ecoblockchain dyhpoon traviscooper vishal-h hbcbh1999 peggymercyprescott hibecki otuld lesingerouge yarwelp number0 yujinqiu

noms's Issues

explore.html doesn't working safari desktop/mobile

Move all the go:generate commands into their own files

In retrospect, it's a bit weird to have them in some random file in the package. We should just have a convention of using, e.g., _gen.go or something and having the command in there.

Make assert.Equals() call Value::Equals()

It's easy to accidentally use assert.Equals() on two values. This will sometimes work, but it depends on whether the internal caches inside the two values are populated or not. Instead, we should be calling Value::Equals().

It should be possible to use some Go interface trickery to make assert.Equals() automatically delegate to Value::Equals() but it is complicated by the fact that the dbg package would then depend on types, which already depends on dbg.

Use nomgen to create types to feed to nomgens

Currently you feed nomgen loosely-typed maps and lists. We should use nomgen to create go types to feed to it instead.

Just because it's fun.

Hook up "pitchmap" client to "heatmap" client

Heatmap outputs data in a structure that should be easy for pitchmap to consume now!

The package name 'datastore' is too long

We end up typing it a lot, so shorter would be better. "ds", or "datas" as an ungrammatical joke and analog to "chunks"?

Add tool to put MLB XML data into the right shape for heatmap demo

Should take in a datastore and an input dataset by name

Using types.* is very verbose and frustrating

It should feel very declarative and direct to write to noms, but instead it feels super frustrating.

We need write a codegen system that generates immutable Go structs and collections from definitions.

A simpler short-term solution might be to write reflection code that can decode a Go struct from a types.Value. But this has the disadvantage that we won't have strongly-typed collections or lazy collections.

output of codegen appears to be non-deterministic

if i run it repeatedly in clients/flickr, it seems to keep changing its mind what order it defines types.

Reorganize clients directory

My ocd is kicking in.

Setup benchmarking system

Once we have various aspects of the bits tuned the way we like (see #52), we should add some kind of continuous benchmarking so that we know when we regress it. There is a 'benchmark' thing built into go testing, but glancing at it, it's not clear how you're supposed to integrate it into your workflow.

Cache root graph somewhere so it doesn't have to be read on first commit

See #59 for context.

Implementations of ChunkStore should check if target file exists before overwriting

Particularly for s3store this is important, but can also be important depending on where the temp directory is in the FileStore implementation.

Make Chk capture stack trace

Remove xml_importer

We don't really need it right? xml2noms can do same thing when combined with wget, but xml2noms does more!

MLB demo

Right now we are testing on just 1 game worth of data. I think we should be able to comfortable suck in all the data from all of history on one macbook and display it in the UI. Let's step up through increasingly large chunks (day, month, year, forever) and use that to drive improvements throughout the stack.

This bug will be complete when:

xml2noms on the entire directory of mlb data takes "some reasonable amount of time". I don't know how long this should be, but my spidey sense says "< ~2x the time it takes to cp -R the entire directory?". @cmasone-attic, @arv, @rafael-atticlabs - please override me here if you have a better idea how long is "reasonable" for this to take.
running heatmap should take < 25% the time it takes to cp -R the entire xml2noms dir (because we're only reading a tiny bit of it?)
loading the initial ui and switching pitchers should be "fast" for normal definitions of ui speediness.

Re-introduce the notion of per-application roots

Current thinking is that, for the MLB pitcher heatmap demo, we want to use a generic json importer to suck in the data and then have a "distiller" app the post-processes the data into some kind of more structure application-specific form. It seems like the most logical way to build something like this is to enable each app to have its own view of the data, and the simplest way to do THAT is to allow each app to have its own root set.

Implement lazy loading of refs in js sdk

Rename s3store->awsStore

Because it uses both s3 and dynamo as part of its impl

explore.html doesn't support blob encoding

MemoryStore should have consistent behavior with other ChunkStores on close

Currently it becomes invalid, but other ChunkStores simply reset themselves and could be used again.

datasets should be a map

Currently the datasets is a Set and we iterate over the set to find the dataset with the correct ID. We should just use a Map and use the ID as the key.

@cmasone-attic

Generic xml import

Lots of data out there is XML. You should be able to dump it into noms and work with it.

Seems like we'd have to come up with some kind of type family that is essentially XML DOM. See also issue #19 and issue #20.

Marking this milestone:later since it's not clear what the requirements are exactly.

ChunkStore::Writer::Close() should also commit the write

Currently, only ::Ref() commits the write. I believe this is different than other io.Writer impls in the Go stdlib. For example File will flush the write on Close (I think).

If this indeed a guarantee in other io.Writer impls, then it should be for us too. And then I guess we'd want an Abandon() method to actually abandon halfway through a write.

Need design sketch: GC

We should have some. I wonder if the fact that we are actually a DAG with a single root helps us at all.

We probably want something simple for v0 that runs periodically on commit, but it's a lowish priority.

Generic xml importer

Web UI does too many HTTP requests

Both in the explorer demo and in the heatmap demo.

We currently do one get per value which gets slow.

Blocking #52

Reduce boilerplate in uses of nomgen.go

Each user of nomgen (e.g., dataset/gen/gen.go) is sepcifying an -o flag that is getting passed in from the rungen.go file. I think it would be simpler to just have dataset/gen/gen.go hardcode that path instead of having a flag. We don't need two layers of indirection.

xml2noms usage unclear

$ ./xml2noms 
Usage of ./xml2noms:
  -aws-store-auth-from-env=false: creates the aws-based chunkstore from authorization found in the environment. This is typically used in production to get keys from IAM profile. If not specified, then -aws-store-key and aws-store-secret must be specified instead
  -aws-store-bucket="": s3 bucket to create an aws-based chunkstore in
  -aws-store-dynamo-table="noms-root": dynamodb table to store the root of the aws-based chunkstore in
  -aws-store-key="": aws key to use to create the aws-based chunkstore
  -aws-store-region="us-west-2": aws region to put the aws-based chunkstore in
  -aws-store-secret="": aws secret to use to create the aws-based chunkstore
  -dataset-id="": dataset id to store data for
  -file-store="": directory to use for a file-based chunkstore
  -file-store-root="root": filename which holds the root ref in the filestore
  -httptest.serve="": if non-empty, httptest.NewServer serves on this address and blocks
  -memory-store=false: use a memory-based (ephemeral, and private to this application) chunkstore

It should tell you that it crawls a dir and how to specify that dir.

Continuous integration!

We need some. Probably in connection with #67 .

Picasa importer

See Camlistore for a start.

Add a read through/caching store

The idea is to add a new chunk store that takes two chunks stores, one acting as "caching" which is first consulted in the case of reads. For puts we always put to the "backing" store.

Make a plan for open sourcing

Document how to get UI working on main readme

Flickr importer

@rafael-atticlabs already started this.

Get rid of the String/flatString, List/flatList, etc... abstraction

So like:

Remove string.go
Rename flat_string->string.go
Repeat for the others

I was originally thinking that the interfaces isolated the guts of the impl, and that there might be different implementations of each interface (e.g., a chunked implementation).

But there's no need to maintain multiple implementations at the same time, and isolating the guts can be done just with normal private fields.

This will simplify the types package somewhat, which is getting a bit hairy.

Implement strongly-typed maps

E.g., Map<Foo, Bar>. This will help with #54.

counter program reads more chunks each time it runs. Why?

The number this shell script outputs each time it runs increases. I expect it to be constant since we should only be looking at previous commit and new commit each time.

./counter <deets> | grep Getting | sort | uniq | wc -l

It should be possible to use DataStore::Commit() without branching

Currently, DataStore always branches the repo if there is a conflict. For some use cases it is better to just return failure if optimistic locking failed, and let the caller retry. See references to this bug in the code where this is a known problem.

Enable nomgen to generate code for lists of all things

Implement chunking for blobs, at least

We want to split large pieces of data up into chunks. Initially, we'll do this for blobs, but probably want to expand beyond that.

We'll use a rolling hash function to figure out where to chunk things, and will need to deal with chunks that are still too big.

A blob value will really just contain some list of Refs to the chunks that make it up, and the Ref of a blob will be some kind of hash-of-hashes thing. This does introduce some complexity to the Reader() of a blob, since it now needs to find and decode all the chunks that make up the blob. Aaron proposes a "Resolver" interface that all composite types will have a pointer to, which looks like this:

type Resolver struct{}
func (r *Resolver) Resolve(r ref.Ref) Value

Lazy loading/decoding of data

We currently eagerly load all values recursively.

While this is sometimes what you want, it frequently isn't. For example right now, when someone asks DataStore for the current roots, we will read the entire database into memory.

It requires some thought as to where to put the gradual expansion.

Encoding needs to be aware of lazy-loading

Even after #11 is implemented, we still have to do more work in the encoding path, otherwise we end up re-expanding everything when trying to save. For example:

v, _ := ReadValue(<bigval>, cs)
v = v.Append(Int32(42))
// At this point,we've only read one ref

// This line ends up reading all the refs because it walks the tree in order to serialize :-/.
r, _ := WriteValue(v, cs)

G+ stream importer

Google has started reducing the visibility of this feature. Capturing it into noms could be a great thing for some people who are worried about losing that data.

Implement generic csv import

A lot of data in the world is csv or tsv. It would be good to have generic support for this.

An interesting question is how to represent csv in noms types. Assuming the column names are known, the first thing that comes to mind is a list of maps. But that would have a problem: it wouldn't enforce that all the maps are the same shape, so it's hard to do things like graphs.

What we really want is List<T> where T is a struct with fields corresponding to to the columns of the csv.

We can easily create a StructDef dynamically using the column names as the fields. But it is not easy with the current generated Go code to dynamically create instances of such a struct. Perhaps there should be.

Generic csv importer

See clients/json_importer as an example.

Since csv is a table, we should import it into List<T>. Not sure what best way to determine T is ... maybe look at some existing csv data, but ideas include:

If it's common for the first row to be the headers, have a -first-row-is-header flag on the importer that we use to get field names of T.
Take a flag that either references or defines a type to use as T.

Encode structs

Our "structs" are currently just helpers around loosely-typed maps. We need to actually encode structs in order to enable various other features -- such as commit-time validation of structs, search by struct type, etc.

I'm not sure exactly what form this should take. The first thing that springs to mind in the current JSON encoding is something like:

{"sha1-xyz": <current map encoding>}
{"List<sha1-xyz>": <current list encoding>}
{"Set<List<sha1-syz>>" ...

... but there is likely something better.

Implement generic json import

Lots of data out there is JSON. You should be able to dump it into noms and work with it.

Seems like we'd have to import it as just loosely typed lists and maps. Leaving this as milestone:later since it's not very clear what form this should take right now.

noms pull

We want two nodes to be able to work with the same datastore offline easily. Like git, the way that they catch each other up will be through push and pull operations.

noms pull http://noms.io/cmasone/mlb/xml would tell the node at noms.io that I'm currently at hash XYZ of the "mlb/xml" dataset in the "cmasone" datastore and I wish to update to head. noms.io would figure out what refs I'm missing and send them all down to me, including the new root.

Do something about the hot mess of non-ergonomics in pitchmap/index/index.go

I'm sure this will be a number of things, and not sure how to tell when it's done. But the goal is for the code to feel very direct and declarative. Reacty.