dat-ecosystem / dat Goto Github PK

View Code? Open in Web Editor NEW

8.2K 312.0 451.0 4.4 MB

:floppy_disk: peer-to-peer sharing & live syncronization of files via command line

Home Page: https://dat.foundation

License: BSD 3-Clause "New" or "Revised" License

JavaScript 97.15% Shell 2.85%

dat dat-protocol command-line dat-node

dat's Introduction

More info on active projects and modules at dat-ecosystem.org

Dat

npm install -g dat

Use dat command line to share files with version control, back up data to servers, browse remote files on demand, and automate long-term data preservation.

dat is the first application based upon the Hypercore Protocol, and drove the architectural design through iterative development between 2014 and 2017. There exists a large community around it.

Have questions? Join our chat via IRC or Gitter:

Thanks to our financial supporters!

Installation
Getting Started
Using Dat
Troubleshooting
Javascript API
For Developers

Installation

Dat can be used as a command line tool or a javascript library:

Install the $ dat CLI to use in the command line.
require('dat') - dat-node, a library for downloading and sharing dat archives in javascript apps.

Installing the `$ dat` command line tool

The recommended way to install dat is through a single file binary distribution version of dat with the one line install command below. The binary includes a copy of node and dat packaged inside a single file, so you just have to download one file in order to start sharing data, with no other dependencies needed on your system:

wget -qO- https://raw.githubusercontent.com/datproject/dat/master/download.sh | bash

Next version

Try the next version of dat! This version (14.0.0) is not compatible with older versions (13.x) and below, and works on node v12.

npm install -g dat@next

Maintainers wanted!

NPM Prerequisites

Node: You'll need to install Node JS before installing Dat. Dat needs node version 4 or above and npm installed. You can run node -v to check your version.
npm: npm is installed with node. You can run npm -v to make sure it is installed.

Once you have npm ready, install dat from npm with the --global, -g option, npm install -g dat.

Getting started

What is Dat?

Share, backup, and publish your filesystem. You can turn any folder on your computer into a dat. Dat scans your folder, allowing you to:

Track your files with automatic version history.
Share files with others over a secure peer to peer network.
Automate live backups to external HDs or remote servers.
Publish and share files with built in HTTP server.

Dat allows you to focus on the fun work without worrying about moving files around. Secure, distributed, fast.

Documentation: docs.datproject.org
Dat white paper

Desktop Applications

Rather not use the command line? Check out these options:

Beaker Browser - An experimental p2p browser with built-in support for the Hypercore Protocol.
Dat Desktop - A desktop app to manage multiple dats on your desktop machine.

JS Library

Add Dat to your package.json, npm install dat --save. Dat exports the dat-node API via require('dat'). Use it in your javascript applications! Dat Desktop and Dat command line both use dat-node to share and download dats.

Full API documentation is available in the dat-node repository on Github.

We have Dat installed, let's use it!

Dat's unique design works wherever you store your data. You can create a new dat from any folder on your computer.

A dat is some files from your computer and a .dat folder. Each dat has a unique dat:// link. With your dat link, other users can download your files and live sync any updates.

Sharing Data

You can start sharing your files with a single command. Unlike git, you do not have to initialize a repository first, dat share or simply dat will do that for you:

dat <dir>

Use dat to create a dat and sync your files from your computer to other users. Dat scans your files inside <dir>, creating metadata in <dir>/.dat. Dat stores the public link, version history, and file information inside the dat folder.

dat sync and dat share are aliases for the same command.

Downloading Data

dat dat://<link> <download-dir>

Use dat to download files from a remote computer sharing files with Dat. This will download the files from dat://<link> to your <download-dir>. The download exits after it completes but you can continue to update the files later after the clone is done. Use dat pull to update new files or dat sync to live sync changes.

dat clone is an alias for the same command.

Misc Commands

A few other highlights. Run dat help to see the full usage guide.

dat create or dat init - Create an empty dat and dat.json file.
dat log ~/data/dat-folder/ or dat log dat://<key> - view the history and metadata information for a dat.

Quick Demos

To get started using Dat, you can try downloading a dat and then sharing a dat of your own.

Download Demo

We made a demo folder just for this exercise. Inside the demo folder is a dat.json file and a gif. We shared these files via Dat and now you can download them with our dat key!

Similar to git, you can download somebody's dat by running dat clone <link>. You can also specify the directory:

❯ dat clone dat://778f8d955175c92e4ced5e4f5563f69bfec0c86cc6f670352c457943666fe639 ~/Downloads/dat-demo
dat v13.5.0
Created new dat in /Users/joe/Downloads/dat-demo/.dat
Cloning: 2 files (1.4 MB)

2 connections | Download 614 KB/s Upload 0 B/s

dat sync complete.
Version 4

This will download our demo files to the ~/Downloads/dat-demo folder. These files are being shared by a server over Dat (to ensure high availability) but you may connect to any number of users also hosting the content.

You can also also view the files online: datbase.org/778f8d955175c92e4ced5e4f5563f69bfec0c86cc6f670352c457943666fe639. datbase.org can download files over Dat and display them on HTTP as long as someone is hosting it. The website temporarily caches data for any visited links (do not view your dat on datbase.org if you do not want us to cache your data).

Sharing Demo

Dat can share files from your computer to anywhere. If you have a friend going through this demo with you, try sharing to them! If not we'll see what we can do.

Find a folder on your computer to share. Inside the folder can be anything, Dat can handle all sorts of files (Dat works with really big folders too!).

First, you can create a new dat inside that folder. Using the dat create command also walks us through making a dat.json file:

❯ dat create
Welcome to dat program!
You can turn any folder on your computer into a Dat.
A dat is a folder with some magic.

This will create a new (empty) dat. Dat will print a link, share this link to give others access to view your files.

Once we have our dat, run dat <dir> to scan your files and sync them to the network. Share the link with your friend to instantly start downloading files.

Bonus HTTP Demo

Dat makes it really easy to share live files on a HTTP server. This is a cool demo because we can also see how version history works! Serve dat files on HTTP with the --http option. For example, dat --http, serves your files to a HTTP website with live reloading and version history! This even works for dats you're downloading (add the --sparse option to only download files you select via HTTP). The default HTTP port is 8080.

Hint: Use localhost:8080/?version=10 to view a specific version.

Get started using Dat today with the share and clone commands or read below for more details.

Usage

The first time you run a command, a .dat folder is created to store the dat metadata. Once a dat is created, you can run all the commands inside that folder, similar to git.

Dat keeps secret keys in the ~/.dat/secret_keys folder. These are required to write to any dats you create.

Creating a dat & dat.json

dat create [<dir>]

The create command prompts you to make a dat.json file and creates a new dat. Import the files with sync or share.

Optionally bypass Title and Description prompt:

dat create --title "MY BITS" --description "are ready to synchronize! 😎"

Optionally bypass dat.json creation:

dat create --yes
dat create -y

Sharing

The quickest way to get started sharing files is to share:

❯ dat 
dat://3e830227b4b2be197679ff1b573cc85e689f202c0884eb8bdb0e1fcecbd93119
Sharing dat: 24 files (383 MB)

0 connections | Download 0 B/s Upload 0 B/s

Importing 528 files to Archive (165 MB/s)
[=-----------------------------------------] 3%
ADD: data/expn_cd.csv (403 MB / 920 MB)

dat [<dir>] [--no-import] [--no-watch]

Start sharing your dat archive over the network. It will import new or updated files since you last ran create or sync. Dat watches files for changes and imports updated files.

Use --no-import to not import any new or updated files.
Use --no-watch to not watch directory for changes. --import must be true for --watch to work.

Ignoring Files

By default, Dat will ignore any files in a .datignore file, similar to git. Each file should be separated by a newline. Dat also ignores all hidden folders and files. Supports pattern wildcards (/*.png) and directory-wildcards (/**/cache).

Selecting Files

By default, Dat will download all files. If you want to only download a subset, you can create a .datdownload file which downloads only the files and folders specified. Each should be separated by a newline.

Downloading

Start downloading by running the clone command. This creates a folder, downloads the content and metadata, and a .dat folder inside. Once you started the download, you can resume at any time.

dat <link> [<dir>] [--temp]

Clone a remote dat archive to a local folder. This will create a folder with the key name if no folder is specified.

Downloading via `dat.json` key

You can use a dat.json file to clone also. This is useful when combining Dat and git, for example. To clone a dat you can specify the path to a folder containing a dat.json:

git [email protected]:joehand/dat-clone-sparse-test.git
dat ./dat-clone-sparse-test

This will download the dat specified in the dat.json file.

Updating Downloaded Archives

Once a dat is clone, you can run either dat pull or dat sync in the folder to update the archive.

dat pull [<dir>]

Download latest files and keep connection open to continue updating as remote source is updated.

Shortcut commands

dat <link> <dir> will run dat clone for new dats or resume the existing dat in <dir>
dat <dir> is the same as running dat sync <dir>

Key Management & Moving dats

dat keys provides a few commands to help you move or backup your dats.

Writing to a dat requires the secret key, stored in the ~/.dat folder. You can export and import these keys between dats. First, clone your dat to the new location:

(original) dat share
(duplicate) dat clone <link>

Then transfer the secret key:

(original) dat keys export - copy the secret key printed out.
(duplicate) dat keys import - this will prompt you for the secret key, paste it in here.

Troubleshooting

We've provided some troubleshooting tips based on issues users have seen. Please open an issue or ask us in our chat room if you need help troubleshooting and it is not covered here.

If you have trouble sharing/downloading in a directory with a .dat folder, try deleting it and running the command again.

Check Your Dat Version

Knowing the version is really helpful if you run into any bugs, and will help us troubleshoot your issue.

Check your Dat version:

dat -v

You should see the Dat semantic version printed, e.g. 14.0.0.

Installation Issues

Node & npm

To use the Dat command line tool you will need to have node and npm installed. Make sure those are installed correctly before installing Dat. You can check the version of each:

node -v
npm -v

Global Install

The -g option installs Dat globally, allowing you to run it as a command. Make sure you installed with that option.

If you receive an EACCES error, read this guide on fixing npm permissions.
If you receive an EACCES error, you may also install Dat with sudo: sudo npm install -g dat.
Have other installation issues? Let us know, you can open an issue or ask us in our chat room.

Debugging Output

If you are having trouble with a specific command, run with the debug environment variable set to dat (and optionally also dat-node). This will help us debug any issues:

DEBUG=dat,dat-node dat dat://<link> dir

Networking Issues

Networking capabilities vary widely with each computer, network, and configuration. Whenever you run Dat there are several steps to share or download files with peers:

Discovering Peers
Connecting to Peers
Sending & Receiving Data

With successful use, Dat will show Connected to 1 peer after connection. If you never see a peer connected, your network may be restricting discovery or connection.

JS API

You can use Dat in your javascript application:

var Dat = require('dat')

Dat('/data', function (err, dat) {
  // use dat
})

Read more about the JS usage provided via dat-node.

For Developers

Please see guidelines on contributing before submitting an issue or PR.

This command line library uses dat-node to create and manage the archives and networking. If you'd like to build your own Dat application that is compatible with this command line tool, we suggest using dat-node.

Installing from source

Clone this repository and in a terminal inside of the folder you cloned run this command:

npm link

This should add a dat command line command to your PATH. Now you can run the dat command to try it out.

The contribution guide also has more tips on our development workflow.

npm run test to run tests
npm run auth-server to run a local auth server for testing

License

BSD-3-Clause

dat's People

Contributors

Stargazers

Watchers

Forkers

timoxley natematias cmpera konklone bradparks hackg wardcunningham todrobbins gijs ddjit groundrace sballesteros natlownes marcoippolito gabelula khayuenkam davidguttman mikelehen junosuarez kod3r imclab pombredanne paulfitz nvdnkpr bsletten dossierapp frewsxcv binocarlos josephmartz tylerstalder afey beaugunderson derek-watson bigeasy jay3126 yalamber yilab dafyddcrosby wbteve tlevine schee tahoemph enredacoop elf-pavlik myf fusuma bitland rakesh-mohanta prodigeni web5design corbt ben-haim dtrejo anb2 joaquimserafim ryancoleman renthusiast mafintosh dist zofuthan diorahman gabrielcury xazzz jjjjw syncreticudon hasantayyar sportebois joyrexus agibsonccc jahraphael yanlinaung vojnovski jbenet hunterwayne7 nikita0208 literalsands bmpvieira sanjeebjena evanvolgas alex-components jarib wking silky atul4mlko devildeveloper mkoopajr mathisonian finnp lazycrazyowl trygve-lie marcesher shama jaredhirsch kesla dr-lab markandrewj ngedmundas nortyspock bussiere schlos

dat's Issues

Dat help malfunctioning

Dat doesn't like being asked for help.

~   
❯ dat help
Command not found: help

Usage: dat <command> [<args>]

Enter 'dat help' for help


~   
❯ dat -h
Usage: dat <command> [<args>]

Enter 'dat help' for help


~   
❯

enable replication by separating mutable from immutable sections

You really cant do effective p2p/master-master replication without having immutable data - but on the other hand - users are used to just editing data - mutable data - just editing text files or fields.

Another tradeoff is that often mutable data is more performant - you don't have to track and recreate state from a chain of immutable appends - you just overwrite what you want to change.

Now - look at git - internally git is immutable - it's based around a content addressable store of blobs, but unless the user investigates deeply, they don't encounter that. they just check out a snapshot into the mutable section,
aka, the working directory, where they can work on it will their an ordinary text editor.

I think the data schema dat has currently is great for a performant mutable section, but I think it will be difficult to use it for anything more than master-slave replication. And also, doesn't have the security of a hash-tree based system.
However, it would allow you make a bunch of edits, and then combine them into a commit much like git.

Then you could send something like a patch file, and apply it on top of another table - this would just look like a csv with some metadata at the top.

Community Data

The intro text to wikidata.org hits the high-level:

Wikidata is a free knowledge base that can be read and edited by humans and machines alike. It is for data what Wikimedia Commons is for media files: it centralizes access to and management of structured data, such as interwiki references and statistical information. Wikidata contains data in every language supported by the MediaWiki software.

After some conversations with folks involved with WikiData, I can see its role on the web in a more sensitive light. There is on-going movement (formal and informal) to address the observable decline in user base (and dwindling conversions of new users to returning editors) to Wikimedia projects, especially the English Wikipedia.

However, the trajectory of Wikimedia projects is certainly on the rise--its aggregate dataset (semi-structured text and structured-ish data alike) continues to grow. There are 48,794,118 good articles and 20,054,661 images across 200 Wikimedia projects.

WikiData exemplifies another transition for the community-driven knowledge project that sprouted from the English language Wikipedia. The future of the sum of all human knowledge is human- and programmatically-accessible, community-managed, and open-licensed.

The opportunity of WikiData is a basic, but large layer for data that matters to people.

The community around Wikipedia grew out of natural inclinations and interests. As a result, the majority of editors who generate this content are not nearly as diverse a group as we should require. This produces errors of omission, limited views and perspective, and a propagation of some elements of some cultures, but not many other elements, and not many other entire cultures. For instance, there is an English-centric focus of Wikimedia work, and the gender imbalance both in coverage and editors is clearly unacceptable.

We have a chance to be mindful of the prospect of WikiData, and I think even embrace it, plan to work with it, to integrate, and make read-write relationships. We must do this, more than anything, because there are few alternatives for constructing and maintaining data that prioritize community values and common good.

Perhaps dat does not necessarily merge, even at its beginning, with WikiData, but I think the example and its principles should be some good food for thought!

serve multiple dat repositories from a single server

I'm working on an open source Github-for-dat project called DatRepo. The idea is that users will be able to upload, download, search and share datasets easily.

My question is whether there is a vision/roadmap for serving multiple dat repositories from a single machine. I can think of at least three different implementations.

The most trivial approach would be to run a separate instance of the dat server on a unique port for each repository. This has the big advantage of simplicity. However, for a server with hundreds or thousands of repositories I'm not sure that this will scale very well, despite node's light footprint.
A more complex solution would be to have a single dat server per machine, and calls to "dat serve" would simply register the current dat directory with the server in a unique namespace. So "dat serve --namespace my_data" could activate the routes "localhost:6461/my_data/_package", etc. This approach seems scalable and in the ethos of dat to me, and it's my preferred solution.
Alternatively, dat could mimic git's paradigm and not mount a server on the remote at all, instead just piggybacking on SSH and having the local dat client connect to the remote filesystem and synchronize that way. I don't think that this is a good idea.

Are any of these, or an alternative, likely to become part of the dat protocol? Is there a reason why supporting multiple repositories on a single machine is a bad idea?

Plugin overview?

I was wondering if there's an up-to-date overview of the plugins dat currently supports. It'd be nice to know what transformations are already being worked on and what still needs building.

Dat crashes permanently on writing JSON payload.

apex:xx ks$ dat init
Initialized dat store at /private/tmp/xx/.dat
apex:xx ks$ echo '{"hello": "world"}' | dat --json
{"_rev":"1-183575eea1905ee592773f88f96f04a2","_id":"2013-11-18T13:01:10.707Z-77dd86b9"}
apex:xx ks$ echo '{"hello": "world"}' | dat --json

/usr/local/lib/node_modules/dat/lib/commands.js:434
        var key = primaryKeys[i].toString()
                               ^
TypeError: Cannot call method 'toString' of undefined
        at checkRows (/usr/local/lib/node_modules/dat/lib/commands.js:434:32)
        at Stream.onWrite (/usr/local/lib/node_modules/dat/lib/commands.js:468:7)
        at Stream.stream.write (/usr/local/lib/node_modules/dat/node_modules/through/index.js:26:11)
        at Stream.ondata (stream.js:51:26)
        at Stream.EventEmitter.emit (events.js:95:17)
        at drain (/usr/local/lib/node_modules/dat/node_modules/through/index.js:36:16)
        at Stream.stream.queue.stream.push (/usr/local/lib/node_modules/dat/node_modules/through/index.js:45:5)
        at Stream.batch (/usr/local/lib/node_modules/dat/node_modules/byte-stream/index.js:33:23)
        at Stream.stream.write (/usr/local/lib/node_modules/dat/node_modules/through/index.js:26:11)
        at Stream.ondata (stream.js:51:26)

Debugging information:

( primaryKeys, primaryKeys[0], i ) => [ undefined ] undefined 0

Error using dat init --remote=http://oaklandcrime.dathub.org

Working through some of the examples in usage.md and came across this error:

dat init --remote=http://oaklandcrime.dathub.org

/usr/local/lib/node_modules/dat/lib/meta.js:95
schemas.map(function(schema) {
^
TypeError: Object has no method 'map'
at ConcatStream.cb (/usr/local/lib/node_modules/dat/lib/meta.js:95:13)
at ConcatStream.end (/usr/local/lib/node_modules/dat/node_modules/concat-stream/index.js:45:21)
at Transform.onend (/usr/local/lib/node_modules/dat/node_modules/level-version/node_modules/through2/node_modules/readable-stream/lib/_stream_readable.js:487:10)
at Transform.g (events.js:180:16)
at Transform.EventEmitter.emit (events.js:117:20)
at /usr/local/lib/node_modules/dat/node_modules/level-version/node_modules/through2/node_modules/readable-stream/lib/_stream_readable.js:924:16
at process._tickCallback (node.js:415:13)

dat-data.com website

Is there anyway I can help you with the dat-data.com website?
Not sure I can help contributing to the library at this stage but I'd love to keep involved on this

Latest version of a repository

Is there a method to get the latest version of a whole repository, meaning only the latest revision of every row?

I tried to use the dat.createReadStream and dat.get functions, but I was only able to choose the revision for single rows.

So basically I am looking for a dat cat for only the latest revisions. Am I overlooking something, or is this not possible?

idea: layers and column sets

I've recently had to work with some data that seems very much dat-data.

There is a publically available dataset that is published periodically.
it contains some known errors that are (manually) fixed each time an update is pulled. 3) that data is combined with another data source, essentially adding extra columns.

Given my data-replication aware viewpoint, here I think you could represent this to make it easy to replicate:

Instead of mixing it all into one spreadsheet, and versioning everything,
what about layers like in photoshop? you could have the first import
as a spreadsheet on it's own - and then you could have the fixes as another
that is layed on top. This table could be very sparse - only rows with updates would be stored... and then only columns that have changes (probably leave nulls in the spaces)

It may be better to overlay the fixes, because then you can still receive updates to the original data.

Then if you pulled in data from another source you could set that beside this first 2 layer table (call this a "column set") it would add a few columns, but share the public key.

Now, you could expose these three combined tables as another table that could be exported, etc.

by separating the bits that change independently, you can greatly simplify replication - replicate base, bulk data, and then edits. replicate different sources (which update at different times, independently) as separate things.
instead of having a single commit value (version) you'd have a vector of versions (for each layer/columnset).

how does this sound?

Moving the computation, not the data

This may be completely off base for your project, but I thought I would throw it out there.

What do you think about trying to move computations closer to the data. For example, if I am maintaining a data-set with ocean temperatures, my cluster could also accept a computational payload to run on the data locally. Moving a result across the wire would be much easier than moving the entire data set.

This of course opens an whole new set of problems, but the problems may be worth the reward.

Dataset licensing metadata

Given that a big part of the use case for dat is enabling easier use, re-mixing, and re-use of the data, it would be great to bake in a convention for declaring data licensing in a machine readable way, similar to node's package.json's license property. Hopefully this will allow both for greater opportunities for data discovery and also create more awareness around licensing issues from both the publisher and consumer end. In npm for example, there is a great community convention around using liberal software licensing like bsd or mit. For non-technical communities, a great example is Flickr, which has long supported adding Creative Commons metadata to photos at upload time and to their advanced search query builder.

/cc @vthunder

CORS support for API endpoints

Doesn't look like CORS headers are added to responses - adding them would make it simpler to implement client-side-only applications that touch dat.

dat init creates useless package.json file

I'm not sure if it's just me, but everytime I do dat init, a 'package.json' file is created in the initialized directory with the following contents:

{
  "name": "prompt",
  "version": "0.0.0",
  "license": "BSD"
}

It's obviously not a big deal, I thought I but might as document this issue

keep total row count

this should be pretty easy. just need to hook in to anywhere that does CRUD and keep the current row count intact. we currently don't do this at all

it would be useful to have around for e.g. showing accurate progress bars on 'dat clone' from the _changes feed. right now we don't actually know how many total rows there are

develop a beta version

Just an update, Knight Foundation are taking their sweet time and havent started paying me yet. When they get their paperwork etc done i'll start working on this, I have a 6 month window that starts whenever they are ready.

Add request to package.json

I did "npm install dat -g" and the request module was missing.

npm install dat -g
...
dat
module.js:340
    throw err;
          ^
Error: Cannot find module 'request'
...

npm install request -g
...
dat
Usage: /usr/local/bin/dat <command> [<args>]

dat clone should make a new subfolder, just like how git clone works

right now it clones into the current folder

common data structure for tabular data

Today I was looking for an npm module that could give me a linear regression,
of course, there where several,
but none of them agreed on the input format... array per column, or should it be array of rows?

To get good code reuse, it's necessary to have shared structures.
Like, streams, or callbacks. levelup and voxel.js all build on a shared base that is consistent.

I'm thinking, it's probably gotta be a stream of rows.

Where, probably, each row is an array... and maybe some way to look up headers?
then, we could have other modules that created subtables, or computed columns or aggregates.

While this isn't necessarily part of dat, it would be good to have something that works well with dat, since dat is already aiming to be modular.

dat constructor should accept a backend option

so you can do stuff like

var dat = new Dat('./data/', { backend: 'leveldown-hyper' }, function(err) {})

Error: Cannot find module 'read-installed'

Error log from terminal (I'm on OSX Mavericks):

Chewbacca:foo $ dat init
module.js:340
    throw err;
    ^
Error: Cannot find module 'read-installed'
    at Function.Module._resolveFilename (module.js:338:15)
    at Function.Module._load (module.js:280:25)
    at Module.require (module.js:364:17)
    at require (module.js:380:17)
    at Object.<anonymous> (/usr/local/lib/node_modules/dat/lib/backend.js:2:21)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Module.require (module.js:364:17)

I imagine this is user error so I'd love any help you could provide.

Having problem on installation with leveldb

I get following error message when installing dat

npm ERR! [email protected] install: `node-gyp rebuild`
npm ERR! Exit status 1
npm ERR! 
npm ERR! Failed at the [email protected] install script.
npm ERR! This is most likely a problem with the leveldown-hyper package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR!     node-gyp rebuild
npm ERR! You can get their info via:
npm ERR!     npm owner ls leveldown-hyper
npm ERR! There is likely additional logging output above.

npm ERR! System Darwin 13.0.0
npm ERR! command "node" "/usr/local/bin/npm" "install" "-g" "dat"
npm ERR! cwd /Users/yalamber
npm ERR! node -v v0.10.23
npm ERR! npm -v 1.3.17
npm ERR! code ELIFECYCLE
npm ERR! 
npm ERR! Additional logging details can be found in:
npm ERR!     /Users/yalamber/npm-debug.log
npm ERR! not ok code 0

don't expose all of the dat prototype to the CLI

right now anything exported by commands.js is available as CLI arguments, we should have an e.g. whitelist function

TODO for dat alpha (first stable version)

most of my work the last 2 months has been on low level stuff, this is a todo list of things that need to get done before I can ship v1

(I'll check off boxes as I complete them)

primary key formatter function

it would be cool to specify a function that produces the primary key

fix CLI arguments to work regardless of order

e.g. dat import --csv foo.csv doesn't work right now but dat import foo.csv --csv does

pilot dataset/use cases

describe and link to datasets that I can use as a way to pilot/test out dat!

via @loleg:

extracting financial data from 26 Swiss cantons and their 2500 administrational units (cities etc) - for instance, over the weekend I hacked on a command line PDF converter which I just posted to Scraperwiki
scraping the national law database into a reasonably parseable format that allows semantic annotation (i.e. NLP + legal ontology) and visualization, one of several ideas for a legal hackday I'm organizing
collating personal data from participants in an open (as in, the door is ajar) loyalty program and packaging it up for social science researchers in a way that the corporates get the oversight they need to not shut us down
creating an activity feed of the dozens of projects we've started in our Open Data community, combining social/site metrics with GitHub, wiki, forum updates
setting up a grassroots (p2p?) datastore that people can chuck ideas into / collaborate à la (the now defunct) BuzzData, build their APIs against before moving somewhere more stable, basically the idea for http://datalets.ch -- must be easy enough to use to bring into schools, teach people coding with

others:

normalize, enrich and export species interaction data through http://eol.org using existing species interaction datasets. Here is the current workflow
make it easier for election data providers (states, cities) to publish data feeds so they can be consumed by https://votinginfoproject.org/

Data provenance

What kinds of considerations are being made for the provenance of data? For instance, multiple groups are working on this issue (see below). I'm a lurker on the repo but very interested in its future.

Provenance resources:

How does the source data fit into the bigger picture?

Hey, everyone. I'm pretty excited about the goals of dat. I'd like to ask a few questions. I'm curious how the original source data fits into the bigger picture after it has been "imported/synced" with dat. First of all, is the data actually copied into the dat store or is it only identifiers?

As an example, I have a simple csv file: teams.csv

team, location
Trail Blazers, "Portland, OR"
Warriors, "Oakland, CA"
Bulls, "Chicago, IL"
Spurs, "San Antonio, TX"

And now I initialize a dat store, and import/sync it with dat.

dat┋ ᴥ dat init
Initialized empty dat store at /Users/justin/dev/me/data/dat/.dat
dat┋ ᴥ cat teams.csv | dat --csv         
{"_rev":1,"_seq":2,"_id":"2013-10-29T21:52:28.205Z-77e98a13"}
{"_rev":1,"_seq":4,"_id":"2013-10-29T21:52:28.208Z-eeb8bc52"}
{"_rev":1,"_seq":5,"_id":"2013-10-29T21:52:28.209Z-83f08f9f"}
{"_rev":1,"_seq":6,"_id":"2013-10-29T21:52:28.210Z-32fe4272"}

And here what it looks like in dat

dat┋ ᴥ dat dump
{"key":"ÿdÿ2013-10-29T21:52:28.205Z-77e98a13ÿrÿ01ÿsÿ02","value":"\rTrail Blazers\u000f \"Portland, OR\""}
{"key":"ÿdÿ2013-10-29T21:52:28.208Z-eeb8bc52ÿrÿ01ÿsÿ04","value":"\bWarriors\u000e \"Oakland, CA\""}
{"key":"ÿdÿ2013-10-29T21:52:28.209Z-83f08f9fÿrÿ01ÿsÿ05","value":"\u0005Bulls\u000e \"Chicago, IL\""}
{"key":"ÿdÿ2013-10-29T21:52:28.210Z-32fe4272ÿrÿ01ÿsÿ06","value":"\u0005Spurs\u0012 \"San Antonio, TX\""}
{"key":"ÿdÿmetaÿrÿ01ÿsÿ01","value":"{\"_id\":\"meta\",\"created\":\"2013-10-29T21:51:41.125Z\",\"columns\":[],\"_rev\":1,\"_seq\":1}"}
{"key":"ÿdÿmetaÿrÿ02ÿsÿ03","value":"{\"_id\":\"meta\",\"_rev\":2,\"_seq\":3,\"created\":\"2013-10-29T21:51:41.125Z\",\"columns\":[\"team\",\" location\"]}"}
{"key":"ÿsÿ01","value":"1,meta,1"}
{"key":"ÿsÿ02","value":"2,2013-10-29T21:52:28.205Z-77e98a13,1"}
{"key":"ÿsÿ03","value":"3,meta,2"}
{"key":"ÿsÿ04","value":"4,2013-10-29T21:52:28.208Z-eeb8bc52,1"}
{"key":"ÿsÿ05","value":"5,2013-10-29T21:52:28.209Z-83f08f9f,1"}
{"key":"ÿsÿ06","value":"6,2013-10-29T21:52:28.210Z-32fe4272,1"}
dat┋ ᴥ dat cat
{"_id":"2013-10-29T21:52:28.205Z-77e98a13","_rev":1,"_seq":2,"team":"Trail Blazers"," location":" \"Portland, OR\""}
{"_id":"2013-10-29T21:52:28.208Z-eeb8bc52","_rev":1,"_seq":4,"team":"Warriors"," location":" \"Oakland, CA\""}
{"_id":"2013-10-29T21:52:28.209Z-83f08f9f","_rev":1,"_seq":5,"team":"Bulls"," location":" \"Chicago, IL\""}
{"_id":"2013-10-29T21:52:28.210Z-32fe4272","_rev":1,"_seq":6,"team":"Spurs"," location":" \"San Antonio, TX\""}

It appears as if dat has copied the data. Does this mean if I have a 2gb source file, I'll end up with near 4gb of data on disk when it's synced with dat?

Also, what's the process after the initial import? Can I edit the original CSV file (say, add a line) and "sync" it?

command line API

currently implemented: https://github.com/maxogden/dat/blob/master/usage.md

TODO:

dat clone SLEEPY_ENDPOINT checks out a new dat folder from a remote SLEEP endpoint
dat remote add origin SLEEPY_ENDPOINT to add another ?

havent figured out yet:

how to register a transformation (maybe specify it like a dependency as in node's package.json?)
do we need a package manager?
should we store a datapackage.json in the root folder?

Relate syncronization to ResourceSync

How does dat relate to ResourceSync? If there are good reasons not to use ResourceSync for synchronization, they should be mentioned.

Support document versions in the REST API

Support a path like:

/:id?rev=REVISION

In the REST API. I will grab this as a newbtask

Security and authentication for distributed data

I have been doing some thinking recently about security and authentication for distributed databases, particularly when the database is not behind a server but shared on disk, in an offline environment. I am interested in what existing thinking there is behind this and any implementations of a security model.

Global read, restricted write: for several use cases, in a Wikipedia/Openstreetmap type scenario, read access is not a problem but we may want to restrict editing to specific users or moderators. In a fully versioned system, users could sign any edits with a public/private key pair based on a hash of the database object edited. You could implement an index that would only look at any edits made by a group of trusted users, based on a collection of trusted public keys.

Private read: this is trickier if anyone has physical access to the files. Here I was thinking of encrypting database objects with public/private keys like PGP email security. How to manage groups though? Like PGP we could create a random key to encrypt each database object, and then encrypt that with each user in the access group's public key, attached to the DB object. That would make group membership difficult to manage: every object would need to have a new key added if we added a new user. The second option would be a keystore in the DB of keys for each access group which are used to encrypt objects, and each group key is encrypted with each user in the group's public key. The security issue here is that any user could decrypt the key and share it, and give access to all the objects to anyone else, but this is the same as passwords. Changing the keys would be very difficult though: ever object would need to be re-encrypted in every copy of the database.

Indexing would be an issue, but a local synced copy of the database would not necessarily need to be encrypted in the same way: it could store an unencrypted version of everything the user has access to within an encrypted volume and index locally.

The ultimate aim of this is a security model that would work without a central server with security through encryption when everyone potentially has access to the data being shared. Have others been thinking about this and tried to implement something similar?

When I say database I am thinking about static CSV/JSON files, git stores, leveldb disk stores, and couchDB amongst others.

I should add that I am not a security expert in any way and I am new to all this, so please excuse any incorrectly used terminology or concepts.

pretty-printed json fails to parse w/ dat --json

there is a bug somewhere in the way stdin is read when piping pretty-printed json objects into dat --json

Fields get mixed up

when I use an empty repo and import first a record without a primary key and then with one, the fields seem to get mixed up. Note in de 'dat cat' output how the value for author ends up in the field 'hello'. I was doing this on Windows, so that could cause trouble. Then again, these operations do not look very platform specific.

    >dat init
    >echo {"hello": "world"} | dat --json
    >dat cat
    {"_id":"2014-04-09T06:54:07.124Z-bc6adc19","_rev":"1-183575eea1905ee592773f88f96f04a2","hello":"world"}

    >type ..\books.json
    {"isbn":12345,"title":"dat for Dummies","author":"Someone or other"}
    {"isbn":12346,"title":"dat unleashed","author":"Someone else"}
    >type ..\books.json | dat --json --primary=isbn
    {"_id":"12345","_rev":"1-965def4a5d75da65947a3eecf6092a5b"}
    {"_id":"12346","_rev":"1-6a7979db45fcfc39a1b414e91f10e40d"}

    >dat cat
    {"_id":"12345","_rev":"1-965def4a5d75da65947a3eecf6092a5b","hello":"Someone or other","isbn":"12345","title":"dat for Dummies"}
    {"_id":"12346","_rev":"1-6a7979db45fcfc39a1b414e91f10e40d","hello":"Someone else","isbn":"12346","title":"dat unleashed"}
    {"_id":"2014-04-09T06:54:07.124Z-bc6adc19","_rev":"1-183575eea1905ee592773f88f96f04a2","hello":"world"}

    >dat dump
    {"key":"config","value":"{\"columns\":[\"hello\",\"author\",\"isbn\",\"title\"]}"}
    {"key":"ÿcÿ12345","value":"01-965def4a5d75da65947a3eecf6092a5b"}
    {"key":"ÿcÿ12346","value":"01-6a7979db45fcfc39a1b414e91f10e40d"}
    {"key":"ÿcÿ2014-04-09T06:54:07.124Z-bc6adc19","value":"01-183575eea1905ee592773f88f96f04a2"}
    {"key":"ÿdÿ12345ÿ01-965def4a5d75da65947a3eecf6092a5b","value":"\u0010Someone or other\u0000\u000512345\u000fdat for Dummies"}
    {"key":"ÿdÿ12346ÿ01-6a7979db45fcfc39a1b414e91f10e40d","value":"\fSomeone else\u0000\u000512346\rdat unleashed"}
    {"key":"ÿdÿ2014-04-09T06:54:07.124Z-bc6adc19ÿ01-183575eea1905ee592773f88f96f04a2","value":"\u0005world"}
    {"key":"ÿsÿ01","value":"[1,\"2014-04-09T06:54:07.124Z-bc6adc19\",\"1-183575eea1905ee592773f88f96f04a2\"]"}
    {"key":"ÿsÿ02","value":"[2,\"12345\",\"1-965def4a5d75da65947a3eecf6092a5b\"]"}
    {"key":"ÿsÿ03","value":"[3,\"12346\",\"1-6a7979db45fcfc39a1b414e91f10e40d\"]"}

stream.js:94 throw er; // Unhandled stream error in pipe.

I am trying to follow the usage.md sample, but I get this error once I start to work with CSV.

Here is my configuration:
OSX 10.9.2
node v0.10.26 (installed with brew)

swapping/installing new backends dynamically

if it's possible to make the backends 'hot swappable' we should do it. it could be behind an admin API. the main goal here is for the use case where you deploy dat using the default backend leveldown and later decide you want to run leveldown-hyper for faster clones. (leveldown-hyper only runs on linux/mac so it isn't the default)

right now the backend code assumes a dat.close() and new Dat() cycle will happen, but we can probably remove this requirement

Synchronization

(It might be related with #5) I'm having some doubts about sync and it may be because this is just starting and I don't know much. 😄

From README:

dat will provide a pluggable API for synchronization so that plugins can be written to export and import remote data efficiently from existing databases.

Should we define sync before, after or at the same time than defining transformations?
How can we datize a project which we don't own. This is probably going to be a common case considering N sources.
How should it behave for non versioning databases?
Considering pagination and no source versioning, how can we know old rows had had an update?
Considering no pagination and no source versioning, will it have to "download" the complete source thing to know what changed?
Either way, (off 3, 4 and 5) how will we define the "pluggability" of the API that dat will provide.

Idea: dat as a data retrieval tool

The idea is rather vague and high-level, I would like to discuss it with you.

Almost all scientific experiments require input data. Consider a script that counts words in a large collection of text. In the example below the SWDA is used:

$ count-words --corpus /PATH_TO_SWDA/
100000

However, it's responsibility of the user to

obtain the corpus
put it in the right place
make sure that the corpus files are not changed

To solve the issues, it would be nice to have the following interface:

$ count-words --corpus dat://some.domain/SWDA
100000

Then I expect that the files are

pulled from the server in case they absent
the integrity is checked

Under the hood, I would expect that the files of swda are put somewhere, so the script could access them.

Is it something dat could support? Are there other tools that solve the issue?

Error: Cannot find module 'ldjson-stream'

Vanilla install appears to be missing a dependency. After installing the missing module it works fine.

~/foo: dat init

module.js:340
    throw err;
    ^
Error: Cannot find module 'ldjson-stream'
    at Function.Module._resolveFilename (module.js:338:15)
    at Function.Module._load (module.js:280:25)
    at Module.require (module.js:364:17)
    at require (module.js:380:17)
    at Object.<anonymous> (/Users/mwillhite/Projects/dat/lib/commands.js:19:11)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Module.require (module.js:364:17)

Consider atomic rather than row data

A fundamental issue for any "git for data" is the core data structure used to store the data itself. It appears that you intend to use rows as your fundamental unit of data and are thinking of data as being "tabular". I may have misunderstood your plans.

While this is perfectly reasonable and even normal I would like to suggest that you consider using "atomic" data as your fundamental unit. For example with numerical data your unit of data would be a single number that is indexed to uniquely identify it rather than an entire row of numbers with a corresponding set of indexes.

While this many seem like a minor change it has many implications, both positive (power) and negative (complexity), as the entire system is built on top of this data structure. Git, at least as it appears to the user, is built on files and in particular lines of files (diffs). While this works great for code it is terrible for data. Change a single number and the entire row is flagged as changed leaving you to search for the change. Add a column and the entire file is changed. Even if all the data is exactly the same but formatting is changed slightly it is as if you are starting from scratch. At least for me this is what makes git unusable for data versioning (yes I know there are configuration settings that help a bit).

If dat is indeed fundamentally row based it will still suffer from many of the problems that make git so poorly suited for data. If instead dat thought of data as atomic units of indivisible information (a number being the cleanest example), data versioning would only consist of tracking changes (i.e. additions, deletions or updates) to the data itself. This could potentially solve all of these issues by breaking the dependence on the format (i.e. representation) of the data.

This is a quite complex issue with many subtle and not so subtle implications - a full discussion would be many pages. I hope this rather abstract comment will trigger some consideration of how best to both structure and think about the underlying data that dat will store. I would be glad to help work thru the many issues involved and perhaps at the end of the day rows are the best way to go.

This is an important but difficult project. Thanks for undertaking it. You have my full support. I hope I can be of help.

Use the Redis protocol

Lately I've been trying to optimize raw tabular file parsing + dat import speed.

A command line pipe chain like this would be ideal (because unix philosophy):

curl http://some-website.com/huge_data.csv | bcsv | dat

(where huge_data.csv is a CSV larger than RAM and bcsv is the binary-csv, but the concept of piping data into dat will be useful for any streamable tabular data format)

The problem lies in the pipe between bcsv and dat. By default bcsv parses rows out of raw CSV data. This means a CSV like this:

a,b,c
1,2,3
"Once upon 
a time",5,6
7,8,9

Gets turned into 4 rows, like this:

a,b,c

1,2,3

"Once upon 
a time",5,6

7,8,9

Note that if I were to simply split on newlines then I would end up with 5 rows, which is incorrect (due to CSV rules regarding double quoting and newlines).

When bcsv parses a line it emits a buffer (binary chunk of memory) containing the data for the line. On the command line in most Unixes when you pipe one process into another the buffers get, well, buffered. This means that when you do process.stdout.write(buffer) twice, you might only receive a single buffer on process.stdin in the receiver (depending on the size of the buffers and the OS pipe buffer size, etc).

So all of the parsing that bcsv did is basically useless if you rely on unix pipes to transfer buffers between processes.

This is where the redis protocol comes in. It's a lightweight framing protocol, meaning if you want to send 4 chunks of data, you first say 'there are 4 items', then 'heres the first one, it is 24 bytes long', etc. Framed protocols are a nice alternative to e.g. newline based protocols which require that you have to scan every byte of the incoming data (checking for newlines) to know when to terminate. Heres a trivial redis protocol example:

*2
$5
hello
$2
hi

If bcsv serialized it's buffers into the redis protocol then dat could parse the buffers efficiently on the receiving end. I haven't implemented or benchmarked but I'm quite confident already that it won't add too much computation overhead/slowdown.

Some notes:

The "redis protocol" is just the framing logic, the actual redis commands are different. The protocol is agnostic of the commands.
Not inventing my own framing protocol would be awesome. And there are lots of tools that already speak the redis protocol
There are already a couple of pretty fast pure JS redis protocol serializers and parsers
The protocol itself defines both request akd reply semantics, so it might be useful for the dat replication API (haven't thought about this enough yet to comment on it)

Feedback/ideas welcome!

After init, dat help hangs

I can run dat help with no issues where there's no repo, but when there is one it hangs

$ dat init
Initialized dat store at /private/var/folders/9n/d624fmq16g1_p5cs4wlfd8j80000gn/T/tmp.lEJ6uXUAp4/.dat
$ dat help
<prints help message, hangs>

This is with 2.9.0 installed from npm.

Missing dependency 'read-installed'

After a fresh install via npm I get the following error

➜  dat  dat init

module.js:340
    throw err;
    ^
Error: Cannot find module 'read-installed'
    at Function.Module._resolveFilename (module.js:338:15)
    at Function.Module._load (module.js:280:25)
    at Module.require (module.js:364:17)
    at require (module.js:380:17)
    at Object.<anonymous> (/usr/local/lib/node_modules/dat/lib/backend.js:2:21)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Module.require (module.js:364:17)

createWriteStream options

for the supported input formats I have the following options implemented so far in .createWriteStream:

primary - specify primary key (default is _id, gets generated for you if not exists)
format - specify input format (csv, json, objects, buff)
overwrite - whether or not to overwrite existing rows by primary key
columns- specify column names at import time, will uses from package.json otherwise, or parse them out of the data if the data has headers (e.g. csvs do, buff doesnt)

I'm thinking it would be nice if:

primary can take a list of primary keys that will be used as a composite key (maybe _id is just a hash of the primary columns)

what other table loading options would be nice to have?

Adherence to W3C PROV standard

The whole process of dealing with data in dat should conform to the PROV [1] standard by W3C.
Transformation and syncing in dat should conform to the PROV Data Model [2].
Data history should be made exportable into PROV XML [3] and RDF [4] formats to enhance interoperability and ease the documentation of provenance metadata.

I think dat has the potential to be the essential toool to handle data and at the same time automatically track provenance in a standardized way.

[1] http://www.w3.org/TR/prov-overview/
[2] http://www.w3.org/TR/prov-dm/
[3] http://www.w3.org/TR/prov-xml/
[4] http://www.w3.org/TR/prov-o/

Dat installation fails on windows

When installing Dat using Node v0.10.25 on Windows 7, I received the following errors when running npm install and npm link in the dat directory:
https://gist.github.com/psgs/10334618

According to line 23 of the first error (gyp ERR! stack Error: Can't find Python executable "python", you can set the PYT HON env variable.) and 17 of the second error, the error may be due to Python not being added as an environment variable.

On line 41 of the first error (npm ERR! This is most likely a problem with the leveldown package,) and line 35 of the second error, the error may be due to the leveldown package.

I tried creating a PYTHON environment variable, then ran npm install again and the following error occured:
https://gist.github.com/psgs/10334836

This issue may be related to issue #29

.tsv or .dsv as input formats?

I'm working with a corpus of annotated speech utterances. The transcribed utterances often contain commas and nested quoting. As a result, we typically use tabs as value separators when dumping from sqlite. I was hoping to test out dat as a vehicle for replicating and syncing our dataset among a small set of editors, but having a hard time figuring out how to import TSV data.

usage.md indicates that the CLI includes an optional -d or --delimiter option for specifying CSV line delimiters, but nothing for delimiting values. Since dat relies on binary-csv for CSV parsing I'm guessing this could be supported. If so, would you be open to replacing the --delimiter option with --separator and --newline options, so that these two types of delimiters could be specified/distinguished?

In sum, it would be great if dat could support arbitrary delimiter-separated values. I'd be happy to venture into a first-ever pull request if you're open to the idea.

what if csv had a header with metadata?

for the stuff I've been thinking about recently, like diffing csv, merging and joining
it would be way easier if csv had a simple header - not just column names.
I know an hcsv file could not be opened in excel, although it could be very easy to remove the header.

just something like this. a line of --- to demarcate the header?

----------
HEADER
----------
CSV

then you could have a patch file, and the header would just say which version (hash of) of the previous file it patched. You could have units for each header. You could have types for each header. You could specify which combination of columns made the primary key. You could also have this as a separate file and then load that metadata as an cli option in all the tools, but sometimes it would be much simpler to just have a header in the csv stream.

The header could be human readable, and you could easily have unix tools to parse it out.

you could also put JSON inside the header - although it might be good to use INI instead just because that is a similar legacy to CSV, so it just fits better...

Installing dat 4.1.1 from npm bails on folder-backup@'^2.0.0' dependency

I'm trying to install it on OS X Mavericks. From my npm-debug.log:

1143 error Error: No compatible version found: folder-backup@'^2.0.0'
1143 error Valid install targets:
1143 error ["0.0.1","0.0.2","1.0.0","2.0.0"]
1143 error     at installTargetsError (/usr/local/Cellar/node/0.10.5/lib/node_modules/npm/lib/cache.js:685:10)

I expected ^2.0.0 to match 2.0.0, which is obviously the latest release. Doth npm protest too much, or what?