Giter Club home page Giter Club logo

couchimport's Introduction

couchimport

Introduction

When populating CouchDB databases, often the source of the data is initially some JSON documents in a file, or some structured CSV/TSV data from another database's export.

couchimport is designed to assist with importing such data into CouchDb efficiently. Simply pipe a file full of JSON documents into couchimport, telling the URL and database to send the data to.

Note: couchimport used to handle the CSV to JSON conversion, but this part is now handled by csvtojsonlines, keeping this package smaller and easier to maintain. The [email protected] package is the last version to support CSV/TSV natively - from 2.0 onwards, couchimport is only for pouring JSONL files into CouchDB.

Also note: the companion CSV export utility (couchexport) is now hosted at couchcsvexport.

Installation

Install using npm or another Node.js package manager:

npm install -g couchimport

Usage

couchimport can either read JSON docs (one per line) from stdin e.g.

cat myfile.json | couchimport

or by passing a filename as the last parameter:

couchimport myfile.json

couchimport's configuration parameters can be stored in environment variables or supplied as command line arguments.

Configuration - environment variables

Simply set the COUCH_URL environment variable e.g. for a hosted Cloudant database

export COUCH_URL="https://myusername:[email protected]"

and define the name of the CouchDB database to write to by setting the COUCH_DATABASE environment variable e.g.

export COUCH_DATABASE="mydatabase"

Simply pipe the text data into "couchimport":

cat mydata.jsonl | couchimport

Configuring - command-line options

Supply the --url and --database parameters as command-line parameters instead:

couchimport --url "http://user:password@localhost:5984" --database "mydata" mydata.jsonl

or by piping data into stdin:

cat mydata.jsonl | couchimport --url "http://user:password@localhost:5984" --database "mydata" 

Handling CSV/TSV data

We can use another package csvtojsonlines to convert CSV/TSV files into a JSONL stream acceptable to couchimport:

# CSV file ----> JSON lines ---> CouchDB
cat transactions.csv | csvtojsonlines --delimiter ',' | couchimport --db ledger

Generating random data

couchimport can be paired with datamaker to generate any amount of sample data:

# template ---> datamaker ---> 100 JSON docs ---> couchimport ---> CouchDB
echo '{"_id":"{{uuid}}","name":"{{name}}","email":"{{email true}}","dob":"{{date 1950-01-01}}"}' | datamaker -f json -i 100 | couchimport --db people
written {"docCount":100,"successCount":1,"failCount":0,"statusCodes":{"201":1}}
written {"batch":1,"batchSize":100,"docSuccessCount":100,"docFailCount":0,"statusCodes":{"201":1},"errors":{}}
Import complete

or with the template as a file:

cat template.json | datamaker -f json -i 10000 | couchimport --db people

Understanding errors

We know if we get an HTTP 4xx/5xx response, then all of the documents failed to be written to the database. But as couchimport is writing data in bulk, the bulk request may get an HTTP 201 response that doesn't mean that all of the documents were written. Some of the document ids may have been in the database already. So the couchimport output includes counts of the number of documents that were written successfully and the number that failed, and a tally of the HTTP response codes and individual document error messages:

e.g.

written {"batch":10,"batchSize":1,"docSuccessCount":4,"docFailCount":6,"statusCodes":{"201":10},"errors":{"conflict":6}}

The above log line shows that after the tenth batch of writes, we have written 4 documents and failed to write 6 others. There were six "conflict" errors, meaning that there was a clash of document id or id/rev combination.

Parallel writes

Older versions of couchimport supported the ability to have multiple HTTP requests in flight at any one time, but the new simplified couchimport does not. To achieve the same thing, simply split your file of JSON docs into smaller pieces and run multiple couchimport jobs:

# split large file into files 1m lines each
# this will create files xaa, xab, xac etc
split -l 1000000 massive.txt
# find all files starting with x and using xargs,
# spawn a max of 2 process at once running couchimport, 
# one for each file
find . -name "x*" | xargs -t -I % -P 2 couchimport --db test %

Environment variables reference

  • COUCH_URL - the url of the CouchDB instance (required, or to be supplied on the command line)
  • COUCH_DATABASE - the database to deal with (required, or to be supplied on the command line)
  • COUCH_BUFFER_SIZE - the number of records written to CouchDB per bulk write (defaults to 500, not required)
  • IAM_API_KEY - to use IBM IAM to do authentication, set the IAM_API_KEY to your api key and a bearer token will be used in the HTTP requests.

Command-line parameters reference

You can also configure couchimport using command-line parameters:

  • --help - show help
  • --url/-u - the url of the CouchDB instance (required, or to be supplied in the environment)
  • --database/--db/-d - the database to deal with (required, or to be supplied in the environment)
  • --buffer/-b - the number of records written to CouchDB per bulk write (defaults to 500, not required)

Using programmatically

In your project, add couchimport into the dependencies of your package.json or run npm install --save couchimport. In your code, require the library with

const couchimport = require('couchimport')

and your options are set in an object whose keys are the same as the command line paramters:

e.g.

const opts = { url: "http://localhost:5984", database: "mydb", rs: fs.createReadStream('myfile.json') }
await couchimport(opts)

Note: rs is the readstream where data will be read (default: stdin) and ws is the write stream where the output will be written (default: stdout)

couchimport's People

Contributors

assafmo avatar benjspriggs avatar dependabot[bot] avatar glynnbird avatar gr2m avatar greenkeeperio-bot avatar jason-cooke avatar jdfitzgerald avatar jkryspin avatar lornajane avatar micmath avatar rajrsingh avatar smeans avatar terichadbourne avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

couchimport's Issues

Getting Killed after importing about 20k records

Hi,
I'm trying to import a fairly big data set (~3 million entries) by using couchimport, but the process always gets killed for some reason.
Here's my output:

root@leb-01122233:~# cat dailydump.txt | couchimport
******************
 COUCHIMPORT - configuration
   {"COUCH_URL":"http://****:****@127.0.0.1:5984","COUCH_DATABASE":"torrents","COUCH_TRANSFORM":null,"COUCH_DELIMETER":"|"}
******************
Written 500  ( 500 )
Written 500  ( 1000 )
...
Written 500  ( 21500 )
Written 500  ( 22000 )
Written 500  ( 22500 )
Written 500  ( 23000 )
Killed

any ideas?

Limit the requests per second this library makes

When using with Cloudant Lite plan on Bluemix there is a rate limit imposed on customers. Users exceed that rate of API calls will start receiving HTTP 429 replies.

When importing large data sets, it's best to stick to a maximum API call (say 5 per second) to avoid the 429 responses.

Getting writefail events even for successful document inserts

Almost there -- and looking good, except I'm getting a writefail for every document insert, when they're succeeding?

Writefail
{ id: '01532024366550100050901',
rev: '3-d8fe0cefe3cdc9914d3040d425ea84ff' }
Writefail
{ id: '01532329456550100051600',
rev: '3-0321a1818f40d1bbf5a07412718bc15c' }
Writefail
{ id: '01532420316550100015440',
rev: '2-fe828a1c6bfd16ee3963d6b415e58454' }
{ documents: 0, failed: 500, total: 0, totalfailed: 6500 }

But, I checked in Cloudant ... the document inserts are definitely succeeding.

Approach to capture errors when running from the command line?

I'll start by saying that couchimport is blazing fast when importing documents into Cloudant, no complaints there!

I'm calling couchimport from the command line to import large JSON files containing 60k documents per file, or so. Is there a way I can get the utility to tell me which documents failed to write, for example, due to revision conflict errors?

cat inv_1.json | couchimport --db invoices --type json --jsonpath "docs.*"

end event listener registration error

Hi all, I've been working on a tool to identify instances of events registered to the wrong object in uses of some JavaScript event-driven APIs, as part of a research project.
The tool flagged line 66 in includes/preview.js, on the registration of the “end” event.

The reason I believe this is indicative of an error is as follows (from looking at the nodejs http API documentation).
The return of agent.get is an http.ClientRequest. But, “end” is an event on a readable stream, and http.ClientRequest is a writable stream.

Since the argument to the callback passed into agent.get is an http.IncomingMessage, which is a readable stream, then my guess is that the listener for “end” maybe should be registered on this variable instead.
Specifically, I would guess the code should instead be

 agent.get(u, function (rs) {
    rs.on('data', function (d) {
      b = Buffer.concat([b, d])
      if (b.length > 10000) {
        rs.destroy()
        alldone()
      }
    });
    rs.on(‘end’, alldone); // this registration has been moved
  }).on('error', alldone)

Thanks!

"file not found" when using IAM_KEY authentication

Fresh install of couchimport.
Verified all runs well when using id and password authentication.

Trying to use IAM_KEY auth.

  1. export IAM_API_KEY=<value from my cloudant service credentials 'apikey'>
  2. run couchimport

get the following:

couchimport

url : "https://6f0e3c7d-3b09-4fd0-b253-c26d43892ac6-bluemix.cloudantnosqldb.appdomain.cloud"
database : "unlocodes_data"
delimiter : "\t"
buffer : 500
parallelism : 1
type : "jsonl"

Error: ENOENT: no such file or directory, open '/Users/kbiegert/.ccurl/keycache.json'
at Object.openSync (fs.js:457:3)
at Object.readFileSync (fs.js:359:35)
at Object.init (/usr/local/Cellar/node/13.8.0/lib/node_modules/couchimport/node_modules/ccurllib/index.js:21:20)
at Object.getToken (/usr/local/Cellar/node/13.8.0/lib/node_modules/couchimport/includes/iam.js:5:14)
at module.exports (/usr/local/Cellar/node/13.8.0/lib/node_modules/couchimport/includes/writer.js:20:7)
at Object.importStream (/usr/local/Cellar/node/13.8.0/lib/node_modules/couchimport/app.js:22:49)
at Object. (/usr/local/Cellar/node/13.8.0/lib/node_modules/couchimport/bin/couchimport.bin.js:48:15)
at Module._compile (internal/modules/cjs/loader.js:1157:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:1177:10)
at Module.load (internal/modules/cjs/loader.js:1001:32) {
errno: -2,
syscall: 'open',
code: 'ENOENT',
path: '/Users/kbiegert/.ccurl/keycache.json'
}
events.js:298
throw er; // Unhandled 'error' event
^

Error [ERR_METHOD_NOT_IMPLEMENTED]: The _transform() method is not implemented
at Transform._transform (_stream_transform.js:166:6)
at Transform._read (_stream_transform.js:191:10)
at Transform._write (_stream_transform.js:179:12)
at doWrite (_stream_writable.js:441:12)
at writeOrBuffer (_stream_writable.js:425:5)
at Transform.Writable.write (_stream_writable.js:316:11)
at Transform.ondata (_stream_readable.js:714:22)
at Transform.emit (events.js:321:20)
at addChunk (_stream_readable.js:294:12)
at readableAddChunk (_stream_readable.js:275:11)
Emitted 'error' event on Transform instance at:
at errorOrDestroy (internal/streams/destroy.js:108:12)
at Transform.onerror (_stream_readable.js:746:7)
at Transform.emit (events.js:321:20)
at errorOrDestroy (internal/streams/destroy.js:108:12)
at onwriteError (_stream_writable.js:456:5)
at onwrite (_stream_writable.js:483:5)
at Transform.afterTransform (_stream_transform.js:98:3)
at Transform._transform (_stream_transform.js:166:3)
at Transform._read (_stream_transform.js:191:10)
at Transform._write (_stream_transform.js:179:12) {
code: 'ERR_METHOD_NOT_IMPLEMENTED'
}

Confused about JSONPath parameter

Want to create a test database.
Production database is ~ 3M rows (Cloudant)
I query out my random sample with a _view, and use ?include_docs=true with ccurl piped to a file.

Now I got JSON like this:

{"total_rows":10000,"offset":0,"rows":[
   {"id":" 61001",
    "key":" 61001",
    "value":1,
    "doc": {
       "_id":"61001",
       "_rev":"7-7b34fbeadcb40c5c7034cd3628da5d7c",
       "field1":""}
   },

etc.

So, to import that back up into a new Cloudant database, doing this:

export COUCH_DATABASE="new-db-small"
export COUCH_PARALLELISM=10
export COUCH_FILETYPE=json

export COUCH_JSON_PATH="rows.*.doc"
cat ./data/tickets_small.json | couchimport

and I'm not finding the documents in the source json. I tried:
rows.* : was wrong, got the id, key, value, doc properties
.rows[*].doc : which should work if it was a JSONpath per Stefan Goessner, I think
and a bunch of trial and error.
So, what am I missing? I'm certain you've done exactly this, considering the output of a view.

Commas not escaped on export

When using couchexport with a , as a delimiter, the commas inside the fields are not escaped, and therefore the CSV is broken. Usually CSV files will wrap fields containing commas with "s.

Using couchimport v. 0.7.0

Export custom data from object

{ "_id": "d4ebc3b9397ba3faaacde2bfb80089e7", "_rev": "2-69b7ea5b6dd7495ff3d26a0fc5630825", "personal": { "firstName": "Dan", "lastName": "Denney", "dob": "12/26/87", "gender": "male" }, "work": { "workTitle": "Software Engineer", "workCompany": "Code School" }, "social": { "website": "dandenney.com", "emailId": "[email protected]", "twitter": "dandenney1", "facebook": "dandenney" }, "address": { "houseNumber": "52", "streetName": "main street", "city": "lodz", "country": "poland" } }

Want to export simple object with attributes firstName, lastName & company
We created transform.js
// example transformation function // -- remove leading and trailing quotes var x = function(doc) { doc.firstName = doc.personal.firstName; doc.lastName= doc.personal.lastName; doc.company = doc.work.workCompany; delete doc.work; delete doc.social; delete doc.address; return doc; } module.exports = x;

Command used
couchexport --url http://localhost:5984 --database tomtomdemo --delimiter "," --transform "D:\Places\optimus\Tools\couchDbTools\transform.js" > test_1.csv

Result : not expected
_id,personal,work,social,address
d4ebc3b9397ba3faaacde2bfb80089e7,[object Object],[object Object],[object Object],[object Object]

Not getting the writefail event as expected

I'm passing in a file I know should be causing document conflicts, but I'm receiving the written event instead of the writefail. Any thoughts?

couchimport.importFile(jsonFile, opts, function(err,data) {
console.log("Imported file: " + jsonFile,err,data);
}).on("writefail", function(data) {
console.log(data);
console.log("Writefail");
}).on("written", function(data) {
console.log(data);
console.log("Written");
}).on("writeerror", function(data) {
console.log(data);
console.log("WriteError");
});

{ documents: 2, total: 2 }
Written
Imported file: /dev/shm/SOE/inv_test.json null { total: 2 }

IMPORT BIGDATA

I'm trying to load a large amount of data 100GB size csv format, and use the --parallelism parameters and --buffer but I can not improve loading time.
Can you help me solve this or How to consume more RAM ??

Crashing Consistently

Loading a JSON file to Cloundant. Any file throws and error around the 117000 record mark.

Posting the console dump here:

<--- Last few GCs --->

35952 ms: Scavenge 1399.3 (1458.1) -> 1399.3 (1458.1) MB, 1.1 / 0 ms (+ 1.0 ms in 1 steps since last GC) [allocation failure] [incremental marking delaying mark-sweep].
36882 ms: Mark-sweep 1399.3 (1458.1) -> 1398.9 (1458.1) MB, 930.4 / 0 ms (+ 1.0 ms in 1 steps since start of marking, biggest step 1.0 ms) [last resort gc].
37793 ms: Mark-sweep 1398.9 (1458.1) -> 1398.9 (1458.1) MB, 910.4 / 0 ms [last resort gc].

<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x30eacf1e3ac1
2: write [/usr/local/lib/node_modules/couchimport/node_modules/jsonparse/jsonparse.js:~80] [pc=0x38b956c216c0](this=0x1cf13944ecc1 <a Parser with map 0x288e820a8131>,buffer=0xdf746b4d31 <an Uint8Array with map 0x288e82005759)
3: /* anonymous */ [/usr/local/lib/node_modules/couchimport/node_modules/JSONStream/index.js:~20] [pc=0x38b956ca4417] (this=0x1cf13944ee19 <a Stream with map 0x2...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Abort trap: 6

Add a version switch

Allow users to check the tool is installed without actually running the import piece. I expected to be able to do couchimport --version to check that the tool was working and what version I had. Could we add this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.