glynnbird / couchimport Goto Github PK

View Code? Open in Web Editor NEW

139.0 9.0 30.0 1.58 MB

CouchDB import tool to allow data to be bulk inserted

License: Other

JavaScript 100.00%

couchdb csv import export command-line

couchimport's Introduction

couchimport

Introduction

When populating CouchDB databases, often the source of the data is initially some JSON documents in a file, or some structured CSV/TSV data from another database's export.

couchimport is designed to assist with importing such data into CouchDb efficiently. Simply pipe a file full of JSON documents into couchimport, telling the URL and database to send the data to.

Note: couchimport used to handle the CSV to JSON conversion, but this part is now handled by csvtojsonlines, keeping this package smaller and easier to maintain. The [email protected] package is the last version to support CSV/TSV natively - from 2.0 onwards, couchimport is only for pouring JSONL files into CouchDB.

Also note: the companion CSV export utility (couchexport) is now hosted at couchcsvexport.

Installation

Install using npm or another Node.js package manager:

npm install -g couchimport

Usage

couchimport can either read JSON docs (one per line) from stdin e.g.

cat myfile.json | couchimport

or by passing a filename as the last parameter:

couchimport myfile.json

couchimport's configuration parameters can be stored in environment variables or supplied as command line arguments.

Configuration - environment variables

Simply set the COUCH_URL environment variable e.g. for a hosted Cloudant database

export COUCH_URL="https://myusername:[email protected]"

and define the name of the CouchDB database to write to by setting the COUCH_DATABASE environment variable e.g.

export COUCH_DATABASE="mydatabase"

Simply pipe the text data into "couchimport":

cat mydata.jsonl | couchimport

Configuring - command-line options

Supply the --url and --database parameters as command-line parameters instead:

couchimport --url "http://user:password@localhost:5984" --database "mydata" mydata.jsonl

or by piping data into stdin:

cat mydata.jsonl | couchimport --url "http://user:password@localhost:5984" --database "mydata"

Handling CSV/TSV data

We can use another package csvtojsonlines to convert CSV/TSV files into a JSONL stream acceptable to couchimport:

# CSV file ----> JSON lines ---> CouchDB
cat transactions.csv | csvtojsonlines --delimiter ',' | couchimport --db ledger

Generating random data

couchimport can be paired with datamaker to generate any amount of sample data:

# template ---> datamaker ---> 100 JSON docs ---> couchimport ---> CouchDB
echo '{"_id":"{{uuid}}","name":"{{name}}","email":"{{email true}}","dob":"{{date 1950-01-01}}"}' | datamaker -f json -i 100 | couchimport --db people
written {"docCount":100,"successCount":1,"failCount":0,"statusCodes":{"201":1}}
written {"batch":1,"batchSize":100,"docSuccessCount":100,"docFailCount":0,"statusCodes":{"201":1},"errors":{}}
Import complete

or with the template as a file:

cat template.json | datamaker -f json -i 10000 | couchimport --db people

Understanding errors

We know if we get an HTTP 4xx/5xx response, then all of the documents failed to be written to the database. But as couchimport is writing data in bulk, the bulk request may get an HTTP 201 response that doesn't mean that all of the documents were written. Some of the document ids may have been in the database already. So the couchimport output includes counts of the number of documents that were written successfully and the number that failed, and a tally of the HTTP response codes and individual document error messages:

e.g.

written {"batch":10,"batchSize":1,"docSuccessCount":4,"docFailCount":6,"statusCodes":{"201":10},"errors":{"conflict":6}}

The above log line shows that after the tenth batch of writes, we have written 4 documents and failed to write 6 others. There were six "conflict" errors, meaning that there was a clash of document id or id/rev combination.

Parallel writes

Older versions of couchimport supported the ability to have multiple HTTP requests in flight at any one time, but the new simplified couchimport does not. To achieve the same thing, simply split your file of JSON docs into smaller pieces and run multiple couchimport jobs:

# split large file into files 1m lines each
# this will create files xaa, xab, xac etc
split -l 1000000 massive.txt
# find all files starting with x and using xargs,
# spawn a max of 2 process at once running couchimport, 
# one for each file
find . -name "x*" | xargs -t -I % -P 2 couchimport --db test %

Environment variables reference

COUCH_URL - the url of the CouchDB instance (required, or to be supplied on the command line)
COUCH_DATABASE - the database to deal with (required, or to be supplied on the command line)
COUCH_BUFFER_SIZE - the number of records written to CouchDB per bulk write (defaults to 500, not required)
IAM_API_KEY - to use IBM IAM to do authentication, set the IAM_API_KEY to your api key and a bearer token will be used in the HTTP requests.

Command-line parameters reference

You can also configure couchimport using command-line parameters:

--help - show help
--url/-u - the url of the CouchDB instance (required, or to be supplied in the environment)
--database/--db/-d - the database to deal with (required, or to be supplied in the environment)
--buffer/-b - the number of records written to CouchDB per bulk write (defaults to 500, not required)

Using programmatically

In your project, add couchimport into the dependencies of your package.json or run npm install --save couchimport. In your code, require the library with

const couchimport = require('couchimport')

and your options are set in an object whose keys are the same as the command line paramters:

e.g.

const opts = { url: "http://localhost:5984", database: "mydb", rs: fs.createReadStream('myfile.json') }
await couchimport(opts)

Note: rs is the readstream where data will be read (default: stdin) and ws is the write stream where the output will be written (default: stdout)

couchimport's People

Contributors

Stargazers

Watchers

couchimport's Issues

Getting Killed after importing about 20k records

Hi,
I'm trying to import a fairly big data set (~3 million entries) by using couchimport, but the process always gets killed for some reason.
Here's my output:

root@leb-01122233:~# cat dailydump.txt | couchimport
******************
 COUCHIMPORT - configuration
   {"COUCH_URL":"http://****:****@127.0.0.1:5984","COUCH_DATABASE":"torrents","COUCH_TRANSFORM":null,"COUCH_DELIMETER":"|"}
******************
Written 500  ( 500 )
Written 500  ( 1000 )
...
Written 500  ( 21500 )
Written 500  ( 22000 )
Written 500  ( 22500 )
Written 500  ( 23000 )
Killed

any ideas?

Feature: when null or "undefined" is returned from the transform function, dont write to Db

Right now it just crashes, which is expected.

Limit the requests per second this library makes

When using with Cloudant Lite plan on Bluemix there is a rate limit imposed on customers. Users exceed that rate of API calls will start receiving HTTP 429 replies.

When importing large data sets, it's best to stick to a maximum API call (say 5 per second) to avoid the 429 responses.

Getting writefail events even for successful document inserts

Almost there -- and looking good, except I'm getting a writefail for every document insert, when they're succeeding?

Writefail
{ id: '01532024366550100050901',
rev: '3-d8fe0cefe3cdc9914d3040d425ea84ff' }
Writefail
{ id: '01532329456550100051600',
rev: '3-0321a1818f40d1bbf5a07412718bc15c' }
Writefail
{ id: '01532420316550100015440',
rev: '2-fe828a1c6bfd16ee3963d6b415e58454' }
{ documents: 0, failed: 500, total: 0, totalfailed: 6500 }

But, I checked in Cloudant ... the document inserts are definitely succeeding.

On couchImport.importStream, if the COUCH_BUFFER_SIZE > number of docs in the readstream, then it never strarts

Approach to capture errors when running from the command line?

I'll start by saying that couchimport is blazing fast when importing documents into Cloudant, no complaints there!

I'm calling couchimport from the command line to import large JSON files containing 60k documents per file, or so. Is there a way I can get the utility to tell me which documents failed to write, for example, due to revision conflict errors?

cat inv_1.json | couchimport --db invoices --type json --jsonpath "docs.*"

end event listener registration error

Hi all, I've been working on a tool to identify instances of events registered to the wrong object in uses of some JavaScript event-driven APIs, as part of a research project.
The tool flagged line 66 in includes/preview.js, on the registration of the “end” event.

The reason I believe this is indicative of an error is as follows (from looking at the nodejs http API documentation).
The return of agent.get is an http.ClientRequest. But, “end” is an event on a readable stream, and http.ClientRequest is a writable stream.

Since the argument to the callback passed into agent.get is an http.IncomingMessage, which is a readable stream, then my guess is that the listener for “end” maybe should be registered on this variable instead.
Specifically, I would guess the code should instead be

 agent.get(u, function (rs) {
    rs.on('data', function (d) {
      b = Buffer.concat([b, d])
      if (b.length > 10000) {
        rs.destroy()
        alldone()
      }
    });
    rs.on(‘end’, alldone); // this registration has been moved
  }).on('error', alldone)

Thanks!

Can we make couchimport trigger from the output of a view?

If so, then we could

output only a subset of the data, using the view as a selector
output a subset of the fields - using the view to 'emit' the object to be exported

"file not found" when using IAM_KEY authentication

Fresh install of couchimport.
Verified all runs well when using id and password authentication.

Trying to use IAM_KEY auth.

export IAM_API_KEY=<value from my cloudant service credentials 'apikey'>
run couchimport

get the following:

couchimport

url : "https://6f0e3c7d-3b09-4fd0-b253-c26d43892ac6-bluemix.cloudantnosqldb.appdomain.cloud"
database : "unlocodes_data"
delimiter : "\t"
buffer : 500
parallelism : 1
type : "jsonl"

Error: ENOENT: no such file or directory, open '/Users/kbiegert/.ccurl/keycache.json'
at Object.openSync (fs.js:457:3)
at Object.readFileSync (fs.js:359:35)
at Object.init (/usr/local/Cellar/node/13.8.0/lib/node_modules/couchimport/node_modules/ccurllib/index.js:21:20)
at Object.getToken (/usr/local/Cellar/node/13.8.0/lib/node_modules/couchimport/includes/iam.js:5:14)
at module.exports (/usr/local/Cellar/node/13.8.0/lib/node_modules/couchimport/includes/writer.js:20:7)
at Object.importStream (/usr/local/Cellar/node/13.8.0/lib/node_modules/couchimport/app.js:22:49)
at Object. (/usr/local/Cellar/node/13.8.0/lib/node_modules/couchimport/bin/couchimport.bin.js:48:15)
at Module._compile (internal/modules/cjs/loader.js:1157:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:1177:10)
at Module.load (internal/modules/cjs/loader.js:1001:32) {
errno: -2,
syscall: 'open',
code: 'ENOENT',
path: '/Users/kbiegert/.ccurl/keycache.json'
}
events.js:298
throw er; // Unhandled 'error' event
^

Error [ERR_METHOD_NOT_IMPLEMENTED]: The _transform() method is not implemented
at Transform._transform (_stream_transform.js:166:6)
at Transform._read (_stream_transform.js:191:10)
at Transform._write (_stream_transform.js:179:12)
at doWrite (_stream_writable.js:441:12)
at writeOrBuffer (_stream_writable.js:425:5)
at Transform.Writable.write (_stream_writable.js:316:11)
at Transform.ondata (_stream_readable.js:714:22)
at Transform.emit (events.js:321:20)
at addChunk (_stream_readable.js:294:12)
at readableAddChunk (_stream_readable.js:275:11)
Emitted 'error' event on Transform instance at:
at errorOrDestroy (internal/streams/destroy.js:108:12)
at Transform.onerror (_stream_readable.js:746:7)
at Transform.emit (events.js:321:20)
at errorOrDestroy (internal/streams/destroy.js:108:12)
at onwriteError (_stream_writable.js:456:5)
at onwrite (_stream_writable.js:483:5)
at Transform.afterTransform (_stream_transform.js:98:3)
at Transform._transform (_stream_transform.js:166:3)
at Transform._read (_stream_transform.js:191:10)
at Transform._write (_stream_transform.js:179:12) {
code: 'ERR_METHOD_NOT_IMPLEMENTED'
}

Confused about JSONPath parameter

Want to create a test database.
Production database is ~ 3M rows (Cloudant)
I query out my random sample with a _view, and use ?include_docs=true with ccurl piped to a file.

Now I got JSON like this:

{"total_rows":10000,"offset":0,"rows":[
   {"id":" 61001",
    "key":" 61001",
    "value":1,
    "doc": {
       "_id":"61001",
       "_rev":"7-7b34fbeadcb40c5c7034cd3628da5d7c",
       "field1":""}
   },

etc.

So, to import that back up into a new Cloudant database, doing this:

export COUCH_DATABASE="new-db-small"
export COUCH_PARALLELISM=10
export COUCH_FILETYPE=json

export COUCH_JSON_PATH="rows.*.doc"
cat ./data/tickets_small.json | couchimport

and I'm not finding the documents in the source json. I tried:
rows.* : was wrong, got the id, key, value, doc properties
.rows[*].doc : which should work if it was a JSONpath per Stefan Goessner, I think
and a bunch of trial and error.
So, what am I missing? I'm certain you've done exactly this, considering the output of a view.

Commas not escaped on export

When using couchexport with a , as a delimiter, the commas inside the fields are not escaped, and therefore the CSV is broken. Usually CSV files will wrap fields containing commas with "s.

Using couchimport v. 0.7.0

can you programmatically read line by line, one at a time

Export custom data from object

{ "_id": "d4ebc3b9397ba3faaacde2bfb80089e7", "_rev": "2-69b7ea5b6dd7495ff3d26a0fc5630825", "personal": { "firstName": "Dan", "lastName": "Denney", "dob": "12/26/87", "gender": "male" }, "work": { "workTitle": "Software Engineer", "workCompany": "Code School" }, "social": { "website": "dandenney.com", "emailId": "[email protected]", "twitter": "dandenney1", "facebook": "dandenney" }, "address": { "houseNumber": "52", "streetName": "main street", "city": "lodz", "country": "poland" } }

Want to export simple object with attributes firstName, lastName & company
We created transform.js
// example transformation function // -- remove leading and trailing quotes var x = function(doc) { doc.firstName = doc.personal.firstName; doc.lastName= doc.personal.lastName; doc.company = doc.work.workCompany; delete doc.work; delete doc.social; delete doc.address; return doc; } module.exports = x;

Command used
couchexport --url http://localhost:5984 --database tomtomdemo --delimiter "," --transform "D:\Places\optimus\Tools\couchDbTools\transform.js" > test_1.csv

Result : not expected
_id,personal,work,social,address
d4ebc3b9397ba3faaacde2bfb80089e7,[object Object],[object Object],[object Object],[object Object]

Not getting the writefail event as expected

I'm passing in a file I know should be causing document conflicts, but I'm receiving the written event instead of the writefail. Any thoughts?

couchimport.importFile(jsonFile, opts, function(err,data) {
console.log("Imported file: " + jsonFile,err,data);
}).on("writefail", function(data) {
console.log(data);
console.log("Writefail");
}).on("written", function(data) {
console.log(data);
console.log("Written");
}).on("writeerror", function(data) {
console.log(data);
console.log("WriteError");
});

{ documents: 2, total: 2 }
Written
Imported file: /dev/shm/SOE/inv_test.json null { total: 2 }

IMPORT BIGDATA

I'm trying to load a large amount of data 100GB size csv format, and use the --parallelism parameters and --buffer but I can not improve loading time.
Can you help me solve this or How to consume more RAM ??

Crashing Consistently

Loading a JSON file to Cloundant. Any file throws and error around the 117000 record mark.

Posting the console dump here:

<--- Last few GCs --->

35952 ms: Scavenge 1399.3 (1458.1) -> 1399.3 (1458.1) MB, 1.1 / 0 ms (+ 1.0 ms in 1 steps since last GC) [allocation failure] [incremental marking delaying mark-sweep].
36882 ms: Mark-sweep 1399.3 (1458.1) -> 1398.9 (1458.1) MB, 930.4 / 0 ms (+ 1.0 ms in 1 steps since start of marking, biggest step 1.0 ms) [last resort gc].
37793 ms: Mark-sweep 1398.9 (1458.1) -> 1398.9 (1458.1) MB, 910.4 / 0 ms [last resort gc].

<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x30eacf1e3ac1
2: write [/usr/local/lib/node_modules/couchimport/node_modules/jsonparse/jsonparse.js:~80] [pc=0x38b956c216c0](this=0x1cf13944ecc1 <a Parser with map 0x288e820a8131>,buffer=0xdf746b4d31 <an Uint8Array with map 0x288e82005759)
3: /* anonymous */ [/usr/local/lib/node_modules/couchimport/node_modules/JSONStream/index.js:~20] [pc=0x38b956ca4417] (this=0x1cf13944ee19 <a Stream with map 0x2...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Abort trap: 6

Add a version switch

Allow users to check the tool is installed without actually running the import piece. I expected to be able to do couchimport --version to check that the tool was working and what version I had. Could we add this?

Preview mode - should show output of a transformation

When doing:

cat data.txt | couchimport --preview true --transform ./mytransform.js

The output should show the transformed data.

glynnbird / couchimport Goto Github PK

couchimport's Introduction

couchimport

Introduction

Installation

Usage

Configuration - environment variables

Configuring - command-line options

Handling CSV/TSV data

Generating random data

Understanding errors

Parallel writes

Environment variables reference

Command-line parameters reference

Using programmatically

couchimport's People

Contributors

Stargazers

Watchers

Forkers

couchimport's Issues

couchimport

url : "https://6f0e3c7d-3b09-4fd0-b253-c26d43892ac6-bluemix.cloudantnosqldb.appdomain.cloud" database : "unlocodes_data" delimiter : "\t" buffer : 500 parallelism : 1 type : "jsonl"

Recommend Projects

Recommend Topics

Recommend Org

url : "https://6f0e3c7d-3b09-4fd0-b253-c26d43892ac6-bluemix.cloudantnosqldb.appdomain.cloud"
database : "unlocodes_data"
delimiter : "\t"
buffer : 500
parallelism : 1
type : "jsonl"