luposlip / nd-db Goto Github PK

Clojure library exposing newline delimited files as lightning fast databases

License: Apache License 2.0

Clojure 100.00%

database document clojure csv edn json ndnippy newline-delimited

nd-db's Introduction

nd-db

[com.luposlip/nd-db "0.9.0-beta9"]

Newline Delimited (read-only) Databases!

Clojure library that treats lines in newline delimited (potentially humongous) files as simple (thus lightening fast) databases.

nd-db currently works with JSON documents in .ndjson files, and EDN documents in .ndedn. It also supports binary nippy encoded EDN documents.

Usage

A very tiny test database resides in resources/test/test.ndjson.

It contains the following 3 documents, that has "id" as their unique IDs:

{"id":1, "data": ["some", "semi-random", "data"]}
{"id":222, "data": 42}
{"id":333333,"data": {"datakey": "datavalue"}}

Create Database

Since version 0.2.0 you need to create a database var before you can use the database. Behind the scenes this creates an index for you in a background thread.

(def db
  (nd-db.core/db
    :id-fn #(Integer. ^String (second (re-find #"^\{\"id\":(\d+)" %))))
    :filename "resources/test/test.ndjson"))

If you want a default :id-fn created for you, use the :id-name together with :id-type and/or :source-type. Both :id-type and :source-type can be :string or :integer. :id-type is the target type of the indexed ID, whereas :source-type is the type in the source .ndjson database file. :source-type defaults to :id-type, and :id-type defaults to :string:

(def db
  (nd-db.core/db
    :id-name "id"
    :id-type :integer
    :filename "resources/test/test.ndjson"}))

EDN

If you want to read a database of EDN documents, just use :doc-type :edn. Please note that the standard :id-name and :id-type parameters doesn't (as of v0.3.0) work with EDN, hence you need to implement the :id-fn accordingly.

Nippy

Nippy can be used also. Since this is a binary standard, you'd probably start out with a .ndjson or .ndedn file, and convert it to .ndnippy via the nd-db.convert namespace. "Why?" you ask. Because of speed. Especially for big documents (10-100s of KBs) the parsing make a huge difference.

Because the nippy-serialized documents are "just" EDN, you can simply give a path for the ID with the :id-path parameter. Or of course use the mighty :id-fn instead.

There's a sample resources/test/test.ndnippy database representing the same data as the .ndjson and .ndedn samples.

Query Single Document

With a reference to the db you can query the database.

To find the data for the document with ID 222, you can perform a query-single:

(nd-db.core/q db 222)

Query Multiple Documents

You can also perform multiple queries at once. This is ideal for a pipelined scenario, since the return value is a lazy seq:

(nd-db.core/q db [333333 1 77])

It keeps!

NB: The above query for multiple documents, returns only 2 documents, since there is no document with ID 77. This is a design decision, as the documents themselves still contain the ID.

In a pipeline you'll be able to give lots of IDs to q, and filter down on documents that are actually represented in the database.

If you want to have an option to return nil in this case, let me know by creating an issue (or a PR).

The ID function

The ID functions adds "unlimited" flexibility as how to uniquely identify each document. You can choose a single attribute, or a composite. It's entirely up to you when you implement your ID function.

In the example above, the value of "id" is used as a unique ID to built up the database index.

Parsing JSON documents

If you use very large databases, it makes sense to think about performance in your ID function. In the above example a regular expression is used to find the value of "id", since this is faster than parsing JSON objects to EDN and querying them as maps.

Return value

Furthermore the return value of the function is (almost) the only thing being stored in memory. Because of that you should opt for as simple data values as possible. In the above example this is the reason for the parsing to Integer instead of keeping the String value.

Also note that the return value is the same you should use to query the database. Which is why the input to q are Integer instances.

Refer to the test for more details.

Laziness!

Since v0.9.0 both the index and the documents can be retrieved in a truly lazy fashion.

Getting the those lazy seqs are possible, since v0.9.0 introduces a new meta+index file format, that makes lazy traversal of the index possible.

The downside to the support for laziness is the size of the meta+index files, which in my tested scenarios have grown with 100%. This means a .ndnippy database of 16.8GB containing ~300k huge documents (of 200-300Kb each in raw JSON/EDN form) has grown form ~5MB to ~10MB.

This is not a problem at all in real life, since when you need the realized in-memory index (for ad-hoc querying by ID), it still consumes the same amount of memory as before (in the above example ~3MB).

Lazy IDs

Here's an example on how to get and use the the lazy IDs:

(with-open [r (nd-db.index/reader my-db)]
  (->> r
       nd-db.core/lazy-ids
       (drop 100000)
       (take 100)
       (sort >)
       first))

Lazy Documents

NB: This currently (v0.9.0) only works for .ndnippy databases!

Getting the documents contained in a nippy document based database, is just as simple as getting the IDs:

(with-open [r (nd-db.index/reader my-db)]
  (->> r
       (nd-db.core/lazy-docs my-db)
       (drop 100000)
       (take 100)
       (sort-by :some-value >)
       first))

The above example spends ~1ms on my laptop (mbp m1 pro) per dropped document, which isn't a lot. But if you don't actually need (most of) the documents, it's much faster to use the IDs as entry, like in this example:

(with-open [r (nd-db.index/reader my-db)]
  (->> r
       nd-db.core/lazy-ids
       (drop 100000)
       (take 100)
       (q my-db)
       (sort-by :some-value >)
       first))

Persisting the database index

From v0.5.0 the generated index will be persisted to the temporary system folder on disk. This is a huge benefit if you need to use the same database multiple times, after throwing away the reference to the parsed database, since it takes much less time to read in the index as compared to parsing the database file.

For small files (like the sample databases found in this repository) it doesn't really make a difference. But for huge files, it makes an immense difference. The bigger the databases, the bigger the individual documents and the more complex the parsing of these documents are (to find the unique ID), the bigger the difference. For a database file of 4.7GB the difference is 47s vs 90ms, or ~500 times faster!!

If you want to keep the serialized meta+index file (*.nddbmeta) between system reboots, you should move it to another folder. You do that by passing :index-folder to the db function.

If for some reason you don't want to persist the index - e.g. there's no storage attached to a docker container or serverless system - you can inhibit the persistence by setting param :index-persist? to false.

From v0.9.0 onwards the index is generated in parallel. This cuts two thirds of the processing time.

For more information on these and other parameters, see the source code for the db function in the core namespace.

Real use case: Verified Twitter Accounts

To test with a real database, download all verified Twitter users from here: https://files.pushshift.io/twitter/TU_verified.ndjson.xz

Put the file somewhere, i.e. path/to/TU_verified.ndjson, and run the following in a repl:

(time
   (def katy-gaga-gates-et-al
     (doall
      (nd-db.core/q
       (nd-db.core/db {:id-name "screen_name"
	                   :id-type :string
                       :filename "path/to/TU_verified.ndjson"})
       ["katyperry" "ladygaga" "BillGates" "ByMikeWilson"]))))

Performance

The extracted .ndjson files is 513 MB (297,878 records).

On my laptop the initial build of the index takes around 3 seconds, and the subsequent query of the above 3 verified Twitter users takes around 1 millisecond (specs: Intel® Core™ i7-8750H CPU @ 2.20GHz × 6 cores with 31,2 GB RAM, SSD HD).

In real usage scenarios, I've used 2 databases simultaneously of sizes 1.6 GB and 43.0 GB, with no problem or performance penalties at all (except for the relatively small size of the in-memory indexes of course). Indexing the biggest database of 43GB took less than 2 minutes (NB: This is with a single core, and BEFORE 0.4.0).

Since the database uses disk random access, SSD speed up the database significantly.

Update

On a MacBook Pro M1 Pro with 32 GB memory, the querying takes around 0.5 ms!

Another update with version 0.9.0-beta4

On the same M1 Pro as mentioned above, the index creation takes around 1 second, and querying the 3 documents, 1000 times, takes less than 0.3ms!

Copyright & License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

nd-db's People

Contributors

Stargazers

Watchers

nd-db's Issues

Don't use double path separators

When specifying i.e. the :index-folder with trailing path separator, make sure to remove it before persisting it.

The same goes for leading path separators for db :filenames.

Always expect no trailing/leading path separators ;)

Consider Concise Encoding

https://github.com/kstenerud/concise-encoding/blob/master/cte-specification.md

Historical db versions

Accept last-line parameter to db, to read historical versions
Refactor index to index (sequentially) for every line in db
Always persist index if not persisted previously
Stop reading persisted index after n lines and use that for db query
Make sure historical db is read-only

Append (write) documents

Right now (as of v0.4.0) the nd-db library only supports using newline delimited files as read only databases.

It should be quite simple to write new documents to the database, by appending only. In this way it will work as sort of a transaction log. To begin with probably without the notion of transaction time.

When writing data, the index should (of course) be automatically updated.

Support CSV/TSV

Add support for CSV and TSV data formats.

Will be implemented by storing the columns (as keywords in a vector) in the metadata. The index name will be based on the selected key.

Since (get data k) is ~3x faster than (get-in data [k]) for single keys, the current :id-path should be modified to accept single keys, i.e. :id-path :some-id.

Columns will be parsed using a default column parser upon reading the document. Document will be returned as map with colums as keys.

There's no need to use separator as part of the index name, as the separator is part of the content hash.

Use default col-parser, but allow custom
Generate id-fn that can be used to get document ID per row
Generate index
Generate metadata for CSV/TSV (columns, id-key and separator - latter shouldn't be part of serialized filename, since it's already represented by the content hash)
Return document from CSV/TSV rows
Make sure lazy-ids work as expected (lazy-docs currently only work with ndnippy)
Put parse-doc fn specific for :doc-type into the db upon initialization (but don't serialize it!) -> speedup querying
Using :id-path [:kw] should parse to same :id-fn as when using :id-path :kw

Optimize metafile naming

Using :id-fn for generating db value should consider :idx-id for naming the nddbmeta file

nd-db metadata v2

Right now (v0.8.0) lazy-docs simply use the document IDs from the database index, to create a lazy seq of documents. This works out fine. But the drawback is, that the initial index has to be in memory.

The index is stored binary, as part of an EDN document. All of this has to be parsed into memory, before the index can be used to get the lazy sequence of database documents.

A more lazy way of reading the database documents, would be to store the index entries as separate lines of text in the .nddbmeta files. Then a line-seq over the index file would remove the need of reading all of the index into memory first.

This doesn't matter for small databases, but for huge databases with millions of documents, it'll make a huge difference to not store all of the index in memory "just" to go through it one by one.

"Since the processing order doesn't matter to me why don't we just make a line-seq over the database file itself?" you ask.
"Good question" I say.

The answer lies in the potential need for parallelizing your workload (i.e. processing pipeline) via clojure.core.reducers or other means. To properly parallelize, you need to have the full set of input data realized first (i.e. stuffed into a vector). You'll probably run out of memory in a snap, if you try to stuff millions of huge documents into memory.

Instead you read only the ID's from the index, stuff them into a vector, and parallelize by querying the database, ID by ID. This will work, but if you have a lot of CPU cores, but a slow harddisk, the read speed of the disk will probably become the bottleneck. You need a snappy SSD!

On the other hand, if you don't parallelize, you can easily just line-seq over the nd-db datab ase file itself, and do your document processing one by one.

The result of this task should be to:

(BREAKING) change the format of the .nddbmeta files, to contain the general meta data on the first line, and the database index on the following lines (one entry per line)
lazy-docs with a nd-db parameter, realizes the :index and returns a lazy seq of the docs
a new nd-db.index/reader can be used to read the :index lazily (instead of having to realize it fully before returning the lazy seq of documents). But this only for the new v0.9.0+ versions of the .nddbmeta files. This is ideal for huge databases with millions of documents.

NB: This new version of the lazy-docs using the index reader only works for .ndnippy files (at least for now)!

Here's an example on how to use the new lazy-docs, where it's up to the caller to use with-open on the nd-db.index/reader:

;; get top 10 docs after dropping a lot, filtering and sorting the rest:
(with-open [r (nd-db.index/reader nd-db)]
  (->> nd-db
       (lazy-docs r)
       (drop 1000000)
       (filter (comp pos? :amount))
       (sort-by :priority)
       (take 10)))

make lazy-docs work for eager (legacy) .nddbmeta
make lazy-docs work for lazy (v0.9.0+) .nddbmeta
make a new lazy-ids for just returning the IDs of the documents (will be based on the eagerly realized :index for legacy .nddbmeta, and be truly lazy based on the new laziness-friendly .nddbmeta for v0.9.0+)
lazy-ids should be passed the needed Reader (from with-open form)
make a conversion function, to convert from legacy to v0.9.0+ .nddbmeta
make it possible to use lazy-docs without realizing full :index of the database (either by not realizing :index until it's required by i.e. nd-db.core/q, or by calling lazy-docs and lazy-ids without the database value, but with the same parameters as used to initialize the db value.

nd-db index transferability

Situation:
I'm using a 100GB large .ndjson file, for which it takes several minutes to generate (and persist) an index (.nddbmeta).

When moving the database to another computer, I copy the .nddbmeta together with the .ndjson database.

Problem is - the other machine (an Amazon Linux instance running in EC2) doesn't recognize the pre-generated .nddbmeta as the correct one (generated on a MBP m1), so it generates a new one.

TODO:

nd-db seems to not generate the same sha ID for the same database on the 2 machines, thus generating a new one
On Amazon Linux the generated index doesn't work properly (the parsing of the document seems to yield wrong start/length data!?)

NB: nd-db has previously been tested on different variants of Ubuntu, plus on Mac, without any issues.

Old (pre-v0.9.0) index failing

Fix issue with old indexes, failing in _parse-db returning a non db? compliant result.

Support compressed databases

Currently one of the databases I'm is around 85GB big uncompressed.
Compressed with i.e. xz it's only around 2GB.

When having a lot of databases lying around for different purposes, it really takes a toll on the remaining free space on the harddrive.

Personally supporting compressed database files, would literally free up 100s of GBs on my drive, and the indexing/querying performance shouldn't suffer much.

The individual lines representing documents is actually compressed. But for small documents, there's a lot of duplication across the documents when they get base64 encoded.

Main issue right now is, that when reading compressed files in an efficient way, we have to use a BufferedInputStream. So even though we can count the amount of uncompressed bytes read in the stream, the compressed bytes read is not accurate.

This can probably be mitigated by using a seekable inputstream on top of the compressed one. Then we can index based on the uncompressed bytes count. The downside to this is, that we may need to keep an open seekable inputstream as long as we keep the database value, to not have to open and close a pipe of streams for every query.

I'm open to other ideas! :)

Auto-persist index to temp filesystem
Optionally store this index somewhere else via configurable environment variable
Default indexing-function should be based on regular expression, since that is often much more performant than parsing a document and querying that afterwards
When using regular expression for indexing-function, the string representation of this should be used to name the persisted index. In this way the same newline delimited database can be persisted multiple times based on different indexing functions.