Giter Club home page Giter Club logo

nd-db's Issues

Auto-serialize index

To save on startup-time when initializing the same newline delimited file multiple time with the same indexing function, it should auto-persist the index itself to filesystem temp storage (optionally somewhere else to preserve after system reboot).

  1. Auto-persist index to temp filesystem
  2. Optionally store this index somewhere else via configurable environment variable
  3. Default indexing-function should be based on regular expression, since that is often much more performant than parsing a document and querying that afterwards
  4. When using regular expression for indexing-function, the string representation of this should be used to name the persisted index. In this way the same newline delimited database can be persisted multiple times based on different indexing functions.

nd-db index transferability

Situation:
I'm using a 100GB large .ndjson file, for which it takes several minutes to generate (and persist) an index (.nddbmeta).

When moving the database to another computer, I copy the .nddbmeta together with the .ndjson database.

Problem is - the other machine (an Amazon Linux instance running in EC2) doesn't recognize the pre-generated .nddbmeta as the correct one (generated on a MBP m1), so it generates a new one.

TODO:

  • nd-db seems to not generate the same sha ID for the same database on the 2 machines, thus generating a new one
  • On Amazon Linux the generated index doesn't work properly (the parsing of the document seems to yield wrong start/length data!?)

NB: nd-db has previously been tested on different variants of Ubuntu, plus on Mac, without any issues.

nd-db metadata v2

Right now (v0.8.0) lazy-docs simply use the document IDs from the database index, to create a lazy seq of documents. This works out fine. But the drawback is, that the initial index has to be in memory.

The index is stored binary, as part of an EDN document. All of this has to be parsed into memory, before the index can be used to get the lazy sequence of database documents.

A more lazy way of reading the database documents, would be to store the index entries as separate lines of text in the .nddbmeta files. Then a line-seq over the index file would remove the need of reading all of the index into memory first.

This doesn't matter for small databases, but for huge databases with millions of documents, it'll make a huge difference to not store all of the index in memory "just" to go through it one by one.

"Since the processing order doesn't matter to me why don't we just make a line-seq over the database file itself?" you ask.
"Good question" I say.

The answer lies in the potential need for parallelizing your workload (i.e. processing pipeline) via clojure.core.reducers or other means. To properly parallelize, you need to have the full set of input data realized first (i.e. stuffed into a vector). You'll probably run out of memory in a snap, if you try to stuff millions of huge documents into memory.

Instead you read only the ID's from the index, stuff them into a vector, and parallelize by querying the database, ID by ID. This will work, but if you have a lot of CPU cores, but a slow harddisk, the read speed of the disk will probably become the bottleneck. You need a snappy SSD!

On the other hand, if you don't parallelize, you can easily just line-seq over the nd-db datab ase file itself, and do your document processing one by one.

The result of this task should be to:

  1. (BREAKING) change the format of the .nddbmeta files, to contain the general meta data on the first line, and the database index on the following lines (one entry per line)
  2. lazy-docs with a nd-db parameter, realizes the :index and returns a lazy seq of the docs
  3. a new nd-db.index/reader can be used to read the :index lazily (instead of having to realize it fully before returning the lazy seq of documents). But this only for the new v0.9.0+ versions of the .nddbmeta files. This is ideal for huge databases with millions of documents.

NB: This new version of the lazy-docs using the index reader only works for .ndnippy files (at least for now)!

Here's an example on how to use the new lazy-docs, where it's up to the caller to use with-open on the nd-db.index/reader:

;; get top 10 docs after dropping a lot, filtering and sorting the rest:
(with-open [r (nd-db.index/reader nd-db)]
  (->> nd-db
       (lazy-docs r)
       (drop 1000000)
       (filter (comp pos? :amount))
       (sort-by :priority)
       (take 10)))
  • make lazy-docs work for eager (legacy) .nddbmeta
  • make lazy-docs work for lazy (v0.9.0+) .nddbmeta
  • make a new lazy-ids for just returning the IDs of the documents (will be based on the eagerly realized :index for legacy .nddbmeta, and be truly lazy based on the new laziness-friendly .nddbmeta for v0.9.0+)
  • lazy-ids should be passed the needed Reader (from with-open form)
  • make a conversion function, to convert from legacy to v0.9.0+ .nddbmeta
  • make it possible to use lazy-docs without realizing full :index of the database (either by not realizing :index until it's required by i.e. nd-db.core/q, or by calling lazy-docs and lazy-ids without the database value, but with the same parameters as used to initialize the db value.

Don't use double path separators

When specifying i.e. the :index-folder with trailing path separator, make sure to remove it before persisting it.

The same goes for leading path separators for db :filenames.

Always expect no trailing/leading path separators ;)

Support CSV/TSV

Add support for CSV and TSV data formats.

Will be implemented by storing the columns (as keywords in a vector) in the metadata. The index name will be based on the selected key.

Since (get data k) is ~3x faster than (get-in data [k]) for single keys, the current :id-path should be modified to accept single keys, i.e. :id-path :some-id.

Columns will be parsed using a default column parser upon reading the document. Document will be returned as map with colums as keys.

There's no need to use separator as part of the index name, as the separator is part of the content hash.

  • Use default col-parser, but allow custom
  • Generate id-fn that can be used to get document ID per row
  • Generate index
  • Generate metadata for CSV/TSV (columns, id-key and separator - latter shouldn't be part of serialized filename, since it's already represented by the content hash)
  • Return document from CSV/TSV rows
  • Make sure lazy-ids work as expected (lazy-docs currently only work with ndnippy)
  • Put parse-doc fn specific for :doc-type into the db upon initialization (but don't serialize it!) -> speedup querying
  • Using :id-path [:kw] should parse to same :id-fn as when using :id-path :kw

Lazy seq of database contents

To iterate over everything in the database, it'd be nice to have a function that just returns a lazy seq of its contents.

Historical db versions

  • Accept last-line parameter to db, to read historical versions
  • Refactor index to index (sequentially) for every line in db
  • Always persist index if not persisted previously
  • Stop reading persisted index after n lines and use that for db query
  • Make sure historical db is read-only

Append (write) documents

Right now (as of v0.4.0) the nd-db library only supports using newline delimited files as read only databases.

It should be quite simple to write new documents to the database, by appending only. In this way it will work as sort of a transaction log. To begin with probably without the notion of transaction time.

When writing data, the index should (of course) be automatically updated.

Support compressed databases

Currently one of the databases I'm is around 85GB big uncompressed.
Compressed with i.e. xz it's only around 2GB.

When having a lot of databases lying around for different purposes, it really takes a toll on the remaining free space on the harddrive.

Personally supporting compressed database files, would literally free up 100s of GBs on my drive, and the indexing/querying performance shouldn't suffer much.

The individual lines representing documents is actually compressed. But for small documents, there's a lot of duplication across the documents when they get base64 encoded.

Main issue right now is, that when reading compressed files in an efficient way, we have to use a BufferedInputStream. So even though we can count the amount of uncompressed bytes read in the stream, the compressed bytes read is not accurate.

This can probably be mitigated by using a seekable inputstream on top of the compressed one. Then we can index based on the uncompressed bytes count. The downside to this is, that we may need to keep an open seekable inputstream as long as we keep the database value, to not have to open and close a pipe of streams for every query.

I'm open to other ideas! :)

ndedn index issue (v0.9.0)

Fix issue when indexing .ndedn - perhaps we always use the ndnippy bytes size when writing the lazy index, and not using json or csv?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.