scottrogowski / mongita Goto Github PK

View Code? Open in Web Editor NEW

877.0 877.0 27.0 686 KB

"Mongita is to MongoDB as SQLite is to SQL"

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.59% Python 99.41%

mongita's People

Contributors

Stargazers

Watchers

mongita's Issues

Is the existence of mongita.errors.InvalidName necessary?

mongita.errors.InvalidName: Collection cannot be named 'Nerve_Tibial.v8.egenes_ann_query_res.vcf'.

It seems to me that the prohibition of presence of non-letter symbols in the collection name is superfluous. By the way, MongoDB does not have this restriction.

There are some wrong in README.md <Hello world>

wrong: mongoose_types.insert_many([{'name': 'Meercat', 'not_into', 'Snakes'},{'name': 'Yellow mongoose': 'eats': 'Termites'}])
desc: the argument is not a correct python dict

wrong: list(coll.find({'weight': {'$gt': 1}))
coll.delete_one({'name': 'Meercat'})
desc: the variable coll is not existed, should be mongoose_types

Is this api thread safe at all?

I'm experiencing weird issues,

the api will throw a key error exception when accessing a dict. And after that, the db will be erased.

It seems to be a thread safety issue? read/write can't happen at the same time.

Querying for nested array elements doesn't work

Having a collection like this

[
  {
    "_id": 1,
    "results": [
      {
        "product": "abc",
        "score": 10
      },
      {
        "product": "xyz",
        "score": 5
      }
    ]
  },
  {
    "_id": 2,
    "results": [
      {
        "product": "abc",
        "score": 8
      },
      {
        "product": "xyz",
        "score": 7
      }
    ]
  },
  {
    "_id": 3,
    "results": [
      {
        "product": "abc",
        "score": 7
      },
      {
        "product": "xyz",
        "score": 8
      }
    ]
  },
  {
    "_id": 4,
    "results": [
      {
        "product": "abc",
        "score": 7
      },
      {
        "product": "def",
        "score": 8
      }
    ]
  }
]

and a query like this

list(db.collection.find({
  "results.product": "xyz"
}))

produces no result

[]

when running it againts mongodb the query finds the elements correctly

[
  {
    "_id": 1,
    "results": [
      {
        "product": "abc",
        "score": 10
      },
      {
        "product": "xyz",
        "score": 5
      }
    ]
  },
  {
    "_id": 2,
    "results": [
      {
        "product": "abc",
        "score": 8
      },
      {
        "product": "xyz",
        "score": 7
      }
    ]
  },
  {
    "_id": 3,
    "results": [
      {
        "product": "abc",
        "score": 7
      },
      {
        "product": "xyz",
        "score": 8
      }
    ]
  }
]

CC: @dgutson

Pip install mongita -- Segment Fault

I am trying to install mongita on Python 3.9 on a Windows machine. I am getting the following error. Any ideas?:

$ pip install mongita
Collecting mongita
Using cached mongita-1.0.0.tar.gz (33 kB)
ERROR: Command errored out with exit status 3221225477:
command: 'D:\anaconda\envs\py39\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\len_w\AppData\Local\Temp\pip-install-ee8rj6s8\mongita\setup.py'"'"'; file='"'"'C:\Users\len_w\AppData\Local\Temp\pip-install-ee8rj6s8\mongita\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\len_w\AppData\Local\Temp\pip-pip-egg-info-m1bl3ksn'
cwd: C:\Users\len_w\AppData\Local\Temp\pip-install-ee8rj6s8\mongita
Complete output (11 lines):
running egg_info
creating C:\Users\len_w\AppData\Local\Temp\pip-pip-egg-info-m1bl3ksn\mongita.egg-info
writing C:\Users\len_w\AppData\Local\Temp\pip-pip-egg-info-m1bl3ksn\mongita.egg-info\PKG-INFO
writing dependency_links to C:\Users\len_w\AppData\Local\Temp\pip-pip-egg-info-m1bl3ksn\mongita.egg-info\dependency_links.txt
writing requirements to C:\Users\len_w\AppData\Local\Temp\pip-pip-egg-info-m1bl3ksn\mongita.egg-info\requires.txt
writing top-level names to C:\Users\len_w\AppData\Local\Temp\pip-pip-egg-info-m1bl3ksn\mongita.egg-info\top_level.txt
writing manifest file 'C:\Users\len_w\AppData\Local\Temp\pip-pip-egg-info-m1bl3ksn\mongita.egg-info\SOURCES.txt'
reading manifest file 'C:\Users\len_w\AppData\Local\Temp\pip-pip-egg-info-m1bl3ksn\mongita.egg-info\SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no files found matching 'LICENSE,'
writing manifest file 'C:\Users\len_w\AppData\Local\Temp\pip-pip-egg-info-m1bl3ksn\mongita.egg-info\SOURCES.txt'
----------------------------------------
ERROR: Command errored out with exit status 3221225477: python setup.py egg_info Check the logs for full command output.

Segmentation fault

replace_one() with upsert=True creates a new ID instead of using ID from filter

In pymongo, using replace_one() with upsert=True will create a new document if it doesn't exist, and always use an ID provided in the filter if available. It appears that mongita will only use the provided ID if the document already exists. Otherwise, it creates a new ID via bson.ObjectId().

Here's an example to reproduce this:

from mongita import MongitaClientDisk
from pymongo import MongoClient


def test_ids(client):
    collection = client['test_db']['test_collection']
    collection.replace_one(
        {'_id': 'id_from_filter'},
        replacement={'key': 'value'},
        upsert=True,
    )
    doc = collection.find_one({'_id': 'id_from_filter'})
    print(f'Fetched document by ID: {doc}')

    print('All IDs:')
    for d in collection.find({}):
        print(d['_id'])


print('pymongo\n----------')
test_ids(MongoClient())

print('\nmongita\n----------')
test_ids(MongitaClientDisk())

Output:

pymongo
----------
Fetched document by ID: {'_id': 'id_from_filter', 'key': 'value'}
All IDs:
id_from_filter

mongita
----------
Fetched document by ID: None
All IDs:
6356e86fd4d2dac326e38371

I believe it comes down to this section in Collection.__find_one_id():

mongita/mongita/collection.py

Lines 823 to 826 in 0bc8e57

 if '_id' in filter: 

 if self._engine.doc_exists(self.full_name, filter['_id']): 

 return filter['_id'] 

 return None

Benchmarks suggetions

@scottrogowski, this is a really nice project. The name is awesome too!

I have not had the chance yet to really git it a spin, however I think the benchmarks can be improved a bit.
I believe SQLite performance comparision is can be improved if you compared insertion of a dict into JSON.

I think this is where most of the CPU cycles in the row insertion are consumed, which make SQLite look so bad ...

def _to_sqlite_row(doc):
    doc['_id'] = str(doc['_id'])
    return (doc['_id'], doc['name'], doc['dt'], doc['count'], doc['city'],
            doc['content'], doc['percent'],
            json.dumps(doc['dict'], default=json_util.default))

Thanks for publishing this nifty little project!

SQLite benchmark is fudged (no primary key index)

In the SQLite schema you forgot to declare idd a primary key, or even to add an index on it. Thus all of your find_one() calls have O(n) performance.

I'm struggling to believe that this was unintentional, especially since you did bother to add indexes on other columns. Surely when you saw the jaw-droppingly bad numbers for SQLite in "get 1000 docs by ID", you would have investigated?

Question about mongita memory

Would it be preferable to use mongita memory over redis for caching? The reason being that I'm already using mongodb as my main database and implementing mongita as a cache layer seems to be the easiest route. How's the performance of it compared to redis?

MongoEngine + mongita

While it is possible to use MongoEngine 0.22.1 with mongita-1.1.0 MongitaClientMemory:

import pymongo
import mongita
#  This works
pymongo.MongoClient = mongita.MongitaClientMemory
import mongoengine
mongoengine.connect(host='c:/temp/mongita')

It does not work with MongitaClientDisk:

import pymongo
import mongita
# This fails
pymongo.MongoClient = mongita.MongitaClientDisk
import mongoengine
mongoengine.connect(host='c:/temp/mongita')

The MongitaClientDisk constructor always fails.

Not clear if this is just a version mis-match with mongoengine or not.
The error also affects the unit tests for mongita.

The problem seems to be in mongita_client.py in the MongitaClientDisk constructor where it
invokes:

disk_engine.DiskEngine.create(host)

This is because according to:
https://pymongo.readthedocs.io/en/stable/api/pymongo/mongo_client.html
the host parameter to the MongoClient() method is a list, not a string. Since the
DiskEngine.create() factory method is expecting a string, it reports the error:
"unhashable type: list"

The simple fix is to test in the MongoClientDisk constructor in mongita_client.py
to see if the parameter is a list and if so pluck the first element. Since the
default value for mongoengine is 'localhost' this is worth checking for and replacing
with DEFAULT_STORAGE_DIR.

While we are at it, a check for the existence of the parent directory where the
database is to be located is worthwhile. This leaves us with:

    def __init__(self, host=DEFAULT_STORAGE_DIR, **kwargs):
        host = host or DEFAULT_STORAGE_DIR
        if host == 'localhost':
            host = DEFAULT_STORAGE_DIR
 
         if os.path.exists(os.path.dirname(host)):
            raise NotADirectoryError(os.path.dirname(host))
 
        self.engine = disk_engine.DiskEngine.create(host)

I'd be happy to generate a pull request, but I'd like to know that the pre-existing unit tests work on some system before
requesting a pull for a fix that may be out of date.

Question about parallelization

I see that it is written in the docs:

It is not process-safe

But the particular case is not clear. What if to create one client to work with one collection?. Example with MongoDB. Is it a safe approach?

Bad returned type for `Collection.index_information`

Mongita Collection returns a List[Dict], and should instead return a MutableMapping[str, Any]
See https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.index_information

My suggested implementation:

def index_information(self):
    ret = {'_id_': {'key': [('_id', 1)]}}
    metadata = self._Collection__get_metadata()
    for idx in metadata.get('indexes', {}).values():
        ret[idx['_id']] = {'key': [(idx['key_str'], idx['direction'])]}
    return ret

mongorestore and mongodump equivalent functionality

That's an awesome tools, and really very close to MongoDB. Is there any feature equivalent to mongodump and mongorestore (or mongoexport and mongoimport)? If not, it would be a useful addition to the module. Thanks.

Allow `BSON.Binary` for `_id` fields

In my PR #40, I suggest a change uplift the limitation preventing users to use bson.Binary data as an identifier.

In my project I need to store UUID data as _id, my ODM (beanie) turns them into bson.Binary. It seems a pretty legit behavior in mongo.

see: https://www.mongodb.com/docs/manual/reference/bson-types/#binary-data

Please add cursor.skip

First off, I love this library. It's one of the main reasons I made the jump from NodeJS with NEDB to Python. However, I'm trying to implement pagination on my Flask site, and the cursor.skip method would make life a little easier for me. Thank you for taking the time for implementing this feature. I appreciate all you do.

Pymongo 4 support

Mongita explicitly depends on pymongo <=4.

Since mongita relies only on pymongo bson support, is it complex to provide a mongita compatible with pymongo 4 ?

Feature request: Allow callback functions in slow codepatch

Hi!

I am not sure if it's against the design goals of the library, but it would be very useful to be able to provide custom Python callback functions to the _doc_matches_slow_filters code path of the find() method. That means a custom function which takes a single document and returns True or False if it should be included in the output. As fast as I understand this should be easy to implement as that's what the non-indexed code path is basically doing anyway.

Using that it would also be easy to work around operators which are not implemented yet.

Connection with mongoengine

This is a really awesome lib. I wanted to see if I am able to start a db, and use it with mongoengine. I'm not sure if it requires an actual connection.

Compression of collections

It is not clear whether collections are compressed. If so, what algorithm is used by default? Can the user specify a preferred algorithm and compression level?

An example from the MongoDB world:
create_collection(name, storageEngine={'wiredTiger': {'configString': 'block_compressor=zstd'}})

Limited cache and performance queries

Hi there, really nice project you've got here! I'm using mongita as an embedded DB in one of my projects and it's great, but I've started making a few tweaks I thought you might be interested in?

Firstly, I see the on-disk storage engine currently has an infinitely growing cache, which naturally leads to memory leaks with large read/write cycles. I've had a go at writing a simple limited cache that should work as a drop-in replacement (#36). Limiting the cache size to below the benchmarking set size will obviously have a negative impact on performance, but when set larger (or to infinite) the change has limited impact.

Secondly, I think one of the current major bottlenecks for performance is the copy.deepcopy() calls on insertion and retrieval. For insertion, I'm fairly certain this can be replaced with a simple shallow copy, as all that's changed is the addition of the _id field? I've made a PR to test this out (#37) and all seems to work fine. On my system, the increase in insertion performance with the benchmarking set is ~50%.

For retrieval it looks like things are more complicated. Currently, the returned record is copied regardless of whether it's fetched from cache or from disk, but of course the record returned from disk is already unique so the copy is wasted. I don't see any easy way to change this at present without changing some other internals, most probably collection.__find_ids(). Is there a reasons this function couldn't return the actual documents rather than just the IDs? It's already gone to the bother of fetching the records so it seems wasteful to discard them only to retrieve again later?

Cheers, and I hope you don't mind the comments!

$in operator fails on array fields

According to https://docs.mongodb.com/manual/reference/operator/query/in/#mongodb-query-op.-in

If the field holds an array, then the $in operator selects the documents whose field holds an array that contains at least one element that matches a value in the specified array (for example, <value1>, <value2>, and so on).

mongita raises an error in this case:

Traceback (most recent call last):
  File "test.py", line 57, in test_mongita
    result = list(col.find({"names: {"$in": ["asd", "qwe"]}}))
  File "Python38\lib\site-packages\mongita\cursor.py", line 56, in __iter__
    for el in self._gen():
  File "Python38\lib\site-packages\mongita\collection.py", line 870, in __find
    for doc_id in self.__find_ids(filter, sort, limit, metadata=metadata):
  File "Python38\lib\site-packages\mongita\collection.py", line 845, in __find_ids
    if doc and _doc_matches_slow_filters(doc, slow_filters):
  File "Python38\lib\site-packages\mongita\collection.py", line 193, in _doc_matches_slow_filters
    if _doc_matches_agg(doc_v, query_ops):
  File "Python38\lib\site-packages\mongita\collection.py", line 143, in _doc_matches_agg
    if doc_v not in query_val:
TypeError: unhashable type: 'list'

Are $and, $or queries supported?

Hey Scott

First thanks a lot for the package; found it really useful.

I'm having some issues anding together some search conditions. It looks like using $and fails. Example:

`test = [{'name': 'foo'}, {'name':'bar'}]

db['assessment'].delete_many({})

db['assessment'].insert_many(test)

print(db['assessment'].count_documents({ 'name': { '$in' :['foo'] }}))

db['assessment'].count_documents({'$and': [{ 'name' :'foo' }]})`

When im trying to find something in a collection 'assessment'

Any thoughts appreciated!

Need to reinstance after update

Hi,
Very nice work with the mongita project!

I have just started experimenting with it.
I have found that if i write to a collection in one process and reads from it in another, the values when reading will not be updated without reinstance the MongitaClient. From behaviour it seems like the full db is loaded to memory? is there a function I can use to refresh?

I was testing this with MongitaClientDisk.

If I open two instances of the same db and collection in the same process the changes to the collection is reflected immediately.

Best regards.

Migration from MongoDB to Mongita

Add support for $push

First of all, thank you for this library.
I love it.

Would love to see $push implemented.

https://docs.mongodb.com/manual/reference/operator/update/push/

If I have a chance I'll make an attempt at it this week.
What modules would I need to touch for this?

Where are the databases / collections?

You should mention in the README where the databases / collections are stored physically. I had to use strace to figure out they are in ~/.mongita under Linux.

Edit: how can we store the data in the current (project) directory (similarly to SQLite)?

Connect to mongita db from other machine

Hello, I was wondering if we can access a mongita db created on the disk of one machine on another machine. We have RPyC connection between those 2 machines if it helps. (We don't want to use an actual mongoDB because we can't install mongo on them). Thanks!

Implement index intersection + create_indexes

When trying to create more than one index via create_indexes, a MongitaNotImplementedError occurs. I understand that implementing index intersection is a difficult task, but I really hope that it will be in Mongita someday.

failed to import on python 3.6, no attribute `Pattern`

Hi, I get the following error when trying to import mongita in python 3.6.

Traceback (most recent call last):
  File "test.py", line 3, in <module>
    from mongita import MongitaClientDisk
  File "/home/chris/GitWS/tinydb_sqlite/venv/lib/python3.6/site-packages/mongita/__init__.py", line 1, in <module>
    from . import collection
  File "/home/chris/GitWS/tinydb_sqlite/venv/lib/python3.6/site-packages/mongita/collection.py", line 39, in <module>
    re.Pattern: b'\n',
AttributeError: module 're' has no attribute 'Pattern'

Looks like 3.6 does not support Pattern. Looks like other projects have the same issue https://github.com/getsentry/responses/pull/196/files, beetbox/beets#2978

To continue to support 3.6 you may have to add something like:

in an appropriate place

Retrieving documents with specific fields with find() is not working/implemented.

I've dumped my Mongo Database in local folder now working with it mongitaDB.

PyMongo has such functionality where I can retrieve documents from the Database with specific fields. For an example:

mongo_client.db.col.find({},{"_id":1})

This line returns the cursor from where I will get the "_id" field. Then, I've tried something similar with mongitaDB.

mongita_client.db.col.find({},{"_id":1})

It raises following error:

mongita.errors.MongitaError: Unsupported sort parameter format. See the docs.

	if '_id' in filter:
	if self._engine.doc_exists(self.full_name, filter['_id']):
	return filter['_id']
	return None

scottrogowski / mongita Goto Github PK

mongita's People

Contributors

Stargazers

Watchers

Forkers

mongita's Issues

Recommend Projects

Recommend Topics

Recommend Org