mongodb-labs / mongo-arrow Goto Github PK

MongoDB integrations for Apache Arrow. Export MongoDB documents to numpy array, parquet files, and pandas dataframes in one line of code.

Home Page: https://mongo-arrow.readthedocs.io

License: Apache License 2.0

Python 81.10% Shell 1.12% Makefile 0.25% Batchfile 0.31% CSS 1.05% Cython 16.17%

apache-arrow arrow mongodb numpy-arrays pandas-dataframe parquet-files python

mongo-arrow's Introduction

mongo-arrow

Tools for using Apache Arrow with MongoDB

Apache Arrow

We utilize Apache Arrow to offer fast and easy conversion of MongoDB query result sets to multiple numerical data formats popular among developers including NumPy arrays, Pandas DataFrames, parquet files, CSV, and more.

We chose Arrow for this because of its unique set of characteristics:

language-independent
columnar memory format for flat and hierarchical data,
organized for efficient analytic operations on modern hardware like CPUs and GPUs
zero-copy reads for lightning-fast data access without serialization overhead
- it was simple and fast, and from our perspective, Apache Arrow is ideal for processing and transporting of large datasets in high-performance applications.

As reference points for our implementation, we also took a look at BigQuery’s Pandas integration, pandas methods to handle JSON/semi-structured data, the Snowflake Python connector, and Dask.DataFrame.

How it Works

Our implementation relies upon a user-specified data schema to marshall query result sets into tabular form. Example

from pymongoarrow.api import Schema

schema = Schema({"_id": int, "amount": float, "last_updated": datetime})

You can install PyMongoArrow on your local machine using Pip: $ python -m pip install pymongoarrow

You can export data from MongoDB to a pandas dataframe easily using something like:

df = production.invoices.find_pandas_all({"amount": {"$gt": 100.00}}, schema=invoices)

Since PyMongoArrow can automatically infer the schema from the first batch of data, this can be further simplified to:

df = production.invoices.find_pandas_all({"amount": {"$gt": 100.00}})

Final Thoughts

This library is in the early stages of development, and so it's possible the API may change in the future - we definitely want to continue expanding it. We welcome your feedback as we continue to explore and build this tool.

mongo-arrow's People

Contributors

Stargazers

Watchers

Forkers

shaneharvey emg110 juliusgeo blink1073 noktoteam bcwarner sibbiii noahstapp caseyclements

mongo-arrow's Issues

TypeError: <lambda>() takes 0 positional arguments but 1 was given

When I try to create a Schema from the following dict: {'_id': <class 'bson.objectid.ObjectId'>, 'date': <class 'datetime.datetime'>, 'ticker': <class 'str'>, 'pca_0': <class 'float'>, 'y': <class 'float'>}
I get the error: TypeError: () takes 0 positional arguments but 1 was given
In types (https://github.com/mongodb-labs/mongo-arrow/blob/main/bindings/python/pymongoarrow/types.py), I see that there is a dict that converts Python types to Bson types. I did my own dict, which is the same, but removing the 'lambda:' from each key-value pair, and it solved the problem. I am just reporting this to see if it can help someone or it is an error.

Is the Schema support pyarrow.string() Type ?

  from pymongoarrow.api import Schema, find_pandas_all
  from datetime import datetime
  from pymongo import MongoClient

  client = MongoClient()
  client.db.data.delete_many({})
  client.db.data.insert_many([
      {'_id': 1, 'amount': 21, 'mac':'eee', 'last_updated': datetime(2020, 12, 10, 1, 3, 1)},
      {'_id': 2, 'amount': 16,  'mac':'ddd', 'last_updated': datetime(2020, 7, 23, 6, 7, 11)},
      {'_id': 3, 'amount': 3,  'mac':'eeeeee', 'last_updated': datetime(2021, 3, 10, 18, 43, 9)},
      {'_id': 4, 'amount': 0, 'mac':'aaa',  'last_updated': datetime(2021, 2, 25, 3, 50, 31)}])

  schema = Schema({'_id': pyarrow.int32(), 'amount': pyarrow.float64(), 'mac': pyarrow.string(), 'last_updated': datetime})
  df = find_pandas_all(client.db.data, {'amount': {'$gt': 5}}, schema=schema)

Schema({'_id': pyarrow.int32(), 'amount': pyarrow.float64(), 'mac': pyarrow.string(), 'last_updated': datetime})

    df = find_pandas_all(client.db.data, {'amount': {'$gt': 5}}, schema=schema)
  File "/home/fcdlab/.local/lib/python3.7/site-packages/pymongoarrow/api.py", line 142, in find_pandas_all
    find_arrow_all(collection, query, schema=schema, **kwargs))
  File "/home/fcdlab/.local/lib/python3.7/site-packages/pymongoarrow/api.py", line 59, in find_arrow_all
    schema, codec_options=collection.codec_options)
  File "/home/fcdlab/.local/lib/python3.7/site-packages/pymongoarrow/context.py", line 53, in from_schema
    str_type_map = _get_internal_typemap(schema.typemap)
  File "/home/fcdlab/.local/lib/python3.7/site-packages/pymongoarrow/types.py", line 70, in _get_internal_typemap
    assert len(internal_typemap) == len(typemap)
AssertionError

Could not find compiled pymongoarrow.lib extension, please install

I ran into this problem: Could not find compiled pymongoarrow.lib extension, please install. But I downloaded pymongoarrow.Python version 3.79.

0.6.2 pypi release is missing source distribution

Hi,

Im on an aarch64 platform and trying to update to the latest version im getting:

$ uname -a
Linux 7dc3eb8d2f89 5.15.49-linuxkit #1 SMP PREEMPT Tue Sep 13 07:51:32 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

$ python -m pip install pymongoarrow==0.6.2
ERROR: Could not find a version that satisfies the requirement pymongoarrow==0.6.2 (from versions: 0.1.1, 0.2.0, 0.3.0, 0.4.0.dev0, 0.4.0, 0.5.0, 0.5.1)
ERROR: No matching distribution found for pymongoarrow==0.6.2

Looking at PyPI downloads, it seems the latest release is missing the source distribution

compared to the 0.5.1 release

Could you add a source distribution, please?

Best,
Wiktor

pymongoarrow does not return nested fields

I'm trying out pymongoarrow to fetch a large dataset, with regular pymongo, I'm able to get all the fields that I needed, but with pymongoarrow it seems to not return any field with nested values (e.g. dict), did I miss something?

Nested Data With Schema ERRor

from pymongo import MongoClient

pyarrow version 14.0.2
pymongo version 4.6.1
pymongoarrow version 1.2.0

java version

any plan to add a java version of this library

Casting timestamp in find_panads_all()

Hi, i'm facing this issue when to try make my mongo collection into pandas dataframe using the find_pandas_all() function

authors_pyarrow = Schema({"_id": ObjectId, "first_name": pyarrow.string(), "last_name": pyarrow.string(), "date_of_birth": datetime})

df = production_db.author.find_pandas_all({}, schema=authors_pyarrow)
print(df.head())

This library has not been compiled

With a fresh install of pymongoarrow, running python 3.10.5 I get the following

C:\Users\prokie>pip list
Package      Version
------------ -------
numpy        1.22.4
pip          22.1.2
pyarrow      7.0.0
pymongo      4.1.1
pymongoarrow 0.4.0
setuptools   62.4.0

C:\Users\prokie>python
Python 3.10.5 (tags/v3.10.5:f377153, Jun  6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pymongoarrow
C:\Users\prokie\AppData\Roaming\Python\Python310\site-packages\pymongoarrow\__init__.py:33: UserWarning: This library has not been compiled
  warnings.warn("This library has not been compiled")

What am I missing?

[documentation update request / feature request] `write` supported types

Update after the issue was created

I'm sorry I bothered you.

I assume I must not use polars.Enum or I must cast it to another type before calling the write function.

Duplicate issues

I did not find an open issue with a roadmap to document workarounds or add a built-in support for this.

Something related: #109, #35

Preface

This is a feature request to make the write function usable by supporting valid polars.DataFrame types.

An alternative option is updating the documentation with an example that shows how to deal with the natively unsupported types.

TO-DO

Enum

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/opt/pysetup/demo/test_polars.py", line 130, in <module>
    write(mongo_database().polars_places, df)
  File "/opt/pysetup/.venv/lib/python3.12/site-packages/pymongoarrow/api.py", line 450, in write
    _validate_schema(tabular.schema.types)
  File "/opt/pysetup/.venv/lib/python3.12/site-packages/pymongoarrow/types.py", line 327, in _validate_schema
    raise ValueError(msg)
ValueError: Unsupported data type "dictionary<values=large_string, indices=uint32, ordered=" in schema

"uint64"

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/opt/pysetup/demo/test_polars.py", line 131, in <module>
    write(mongo_database().polars_places, df)
  File "/opt/pysetup/.venv/lib/python3.12/site-packages/pymongoarrow/api.py", line 450, in write
    _validate_schema(tabular.schema.types)
  File "/opt/pysetup/.venv/lib/python3.12/site-packages/pymongoarrow/types.py", line 327, in _validate_schema
    raise ValueError(msg)
ValueError: Unsupported data type "uint64" in schema
```

Versions

pip show pymongoarrow

Version: 1.4.0.dev0

pip show polars-lts-cpu

0.20.23

Support for other data types?

In roughly what timeframe are you expecting to add support for reading other datatypes, rather than int, float, date?

Support for Tool

Hi I am considering to use the library to serialize data from mongo and integrate with apis. Will this tool be continually developed for some time or more exploratory?

Bug: find_arrow_all in version 1.0.1 returns wrong schema for nested bson.ObjectId while bson.ObjectId on root level works as documented

Hi,

Let me thank you for the new 1.0.1 release and making the bson.ObjectId an Extension Type.
This new feature seems to work for all keys on the root level, but not for the ones nested within Objects.

Steps to reproduce:

    collection = ...
    obj_id = bson.ObjectId()

    collection.insert_one(
        {'_id': obj_id,
         'id1': obj_id,
         'obj': {'id2': obj_id}
         })

    print(pymongoarrow.api.find_arrow_all(collection, {}))
    print(pymongoarrow.api.find_arrow_all(collection, {}).to_pandas())

The output is:

    _id: extension<arrow.py_extension_type<ObjectIdType>>
    id1: extension<arrow.py_extension_type<ObjectIdType>>
    obj: struct<id2: fixed_size_binary[12]>
           child 0, id2: fixed_size_binary[12]

                        _id                       id1                                            obj
0  649c7842aa528cb1069843d2  649c7842aa528cb1069843d2  {'id2': b'd\x9cxB\xaaR\x8c\xb1\x06\x98C\xd2'}

Note that id1 is of type arrow.py_extension_type and id2 of type fixed_size_binary[12]. It seems that nested ObjectIDs are treated as before version 1.0.0 so maybe something has been overlooked here at the latest upgrade.

Defining the schema manually leads to the same output.

Thanks,
Sebastian

Can `find_all_pandas` treat list of struct as nested dataframe?

I have a mongo document which has a list field containing child documents.

Pandas data frames can be nested. And PyArrow has Table and RecordBatch types.

I would like to avoid having to call pandas.json_normalize on the child list and instead have find_all_pandas return directly a nested dataframe.

Would it be possible to use Table or RecordBatch type in the schema to get this behaviour?

MongoDB's Decimal128 seems to be returned as fixed_size_binary[16]

Hi,

when I use pymongoarrow.api.aggregate_arrow_all() it seems to return Decimal128 as FixedSizeBinary when context.finish() is called.
When looking at the code, my assumption is, it stems from lib.pyx where return pyarrow_wrap_array(out).cast(Decimal128Type_()) in line 784 does not cast the fixed_sized_binary back to Decimal128.

pymongo==4.6.2
pymongoarrow==1.3.0
pyarrow==15.0.1

Ability to query _id as string if it is of type ObjectId (e.g. "63fcb5aa5e1d7530a517dc44")

Hi,

Sorry if i have overlooked something in the documentation. I am trying to get the _id field as sting if it is of type ObjectId.

Using schema = pymongoarrow.api.Schema( {'_id': bson.ObjectId} ) and then aggregate_to_df_with_schema(...) works well, but results in a fixed_size_binary[12], which is fine, but sometimes, it needs a string in the form of e.g. "63fcb5aa5e1d7530a517dc44".

I've tried schema = pymongoarrow.api.Schema( {'_id': pyarrow.string()} ) but the result is null only.

The workaround I have found so far is to use pandas df[key].apply(bytes.hex).astype('string') which works fine for non null values, but some convenience function or documented way how to get a string would be appreciated.

Many thanks for providing this great library,
it works well so far,
Sebastian

undefined symbol: _ZN5arrow6StatusC1ENS_10StatusCodeERKSs with airflow 2.8.1

I'm reproducing a bug in airflow with the docker-compose method to run airflow2.8.1 with python 3.11 ( https://airflow.apache.org/docs/apache-airflow/2.8.1/howto/docker-compose/index.html#fetching-docker-compose-yaml ).

I'm creating a requirements.txt with the following packages :

pymongo==4.6.1
pymongo[srv]==4.6.1
pymongoarrow==1.2.0
pandas==2.1.4

After starting the airflow services, each tasks containing a pymongoarrow reference return the following error :

from pymongoarrow.monkey import patch_all
/home/airflow/.local/lib/python3.11/site-packages/pymongoarrow/__init__.py:27: UserWarning: Could not find compiled pymongoarrow.lib extension, please install from source or report the following traceback on the issue tracker:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.11/site-packages/pymongoarrow/__init__.py", line 25, in <module>
    from pymongoarrow.lib import libbson_version
ImportError: /home/airflow/.local/lib/python3.11/site-packages/pymongoarrow/lib.cpython-311-aarch64-linux-gnu.so: undefined symbol: _ZN5arrow6StatusC1ENS_10StatusCodeERKSs

  warnings.warn(

Dataframe is all Nat and None after loading

I was trying mongo arrow to load a dataset from mongodb, it is loading the selected columns only that's saving space, but the dataframe is all Nat and Nones only. Is this a common issue and how to fix that?
Thanks in advance

df=collection.find_pandas_all(
{ "prop.Start": {'$gte':start_date,'$lte':end_date}} ,
schema=Schema({
'prop.Start': datetime,
'prop.Name':str,
'_id.objectId':str
}))

ARROW-175 Bug: nested data seems to be decoded even if not in schema

Hi,

I found another issue with version 1.0.1 that is a little bit tricky,
so I spend some time to create a minimal example:

Assume you have a collection with some fields that fit a certain schema.
Now you add a new key that is not in the schema so this new key should be ignored when aggregate_arrow_all is called:

    collection.insert_one(
        {'_id': 0,
         'new_key': 1000000000
         })

    collection.insert_one(
        {'_id': 1,
         'new_key': 1.0e50
         })

    schema = pymongoarrow.api.Schema(
        {'_id': pa.int32()
         })

    df = pymongoarrow.api.aggregate_arrow_all(collection, [{'$sort': {'_id': 1}}], schema=schema)

all works as expected. So far so good. This means that aggregate_arrow_all does not break even if new fields with contradicting data types are added to the collection.

Now the same example with a nested new key:

    temp_collection._collection.insert_one(
        {'_id': 0,
         'obj': {'a': 1,
                 'new_key': 1000000000}
         })

    temp_collection._collection.insert_one(
        {'_id': 1,
         'obj': {'a': 1,
                 'new_key': 1.0e50}
         })

    schema = pymongoarrow.api.Schema(
        {'_id': pa.int32(),
         'obj': {'a': pa.int32()}
         })

    df = pymongoarrow.api.aggregate_arrow_all(collection, [{'$sort': {'_id': 1}}], schema=schema)

This raises the error

pymongoarrow\lib.pyx:284: in pymongoarrow.lib.process_raw_bson_stream
OverflowError: Python int too large to convert to C long
pymongoarrow\lib.pyx:507: OverflowError

But there is no reason for this error as the new_key should not even be decoded as it is not in the schema.

Ps.: I could remove new_key by a projection if I knew that it exists (which I do not know) or define all fields that I have within the schema with a projection (that leads to crazy long projections if importing 30+ fields). Interestingly, on root level it works flawlessly, the issue just arises with nested keys so maybe this is really a bug.

Ps.: If I can help by making more tests please come back to me, I am happy to contribute if i can.

Thanks,
Sebastian

Add an optional bool flag to the `write` function to skip writing `null` fields

Function parameters example

def write(collection, tabular, *, exclude_none: bool = False):
    ...

Usage example

write(collection, df, exclude_none=True)

How

Replacing https://github.com/mongodb-labs/mongo-arrow/blob/main/bindings/python/pymongoarrow/api.py#L390 with

if exclude_none:
    yield {k:v for k, v in row.items() if v is not None}
else:
    yield row

did the job.

Trouble reading documents with empty embedded arrays

Goal:
Trying to read a mongo document with an embedded object containing an empty array to a pyarrow table, then write it out as a parquet file.

Expected result:
Parquet file created

Actual Result:
Getting error from pymongoarrow when creating the pyarrow.Table. Interestingly reading the same document from mongo directly and using pyarrow.json to create the table works fine. Obviously embedded objects with non-empty arrays work fine with pymongoarrow.

Steps to reproduce:

from pymongo import MongoClient

import pymongoarrow.api as pmaapi

import pyarrow.parquet as papq
import pyarrow.json as pajson

import io
import json
import bson


client = MongoClient()
collection = client.testdb.data;
collection.drop();

client.testdb.data.insert_many([
    { '_id': 1, 'foo':  { 'bar': ['1','2'] } },
    { '_id': 2, 'foo':  { 'bar': [] } }
])

# get document out of mongo, put it in a file and read it with pyarrow and write it to parquet
doc1 = client.testdb.data.find_one({'_id': 1})
string1 = bson.json_util.dumps(doc1, indent = 2) 
file1 = io.BytesIO(bytes(string1, encoding='utf-8'))
papatable1 = pajson.read_json(file1)
print(str(papatable1))
papq.write_table(papatable1, 'pyarrow' + str(1) + '.parquet')

# read document with pymongoarrow and write it to parquet
pmapatable1 = pmaapi.find_arrow_all(client.testdb.data,{'_id': {'$eq': 1}})
print(str(pmapatable1))
papq.write_table(pmapatable1, 'pymongoarrow' + str(1) + '.parquet')



doc2 = client.testdb.data.find_one({'_id': 2})
string2 = bson.json_util.dumps(doc2, indent = 2) 
file2 = io.BytesIO(bytes(string2, encoding='utf-8'))
papatable2 = pajson.read_json(file2)
print(str(papatable2))
papq.write_table(papatable2, 'pyarrow' + str(2) + '.parquet')

pmapatable2 = pmaapi.find_arrow_all(client.testdb.data,{'_id': {'$eq': 2}})
papq.write_table(pmapatable2, 'pymongoarrow' + str(2) + '.parquet')

produces

$ python repro.py
pyarrow.Table
_id: int64
foo: struct<bar: list<item: string>>
  child 0, bar: list<item: string>
      child 0, item: string
----
_id: [[1]]
foo: [
  -- is_valid: all not null
  -- child 0 type: list<item: string>
[["1","2"]]]
pyarrow.Table
_id: int32
foo: struct<bar: list<item: string>>
  child 0, bar: list<item: string>
      child 0, item: string
----
_id: [[1]]
foo: [
  -- is_valid: all not null
  -- child 0 type: list<item: string>
[["1","2"]]]
pyarrow.Table
_id: int64
foo: struct<bar: list<item: null>>
  child 0, bar: list<item: null>
      child 0, item: null
----
_id: [[2]]
foo: [
  -- is_valid: all not null
  -- child 0 type: list<item: null>
[0 nulls]]
Traceback (most recent call last):
  File "/workspaces/vscode-python/pymongoarrow/repro.py", line 45, in <module>
    pmapatable2 = pmaapi.find_arrow_all(client.testdb.data,{'_id': {'$eq': 2}})
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/Envs/pma1/lib/python3.11/site-packages/pymongoarrow/api.py", line 112, in find_arrow_all
    process_bson_stream(batch, context)
  File "pymongoarrow/lib.pyx", line 159, in pymongoarrow.lib.process_bson_stream
  File "pymongoarrow/lib.pyx", line 246, in pymongoarrow.lib.process_raw_bson_stream
  File "pymongoarrow/lib.pyx", line 133, in pymongoarrow.lib.extract_document_dtype
  File "pymongoarrow/lib.pyx", line 108, in pymongoarrow.lib.extract_field_dtype
  File "pyarrow/types.pxi", line 4452, in pyarrow.lib.list_
TypeError: List requires DataType or Field

FWIW the three parquet files which are produced, duckdb shows the following...

D select * from 'pyarrow1.parquet';
┌───────┬───────────────────────┐
│  _id  │          foo          │
│ int64 │ struct(bar varchar[]) │
├───────┼───────────────────────┤
│     1 │ {'bar': [1, 2]}       │
└───────┴───────────────────────┘
D select * from 'pymongoarrow1.parquet';
┌───────┬───────────────────────┐
│  _id  │          foo          │
│ int32 │ struct(bar varchar[]) │
├───────┼───────────────────────┤
│     1 │ {'bar': [1, 2]}       │
└───────┴───────────────────────┘
D select * from 'pyarrow2.parquet';
┌───────┬───────────────────────┐
│  _id  │          foo          │
│ int64 │ struct(bar integer[]) │
├───────┼───────────────────────┤
│     2 │ {'bar': []}           │
└───────┴───────────────────────┘
D

Versions:

Python 3.11.8 (main, Mar 12 2024, 11:41:52) [GCC 12.2.0] on linux
Successfully installed dnspython-2.6.1 numpy-1.26.4 packaging-23.2 pandas-2.2.2 pyarrow-15.0.2 pymongo-4.7.1 pymongoarrow-1.3.0 python-dateutil-2.9.0.post0 pytz-2024.1 six-1.16.0 tzdata-2024.1

bson_iter_type(): precondition failed: iter->raw on find_pandas_all()

Hi, I compiled the head branch yesterday to test and I ran into the above mentioned error.

The standard tests python -m pytest work fine.
I also imported a number of sample mongo datasets and running find_pandas_all against them work fine as well.

However when I try to test against my production database :
df = find_pandas_all(client.test.dashboards, {})

I get the following error :
/tmp/mongo-c-driver-20230103-6603-ptzsc0/mongo-c-driver-1.23.2/src/libbson/src/bson/bson-iter.c:477 bson_iter_type(): precondition failed: iter->raw

Couldn't figure out with python -v where exactly it failed, wondering if theres an easier way in the library to show the exact row/datatype it failed on?

Does mongo-arrow provide real zero copy in the chain mongodb->arrow->pandas?

.. or zero copy appear only between arrow->pandas but not here mongodb->arrow?

In other words are arrow data types used in mongodb?

Any chance you could fix the docs?

Docs homepage https://mongo-arrow.readthedocs.io/en/latest/index.html says:

pymongoarrow – Tools for working with MongoDB and PyArrow
The complete API documentation, organized by module.

but API docs are empty:

https://mongo-arrow.readthedocs.io/en/latest/api/api.html

is self contained installation possible?

Hi,

I am using this package in a docker environment with a multistage build, i.e. I am building all python dependancies to a virtual environment in a build stage and copying it to the final production image.

I was hoping pymongoarrow would be self contained, i.e. it would not require any additional apt-get installs in the production image.

However, when attempting to use pymongoarrow in the production image I am getting:

/opt/venv/lib/python3.9/site-packages/pymongoarrow/__init__.py:34: UserWarning: Could not find compiled pymongoarrow.lib extension, please install from source or report the following traceback on the issue tracker:
Traceback (most recent call last):
  File "/opt/venv/lib/python3.9/site-packages/pymongoarrow/__init__.py", line 32, in <module>
    from pymongoarrow.lib import libbson_version
ImportError: libbson-1.0.so.0: cannot open shared object file: No such file or directory

  warnings.warn(
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)

(...)

    from pymongoarrow.api import Schema
  File "/opt/venv/lib/python3.9/site-packages/pymongoarrow/api.py", line 26, in <module>
    from pymongoarrow.context import PyMongoArrowContext
  File "/opt/venv/lib/python3.9/site-packages/pymongoarrow/context.py", line 16, in <module>
    from pymongoarrow.lib import (
ImportError: libbson-1.0.so.0: cannot open shared object file: No such file or directory

here are the contents of pymongoarrow site-package

$ ls -l /opt/venv/lib/python3.9/site-packages/pymongoarrow/
total 68624
-rw-r--r-- 1 user user     1606 Nov 23 17:16 __init__.py
drwxr-xr-x 2 user user      267 Nov 23 17:16 __pycache__
-rw-r--r-- 1 user user    13902 Nov 23 17:16 api.py
-rw-r--r-- 1 user user     3456 Nov 23 17:16 context.py
-rw-r--r-- 1 user user     1553 Nov 23 17:16 errors.py
-rwxr-xr-x 1 user user  2400424 Nov 23 17:16 lib.cpython-39-aarch64-linux-gnu.so
-rw-r--r-- 1 user user    13466 Nov 23 17:16 lib.pyx
-rw-r--r-- 1 user user     1232 Nov 23 17:16 libarrow.pxd
-rwxr-xr-x 1 user user 56618568 Nov 23 17:16 libarrow.so.900
-rwxr-xr-x 1 user user  2251680 Nov 23 17:16 libarrow_python.so.900
-rw-r--r-- 1 user user     3960 Nov 23 17:16 libbson.pxd
-rwxr-xr-x 1 user user  8915624 Nov 23 17:16 libparquet.so.900
-rw-r--r-- 1 user user     1484 Nov 23 17:16 monkey.py
-rw-r--r-- 1 user user      945 Nov 23 17:16 result.py
-rw-r--r-- 1 user user     2397 Nov 23 17:16 schema.py
-rw-r--r-- 1 user user     4465 Nov 23 17:16 types.py
-rw-r--r-- 1 user user      637 Nov 23 17:16 version.py

I need to install the libbson library explicitly with

sudo apt-get update && sudo apt-get install libbson-1.0-0

then everything is working as expected.

Is this a bug or a feature?

Best,
Wiktor

Documentation should describe advantages over DataFrame constructor (of Pandas)

Converting the output of the pymongo "find()" method to a Pandas DataFrame can be done directly by the DataFrame constructor.

The output of the "find()" method is a Python list containing Python dictionary objects, and this kind of data collection can be directly handled by the DataFrame constructor.

Moreover, the Pandas DataFrame constructor can already handle data of all Python types (particularly lists and dictionaries).

In view of the above, there should be some discussion of the need for this library, and any advantages it may eventually have over the Pandas DataFrame constructor should be documented.

ARROW-134 bson.errors.InvalidDocument: cannot encode object: <NA>, of type: <class 'pandas._libs.missing.NAType'>

Package version: 0.6.3
PyMongo version: 4.3.2

Pretty exciting library for us. We use pandas DataFrame heavily, and store data to MongoDB.

I am inserting data from pandas dataframes to MongoDB. Many of them contain integer with nullable values, which are encoded to pandas Nullable integer type which use the pandas.NA value to represent nulls (not numpy.nan).

Here is a simple reproduction of the issue:

import pandas as pd
from pymongoarrow.api import write

from src import db


db = db.get_college_db("qweqwe", "historic")

df = pd.DataFrame({
    "one": [1, 2, 3, None],
})
df["one"] = df["one"].astype("Int64")

print(df.dtypes)

write(db.collection, df)

Raises error:
bson.errors.InvalidDocument: cannot encode object: <NA>, of type: <class 'pandas._libs.missing.NAType'>

Compatibility with PyMongo 4.0?

I have seen on documentation that PyMongoArrow is not compatible with Pymongo >= 4.0.
Is it still true for PyMongoArrow 0.2.0?

Thanks in advance !

AttributeError: 'pyarrow.lib.DataType' object has no attribute '_type_marker'

Hi,

I'm trying to load a list of nested objects (list of structs in pyarrow), tried both with pymongoarrow 0.7.0 and 25a8832 which results in AttributeError: 'pyarrow.lib.DataType' object has no attribute '_type_marker'
In case it matters, I am installing with pip

EDIT: the issue turned out to be a list of float32 within a struct -> once I changed the schema to use float64 things work as expected.
rtfm with more attention if you don't want to waste hours 🤦

it would be nice if the library failed earlier with a more human friendly message, such as pa.float32 type is not supported

here is my original issue

among other things (simple types) that work as expected, here is the (simplified) object that gets parsed correctly (list of structs containing simple types only)

{
    "_id" : ObjectId("someId"),
    "parent" : {
        "child" : [ 
            {
                "fieldA" : "valueA"
            }, 
            {
                "fieldA" : "valueB"
            }
        ]
    }
}

note that I am escaping the dots with underscores by projecting the nested fields in an aggregation pipeline, so that my actual input to pymongoarrow looks like:

{
    "_id" : ObjectId("someId"),
    "parent_child" :  [ 
        {
            "fieldA" : "valueA"
        }, 
        {
            "fieldA" : "valueB"
        }
    ]
}

and here is the corresponding schema definition

parent_child_fields = [
    pa.field("fieldA", pa.string())
]
parent_child_schema = pa.list_(pa.struct(parent_child_fields))

schema_dict = {
    "_id": ObjectId,
    "parent_child": parent_child_schema
}

schema = Schema(schema_dict)

finally, note that this is a simplified example and in reality I have more fields (also nested) that I would like to include in the parent_child_schema as nested structs

is this scenario supported?
what am I missing?

here is a full trace

File [opt/venv/src/pymongoarrow/bindings/python/pymongoarrow/api.py:193), in aggregate_pandas_all(collection, pipeline, schema, **kwargs)
    175 def aggregate_pandas_all(collection, pipeline, *, schema=None, **kwargs):
    176     """Method that returns the results of an aggregation pipeline as a
    177     :class:`pandas.DataFrame` instance.
    178 
   (...)
    191       An instance of class:`pandas.DataFrame`.
    192     """
--> 193     return _arrow_to_pandas(aggregate_arrow_all(collection, pipeline, schema=schema, **kwargs))

File [/opt/venv/src/pymongoarrow/bindings/python/pymongoarrow/api.py:118), in aggregate_arrow_all(collection, pipeline, schema, **kwargs)
    100 def aggregate_arrow_all(collection, pipeline, *, schema=None, **kwargs):
    101     """Method that returns the results of an aggregation pipeline as a
    102     :class:`pyarrow.Table` instance.
    103 
   (...)
    116       An instance of class:`pyarrow.Table`.
    117     """
--> 118     context = PyMongoArrowContext.from_schema(schema, codec_options=collection.codec_options)
    120     if pipeline and ("$out" in pipeline[-1] or "$merge" in pipeline[-1]):
    121         raise ValueError(
    122             "Aggregation pipelines containing a '$out' or '$merge' stage are "
    123             "not supported by PyMongoArrow"
    124         )

File [/opt/venv/src/pymongoarrow/bindings/python/pymongoarrow/context.py:97), in PyMongoArrowContext.from_schema(cls, schema, codec_options)
     95 elif builder_cls == ListBuilder:
     96     arrow_type = schema.typemap[fname]
---> 97     builder_map[encoded_fname] = ListBuilder(arrow_type, tzinfo)
     98 elif builder_cls == BinaryBuilder:
     99     subtype = schema.typemap[fname].subtype

File pymongoarrow[lib.pyx:806), in pymongoarrow.lib.ListBuilder.__cinit__()

File pymongoarrow[/lib.pyx:716), in pymongoarrow.lib.get_field_builder()

File pymongoarrow[/lib.pyx:750), in pymongoarrow.lib.DocumentBuilder.__cinit__()

File pymongoarrow[/lib.pyx:716), in pymongoarrow.lib.get_field_builder()

File pymongoarrow[/lib.pyx:750), in pymongoarrow.lib.DocumentBuilder.__cinit__()

File pymongoarrow[/lib.pyx:716), in pymongoarrow.lib.get_field_builder()

File pymongoarrow[/lib.pyx:750), in pymongoarrow.lib.DocumentBuilder.__cinit__()

File pymongoarrow[/lib.pyx:718), in pymongoarrow.lib.get_field_builder()

File pymongoarrow[/lib.pyx:806), in pymongoarrow.lib.ListBuilder.__cinit__()

File pymongoarrow[/lib.pyx:719), in pymongoarrow.lib.get_field_builder()

AttributeError: 'pyarrow.lib.DataType' object has no attribute '_type_marker'

thank you,
wiktor

aggregate_arrow_all(...) >four times slower in version 1.0.2 compared to 1.0.1 with fields objects

Hi,

Thanks again for fixing the bugs in Version 1.0.2.
Unfortunately it seems that the new version loads data approx.. >four times slower in case there are nested fields in the schema.
(without nested fields there seems to be no speed difference)

Are you aware of any issue already?
We will post a unit test to reproduce the error here soon.

Sebastian

How to define data type for uuid and array types

I have a schema that contains coordinate array and uuid columns and couldn't figure out a way to define the schema properly. Schema returns ValueError. I tried, bson.binary.Binary for uuid and list for the array.

schema = Schema({...})
collection.find_pandas_all({}, schema=schema)

Here is an example data:

{
  "id": UUID,
  "data": {"type": "Point", "coordinates": [6.083531, 52.53661]}
}