fuyb1992 / es_pandas Goto Github PK

View Code? Open in Web Editor NEW

35.0 7.0 11.0 90 KB

Read, write and update large scale pandas DataFrame with Elasticsearch

License: MIT License

Python 100.00%

pandas elasticsearch large-scale

es_pandas's Introduction

es_pandas

Read, write and update large scale pandas DataFrame with ElasticSearch.

Requirements

This package should work on Python3(>=3.4) and ElasticSearch should be version 5.x, 6.x or 7.x.

Installation The package is hosted on PyPi and can be installed with pip:

pip install es_pandas

Deprecation Notice

Supporting of ElasticSearch 5.x will by deprecated in future version.

Usage

import time

import pandas as pd

from es_pandas import es_pandas


# Information of es cluseter
es_host = 'localhost:9200'
index = 'demo'

# crete es_pandas instance
ep = es_pandas(es_host)

# Example data frame
df = pd.DataFrame({'Num': [x for x in range(100000)]})
df['Alpha'] = 'Hello'
df['Date'] = pd.datetime.now()

# init template if you want
doc_type = 'demo'
ep.init_es_tmpl(df, doc_type)

# Example of write data to es, use the template you create
ep.to_es(df, index, doc_type=doc_type, thread_count=2, chunk_size=10000)

# set use_index=True if you want to use DataFrame index as records' _id
ep.to_es(df, index, doc_type=doc_type, use_index=True, thread_count=2, chunk_size=10000)

# delete records from es
ep.to_es(df.iloc[5000:], index, doc_type=doc_type, _op_type='delete', thread_count=2, chunk_size=10000)

# Update doc by doc _id
df.iloc[:1000, 1] = 'Bye'
df.iloc[:1000, 2] = pd.datetime.now()
ep.to_es(df.iloc[:1000, 1:], index, doc_type=doc_type, _op_type='update')

# Example of read data from es
df = ep.to_pandas(index)
print(df.head())

# return certain fields in es
heads = ['Num', 'Date']
df = ep.to_pandas(index, heads=heads)
print(df.head())

# set certain columns dtype
dtype = {'Num': 'float', 'Alpha': object}
df = ep.to_pandas(index, dtype=dtype)
print(df.dtypes)

# infer dtype from es template
df = ep.to_pandas(index, infer_dtype=True)
print(df.dtypes)

# use query_sql parameter if you want to do query in sql

# Example of write data to es with pandas.io.json
ep.to_es(df, index, doc_type=doc_type, use_pandas_json=True, thread_count=2, chunk_size=10000)
print('write es doc with pandas.io.json finished')

es_pandas's People

Contributors

Stargazers

Watchers

Forkers

xuehh zhangbk920209 virtustate gxflove307 oskrdt robomotic asmitaccenture gnandaki mrandyaswin shuguangbo

es_pandas's Issues

Any ways to force push all columns as string?

While importing, pandas makes phone numbers float, so converting to string adds .0 at the end.
I decided to check .0 at the every line and erase it if exists, but now importing is 100x slower

User credentials

Can I pass a user and pass to the connection to es?
TXS

支持将多个具有相同映射关系index中的数据导入一个DataFrame

对于相同映射关系的index，比如按日期保存的数据，应该支持将多个index中的数据导入一个DataFrame中。
比如 index-2022-01, index-2022-02, index-2022-03 ...
df = ep.to_pandas('index-2022*',...)

如果不显示进度，应不计算index中的文档数

如果不显示进度，则不应该计算index中的文档数。
特别是index中的文档数量巨大或index数量多时，会节约开销。

为什么第一次写入的时候, _id 是自己生成的?

第一次,让es随机生成,后面想要更新,怎么根据这个来确定唯一.(每次查出来,再去更新可以,有时候没必要).

ModuleNotFoundError: No module named 'progressbar'

progressbar2 is not being installed when installing the package

How to implement authentication support through the library ?

_op_type='update' not working

Running below command, does not update the records in elasticsearch.

ep.to_es(df.iloc[:1000, 1:], index, doc_type=doc_type, _op_type='update')

N/A% (0 of 1000) | | Elapsed Time: 0:00:00 ETA: --:--:--
1000

Unable to upload array as value

(Edited)
I'm having the following failure trying to upload a value with an array.

>>> response = ep.to_es(df, index='myindex', _op_type='update')
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

It seems that he serialize function fails since pd.isna function returns an array when the input is an array.
Could you please consider to use np.all method to wrap pd.isna output to always produce a boolean and enable arrays to be processed?

to_pandas sql error

Hi there,
I am doing a very simple sql query like this:

from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from elasticsearch.helpers.errors import BulkIndexError
import time
import pandas as pd
from es_pandas import es_pandas


import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# crete es_pandas instance
es = es_pandas(es_url,verify_certs=False,ssl_show_warn=False)

es.to_pandas(index='priam_unified_host-2021-05-03', query_sql='select top 10 * from day-2021-05-03 WHERE EventID=4688')

I get this error:

TypeError: search() got an unexpected keyword argument 'query_sql'

AttributeError: module 'progressbar' has no attribute 'version'

Summary

On python 3.6 virtual environment in Ubuntu 14.04.5 LTS after installing es_pandas and progressbar2, I get the error "AttributeError: module 'progressbar' has no attribute 'version'" when trying to:
from es_pandas import es_pandas

Details

root@ns502245:~# source p36/bin/activate
(p36) root@ns502245:~# pip install progressbar2
Requirement already satisfied: progressbar2 in ./p36/lib/python3.6/site-packages (3.50.0)
Requirement already satisfied: six in ./p36/lib/python3.6/site-packages (from progressbar2) (1.13.0)
Requirement already satisfied: python-utils>=2.3.0 in ./p36/lib/python3.6/site-packages (from progressbar2) (2.4.0)
WARNING: You are using pip version 19.3.1; however, version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
(p36) root@ns502245:~# python
Python 3.6.9 (default, Nov 19 2019, 14:10:59)
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from es_pandas import es_pandas
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/p36/lib/python3.6/site-packages/es_pandas/__init__.py", line 1, in <module>
    from .es_pandas import es_pandas
  File "/root/p36/lib/python3.6/site-packages/es_pandas/es_pandas.py", line 7, in <module>
    if not progressbar.__version__.startswith('3.'):
AttributeError: module 'progressbar' has no attribute '__version__'

Disable progress bar on to_es

Is there a way to disable the progress bar when writing a DataFrame to ES?

Thanks

Version check fails using SNAPSHOT

The version of my elasticsearch instance ends with SNAPSHOT and that's causing to fail when trying to init.
Version:

7.9.1-SNAPSHOT

Error I'm getting

ValueError: invalid literal for int() with base 10: ‘1-SNAPSHOT’

to_es error with show_progress=False

Using version 0.17 to_es gives error with show_progress=False

Traceback (most recent call last):
File "/opt/anaconda3/envs/algorithms/lib/python3.8/site-packages/elasticsearch/helpers/init.py", line 304, in parallel_bulk
for result in pool.imap(
File "/opt/anaconda3/envs/algorithms/lib/python3.8/multiprocessing/pool.py", line 868, in next
raise value
File "/opt/anaconda3/envs/algorithms/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/opt/anaconda3/envs/algorithms/lib/python3.8/multiprocessing/pool.py", line 144, in _helper_reraises_exception
raise ex
File "/opt/anaconda3/envs/algorithms/lib/python3.8/multiprocessing/pool.py", line 388, in _guarded_task_generation
for i, x in enumerate(iterable):
File "/opt/anaconda3/envs/algorithms/lib/python3.8/site-packages/elasticsearch/helpers/init.py", line 58, in _chunk_actions
for action, data in actions:
File "/opt/anaconda3/envs/algorithms/lib/python3.8/site-packages/es_pandas/es_pandas.py", line 136, in rec_to_actions
bar.update(i)
TypeError: update() takes 1 positional argument but 2 were given

How to implement pagination support through the library ?

How to implement pagination, specifically from/size parameters as the query_rule parameter does not accept from and size ?

在to_pandas函数中，set_index应该在设置dtype以后，否则通过dtype重置'_id'类型会失败

    df = pd.DataFrame(self.get_source(anl, show_progress=show_progress, count=count)).set_index('_id')
    if infer_dtype:
        dtype = self.infer_dtype(index, df.columns.values)
    if len(dtype):
        df = df.astype(dtype)
    return df

    df = pd.DataFrame(self.get_source(anl, show_progress=show_progress, count=count))
    if infer_dtype:
        dtype = self.infer_dtype(index, df.columns.values)
    if len(dtype):
        df = df.astype(dtype)
    df = df.set_index('_id')   <<< 返回之前set_index
    return df

elasticsearch.exceptions.RequestError: RequestError(400, 'parsing_exception', 'request does not support [_source]')

Second call of method to_pandas using the default value for argument query_rule raises exception. The default value of argument query_rule should be None. The value should be internally set to default if the user does not set a query_rule.

sql_query fetch size

Hi there,
what parameter should I pass to provide the fetch size:

https://www.elastic.co/guide/en/elasticsearch/reference/current/sql-translate.html

POST /_sql/translate
{
  "query": "SELECT * FROM library ORDER BY page_count DESC",
  "fetch_size": 10
}

Something wrong when you run template code

I just pip install es_pandas, and attach other packages including progressbar2 (>3), but can't work.

The following error message:
Incorrect version of progerssbar package, please do pip install progressbar2
but the version python detect is the python_utils package, then I fixed out, the following error outputs
TypeError: __init__() got an unexpected keyword argument 'max_value'