Giter Club home page Giter Club logo

es_pandas's Introduction

es_pandas

Build Status 996.icu LICENSE PyPi version Downloads

Read, write and update large scale pandas DataFrame with ElasticSearch.

Requirements

This package should work on Python3(>=3.4) and ElasticSearch should be version 5.x, 6.x or 7.x.

Installation The package is hosted on PyPi and can be installed with pip:

pip install es_pandas

Deprecation Notice

Supporting of ElasticSearch 5.x will by deprecated in future version.

Usage

import time

import pandas as pd

from es_pandas import es_pandas


# Information of es cluseter
es_host = 'localhost:9200'
index = 'demo'

# crete es_pandas instance
ep = es_pandas(es_host)

# Example data frame
df = pd.DataFrame({'Num': [x for x in range(100000)]})
df['Alpha'] = 'Hello'
df['Date'] = pd.datetime.now()

# init template if you want
doc_type = 'demo'
ep.init_es_tmpl(df, doc_type)

# Example of write data to es, use the template you create
ep.to_es(df, index, doc_type=doc_type, thread_count=2, chunk_size=10000)

# set use_index=True if you want to use DataFrame index as records' _id
ep.to_es(df, index, doc_type=doc_type, use_index=True, thread_count=2, chunk_size=10000)

# delete records from es
ep.to_es(df.iloc[5000:], index, doc_type=doc_type, _op_type='delete', thread_count=2, chunk_size=10000)

# Update doc by doc _id
df.iloc[:1000, 1] = 'Bye'
df.iloc[:1000, 2] = pd.datetime.now()
ep.to_es(df.iloc[:1000, 1:], index, doc_type=doc_type, _op_type='update')

# Example of read data from es
df = ep.to_pandas(index)
print(df.head())

# return certain fields in es
heads = ['Num', 'Date']
df = ep.to_pandas(index, heads=heads)
print(df.head())

# set certain columns dtype
dtype = {'Num': 'float', 'Alpha': object}
df = ep.to_pandas(index, dtype=dtype)
print(df.dtypes)

# infer dtype from es template
df = ep.to_pandas(index, infer_dtype=True)
print(df.dtypes)

# use query_sql parameter if you want to do query in sql

# Example of write data to es with pandas.io.json
ep.to_es(df, index, doc_type=doc_type, use_pandas_json=True, thread_count=2, chunk_size=10000)
print('write es doc with pandas.io.json finished')

es_pandas's People

Contributors

elestudent avatar fuyb1992 avatar nachtsky1077 avatar xuehh avatar zhangbk920209 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

es_pandas's Issues

Any ways to force push all columns as string?

While importing, pandas makes phone numbers float, so converting to string adds .0 at the end.
I decided to check .0 at the every line and erase it if exists, but now importing is 100x slower
CleanShot 2023-02-14 at 09 02 41@2x

_op_type='update' not working

Running below command, does not update the records in elasticsearch.

ep.to_es(df.iloc[:1000, 1:], index, doc_type=doc_type, _op_type='update')

N/A% (0 of 1000) | | Elapsed Time: 0:00:00 ETA: --:--:--
1000

Unable to upload array as value

(Edited)
I'm having the following failure trying to upload a value with an array.

>>> response = ep.to_es(df, index='myindex', _op_type='update')
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

It seems that he serialize function fails since pd.isna function returns an array when the input is an array.
Could you please consider to use np.all method to wrap pd.isna output to always produce a boolean and enable arrays to be processed?

to_pandas sql error

Hi there,
I am doing a very simple sql query like this:

from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from elasticsearch.helpers.errors import BulkIndexError
import time
import pandas as pd
from es_pandas import es_pandas


import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# crete es_pandas instance
es = es_pandas(es_url,verify_certs=False,ssl_show_warn=False)

es.to_pandas(index='priam_unified_host-2021-05-03', query_sql='select top 10 * from day-2021-05-03 WHERE EventID=4688')

I get this error:

TypeError: search() got an unexpected keyword argument 'query_sql'

AttributeError: module 'progressbar' has no attribute '__version__'

Summary

On python 3.6 virtual environment in Ubuntu 14.04.5 LTS after installing es_pandas and progressbar2, I get the error "AttributeError: module 'progressbar' has no attribute 'version'" when trying to:
from es_pandas import es_pandas

Details

root@ns502245:~# source p36/bin/activate
(p36) root@ns502245:~# pip install progressbar2
Requirement already satisfied: progressbar2 in ./p36/lib/python3.6/site-packages (3.50.0)
Requirement already satisfied: six in ./p36/lib/python3.6/site-packages (from progressbar2) (1.13.0)
Requirement already satisfied: python-utils>=2.3.0 in ./p36/lib/python3.6/site-packages (from progressbar2) (2.4.0)
WARNING: You are using pip version 19.3.1; however, version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
(p36) root@ns502245:~# python
Python 3.6.9 (default, Nov 19 2019, 14:10:59)
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from es_pandas import es_pandas
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/p36/lib/python3.6/site-packages/es_pandas/__init__.py", line 1, in <module>
    from .es_pandas import es_pandas
  File "/root/p36/lib/python3.6/site-packages/es_pandas/es_pandas.py", line 7, in <module>
    if not progressbar.__version__.startswith('3.'):
AttributeError: module 'progressbar' has no attribute '__version__'

Version check fails using SNAPSHOT

The version of my elasticsearch instance ends with SNAPSHOT and that's causing to fail when trying to init.
Version:

7.9.1-SNAPSHOT

Error I'm getting

ValueError: invalid literal for int() with base 10: ‘1-SNAPSHOT’

to_es error with show_progress=False

Using version 0.17 to_es gives error with show_progress=False

Traceback (most recent call last):
File "/opt/anaconda3/envs/algorithms/lib/python3.8/site-packages/elasticsearch/helpers/init.py", line 304, in parallel_bulk
for result in pool.imap(
File "/opt/anaconda3/envs/algorithms/lib/python3.8/multiprocessing/pool.py", line 868, in next
raise value
File "/opt/anaconda3/envs/algorithms/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/opt/anaconda3/envs/algorithms/lib/python3.8/multiprocessing/pool.py", line 144, in _helper_reraises_exception
raise ex
File "/opt/anaconda3/envs/algorithms/lib/python3.8/multiprocessing/pool.py", line 388, in _guarded_task_generation
for i, x in enumerate(iterable):
File "/opt/anaconda3/envs/algorithms/lib/python3.8/site-packages/elasticsearch/helpers/init.py", line 58, in _chunk_actions
for action, data in actions:
File "/opt/anaconda3/envs/algorithms/lib/python3.8/site-packages/es_pandas/es_pandas.py", line 136, in rec_to_actions
bar.update(i)
TypeError: update() takes 1 positional argument but 2 were given

在to_pandas函数中,set_index应该在设置dtype以后,否则通过dtype重置'_id'类型会失败

    df = pd.DataFrame(self.get_source(anl, show_progress=show_progress, count=count)).set_index('_id')
    if infer_dtype:
        dtype = self.infer_dtype(index, df.columns.values)
    if len(dtype):
        df = df.astype(dtype)
    return df

    df = pd.DataFrame(self.get_source(anl, show_progress=show_progress, count=count))
    if infer_dtype:
        dtype = self.infer_dtype(index, df.columns.values)
    if len(dtype):
        df = df.astype(dtype)
    df = df.set_index('_id')   <<< 返回之前set_index
    return df

Something wrong when you run template code

I just pip install es_pandas, and attach other packages including progressbar2 (>3), but can't work.

The following error message:
Incorrect version of progerssbar package, please do pip install progressbar2
but the version python detect is the python_utils package, then I fixed out, the following error outputs
TypeError: __init__() got an unexpected keyword argument 'max_value'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.