For Amazon Redshift SQLAlchemy library by default using psycopg2 library that is

sounds like issue <a class="issue-link js-issue-link" data-error-text="Failed to load

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

It works for postgres <a class="user-mention notranslate" data-hovercard-type="user" d

Dialect cannot work with huge data tables (allocation of memory on server/client),about sqlalchemy-redshift/sqlalchemy-redshift

Comments (23)

jklukas commented on July 22, 2024

From this description, it sounds like the issue is at the psycopg2 layer rather than with sqlalchemy-redshift. If you can configure psycopg2 to have reasonable behavior for your case, we can likely figure out how to expose that through sqlalchemy, but if it's a psycopg2 limitation, that would need to be addressed upstream.

from sqlalchemy-redshift.

graingert commented on July 22, 2024

sounds like issue #100 to me

from sqlalchemy-redshift.

Rustem commented on July 22, 2024

@graingert really seems abnormal behaviour for the cursors. I know that psycopg driver is not native for redshift, but still cursors are very important. I can't go to production with python app due to this bug.

from sqlalchemy-redshift.

graingert commented on July 22, 2024

This is psycopg2's default behaviour. Can you show a way of using streaming cursors with pure psycopg2?

…

On 5 May 2017 18:27, "xepa4ep" ***@***.***> wrote: @graingert <https://github.com/graingert> really seems abnormal behaviour for the cursors. I can't go to production with python app due to this bug. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#116 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAZQTCXTAY4Db5ydQF2q9s2_2uQ1F561ks5r21v6gaJpZM4NQT9V> .

from sqlalchemy-redshift.

Rustem commented on July 22, 2024

I will answer asap thanks

from sqlalchemy-redshift.

Rustem commented on July 22, 2024

@graingert for redshift connection I tried both:

raw SQL commands

begin;
DECLARE curname cursor FOR
    SELECT * FROM schema.table;
fetch forward 10 from curname;
fetch next from curname;
close curname;
commit;

server side cursors that psycopg2 advices

command_text = """
    SELECT * FROM schema.table;
"""
conn = psycopg2.connect(host="HOST", dbname="dbname", password="****", user="user", port="5439")
cursor = conn.cursor('curname')
cursor.itersize = 100
cursor.execute(command_text)
for item in cursor:
    print(item)

They both allocate memory server-side before sending actual response. So if my table contains hundred million rows, then it allocates memory before sending actual result even I set an itersize up to 1000. That's really abnormal for me.

When I used native JDBC connection, then evth was good.

from sqlalchemy-redshift.

graingert commented on July 22, 2024

@Rustem sorry it's not clear what you mean, when you use part 2 (using curname) does it run without taking up all your memory?

How about when using .execution_options(stream_results=True) ?

@Profiler.profile
def test_core_fetchmany_w_streaming(n):
    """Load Core result rows using fetchmany/streaming."""

    with engine.connect() as conn:
        result = conn.execution_options(stream_results=True).\
            execute(Customer.__table__.select().limit(n))
        while True:
            chunk = result.fetchmany(10000)
            if not chunk:
                break
            for row in chunk:
                data = row['id'], row['name'], row['description']

from sqlalchemy-redshift.

Rustem commented on July 22, 2024

I think i tried this one, I will try it again and share my results here.

from sqlalchemy-redshift.

graingert commented on July 22, 2024

@Rustem when you use part 2 (psycopg2, using curname) does it run without taking up all your memory?

from sqlalchemy-redshift.

Rustem commented on July 22, 2024

It works for postgres @graingert , I am trying to do it with redshift now

from sqlalchemy-redshift.

Rustem commented on July 22, 2024

I am looking through the code of dialect and see that exec option called stream_results just reuses psycopg2 cursor. However the way it should work is to use the native cursor SQL commands. Seems psycopg2 uses https://www.postgresql.org/docs/9.2/static/plpgsql-cursors.html, but redshift cursors has different syntax to operate over it. As soon as I will profile on the redshift i update my comment. There are two options now, either to extend the dialect or keep as it is if it works.

from sqlalchemy-redshift.

Rustem commented on July 22, 2024

btw, why don't u try something like redshift-odbc ? http://docs.aws.amazon.com/redshift/latest/mgmt/install-odbc-driver-windows.html

from sqlalchemy-redshift.

Rustem commented on July 22, 2024

So it does not work for now with 600 million records it is too long.
So this picture on AWS is for minutes and nothing get updated.

After all it raised an exception server side:

Traceback (most recent call last):
 File “redshift_amazon.py”, line 16, in <module>
   result = conn.execution_options(stream_results=True).execute(command_text)
 File “/Users/rustem/Sites/toptal/SparkETL/env/lib/python3.5/site-packages/sqlalchemy/engine/base.py”, line 906, in execute
   return self._execute_text(object, multiparams, params)
 File “/Users/rustem/Sites/toptal/SparkETL/env/lib/python3.5/site-packages/sqlalchemy/engine/base.py”, line 1054, in _execute_text
   statement, parameters
 File “/Users/rustem/Sites/toptal/SparkETL/env/lib/python3.5/site-packages/sqlalchemy/engine/base.py”, line 1159, in _execute_context
   result = context._setup_crud_result_proxy()
 File “/Users/rustem/Sites/toptal/SparkETL/env/lib/python3.5/site-packages/sqlalchemy/engine/default.py”, line 822, in _setup_crud_result_proxy
   result = self.get_result_proxy()
 File “/Users/rustem/Sites/toptal/SparkETL/env/lib/python3.5/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py”, line 472, in get_result_proxy
   return _result.BufferedRowResultProxy(self)
 File “/Users/rustem/Sites/toptal/SparkETL/env/lib/python3.5/site-packages/sqlalchemy/engine/result.py”, line 497, in __init__
   self._init_metadata()
 File “/Users/rustem/Sites/toptal/SparkETL/env/lib/python3.5/site-packages/sqlalchemy/engine/result.py”, line 1106, in _init_metadata
   self.__buffer_rows()
 File “/Users/rustem/Sites/toptal/SparkETL/env/lib/python3.5/site-packages/sqlalchemy/engine/result.py”, line 1128, in __buffer_rows
   self.__rowbuffer = collections.deque(self.cursor.fetchmany(size))
psycopg2.InternalError: exceeded the maximum size allowed for the total set of cursor data: 16000MB.

I used the following code to test:

import sqlalchemy as sa
command_text = “”"
   SELECT * from lineorder
“”"
url = ‘redshift+psycopg2://host:5439/dbname’
engine = sa.create_engine(
   url,
   poolclass=sa.pool.QueuePool,
   max_overflow=10,
   pool_size=int(5),
   pool_recycle=10000,
   pool_timeout=300)

with engine.connect() as conn:
   result = conn.execution_options(stream_results=True).execute(command_text)
   print ("connected")
   while True:
       chunk = result.fetchmany(50)
       if not chunk:
           break
       for row in chunk:
           # data = row[‘p_partkey’], row[‘p_name’]
           print(row)

from sqlalchemy-redshift.

graingert commented on July 22, 2024

@Rustem when you use part 2 (psycopg2, using curname) does it run without taking up all your memory?

from sqlalchemy-redshift.

Rustem commented on July 22, 2024

@graingert no same situation. It allocates memory on server

from sqlalchemy-redshift.

graingert commented on July 22, 2024

memory on server

on the server or the client?

from sqlalchemy-redshift.

Rustem commented on July 22, 2024

@graingert on the server

from sqlalchemy-redshift.

Rustem commented on July 22, 2024

It will raise clientside if u use basic cursor.

from sqlalchemy-redshift.

Rustem commented on July 22, 2024

However when I test with JDBC connection (pure Java code), then everything works fine and I observe that it does not use cursors, it just uses pure stream. According to the server query log, I see only SELECT * FROM lineorder, but console feed data immediately.

from sqlalchemy-redshift.

graingert commented on July 22, 2024

@Rustem I'm not interested in Java code. Can you demonstrate Python code that would not trigger the issue you're having. If not, it's not an issue in sqlalchemy-redshift.

from sqlalchemy-redshift.

Rustem commented on July 22, 2024

@graingert then may be you show to me a working code that could execute cursor on the huge tables. I think i will raise the issue on psycopg2 https://github.com/psycopg/psycopg2/issues/new. Then what is the real intention of building this lib?

from sqlalchemy-redshift.

graingert commented on July 22, 2024

then may be you show to me a working code that could execute cursor on the huge tables.

This project has very little to do with raw psycopg2 interaction. It's just used to generate the SQL from generic(-ish) sqlalchemy expressions

from sqlalchemy-redshift.

graingert commented on July 22, 2024

closing in favour of psycopg/psycopg2#553

from sqlalchemy-redshift.

Dialect cannot work with huge data tables (allocation of memory on server/client) about sqlalchemy-redshift HOT 23 CLOSED

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent