Giter Club home page Giter Club logo

Comments (23)

jklukas avatar jklukas commented on July 22, 2024

From this description, it sounds like the issue is at the psycopg2 layer rather than with sqlalchemy-redshift. If you can configure psycopg2 to have reasonable behavior for your case, we can likely figure out how to expose that through sqlalchemy, but if it's a psycopg2 limitation, that would need to be addressed upstream.

from sqlalchemy-redshift.

graingert avatar graingert commented on July 22, 2024

sounds like issue #100 to me

from sqlalchemy-redshift.

Rustem avatar Rustem commented on July 22, 2024

@graingert really seems abnormal behaviour for the cursors. I know that psycopg driver is not native for redshift, but still cursors are very important. I can't go to production with python app due to this bug.

from sqlalchemy-redshift.

graingert avatar graingert commented on July 22, 2024

from sqlalchemy-redshift.

Rustem avatar Rustem commented on July 22, 2024

I will answer asap thanks

from sqlalchemy-redshift.

Rustem avatar Rustem commented on July 22, 2024

@graingert for redshift connection I tried both:

  1. raw SQL commands
begin;
DECLARE curname cursor FOR
    SELECT * FROM schema.table;
fetch forward 10 from curname;
fetch next from curname;
close curname;
commit;
  1. server side cursors that psycopg2 advices
command_text = """
    SELECT * FROM schema.table;
"""
conn = psycopg2.connect(host="HOST", dbname="dbname", password="****", user="user", port="5439")
cursor = conn.cursor('curname')
cursor.itersize = 100
cursor.execute(command_text)
for item in cursor:
    print(item)

They both allocate memory server-side before sending actual response. So if my table contains hundred million rows, then it allocates memory before sending actual result even I set an itersize up to 1000. That's really abnormal for me.

When I used native JDBC connection, then evth was good.

from sqlalchemy-redshift.

graingert avatar graingert commented on July 22, 2024

@Rustem sorry it's not clear what you mean, when you use part 2 (using curname) does it run without taking up all your memory?

How about when using .execution_options(stream_results=True) ?

@Profiler.profile
def test_core_fetchmany_w_streaming(n):
    """Load Core result rows using fetchmany/streaming."""

    with engine.connect() as conn:
        result = conn.execution_options(stream_results=True).\
            execute(Customer.__table__.select().limit(n))
        while True:
            chunk = result.fetchmany(10000)
            if not chunk:
                break
            for row in chunk:
                data = row['id'], row['name'], row['description']

from sqlalchemy-redshift.

Rustem avatar Rustem commented on July 22, 2024

I think i tried this one, I will try it again and share my results here.

from sqlalchemy-redshift.

graingert avatar graingert commented on July 22, 2024

@Rustem when you use part 2 (psycopg2, using curname) does it run without taking up all your memory?

from sqlalchemy-redshift.

Rustem avatar Rustem commented on July 22, 2024

It works for postgres @graingert , I am trying to do it with redshift now

from sqlalchemy-redshift.

Rustem avatar Rustem commented on July 22, 2024

I am looking through the code of dialect and see that exec option called stream_results just reuses psycopg2 cursor. However the way it should work is to use the native cursor SQL commands. Seems psycopg2 uses https://www.postgresql.org/docs/9.2/static/plpgsql-cursors.html, but redshift cursors has different syntax to operate over it. As soon as I will profile on the redshift i update my comment. There are two options now, either to extend the dialect or keep as it is if it works.

from sqlalchemy-redshift.

Rustem avatar Rustem commented on July 22, 2024

btw, why don't u try something like redshift-odbc ? http://docs.aws.amazon.com/redshift/latest/mgmt/install-odbc-driver-windows.html

from sqlalchemy-redshift.

Rustem avatar Rustem commented on July 22, 2024

So it does not work for now with 600 million records it is too long.
So this picture on AWS is for minutes and nothing get updated.
profiling_redshift_1

After all it raised an exception server side:

Traceback (most recent call last):
 File “redshift_amazon.py”, line 16, in <module>
   result = conn.execution_options(stream_results=True).execute(command_text)
 File “/Users/rustem/Sites/toptal/SparkETL/env/lib/python3.5/site-packages/sqlalchemy/engine/base.py”, line 906, in execute
   return self._execute_text(object, multiparams, params)
 File “/Users/rustem/Sites/toptal/SparkETL/env/lib/python3.5/site-packages/sqlalchemy/engine/base.py”, line 1054, in _execute_text
   statement, parameters
 File “/Users/rustem/Sites/toptal/SparkETL/env/lib/python3.5/site-packages/sqlalchemy/engine/base.py”, line 1159, in _execute_context
   result = context._setup_crud_result_proxy()
 File “/Users/rustem/Sites/toptal/SparkETL/env/lib/python3.5/site-packages/sqlalchemy/engine/default.py”, line 822, in _setup_crud_result_proxy
   result = self.get_result_proxy()
 File “/Users/rustem/Sites/toptal/SparkETL/env/lib/python3.5/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py”, line 472, in get_result_proxy
   return _result.BufferedRowResultProxy(self)
 File “/Users/rustem/Sites/toptal/SparkETL/env/lib/python3.5/site-packages/sqlalchemy/engine/result.py”, line 497, in __init__
   self._init_metadata()
 File “/Users/rustem/Sites/toptal/SparkETL/env/lib/python3.5/site-packages/sqlalchemy/engine/result.py”, line 1106, in _init_metadata
   self.__buffer_rows()
 File “/Users/rustem/Sites/toptal/SparkETL/env/lib/python3.5/site-packages/sqlalchemy/engine/result.py”, line 1128, in __buffer_rows
   self.__rowbuffer = collections.deque(self.cursor.fetchmany(size))
psycopg2.InternalError: exceeded the maximum size allowed for the total set of cursor data: 16000MB.

I used the following code to test:

import sqlalchemy as sa
command_text = “”"
   SELECT * from lineorder
“”"
url =redshift+psycopg2://host:5439/dbnameengine = sa.create_engine(
   url,
   poolclass=sa.pool.QueuePool,
   max_overflow=10,
   pool_size=int(5),
   pool_recycle=10000,
   pool_timeout=300)

with engine.connect() as conn:
   result = conn.execution_options(stream_results=True).execute(command_text)
   print ("connected")
   while True:
       chunk = result.fetchmany(50)
       if not chunk:
           break
       for row in chunk:
           # data = row[‘p_partkey’], row[‘p_name’]
           print(row)

from sqlalchemy-redshift.

graingert avatar graingert commented on July 22, 2024

@Rustem when you use part 2 (psycopg2, using curname) does it run without taking up all your memory?

from sqlalchemy-redshift.

Rustem avatar Rustem commented on July 22, 2024

@graingert no same situation. It allocates memory on server

from sqlalchemy-redshift.

graingert avatar graingert commented on July 22, 2024

memory on server

on the server or the client?

from sqlalchemy-redshift.

Rustem avatar Rustem commented on July 22, 2024

@graingert on the server

from sqlalchemy-redshift.

Rustem avatar Rustem commented on July 22, 2024

It will raise clientside if u use basic cursor.

from sqlalchemy-redshift.

Rustem avatar Rustem commented on July 22, 2024

However when I test with JDBC connection (pure Java code), then everything works fine and I observe that it does not use cursors, it just uses pure stream. According to the server query log, I see only SELECT * FROM lineorder, but console feed data immediately.

from sqlalchemy-redshift.

graingert avatar graingert commented on July 22, 2024

@Rustem I'm not interested in Java code. Can you demonstrate Python code that would not trigger the issue you're having. If not, it's not an issue in sqlalchemy-redshift.

from sqlalchemy-redshift.

Rustem avatar Rustem commented on July 22, 2024

@graingert then may be you show to me a working code that could execute cursor on the huge tables. I think i will raise the issue on psycopg2 https://github.com/psycopg/psycopg2/issues/new. Then what is the real intention of building this lib?

from sqlalchemy-redshift.

graingert avatar graingert commented on July 22, 2024

then may be you show to me a working code that could execute cursor on the huge tables.

This project has very little to do with raw psycopg2 interaction. It's just used to generate the SQL from generic(-ish) sqlalchemy expressions

from sqlalchemy-redshift.

graingert avatar graingert commented on July 22, 2024

closing in favour of psycopg/psycopg2#553

from sqlalchemy-redshift.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.