Comments (23)
From this description, it sounds like the issue is at the psycopg2 layer rather than with sqlalchemy-redshift. If you can configure psycopg2 to have reasonable behavior for your case, we can likely figure out how to expose that through sqlalchemy, but if it's a psycopg2 limitation, that would need to be addressed upstream.
from sqlalchemy-redshift.
sounds like issue #100 to me
from sqlalchemy-redshift.
@graingert really seems abnormal behaviour for the cursors. I know that psycopg driver is not native for redshift, but still cursors are very important. I can't go to production with python app due to this bug.
from sqlalchemy-redshift.
from sqlalchemy-redshift.
I will answer asap thanks
from sqlalchemy-redshift.
@graingert for redshift connection I tried both:
- raw SQL commands
begin;
DECLARE curname cursor FOR
SELECT * FROM schema.table;
fetch forward 10 from curname;
fetch next from curname;
close curname;
commit;
- server side cursors that psycopg2 advices
command_text = """
SELECT * FROM schema.table;
"""
conn = psycopg2.connect(host="HOST", dbname="dbname", password="****", user="user", port="5439")
cursor = conn.cursor('curname')
cursor.itersize = 100
cursor.execute(command_text)
for item in cursor:
print(item)
They both allocate memory server-side before sending actual response. So if my table contains hundred million rows, then it allocates memory before sending actual result even I set an itersize
up to 1000. That's really abnormal for me.
When I used native JDBC connection, then evth was good.
from sqlalchemy-redshift.
@Rustem sorry it's not clear what you mean, when you use part 2 (using curname) does it run without taking up all your memory?
How about when using .execution_options(stream_results=True)
?
@Profiler.profile
def test_core_fetchmany_w_streaming(n):
"""Load Core result rows using fetchmany/streaming."""
with engine.connect() as conn:
result = conn.execution_options(stream_results=True).\
execute(Customer.__table__.select().limit(n))
while True:
chunk = result.fetchmany(10000)
if not chunk:
break
for row in chunk:
data = row['id'], row['name'], row['description']
from sqlalchemy-redshift.
I think i tried this one, I will try it again and share my results here.
from sqlalchemy-redshift.
@Rustem when you use part 2 (psycopg2, using curname) does it run without taking up all your memory?
from sqlalchemy-redshift.
It works for postgres @graingert , I am trying to do it with redshift now
from sqlalchemy-redshift.
I am looking through the code of dialect and see that exec option called stream_results
just reuses psycopg2 cursor. However the way it should work is to use the native cursor SQL commands. Seems psycopg2 uses https://www.postgresql.org/docs/9.2/static/plpgsql-cursors.html, but redshift cursors has different syntax to operate over it. As soon as I will profile on the redshift i update my comment. There are two options now, either to extend the dialect or keep as it is if it works.
from sqlalchemy-redshift.
btw, why don't u try something like redshift-odbc ? http://docs.aws.amazon.com/redshift/latest/mgmt/install-odbc-driver-windows.html
from sqlalchemy-redshift.
So it does not work for now with 600 million records it is too long.
So this picture on AWS is for minutes and nothing get updated.
After all it raised an exception server side:
Traceback (most recent call last):
File “redshift_amazon.py”, line 16, in <module>
result = conn.execution_options(stream_results=True).execute(command_text)
File “/Users/rustem/Sites/toptal/SparkETL/env/lib/python3.5/site-packages/sqlalchemy/engine/base.py”, line 906, in execute
return self._execute_text(object, multiparams, params)
File “/Users/rustem/Sites/toptal/SparkETL/env/lib/python3.5/site-packages/sqlalchemy/engine/base.py”, line 1054, in _execute_text
statement, parameters
File “/Users/rustem/Sites/toptal/SparkETL/env/lib/python3.5/site-packages/sqlalchemy/engine/base.py”, line 1159, in _execute_context
result = context._setup_crud_result_proxy()
File “/Users/rustem/Sites/toptal/SparkETL/env/lib/python3.5/site-packages/sqlalchemy/engine/default.py”, line 822, in _setup_crud_result_proxy
result = self.get_result_proxy()
File “/Users/rustem/Sites/toptal/SparkETL/env/lib/python3.5/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py”, line 472, in get_result_proxy
return _result.BufferedRowResultProxy(self)
File “/Users/rustem/Sites/toptal/SparkETL/env/lib/python3.5/site-packages/sqlalchemy/engine/result.py”, line 497, in __init__
self._init_metadata()
File “/Users/rustem/Sites/toptal/SparkETL/env/lib/python3.5/site-packages/sqlalchemy/engine/result.py”, line 1106, in _init_metadata
self.__buffer_rows()
File “/Users/rustem/Sites/toptal/SparkETL/env/lib/python3.5/site-packages/sqlalchemy/engine/result.py”, line 1128, in __buffer_rows
self.__rowbuffer = collections.deque(self.cursor.fetchmany(size))
psycopg2.InternalError: exceeded the maximum size allowed for the total set of cursor data: 16000MB.
I used the following code to test:
import sqlalchemy as sa
command_text = “”"
SELECT * from lineorder
“”"
url = ‘redshift+psycopg2://host:5439/dbname’
engine = sa.create_engine(
url,
poolclass=sa.pool.QueuePool,
max_overflow=10,
pool_size=int(5),
pool_recycle=10000,
pool_timeout=300)
with engine.connect() as conn:
result = conn.execution_options(stream_results=True).execute(command_text)
print ("connected")
while True:
chunk = result.fetchmany(50)
if not chunk:
break
for row in chunk:
# data = row[‘p_partkey’], row[‘p_name’]
print(row)
from sqlalchemy-redshift.
@Rustem when you use part 2 (psycopg2, using curname) does it run without taking up all your memory?
from sqlalchemy-redshift.
@graingert no same situation. It allocates memory on server
from sqlalchemy-redshift.
memory on server
on the server or the client?
from sqlalchemy-redshift.
@graingert on the server
from sqlalchemy-redshift.
It will raise clientside if u use basic cursor.
from sqlalchemy-redshift.
However when I test with JDBC connection (pure Java code), then everything works fine and I observe that it does not use cursors, it just uses pure stream. According to the server query log, I see only SELECT * FROM lineorder
, but console feed data immediately.
from sqlalchemy-redshift.
@Rustem I'm not interested in Java code. Can you demonstrate Python code that would not trigger the issue you're having. If not, it's not an issue in sqlalchemy-redshift.
from sqlalchemy-redshift.
@graingert then may be you show to me a working code that could execute cursor on the huge tables. I think i will raise the issue on psycopg2 https://github.com/psycopg/psycopg2/issues/new. Then what is the real intention of building this lib?
from sqlalchemy-redshift.
then may be you show to me a working code that could execute cursor on the huge tables.
This project has very little to do with raw psycopg2 interaction. It's just used to generate the SQL from generic(-ish) sqlalchemy expressions
from sqlalchemy-redshift.
closing in favour of psycopg/psycopg2#553
from sqlalchemy-redshift.
Related Issues (20)
- support impersonation principal in sqlalchemy engine HOT 1
- Drop support for EOL versions of python HOT 2
- support sqlalchemy 2.0 HOT 23
- Use bind parameters
- Update project CI
- Regression causing errors during reflection for 1.4 FutureEngine HOT 5
- Log handler added in v0.8.13 produces duplicate logs in application
- Regression in v0.8.13: table reflection broken for non-superusers HOT 2
- Do not run integration tests with dbuser having elevated permissions HOT 1
- Migration error: 'Relation "alembic_version" already exists'
- Alembic migration issue: alter table
- Alembic migrations use RETURNING on update
- Performance degradation and memory peaks when updating from 0.8.12
- Class method needs updating HOT 1
- Table Reflection Slowness
- Temp Table Reflection
- Programming Error with percentile_cont HOT 1
- statements save OIDs unless commited
- Proposal: Remove pkg_resources from the fast path
- Any way to support sqlalchemy >= 2.0? HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sqlalchemy-redshift.