Giter Club home page Giter Club logo

sqlalchemy-redshift's Introduction

sqlalchemy-redshift

Amazon Redshift dialect for SQLAlchemy.

Installation

The package is available on PyPI:

pip install sqlalchemy-redshift

Warning

This dialect requires either redshift_connector or psycopg2 to work properly. It does not provide it as required, but relies on you to select the distribution you need:

  • psycopg2 - standard distribution of psycopg2, requires compilation so few system dependencies are required for it
  • psycopg2-binary - already compiled distribution (no system dependencies are required)
  • psycopg2cffi - pypy compatible version

See Psycopg2's binary install docs for more context on choosing a distribution.

Usage

The DSN format is similar to that of regular Postgres:

>>> import sqlalchemy as sa
>>> sa.create_engine('redshift+psycopg2://[email protected]:5439/database')
Engine(redshift+psycopg2://[email protected]:5439/database)

See the RedshiftDDLCompiler documentation for details on Redshift-specific features the dialect supports.

Running Tests

Tests are ran via tox and can be run with the following command:

$ tox

However, this will not run integration tests unless the following environment variables are set:

  • REDSHIFT_HOST
  • REDSHIFT_PORT
  • REDSHIFT_USERNAME
  • PGPASSWORD (this is the redshift instance password)
  • REDSHIFT_DATABASE
  • REDSHIFT_IAM_ROLE_ARN

Note that the IAM role specified will need to be associated with redshift cluster and have the correct permissions to create databases and tables as well drop them. Exporting these environment variables in your shell and running tox will run the integration tests against a real redshift instance. Practice caution when running these tests against a production instance.

Continuous Integration (CI)

Project CI is built using AWS CodePipeline and CloudFormation. Please see the ci/ folder and included README.txt for details on how to spin up the project's CI.

Releasing

To perform a release, you will need to be an admin for the project on GitHub and on PyPI. Contact the maintainers if you need that access.

You will need to have a ~/.pypirc with your PyPI credentials and also the following settings:

[zest.releaser]
create-wheels = yes

To perform a release, run the following:

python -m venv ~/.virtualenvs/dist
workon dist
pip install -U pip setuptools wheel
pip install -U tox zest.releaser
fullrelease  # follow prompts, use semver ish with versions.

The releaser will handle updating version data on the package and in CHANGES.rst along with tagging the repo and uploading to PyPI.

sqlalchemy-redshift's People

Contributors

bbayles avatar binarymatt avatar bouk avatar brooke-white avatar cpcloud avatar dargueta avatar djgoku avatar eeshugerman avatar ewengillies avatar f0rk avatar gmcrocetti avatar graingert avatar grayan avatar haleemur avatar jasonamyers avatar jcampbell avatar jkeelan avatar jklukas avatar jleclanche avatar ls-aron-kyle avatar matthewgdv avatar mbrunthaler avatar mheilman avatar mjschultz avatar mpatek avatar ms32035 avatar oyiptong avatar solackerman avatar tenzer avatar tirkarthi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sqlalchemy-redshift's Issues

Add a CONTRIBUTING.md

For most PRs, we are requiring tests, documentation, and CHANGES.rst notes. Sometimes we ask for these upfront, but sometimes we forget until later in the review, and it delays merging good changes.

We should provide a CONTRIBUTING.md both as a checklist for those doing review to make sure they aren't forgetting things, and for contributors so that they don't feel like requirements are popping up out of nowhere.

cc @graingert

how to use CopyCommand

I have a working copy command that works in SQL Workbench/J. Mirroring it, I created a CopyCommand object. What do I do next to run the copy command against Redshift?

I see that there's a
sqlalchemy_redshift.commands.visit_copy_command(element, compiler, **kw)
method. Looks like I'll have element = my_copy_command_object, is there more documentation on the compiler argument?

It'd be super if there's an end-to-end example somewhere, thanks a lot.

Can't generate migrations with alembic against redshift cluster

Pretty long traceback due to use wrapping alembic within click based command:

Traceback (most recent call last):
  File "/home/gsliwinski/.virtualenvs/x/bin/dw", line 11, in <module>
    load_entry_point('dw', 'console_scripts', 'dw')()
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/gsliwinski/dw/src/dw/scripts/decorators.py", line 11, in call_with_app_context
    return cmd(*args, **kwargs)
  File "/home/gsliwinski/dw/src/dw/scripts/alembic_migrations.py", line 137, in autogenerate
    alembic_ctx = command.revision(alembic_config(click_ctx.obj['DB']), autogenerate=True, message=message)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/alembic/command.py", line 176, in revision
    script_directory.run_env()
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/alembic/script/base.py", line 425, in run_env
    util.load_python_file(self.dir, 'env.py')
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/alembic/util/pyfiles.py", line 93, in load_python_file
    module = load_module_py(module_id, path)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/alembic/util/compat.py", line 75, in load_module_py
    mod = imp.load_source(module_id, path, fp)
  File "alembic/env.py", line 88, in <module>
    run_migrations_online()
  File "alembic/env.py", line 82, in run_migrations_online
    context.run_migrations()
  File "<string>", line 8, in run_migrations
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/alembic/runtime/environment.py", line 836, in run_migrations
    self.get_context().run_migrations(**kw)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/alembic/runtime/migration.py", line 321, in run_migrations
    for step in self._migrations_fn(heads, self):
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/alembic/command.py", line 156, in retrieve_migrations
    revision_context.run_autogenerate(rev, context)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/alembic/autogenerate/api.py", line 415, in run_autogenerate
    self._run_environment(rev, migration_context, True)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/alembic/autogenerate/api.py", line 451, in _run_environment
    autogen_context, migration_script)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/alembic/autogenerate/compare.py", line 22, in _populate_migration_script
    _produce_net_changes(autogen_context, upgrade_ops)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/alembic/autogenerate/compare.py", line 38, in _produce_net_changes
    schemas = set(inspector.get_schema_names())
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/sqlalchemy/engine/reflection.py", line 158, in get_schema_names
    info_cache=self.info_cache)
  File "<string>", line 2, in get_schema_names
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/sqlalchemy/engine/reflection.py", line 54, in cache
    ret = fn(self, con, *args, **kw)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/sqlalchemy/dialects/postgresql/base.py", line 2349, in get_schema_names
    ).columns(nspname=sqltypes.Unicode))
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 945, in execute
    return meth(self, multiparams, params)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 263, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1053, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1189, in _execute_context
    context)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1405, in _handle_dbapi_exception
    util.reraise(*exc_info)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1182, in _execute_context
    context)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 470, in do_execute
    cursor.execute(statement, parameters)
TypeError: 'dict' object does not support indexing

redshift-ca-bundle.crt is out of date

After Amazon issued a new SSL root certificate for Redshift last week, I started getting connection failures with "sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) SSL error: certificate verify failed".

I followed the instructions at http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-transitioning-to-acm-certs.html and copied the new https://s3.amazonaws.com/redshift-downloads/redshift-ca-bundle.crt to ~/.postgresql/root.crt , which is the default location according to postgresql docs.

However, for the client to actually check the certificate, I had to provide the file path:

redshift+psycopg2://username:[email protected]:5439/mydb?sslmode=verify-ca&sslrootcert=/home/mydir/.postgresql/root.crt

I guess this is because the module provides its own root certificate file, which is now out of date:

def create_connect_args(self, *args, **kwargs):
        """
        Build DB-API compatible connection arguments.

        Overrides interface
        :meth:`~sqlalchemy.engine.interfaces.Dialect.create_connect_args`.
        """
        default_args = {
            'sslmode': 'verify-full',
            'sslrootcert': pkg_resources.resource_filename(
                __name__,
                'redshift-ca-bundle.crt'
            ),
        }
        cargs, cparams = super(RedshiftDialect, self).create_connect_args(
            *args, **kwargs
        )
        default_args.update(cparams)
        return cargs, default_args

Cannot connect to Redshift due to hostname error

(psycopg2.OperationalError) server certificate for "ec2-[IP address].compute-1.amazonaws.com" does not match host name "[Hostname].redshift.amazonaws.com"

The same connection URI was fine until last Friday. I've checked that both hostnames correspond to the same IP address. If I set 'sslmode': 'prefer' in dialect.py, the connection is successful.

According to https://docs.aws.amazon.com/redshift/latest/mgmt/connecting-ssl-support.html,

SSL support in Amazon Redshift is strictly for encrypting the connection between your client and your cluster; it should not be relied on for authenticating the server. To authenticate the server, install the public key (.pem file) for the SSL certificate on your client and use the key to connect to your clusters.

Perhaps sslmode: require should be the default?

RelationKey.unquote only works when schema is None

Hi all,

In our database, we have a relation named quote sitting inside a schema called platform.

Because the word quote is a Redshift keyword, the string representation of a dialect.RelationKey corresponding to this relation is platform."quote". This is all good.

The class RelationKey has an unquote method, which returns a string representation of it with any "" quotes removed. Unfortunately, this method correctly removes the quotes only when the schema is None. In our case, the code is used with the schema non None (and equal to platform). What we see is that unquote, erroneously, does not remove the quotes (as it triggers only when str(RelationKey) both starts and ends with quotes).

I think a more comprehensive behaviour for the unquote method would be applying unquoting at both the schema names AND the relation names, so that unquoted(platform."quote") -> platform.quote.

This bug prevents sqlalchemy from reflecting our platform schema, ending with:

File "/home/vlasisva/.local/lib/python3.5/site-packages/sqlalchemy_redshift/dialect.py", line 630, in _get_redshift_relation
raise sa.exc.NoSuchTableError(key)
sqlalchemy.exc.NoSuchTableError: platform."quote"

Happy to elaborate, if above is not clear.

Model instance overwritten after commit.

Hi, I'm using the latest version 7691341 of the repo with sa v. 1.0.12. I'm noticing some strange behavior if I define a model with an identity column. Here's an example:

import os

import sqlalchemy as sa
from sqlalchemy.ext import declarative

Base = declarative.declarative_base()


class IdentityExample(Base):
    __tablename__ = 'identity_example'

    id = sa.Column(sa.Integer, primary_key=True, info={'identity': (0, 1)})
    name = sa.Column(sa.String)


engine = sa.create_engine(os.environ['REDSHIFT_URL'])
Base.metadata.drop_all(engine)
Base.metadata.create_all(engine)
RedshiftSession = sa.orm.sessionmaker()
session = RedshiftSession(bind=engine)

example_1 = IdentityExample(name='Example #1')
session.add(example_1)
print("EXAMPLE 1 BEFORE: ", example_1.id, example_1.name)
session.commit()
print("EXAMPLE 1 AFTER: ", example_1.id, example_1.name)

example_2 = IdentityExample(name='Example #2')
session.add(example_2)
print("EXAMPLE 2 BEFORE: ", example_2.id, example_2.name)
session.commit()
print("EXAMPLE 2 AFTER: ", example_2.id, example_2.name)

I'd expect to see output like this:

EXAMPLE 1 BEFORE:  None Example #1
EXAMPLE 1 AFTER:  0 Example #1
EXAMPLE 2 BEFORE:  None Example #2
EXAMPLE 2 AFTER:  1 Example #2

Instead I see this:

EXAMPLE 1 BEFORE:  None Example #1
EXAMPLE 1 AFTER:  0 Example #1
EXAMPLE 2 BEFORE:  None Example #2
EXAMPLE 2 AFTER:  0 Example #1

Am I making some mistake, or is this a bug? Thanks!

Constraint information is not being cached

In RedshiftDialect, there are three methods, _get_all_relation_info, _get_all_column_info, and _get_all_constraint_info, which query information about the whole cluster. These methods are called by several other methods which pass their info_cache argument down to one of the above methods. These methods, in turn, are called by other methods which, in most cases, pass down all of their keyword arguments. The one exception is _get_redshift_constraints - this method is never called with the keyword arguments of its caller, and therefore never gets the info_cache object that was passed in by SQLAlchemy. The result of this is that constraint information is never cached. When running something like alembic which does a lot of reflection, this can slow things down considerably because the dialect winds up querying information about the entire database every time alembic asks for anything.

autoincrement=True Column() kwarg does not work

# SQLAlchemy==1.0.10, sqlalchemy-redshift==0.4.0, psycopg2==2.6
import sqlalchemy as sa
from sqlalchemy.schema import CreateTable

engine = sa.create_engine('redshift+psycopg2://example')
user = sa.Table(
    'user', 
    sa.MetaData(), 
    # http://docs.sqlalchemy.org/en/latest/core/metadata.html#sqlalchemy.schema.Column.params.autoincrement
    sa.Column('id', sa.Integer, primary_key=True, autoincrement=True),
)

col = user.columns.get('id')
print(col.expression.autoincrement)
# >>> True

print(CreateTable(user).compile(engine))
# CREATE TABLE user (
#         id INTERGER NOT NULL
#         PRIMARY KEY (id)
# )

The ddl_compiler.get_column_default_string() method comes back None. I'm unable to figure out whether this is a lack of documentation and there is an easy solution or a deeper problem with the DDL for redshift.

Thanks in advance!

Cut a release

The last release was in April, and there have been a whole mass of patches made since then.

Migration plan for moving to the sqlalchemy-redshift org

There are some details we'll need to deal with when this project moves to the https://github.com/sqlalchemy-redshift org.

  • Documentation
    • Make sure we're using sqlachemy-redshift and sqlalchemy_redshift consistently everywhere
    • Make sure any references to the repo are correct
  • Read the Docs
    • Change the url of docs to sqlalchemy-redshift (it's currently inverted)
    • Set up a webhook on the repo so docs will build automatically.
  • PyPI

I wanted to make sure to document thoughts as I have them. Other things we'll need to keep in mind?

NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:redshift.psycopg2

I am trying to connect to redshift from my python code. my pip installed:

psycopg2==2.6.1
redshift-sqlalchemy==0.4.1
SQLAlchemy==1.0.9
and my virtual machine(ubuntu) has:

libpq-dev
python-psycopg2
But I am still getting

engine = create_engine('redshift+psycopg2://{}:{}@{}'.format(username, password, url))
File "/opt/project/env/local/lib/python2.7/site-packages/sqlalchemy/engine/init.py", line 386, in create_engine
return strategy.create(args, *kwargs)
File "/opt/project/env/local/lib/python2.7/site-packages/sqlalchemy/engine/strategies.py", line 51, in create
entrypoint = u._get_entrypoint()
File "/opt/project/env/local/lib/python2.7/site-packages/sqlalchemy/engine/url.py", line 131, in _get_entrypoint
cls = registry.load(name)
File "/opt/project/env/local/lib/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 205, in load
(self.group, name))
NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:redshift.psycopg2
I tried SQLAlchemy version 0.8.0, but doesn't work either.
With the same config, I am able to run from my laptop (mac), but on linux, I guess some packages still missing? Any suggestion will be appreciated, thanks!

Can't load plugin: sqlalchemy.dialects:psycopg2

Hey,

I just wanted to give it a spin. But I get a NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:psycopg2. I'm running 3.6.1 and installed sqlalchemy 1.1.9 and sqlalchemy-redshift 0.6.0.

From there I just followed the docs and used my custom connection string to create the engine. However I'm not sure if this is actually a bug on this project.

SQL_IDENTIFIER_RE issue

I have encountered table names that look like schema.table.name and reflection is breaking because SQL_IDENTIFER_RE.findall(key) returns a list of length 3 in dialect._get_schema_and_relation.

can't execute query when passing tuple parameter

I m trying to execute basic query and pass tuple parameter into it like:

engine = create_engine(app.config['REDSHIFT_DATABASE_URI'], echo=True)
conn = engine.connect()
s = text("select distinct t1.hh_id, t2.mri_id from msa t1 left outer join (select hh_id, mri_id from msa where mri_id in :code_param) t2 on t1.hh_id = t2.hh_id")
mriset = conn.execute(s, code_param = tuple(codes)).fetchall()

Here is an error:

2015-11-11 22:49:37,373 INFO sqlalchemy.engine.base.Engine select distinct t1.hh_id, t2.mri_id from msa t1 left outer join (select hh_id, mri_id from msa where mri_id in %(code_param)s) t2 on t1.hh_id = t2.hh_id
2015-11-11 22:49:37,374 INFO sqlalchemy.engine.base.Engine {'code_param': (-207426274, -205158201, -174737541)}
2015-11-11 22:49:37,681 INFO sqlalchemy.engine.base.Engine ROLLBACK

It works good though with postgres db.

Do you have any clue why?

SSL causing connection to fail

When I try to connect using redshift+psycopg2 I end up with this error:
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) sslmode value "verify-full" invalid when SSL support is not compiled in
What do I need to do to actually connect to my redshift instance? I'm able to connect fine using postgresql+psycopg2

Add support for AWS token/key credentials

sqlalchemy redshift does not appear to support authenticating with AWS token/secret key. I did find something that looks like it might in commands.py but there are no docs supporting how to supply the credentials if it does exist.

_get_column_info() missing 1 required positional argument: 'comment' in SQLAlchemy 1.2.0

column_info = super(RedshiftDialect, self)._get_column_info(

This breaks, because this was added:

http://docs.sqlalchemy.org/en/latest/changelog/changelog_12.html#change-3f38508932257c303017b1276f285c39

When it's called from get_columns():

column_info = self._get_column_info(
name=col.name, format_type=col.format_type,
default=col.default, notnull=col.notnull, domains=domains,
enums=[], schema=col.schema, encode=col.encode)

Because comment is missing from there.

Problem with [waits] command for concurrent write queries

I m trying to reproduce an example from RS documentation using:
python 3.4
sqlalchemy-redshift==0.3.1
Here is an example from RedShift doc (http://goo.gl/V2PuZh):

begin;
[waits]
copy listing from ;
end;

But it throws an exception:

ERROR - [RedShift] Programming Error: (psycopg2.ProgrammingError) syntax error at or near "["
LINE 1: begin; [waits] copy my_schema.my_table (userid, flag,...
               ^
[SQL: "begin; [waits] copy my_schema.my_table (userid, flag, campid, condition, ts) from 's3://mybucket/mydata' credentials 'aws_access_key_id=<my_aws_secret_key>;aws_secret_access_key=<my_aws_secret_access_key>' delimiter ','; end;"]

It works good without the [waits] operator. Is there any workaround so I can use [waits]?

3rd reflection regression

As I mentioned a couple weeks ago, I got another regression, here's the minimal schema test case:

create table foo.domain(
id integer identity(1,1) not null unique);

create table foo.bar(
domain_id integer not null,
foreign key(domain_id) references foo.domain(id))

(I almost got it working last time, but then life intervened and haven't touch this code since :( )

This is quite weird, since redshiftmeta.reflect(only=['domain'], schema='foo') works just fine, but redshiftmeta.reflect(only=['bar'], schema='foo') fails with

*** NoSuchTableError: foo."domain"

Also, the same with foo.domain2 will work just fine... so I'd guess that there's some special-cased/hardcoded code for stuff named domain, though by grepping briefly through the code I don't find anything suspicious

Unable to reflect non-default schemas

I'm new at SQL Alchemy so apologies if this is obvious

I'm not able to to reflect for the non-default schemas:

When I run:

 meta = sa.MetaData(schema="booker")
 meta.reflect(bind=engine)

I get

OperationalError: (psycopg2.OperationalError) server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
 [SQL: '
        SELECT
          c.relkind,
          n.oid as "schema_oid",
          n.nspname as "schema",
          c.oid as "rel_oid",
          c.relname,
          CASE c.reldiststyle
            WHEN 0 THEN \'EVEN\' WHEN 1 THEN \'KEY\' WHEN 8 THEN \'ALL\' END
            AS "diststyle",
          c.relowner AS "owner_id",
          u.usename AS "owner_name",
          TRIM(TRAILING \';\' FROM pg_catalog.pg_get_viewdef(c.oid, true))
            AS "view_definition",
          pg_catalog.array_to_string(c.relacl, \'
\') AS "privileges"
        FROM pg_catalog.pg_class c
             LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
             JOIN pg_catalog.pg_user u ON u.usesysid = c.relowner
        WHERE c.relkind IN (\'r\', \'v\', \'m\', \'S\', \'f\')
          AND n.nspname !~ \'^pg_\'
        ORDER BY c.relkind, n.oid, n.nspname;
        ']

ps. This seems similar to apache/superset#217

Hosted documentation

It'd be swell to have some documentation, that for now could consist of answers to questions like "How is this different from the postgresql dialect?" rtfd.org + sphinx would be my tool of choice here, but others may have thoughts on this.

column info doesnt register in alembic autogenerated migrations

class Example(Model):
    id = Column(types.Integer(), primary_key=True, info={'encode': 'lzo'})
    name = Column(types.Unicode(255), info={'encode': 'lzo'})

I wasn't sure if this was an alembic bug/setting or not, but given that e.g. defining identity columns or column encodings are things which would be more important to redshift tables, I would have expected them to be included in the generated migrations.

Add a persistent reflection cache option

As discussed in #101, reflection on Redshift can be time-consuming because the per-query overhead is high and SQLAlchemy issues new queries for every table you need to reflect.

I'd love to find a way to "freeze" the schema cache and have it persist while you reflect multiple tables. We already do bulk queries to get info about all tables on reflection, but that info gets thrown away after each table is reflected.

One possibility is to introduce a RedshiftDialect._global_reflection_cache member that would normally be None and modify the reflection.cache decorator to check for it. We'd need to be very intentional about how this gets used.

It might be nice to wrap it up in a context manager. You could write something like:

with RedshiftDialect.caching_schema():
    # reflect a bunch of tables here, utilizing the persistent cache

# Reflection is back to normal now that we've left the caching_schema context

I've never used alembic. I wonder if there's some way we could hook this into alembic so that migrations could be faster.

Redshift does not support sequences

The dialect needs to specify that sequences are not supported. Otherwise sqlalchemy tries to reference a nonexistent sequence and you get an error such as the following:
(sqlalchemy.exc.ProgrammingError) (psycopg2.ProgrammingError) relation "tablename_model_id_seq" does not exist [SQL: 'select nextval(\'"tablename_model_id_seq"\')'] [SQL: u'INSERT INTO tablename (model_id, model_name) VALUES (%(model_id)s, %(model_name)s,)']

Redshift reserved words are not escaped

I got an error trying to create a table that had a column called "tag". The list of reserved words are here: http://docs.aws.amazon.com/redshift/latest/dg/r_pg_keywords.html

I fixed it in a very dirty way:

import sqlalchemy.dialects.postgresql.base as ps

REDSHIFT_RESERVED_WORDS = set("""aes128,aes256,all,allowoverwrite,analyse,
analyze,and,any,array,as,asc,authorization,backup,between,binary,blanksasnull,
both,bytedict,bzip2,case,cast,check,collate,column,constraint,create,
credentials,cross,current_date,current_time,current_timestamp,current_user,
current_user_id,default,deferrable,deflate,defrag,delta,delta32k,desc,disable,
distinct,do,else,emptyasnull,enable,encode,encrypt,encryption,end,except,
explicit,false,for,foreign,freeze,from,full,globaldict256,globaldict64k,grant,
group,gzip,having,identity,ignore,ilike,in,initially,inner,intersect,into,is,
isnull,join,leading,left,like,limit,localtime,localtimestamp,lun,luns,lzo,lzop,
minus,mostly13,mostly32,mostly8,natural,new,not,notnull,null,nulls,off,offline,
offset,oid,old,on,only,open,or,order,outer,overlaps,parallel,partition,percent,
permissions,placing,primary,raw,readratio,recover,references,respect,rejectlog,
resort,restore,right,select,session_user,similar,some,sysdate,system,table,tag,
tdes,text255,text32k,then,timestamp,to,top,trailing,true,truncatecolumns,union,
unique,user,using,verbose,wallet,when,where,with,without""".split(","))

ps.PGIdentifierPreparer.reserved_words = ps.PGIdentifierPreparer.reserved_words.union(helpers.REDSHIFT_RESERVED_WORDS)

Would be good if the redshift dialect would support this out of the box.

Question: errors from queries involving pg_catalog tables running while things are being dropped

Hello! The impetus for the investigation that led to issue #101 is a class of errors that we've been seeing that seem to be due to concurrent access to pg_catalog tables. We have a test which makes sure that our alembic migrations are up to date. This test necessarily does a lot of reflection. We've been seeing this test fail intermittently with errors like these:

1

E       sqlalchemy.exc.InternalError: (psycopg2.InternalError) could not find tuple for constraint 839833
E        [SQL: '\n        SELECT\n          n.nspname as "schema",\n          c.relname as "table_name",\n          t.contype,\n          t.conname,\n          t.conkey,\n          a.attnum,\n          a.attname,\n          pg_catalog.pg_get_constraintdef(t.oid, true) as condef,\n          n.oid as "schema_oid",\n          c.oid as "rel_oid"\n        FROM pg_catalog.pg_class c\n        LEFT JOIN pg_catalog.pg_namespace n\n          ON n.oid = c.relnamespace\n        JOIN pg_catalog.pg_constraint t\n          ON t.conrelid = c.oid\n        JOIN pg_catalog.pg_attribute a\n          ON t.conrelid = a.attrelid AND a.attnum = ANY(t.conkey)\n        WHERE n.nspname !~ \'^pg_\'\n        ORDER BY n.nspname, c.relname\n        ']

2

E       sqlalchemy.exc.InternalError: (psycopg2.InternalError) cache lookup failed for relation 870981
E        [SQL: '\n            SELECT\n              n.nspname as "schema",\n              c.relname as "table_name",\n              d.column as "name",\n              encoding as "encode",\n              type, distkey, sortkey, "notnull", adsrc, attnum,\n              pg_catalog.format_type(att.atttypid, att.atttypmod),\n              pg_catalog.pg_get_expr(ad.adbin, ad.adrelid) AS DEFAULT,\n              n.oid as "schema_oid",\n              c.oid as "table_oid"\n            FROM pg_catalog.pg_class c\n            LEFT JOIN pg_catalog.pg_namespace n\n              ON n.oid = c.relnamespace\n            JOIN pg_catalog.pg_table_def d\n              ON (d.schemaname, d.tablename) = (n.nspname, c.relname)\n            JOIN pg_catalog.pg_attribute att\n              ON (att.attrelid, att.attname) = (c.oid, d.column)\n            LEFT JOIN pg_catalog.pg_attrdef ad\n              ON (att.attrelid, att.attnum) = (ad.adrelid, ad.adnum)\n            WHERE n.nspname !~ \'^pg_\'\n            ORDER BY n.nspname, c.relname, att.attnum\n            ']

3

E       sqlalchemy.exc.InternalError: (psycopg2.InternalError) cache lookup failed for relation 731237
E        [SQL: "\n            SELECT c.oid\n            FROM pg_catalog.pg_class c\n            LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace\n            WHERE (pg_catalog.pg_table_is_visible(c.oid))\n            AND c.relname = %(table_name)s AND c.relkind in ('r', 'v', 'm', 'f')\n        "] [parameters: {'table_name': 'ref_cm_icd9'}]

We think that these errors happen when one of the other schemas in our database is being dropped while one of these queries is being run. Since these queries grab information about the entire database, they are not scoped to schemas at all. We have also found this page in the redshift docs: http://docs.aws.amazon.com/redshift/latest/dg/c_serial_isolation.html which seems to be saying that this might be expected behavior.

Error number 1 is the one that we've seen the most. We think this is because of issue #101 - because the constraint query was being run many times, there was more of a chance that it would be run at the wrong time. However, we just recently saw error number 2, which is the query that this dialect uses for getting column info. Error number 3 is actually a query from the regular postgres dialect, so this issue isn't specific to this dialect. We think that these sorts of errors happen specifically when functions are involved - some data is fetched, a function is run on some fetched id, and if the entity associated with that id disappears between those two steps, an error occurs.

I'm creating this issue more to ask if you guys have encountered something like this before and whether you've found any resolution. However, it may be useful for our purposes to restrict the _get_all_* queries so that they only query information about the requested schema. Since checking the schema name would not involve any functions, we think it would resolve our issue. I don't know whether that would be a good change to make in upstream as well.

Support external tables

With Spectrum, Amazon has added support for querying external data to Redshift. There are a few new features that make this work. It would be nice to support these for migrations/ddl and querying:

DDL:

Querying:

Reflection bug on SQLAlchemy 1.2

Hi there,

It looks like SQLAlchemy has changed a few method signatures in version >=1.2. The following statement works on SQLAlchemy <1.2 but not on SQLAlchemy >= 1.2:

table = sa.Table(table_name, meta, autoload=True, postgresql_ignore_search_path=True)

I get the following traceback with SQLAlchemy version 1.2.5:

  File "/home/colin/work/spectrify/spectrify/utils/schema.py", line 54, in get_table_schema
    table = sa.Table(table_name, meta, **table_kwargs)
  File "/home/colin/.virtualenvs/spectrify/local/lib/python2.7/site-packages/sqlalchemy/sql/schema.py", line 456, in __new__
    metadata._remove_table(name, schema)
  File "/home/colin/.virtualenvs/spectrify/local/lib/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 66, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/home/colin/.virtualenvs/spectrify/local/lib/python2.7/site-packages/sqlalchemy/sql/schema.py", line 451, in __new__
    table._init(name, metadata, *args, **kw)
  File "/home/colin/.virtualenvs/spectrify/local/lib/python2.7/site-packages/sqlalchemy/sql/schema.py", line 533, in _init
    include_columns, _extend_on=_extend_on)
  File "/home/colin/.virtualenvs/spectrify/local/lib/python2.7/site-packages/sqlalchemy/sql/schema.py", line 559, in _autoload
    _extend_on=_extend_on
  File "/home/colin/.virtualenvs/spectrify/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2056, in run_callable
    return conn.run_callable(callable_, *args, **kwargs)
  File "/home/colin/.virtualenvs/spectrify/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1545, in run_callable
    return callable_(self, *args, **kwargs)
  File "/home/colin/.virtualenvs/spectrify/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 389, in reflecttable
    table, include_columns, exclude_columns, **opts)
  File "/home/colin/.virtualenvs/spectrify/local/lib/python2.7/site-packages/sqlalchemy/engine/reflection.py", line 618, in reflecttable
    table_name, schema, **table.dialect_kwargs):
  File "/home/colin/.virtualenvs/spectrify/local/lib/python2.7/site-packages/sqlalchemy/engine/reflection.py", line 369, in get_columns
    **kw)
  File "<string>", line 2, in get_columns
  File "/home/colin/.virtualenvs/spectrify/local/lib/python2.7/site-packages/sqlalchemy/engine/reflection.py", line 54, in cache
    ret = fn(self, con, *args, **kw)
  File "/home/colin/.virtualenvs/spectrify/local/lib/python2.7/site-packages/sqlalchemy_redshift/dialect.py", line 404, in get_columns
    enums=[], schema=col.schema, encode=col.encode)
  File "/home/colin/.virtualenvs/spectrify/local/lib/python2.7/site-packages/sqlalchemy_redshift/dialect.py", line 604, in _get_column_info
    **kw
TypeError: _get_column_info() takes exactly 9 arguments (8 given)

I tracked the issue down to the following commit: zzzeek/sqlalchemy@fadb8d6#diff-d33159d80d3deef1d5bdcd057dcc3d6bR2443

Another reflection regression

Sorry, I should've tested this earlier as soon as you finished working on #64 , but I kept using 0.1.2 and didn't check if there were any more issues.

The issue now is with tables that have foreign keys contraints, I've been able to reproduce the problem with the following 2 tables:

create table ref.foo(id integer identity(1, 1) not null unique);

create table ref.bar(
foo_id integer not null,
foreign key(foo_id) references ref.foo(id))

And, when running .reflect(only=['bar'], schema='ref')

I get *** AttributeError: 'NoneType' object has no attribute 'group'

This is the traceback:

  File "/home/dario/Projects/l2/Hydra/database/hydra_database/__init__.py", line 81, in cached_metadatas
    redshiftmeta.reflect(only=tablenames, schema=schema)
  File "/home/dario/.local/share/virtualenvs/hydra_worker/local/lib/python2.7/site-packages/sqlalchemy/sql/schema.py", line 3647, in reflect
    Table(name, self, **reflect_opts)
  File "/home/dario/.local/share/virtualenvs/hydra_worker/local/lib/python2.7/site-packages/sqlalchemy/sql/schema.py", line 416, in __new__
    metadata._remove_table(name, schema)
  File "/home/dario/.local/share/virtualenvs/hydra_worker/local/lib/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 60, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/home/dario/.local/share/virtualenvs/hydra_worker/local/lib/python2.7/site-packages/sqlalchemy/sql/schema.py", line 411, in __new__
    table._init(name, metadata, *args, **kw)
  File "/home/dario/.local/share/virtualenvs/hydra_worker/local/lib/python2.7/site-packages/sqlalchemy/sql/schema.py", line 484, in _init
    self._autoload(metadata, autoload_with, include_columns)
  File "/home/dario/.local/share/virtualenvs/hydra_worker/local/lib/python2.7/site-packages/sqlalchemy/sql/schema.py", line 496, in _autoload
    self, include_columns, exclude_columns
  File "/home/dario/.local/share/virtualenvs/hydra_worker/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1477, in run_callable
    return callable_(self, *args, **kwargs)
  File "/home/dario/.local/share/virtualenvs/hydra_worker/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 364, in reflecttable
    return insp.reflecttable(table, include_columns, exclude_columns)
  File "/home/dario/.local/share/virtualenvs/hydra_worker/local/lib/python2.7/site-packages/sqlalchemy/engine/reflection.py", line 578, in reflecttable
    exclude_columns, reflection_options)
  File "/home/dario/.local/share/virtualenvs/hydra_worker/local/lib/python2.7/site-packages/sqlalchemy/engine/reflection.py", line 666, in _reflect_fk
    table_name, schema, **table.dialect_kwargs)
  File "/home/dario/.local/share/virtualenvs/hydra_worker/local/lib/python2.7/site-packages/sqlalchemy/engine/reflection.py", line 447, in get_foreign_keys
    **kw)
  File "<string>", line 2, in get_foreign_keys
  File "/home/dario/.local/share/virtualenvs/hydra_worker/local/lib/python2.7/site-packages/sqlalchemy/engine/reflection.py", line 54, in cache
    ret = fn(self, con, *args, **kw)
  File "/home/dario/.local/share/virtualenvs/hydra_worker/local/lib/python2.7/site-packages/sqlalchemy_redshift/dialect.py", line 396, in get_foreign_keys
    referred_column = m.group('referred_column')
AttributeError: 'NoneType' object has no attribute 'group'

Advertise for maintainers

Had a discussion with @graingert that he's no longer using Redshift, so likely won't be able to be active in owning and maintaining this project.

I'm still using Redshift and can be involved, but probably can't give the project the attention it deserves without help. I propose we advertise for a new maintainer.

A proposed path forward:

  • Update README with a call to action at the top indicating that we're looking for a maintainer
  • Leave comments on all open issues indicating that we're looking for a new maintainer
  • @graingert grants rights on pypi to let @jklukas upload new releases

If we get someone interested in being maintainer, then we'd proceed with making sure they also have rights on pypi, and we can update metadata in setup.py as appropriate.

How does the above sound, @graingert ?

Bigcrunch improvements

@graingert - What would you think about hosting the bigcrunch repo on GitHub under the sqlalchemy-redshift org? It would make it easier to browse the code, and it would let us make PRs.

Another thought on bigcrunch - I'd be for upping the timeout. It looks like it's supposed to shut down 1 hour after the last created test session, and I've gotten burned by that a few times, meaning waiting for another whole round of shutdown and spinup to be able to run more tests. Maybe let it keep running for 4 hours?

Fix DELETE statement for Redshift Dialect

Redshift's delete statement varies slightly from Postgresql's. See here for documentation.

Basic delete statements have the same syntax.

For instance, the following is valid SQL in both dialects:

DELETE FROM customer_table WHERE customer_table.id > 1000

However, while the following is a valid statement in Postgresql:

DELETE FROM customer_table 
WHERE customer_table.id = order_table.customer_id 
AND order_table.id < 100

It needs to be written for Redshift as:

DELETE FROM customer_table
USING order_table
WHERE customer_table.id = order_table.customer_id
AND order_table.id < 100

SqlAlchemy should be able to build this resultant query with the following Python snippet:

from sqlalchemy import delete, Table, Column, Integer, MetaData
from redshift_sqlalchemy import RedshiftDialect
meta = MetaData()
customer = Table('customer_table', meta, Column('id', Integer, primary_key=True))
order = Table('order_table', meta, Column('id', Integer, primary_key=True), Column('customer_id', Integer)
del_stmt = delete(order).where(order.c.customer_id==customer.c.id).where(order.c.id<100)
print(del_stmt.compile(dialect=RedshiftDialect())

Dialect cannot work with huge data tables (allocation of memory on server/client)

  1. For Amazon Redshift SQLAlchemy library by default using psycopg2 library that is used for Postgres. It by default using client side cursors. When sql query is executing it allocated all QuerySet in memory, because of this big query is overflow memory limit.
    After this I tried to use Raw Driver Connection using Psycopg2 and work with server side cursor that is recommended by psycopg documentation (http://initd.org/psycopg/docs/usage.html#server-side-cursors):
    http://docs.aws.amazon.com/redshift/latest/dg/declare.html
begin;
DECLARE curname cursor FOR
    SELECT * FROM schema.table;
fetch forward 10 from curname;
fetch next from curname;
close curname;
commit;

or

command_text = """
    SELECT * FROM schema.table;
"""
conn = psycopg2.connect(host="HOST", dbname="dbname", password="****", user="user", port="5439")
cursor2 = conn.cursor('curname')
cursor2.itersize = 100
cursor2.execute(command_text)
for item in cursor2:
    print(item)

This method is executed but it still allocated all queryset on server side and executing for about 10 minutes. In same cases it still have problem with overflow memory, but at this time on server side.

  1. When I tried to execute it using SQL Workbench query is starting executing and return result in loop by 1000 result and printed it.
    I assuming that workbench is making another kind of limit offset on their side.
    Also I tested it using java code and use redhift jdbc driver using Java. It code work as expected and return result row by row.
    So I assuming problem in python driver psycopg2 that allocating queryset in memory before returning result
package connection;

import java.sql.*;
import java.util.Properties;
public class Redshift {
    //Redshift driver: "jdbc:redshift://host:5439/dev";
    static final String dbURL = "jdbc:redshift://host:5439/dbname";
    static final String MasterUsername = "user";
    static final String MasterUserPassword = "****";
    public static void main(String[] args) {
        Connection conn = null;
        Statement stmt = null;
        try{
           Class.forName("com.amazon.redshift.jdbc41.Driver");
           //Open a connection and define properties.
           System.out.println("Connecting to database...");
           Properties props = new Properties();
           //Uncomment the following line if using a keystore.
           //props.setProperty("ssl", "true");
           props.setProperty("user", MasterUsername);
           props.setProperty("password", MasterUserPassword);
           conn = DriverManager.getConnection(dbURL, props);
           //Try a simple query.
           System.out.println("Listing system tables...");
           stmt = conn.createStatement();
           String sql;
           sql = "SELECT * FROM schema.table;";
           ResultSet rs = stmt.executeQuery(sql);
           //Get the data from the result set.
           while(rs.next()){
              //Retrieve two columns.
              String catalog = rs.getString("list_entry_id");
              String name = rs.getString("source_code");
              //Display values.
              System.out.print("Catalog: " + catalog);
              System.out.println(", Name: " + name);
           }
           rs.close();
           stmt.close();
           conn.close();
        }catch(Exception ex){
           //For convenience, handle all errors here.
           ex.printStackTrace();
        }finally{
           //Finally block to close resources.
           try{
              if(stmt!=null)
                 stmt.close();
           }catch(Exception ex){
           }// nothing we can do
           try{
              if(conn!=null)
                 conn.close();
           }catch(Exception ex){
              ex.printStackTrace();
           }
        }
        System.out.println("Finished connectivity test.");
     }
}

Redshift spinup/shutdown script

We're running tests against Redshift now and have AWS access keys available to Travis, but the cluster is currently running full-time.

@graingert - you mentioned that you had a good idea of how to hook a script in to spin up the cluster for tests and then spin it back down. Are there any blockers at this point to make that happen?

BZIP2 format not supported with COPY

We use BZIP2 compression on our s3 files, it doesn't seem possible to specify this with the CopyCommand.
http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html#copy-bzip2

>>> import sqlalchemy as sa
>>> from redshift_sqlalchemy.dialect import CopyCommand
>>> table = sa.table("foo")
>>> CopyCommand(table, "s3://bucket/data/", "XXXYYYZZZXXXYYYZZZ11", "XXXYYYZZZXXXYYYZZZ11XXXYYYZZZXXXYYYZZZ11", format="BZIP2")
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/virtualenv/local/lib/python2.7/site-packages/sqlalchemy_redshift/commands.py", line 362, in __init__
    self.formats)
ValueError: "format" parameter must be one of ['CSV', 'JSON', 'AVRO', None]
>>> CopyCommand(table, "s3://bucket/data/", "XXXYYYZZZXXXYYYZZZ11", "XXXYYYZZZXXXYYYZZZ11XXXYYYZZZXXXYYYZZZ11", compression="BZIP2")
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/virtualenv/local/lib/python2.7/site-packages/sqlalchemy_redshift/commands.py", line 368, in __init__
    self.compression_types
ValueError: "compression" parameter must be one of ['GZIP', 'LZOP']

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.