quantco / pytsql Goto Github PK

View Code? Open in Web Editor NEW

14.0 6.0 3.0 6.36 MB

Run mssql scripts from Python.

License: BSD 3-Clause "New" or "Revised" License

Python 32.81% ANTLR 1.83% Shell 0.01% C++ 65.30% C 0.05%

sql tsql mssql database

pytsql's Introduction

pytsql

pytsql allows to run mssql scripts, typically run from GUIs, from Python.

This, in turn, eases concurrent, scheduled and chained execution of scripts in a workflow. Put differently, it enables automated execution of Microsoft sql server scripts.

Moreover, pytsql supports parametrization of scripts, akin to templating.

Please have a look at our documentation if you're curious about more details.

Installation

To install, execute:

pip install pytsql

or in a conda environment

mamba install pytsql

Be aware that depending on your platform, you'will need specific drivers (Windows, Linux, macOS).

Usage

pytsql is tailored to run with MSSQL. Other sql dialects are not supported.

The main function of interest is execute. It requires a sqlalchemy Engine object as well as the path to the sql script of interest. A typical use case might look as follows:

import pytsql
from sqlalchemy import sa

db_connection_string = "connection_to_my_database"
engine = sa.get_engine(db_connection_string)

pytsql.execute("my_sql_script.sql", engine)

Please also have a look at our documentation sections on usage and development.

Credits

Grammar is based on antlr4/grammars-v4.

pytsql's People

Contributors

Stargazers

Watchers

Forkers

rokasurbonasqc jeffamaxey prsabahrami

pytsql's Issues

Add support for returning result tables

As far as I see there is no option for returning the resulting table of a query when executing a file or sql string. This feature, if even possible, would be super nice, as it makes ongoing work with dataframes in python much easier! Basically something similiar like what pandas supports with df = pd.read_sql(string, con).

Test suite: one test disables C++ implementation for all subsequent ones

All tests that run after

pytsql/tests/unit/test_py_vs_cpp.py

Lines 141 to 144 in 7699639

 def test__py_parse__warns(): 

 tsql.USE_CPP_IMPLEMENTATION = False 

 with pytest.warns(match=r"Can not find C\+\+ version of the parser"): 

 _split("")

will not use the C++ implementation, and print a misleading warning from here

pytsql/src/pytsql/tsql.py

Lines 150 to 155 in 7699639

 if not USE_CPP_IMPLEMENTATION: 

 warnings.warn( 

 "Can not find C++ version of the parser, Python version will be used instead." 

 " Something is likely wrong with your installation of the package," 

 " and is preventing the use of the faster C++ parser." 

 )

like in the current CI

Package for pypi

Hit test-pypi memory limit

There is a workflow that tries to upload the package to TestPyPI, and it started failing with:

WARNING  Error during upload. Retry with the --verbose option for more details. 
ERROR    HTTPError: 400 Bad Request from https://test.pypi.org/legacy/          
         Project size too large. Limit for project 'pytsql' total size is 10 GB.
         See https://test.pypi.org/help/#project-size-limit

This should not be critical.

We could consider:

Deleting what's already there to free up all space and deal with this when it happens again
Disable the workflow that uploads to testpypi

Add keywords in pyproject.toml

`sa.text()` needs to escape colons (in text)

Relates to PR #66

Issue: SQLAlchemy's docu states:

For SQL statements where a colon is required verbatim, as within an inline string, use a backslash to escape:
t = text(r"SELECT * FROM users WHERE name='\:username'")

However, pytsql isn't incorporating this. Hence, the example above fails.

(Same for INSERT INTO dbo.tbl VALUES (':foo'))

Missing support for user-defined table types

The following statement to create a user-defined table type is not executable via the latestpytsql (1.1.4):

CREATE TYPE x AS TABLE (
    id INT
)

This statement it valid in MSSQL since 2008 - see doc.

 File "D:[...]\pytsql\tsql.py", line 252, in execute
    executes(_code(path, encoding), engine, parameters, isolate_top_level_statements)
  File "[...]\lib\site-packages\pytsql\tsql.py", line 226, in executes
    for batch in _split(parametrized_code, isolate_top_level_statements):
  File "[...]\lib\site-packages\pytsql\tsql.py", line 163, in _split
    tree = parse(InputStream(data=code), "tsql_file", error_listener)
  File "[...]\lib\site-packages\pytsql\grammar\sa_tsql.py", line 82, in parse
    return _cpp_parse(stream, entry_rule_name, sa_err_listener)
  File "[...]\lib\site-packages\pytsql\grammar\sa_tsql.py", line 105, in _cpp_parse
    return sa_tsql_cpp_parser.do_parse(tsqlParser, stream, entry_rule_name, sa_err_listener)
  File "[...]\lib\site-packages\pytsql\tsql.py", line 146, in syntaxError
    raise ValueError(f"Error parsing SQL script: {error_message}")
ValueError: Error parsing SQL script: Line 1:49 mismatched input 'AS' expecting FROM. Problem near token 'AS'.

Support for `@@ROWCOUNT`

I've received requests for pytsql to support the @@ROWCOUNT variable, so that print statements of the number of affected rows can be used. An example usage would be:

<select/update statement>
PRINT (@@ROWCOUNT)

The reason this doesn't work at the moment seems to be that if we don't send both statements in a single "batch" towards the SQL server, the variable gets reset. The current implementation of pytsql separates all top-level statements and sends them separately.

My suggestion would be to introduce an option along the lines of "isolate-top-level-statements" which is true by default and when enabled would retain the current behavior of the tool. However, when disabled, the batching would be done in whole groups inbetween GO commands. In this way users who want to use such "context-sensitive" commands can use the latter mode. The only downside of the latter mode should be that PRINT statement values would only be shown when a full batch executes, but that can easily be tuned by the user with appropriate amount of GO statements. In my opinion such a change gives more control to the user, and the default of the option would keep us backwards compatible.

Declaration in Stored Procedure is used to prepend further statements => Error

Declarations in stored procedures can be defined using stored procedure arguments. Pytsql currently still copies them to prefix batches with isolate_top_level_statements=True which causes errors.
TSQL:

CREATE PROCEDURE CREATEALLDATES
    (
        @StartDate AS DATE, @EndDate AS DATE
    ) AS
    DECLARE @Current AS DATE = DATEADD(DD, 0, @StartDate); DROP TABLE IF EXISTS ##alldates CREATE TABLE ##alldates (
        dt DATE PRIMARY KEY
    ) WHILE @Current <= @EndDate BEGIN
    INSERT INTO ##alldates
    VALUES (@Current);
    SET @Current = DATEADD(DD, 1, @Current) -- add 1 to current day
END
GO
IF OBJECT_ID ( N'dbo.get_db_sampling_factor' , N'FN' ) IS NOT NULL DROP FUNCTION get_db_sampling_factor ;

Error:

sqlalchemy.exc.ProgrammingError: (pyodbc.ProgrammingError) ('42000', '[42000] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Must declare the scalar variable "@StartDate". (137) (SQLExecDirectW)')
[SQL: DECLARE @Current AS DATE = DATEADD ( DD , 0 , @StartDate ) ; IF OBJECT_ID ( N'dbo.get_db_sampling_factor' , N'FN' ) IS NOT NULL DROP FUNCTION get_db_sampling_factor ;]

Latest `speedy-antlr-tool` is not compatible with our ANTLR version

We are currently using ANTLR version 4.9.2 (both for generating and running the parsers, of course), whereas speedy-antlr-tool started requiring ANTLR >= 4.10.

Note that if you are not careful, installing the latest speedy-antlr-tool will silently bring the latest 4.11 ANTLR runtime, and this could additionally confuse things (e.g. you will get errors when you try to run the parsers, that were generated by the 4.9 version in the new 4.14 runtime):

Spurious error during debugging: 'Namespace' object has no attribute 'backend'

This error appeared to me during debugging pytsql, then it vanished, and now it resurfaced:

Traceback (most recent call last):
  File "/home/martin/progs/pycharm-2022.2.1/plugins/python/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
    exec(exp, global_vars, local_vars)
  File "<input>", line 1, in <module>
AttributeError: 'Namespace' object has no attribute 'backend'

It could be a problem of my pycharm installation, but this is the first repo where it happens.
I checked both python 3.11 and 3.10.
I tested pycharm 2022.2.1 and 2022.2.4

Deal with freetds driver in CI

To handle the freetds driver, a separate job is needed that installs the relevant driver, adding it to the drivers list ends up with

sqlalchemy.exc.DBAPIError: (pyodbc.Error) ('01000', "[01000] [unixODBC][Driver Manager]Can't open lib 'libtdsodbc.so' : file not found (0) (SQLDriverConnect)")

Publish package on conda-forge

How to remove `#pytsql_prints` lines from logs

I wonder if there is something that we can implement in pytsql to optionally disable printing of the #pytsql_prints related lines to the SQLAlchemy logger.

Add sphinx documentation

Analogous to what we have with glum:
https://glum.readthedocs.io/en/latest/

Extend support for conditionals

For my project, I'm currently trying to use pytsql for use cases beyond replacing DB or table names. In particular, I want to make use of conditionals.
However, I run into some issues there:

Executing certain functions within conditionals throws me (an) unexpected error(s)
- E.g., if you want to parameterize whether a certain table or a view is created. For views, I use the following approach:

IF 1 = 1
BEGIN
     SET NONEXEC ON
END
GO
CREATE VIEW v AS ...
SET NONEXEC OFF

... however, pytsql indirectly throws me an unexpected error via sqlalchemy:
sqlalchemy.exec.ResourceClosedError: This result object does not return rows. It has been closed automatically. (If it's helpful I can also post the full error, but that requires some anonymization first)

Declaring variables within the conditional seems current not possible.
- while there is an obvious workaround by declaring every variable at the top of the file, pytsqls behavior is unexpected
- I believe this is already addressed in issue #19

pytsql silently changes the isolation level of my engine

This is due to a bug in sqlalchemy's Connection.execution_options() or something deeper like pyodbc. Here is a "minimal" example (also in a gist):

import random
import string
import pytest

import pytsql

import sqlalchemy
from sqlalchemy.pool import NullPool


TEMPDB_CONN_STR = "mssql+pyodbc://sa:QuantCo123@localhost:1433/tempdb?driver=libtdsodbc.so"


def show_bug(
    tempdb_connection_string: str,
    break_with_root_cause: bool,
    fix_by_disabling_pooling: bool,
    db: str = "tempdb",
):
    eng = sqlalchemy.create_engine(
        tempdb_connection_string,
        connect_args={"autocommit": True},
        echo=False,
        **({"poolclass": NullPool} if fix_by_disabling_pooling else {}),
    )

    source_table = random_name()
    table_1 = random_name()
    table_2 = random_name()

    # print("Source table:", source_table)
    # print("Table 1:", table_1)
    # print("Table 2:", table_2)

    with eng.connect() as conn:
        conn.execute(f"CREATE TABLE {db}.dbo.{source_table} (ID INT)")
        conn.execute(f"INSERT INTO {db}.dbo.{source_table} VALUES (123)")

    with eng.connect() as conn:
        conn.execute(f"SELECT * INTO {db}.dbo.{table_1} FROM {db}.dbo.{source_table}")
    with eng.connect() as conn:
        print("Found in table:", conn.execute(f"SELECT * FROM {db}.dbo.{table_1}").all())

    if not break_with_root_cause:
        pytsql.executes("SELECT 12", eng)
    else:
        eng.connect().execution_options(isolation_level="AUTOCOMMIT")

    with eng.connect() as conn:
        # Won't be committed if no fixes are applied.
        conn.execute(f"SELECT * INTO {db}.dbo.{table_2} FROM {db}.dbo.{source_table}")
    with eng.connect() as conn:
        # Fails with "invalid object name" if no fixes are applied because the above didn't get committed.
        print("Found in table:", conn.execute(f"SELECT * FROM {db}.dbo.{table_2}").all())


def random_name() -> str:
    return "".join(random.choices(string.ascii_letters, k=10))


print("Should work")
for break_with_root_cause in [False, True]:
    show_bug(
        tempdb_connection_string=TEMPDB_CONN_STR,
        break_with_root_cause=break_with_root_cause,
        fix_by_disabling_pooling=True,  # Disabling pooling fixes it because connections are not reused.
    )

print("Should NOT work")
for break_with_root_cause in [False, True]:
    with pytest.raises(sqlalchemy.exc.ProgrammingError, match="Invalid object name"):
        show_bug(
            tempdb_connection_string=TEMPDB_CONN_STR,
            break_with_root_cause=break_with_root_cause,
            fix_by_disabling_pooling=False,
        )
    print("Told ya")

Run tests on push via GitHub Actions

unit tests
integration tests
linting (?)
publish on tag

`SET` statements for local variables are unreliable

With the current design of the library, SET statements have a rather weird behavior. In short, before every top-level statement all local variables are redeclared with their initial value. This is a behavior hidden from the user and is potentially dangerous.

Examples

To illustrate the current behavior, here are some examples:

Top-level SET
```
DECLARE @A INT = 10
SET @A = 5
PRINT @A
```
Will print 10. The SET is actually executed but has no effect since the variable is lost after batch execution and then redeclared with its initial value before the PRINT.
SET in control flow
```
DECLARE @A INT = 10
IF 1 = 1
  BEGIN
    SET @A = 5
    PRINT @A
  END
```
Will print 5 since the whole IF is executed as a single statement.
And a mixed example that shows the inconsistent behavior
```
DECLARE @A INT = 10
IF 1 = 1
  BEGIN
    SET @A = 5
    PRINT @A
  END
PRINT @A
```
Will produce 5 followed by a 10, combining the behavior of the previous two examples.

Solution

It's unfortunately not easy to fix this in general, but it would be nice to at least add a warning log whenever SET statements are parsed in a script.

Note that simply banning SET statements at top level is not sufficient, because of cases like the third example - hence it is not trivial to statically determine which scripts would work as expected.

Minimal Python version

The ANTLR4 runtime package states that it supports Python 3.7 [0].

However, inspecting the source code of the package itself, there is no specification for the version, thus it seems highly likely any version above 3.7 might be safe to use too.

Potentially make running tests easier for Linux

I needed to do the following changes to conftest.py to avoid strange timeouts:

And it might be nice to have a developer README.md in the repo which is distinct from the README.md which gets packaged for pypi.

I consider this a lower priority issue, but wanted to put my observations somewhere.

Use flit-scm for version handling

See to use https://pypi.org/project/flit-scm/ to get the link of the version from git into flit packaging workflow.

	def test__py_parse__warns():
	tsql.USE_CPP_IMPLEMENTATION = False
	with pytest.warns(match=r"Can not find C\+\+ version of the parser"):
	_split("")

	if not USE_CPP_IMPLEMENTATION:
	warnings.warn(
	"Can not find C++ version of the parser, Python version will be used instead."
	" Something is likely wrong with your installation of the package,"
	" and is preventing the use of the faster C++ parser."
	)