d6t / d6tstack Goto Github PK

View Code? Open in Web Editor NEW

192.0 192.0 46.0 112 KB

Quickly ingest messy CSV and XLS files. Export to clean pandas, SQL, parquet

License: MIT License

Jupyter Notebook 69.00% Python 30.98% Shell 0.02%

d6tstack's People

Contributors

Stargazers

Watchers

d6tstack's Issues

Issue on importing Dataframe into SQL Server

Hello everyone,

I am having problems with the function d6tstack.utils.pd_to_mssql() when I want to import the Dataframe into Microsoft SQL Server database. The name of SQL Table is fxxProduktKatalogDE. I found that the name in function should be written lowercase and so I did. But I get the error as the Table does not exists. The table is there in database. I also tried to write in the table name the name exactly as in database but the same error happens.

Here the Code and the error

Can you help me to solve this issue ?

Thank you in advance!

pd_to_mssql error: Cannot bulk load because the file "mysql.csv" could not be opened

The function pd_to_mssql uses BULK INSERT which requires the target file to exist on the same instance as the SQL Server. If you're running Python code on a remote machine, like your own computer, then pd_to_mssql will return the error, "Cannot bulk load because the file "mysql.csv" could not be opened". I'm hopeful you can find a workaround for this but I imagine you'll have to change your approach.

Adding to a table with primary key defined with SERIAL

When I tried inserting I got an error like so:

Traceback (most recent call last):
  File "results_sql_fasttext.py", line 284, in <module>
    d6tstack.utils.pd_to_psql(aux_results, URI, 'aux_results', if_exists='append')
  File "/usr/local/lib/python3.6/dist-packages/d6tcollect/__init__.py", line 58, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/d6tstack/utils.py", line 107, in pd_to_psql
    cursor.copy_from(fbuf, table_name, sep=sep, null='')
psycopg2.errors.InvalidTextRepresentation: invalid input syntax for type integer: "ANSVAR"
CONTEXT:  COPY aux_results, line 1, column aux_id: "ANSVAR"

aux_id is the id created by the database - nothing should be inserted into it since it is auto-generated. "ANSVAR" is a string that should go into the fourth column, not the first. So it seems there are multiple issues at play.

module 'd6tstack.utils' has no attribute 'pd_to_mssql'

Hi,

I was trying to use module 'd6tstack.utils'for mssql but gettign below error:

module 'd6tstack.utils' has no attribute 'pd_to_mssql'

Any chance of adding this sooner.

Regards,
Jai

Why embed d6tcollect by default ?

At least disable by default.

d6tstack.utils seems to leak database connections

I have a 200million row dataframe that I've sliced in-memory into chunks of 200,000 to insert gradually.

After running d6tstack.utils.pd_to_psql for 30 minutes or so (which is the halfway mark), I will get the following error:

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to server at "...", port ... failed: FATAL: sorry, too many clients already

I added the following lines to d6tstack.utils.py and the error stopped reproducing:

sql_cnxn.close()
sql_engine.dispose()

Error with pd_to_mssql

Hello everybody,

great package, but I get a strange error message when I try the mentioned above function with a sqlalchemy engine connected to ODBC.

See code below:
engine = create_engine(constring2)
engine.url
mssql+pymssql://XXXXXX:***@azurexxxx

import d6tstack.utils as du
du.pd_to_mssql(I_df, engine, "t_abda_pacapo01_stage", if_exists="replace")

This results in the error message below:
argument of type 'Engine' is not iterable

Any ideas how I can solve this?

thanks and br

christoph

installation on anaconda

I tried to install in a package but I didn't succeed and I can't use it.
Would it be possible to create the repo for installation on anaconda?

Upgraded to 0.14, still same error

I see that you changed the separator from tab. Is there something else?

Originally posted by @danielpoon in #4 (comment)

pd_to_mssql doesn't use schema arguement

example code

`uri_mssql = f'mssql+pymssql://{u}:{p}@$server/$db'

d6tstack.utils.pd_to_mssql(df = df, uri = uri_mssql, table_name = 'mytable', schema_name = 'myschema', if_exists='replace')`

The error reads that I dont have access to the schema 'dbo' when making the connection, which I assume is the default schema that the connection trys to make when schema_name is set to None.

pd_to_psql: 'NoneType' object has no attribute 'lower'

d6tstack/d6tstack/utils.py

Line 69 in 155e9b9

 def pd_to_psql(df, uri, table_name, schema_name=None, if_exists='fail', sep=','): 

The method pd_to_psql has default value None for schema_name parameter, but in last update to v0.1.7 was added param to lower case statement. So, it breaks execution and has logical conflict.

d6tstack requires python3, it's not possible to install using pip

The installation instruction are meant for pip, sadly, pip only queries python 2 only dependencies. I think modifying all instances of pip to pip3 would fix this issue.

postgres and Primary Key

I have searched but no luck. I maybe missing the right terms. I have a pre-existing table with a primary key.

class District(Base):
    __tablename__ = 'districts'

    id = Column(Integer, primary_key=True, autoincrement=True)

    district_county = Column(String(30))
    district_type = Column(String(100))
    district_name = Column(String(100))

    file_date = Column(Date)

I process a bunch of csv files with pandas and then pass it to a function

    def _write_df(self, df, engine, table_name) -> None:
        # df.to_sql(table_name, engine, if_exists='append', index=False, chunksize=100000)
        # add null id column
        # df.insert(0,'id','')
        d6tstack.utils.pd_to_psql(df, str(engine.url), table_name, if_exists='append', sep='\t')
    return None

But I get this error,

    File "/usr/local/lib/python3.8/site-packages/d6tstack/utils.py", line 107, in pd_to_psql
    cursor.copy_from(fbuf, table_name, sep=sep, null='')
    psycopg2.errors.InvalidTextRepresentation: invalid input syntax for type integer: "BAKER"
    CONTEXT:  COPY districts, line 1, column id: "BAKER"

I tried adding an 'id' column with df.insert(0,'id','') but then I get this error.

psycopg2.errors.NotNullViolation: null value in column "id" of relation "districts" violates not-null constraint

What am I doing wrong? In using pandas.to_sql, I don't need an 'id' column and I use index=false in the call.

Automatically appending \n in last column of df

Hi,
Whenever I am dumping data using d6tstack it is appending a \n (escape character at the end of each row of the last column). Any suggestions that how can I tackle such situations.
Any response would be highly appreciated.
Thanks

UnicodeDecodeError

Allow to set the encoding when writing the dataframe to csv, before loading infile. It gets an error when the csv contains unicode characters.

pd_readsql_table_from_sqlengine

Hi,

there is a spelling error.

File "C:\Python37\lib\site-packages\d6tstack\utils.py", line 65, in pd_readsql_table_from_sqlengine
return pd_readsql_query_from_sqlengine(uri, "SELECT * FROM {};".fromat(table_name), schema_name=schema_name, connect_args=connect_args)
AttributeError: 'str' object has no attribute 'fromat'

Issue with database name with "\"

use a database name with \ it

#server name is someserver123\data
#database name = Comissions
#no user name and password because Trusted_Connection = yes

import pandas as pd
import d6tstack.utils

myfile = "myfile.csv" #absolute file
csv_file = pd.read_csv(my_file, iterator = True, chunksize= 20000, low_memory=False, dtype=str)

for i, chunk in enumerate(csv_file):
    d6tstack.utils.pd_to_mssql(chunk , 'mssql+pymssql://someserver123\data/Comissions', 'dbo.mydb', if_exists="append")

will yield error

UserWarning: .pd_to_mssql() is experimental, if any problems please raise an issue on https://github.com/d6t/d6tstack/issues or make a pull request
warnings.warn('.pd_to_mssql() is experimental, if any problems please raise an issue on https://github.com/d6t/d6tstack/issues or make a pull request')
Traceback (most recent call last):
File "src\pymssql.pyx", line 636, in pymssql.connect
File "src_mssql.pyx", line 1957, in _mssql.connect
File "src_mssql.pyx", line 677, in _mssql.MSSQLConnection.init
_mssql.MSSQLDriverException: Connection to the database failed for an unknown reason.

this error is familiar with pyodbc, to fix this on pyodbc i use urllib and add a param. I dont know how to pull this off with d6stack

Writing File in Chunk

Hi,
I am wondering that it is possible to write the file in chunk size because when I am trying to handle a large size file, my server is getting out of memory. Though your library is working outstanding for the small size of file. I am using the following method to write file in the database.

d6tstack.utils.pd_to_mysql(meta_dataframe, 'mysql+mysqlconnector://abc:[email protected]/abcdb', meta_filename)

How can I handle large size of files? Any suggestions would be great.

error while dumping NA values

Hi,
First of all, thank you very much for such a wonderful library.
I am facing while dumping data to MySQL database.
The NA and blank are getting either 'Null' or 'blank' randomly in the database.
For eg. If some value in CSV is 'null' or 'NA' then after dumping this kind of CSV into database these values become 'NA' or blank.

Can you please address this problem.

Issue with Schema Name

In the file:
/usr/local/lib/python3.7/site-packages/d6tstack/utils.py

def pd_to_psql(df, uri, tablename, schema_name=None, if_exists='fail'):

This references schema_name, not schema so it throws an error when needing to use a schema. I fixed it via:

sql_engine = sqlalchemy.create_engine(uri, connect_args={'options': '-csearch_path={}'.format(schema_name)})

Please make the change in your code. Thanks.

Regards,
Cody Kirrabilli

Cannot import csv to mssql

Hi, I'm trying to import a csv to a mssql server and I get the following error:

creating mysql.csv ok
loading mysql.csv ok

---------------------------------------------------------------------------
MSSQLDatabaseException                    Traceback (most recent call last)
src/pymssql.pyx in pymssql.Cursor.execute()

src/_mssql.pyx in _mssql.MSSQLConnection.execute_query()

src/_mssql.pyx in _mssql.MSSQLConnection.execute_query()

src/_mssql.pyx in _mssql.MSSQLConnection.format_and_run_query()

src/_mssql.pyx in _mssql.check_cancel_and_raise()

src/_mssql.pyx in _mssql.maybe_raise_MSSQLDatabaseException()

MSSQLDatabaseException: (4860, b'Cannot bulk load. The file "mysql.csv" does not exist.DB-Lib error message 20018, severity 16:\nGeneral SQL Server error: Check messages from the SQL Server\n')

During handling of the above exception, another exception occurred:

OperationalError                          Traceback (most recent call last)
~/codigos/thomson/myenv/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
   1245                     self.dialect.do_execute(
-> 1246                         cursor, statement, parameters, context
   1247                     )

~/codigos/thomson/myenv/lib/python3.6/site-packages/sqlalchemy/engine/default.py in do_execute(self, cursor, statement, parameters, context)
    580     def do_execute(self, cursor, statement, parameters, context=None):
--> 581         cursor.execute(statement, parameters)
    582 

src/pymssql.pyx in pymssql.Cursor.execute()

OperationalError: (4860, b'Cannot bulk load. The file "mysql.csv" does not exist.DB-Lib error message 20018, severity 16:\nGeneral SQL Server error: Check messages from the SQL Server\n')

The above exception was the direct cause of the following exception:

OperationalError                          Traceback (most recent call last)
<ipython-input-11-267ba832f344> in <module>
      5 #uri_mssql = f'mysql+mysqlconnector://{user}:{passW}@{host}:1433/{db}'
      6 
----> 7 d6tstack.utils.pd_to_mssql(df, uri_mssql, 'prueba', if_exists='replace') # experimental

~/codigos/thomson/myenv/lib/python3.6/site-packages/d6tcollect/__init__.py in wrapper(*args, **kwargs)
     78             payload['exceptionMsg'] = str(e)
     79             _submit(payload)
---> 80             raise e
     81 
     82     return wrapper

~/codigos/thomson/myenv/lib/python3.6/site-packages/d6tcollect/__init__.py in wrapper(*args, **kwargs)
     72         _submit(payload)
     73         try:
---> 74             return func(*args, **kwargs)
     75         except Exception as e:
     76             payload['event'] = 'exception'

~/codigos/thomson/myenv/lib/python3.6/site-packages/d6tstack/utils.py in pd_to_mssql(df, uri, table_name, schema_name, if_exists, tmpfile)
    188         table_name = '{}.{}'.format(schema_name,table_name)
    189     sql_load = "BULK INSERT {} FROM '{}';".format(table_name, tmpfile)
--> 190     sql_engine.execute(sql_load)
    191 
    192     os.remove(tmpfile)

~/codigos/thomson/myenv/lib/python3.6/site-packages/sqlalchemy/engine/base.py in execute(self, statement, *multiparams, **params)
   2180 
   2181         connection = self._contextual_connect(close_with_result=True)
-> 2182         return connection.execute(statement, *multiparams, **params)
   2183 
   2184     def scalar(self, statement, *multiparams, **params):

~/codigos/thomson/myenv/lib/python3.6/site-packages/sqlalchemy/engine/base.py in execute(self, object_, *multiparams, **params)
    974         """
    975         if isinstance(object_, util.string_types[0]):
--> 976             return self._execute_text(object_, multiparams, params)
    977         try:
    978             meth = object_._execute_on_connection

~/codigos/thomson/myenv/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _execute_text(self, statement, multiparams, params)
   1147             parameters,
   1148             statement,
-> 1149             parameters,
   1150         )
   1151         if self._has_events or self.engine._has_events:

~/codigos/thomson/myenv/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
   1248         except BaseException as e:
   1249             self._handle_dbapi_exception(
-> 1250                 e, statement, parameters, cursor, context
   1251             )
   1252 

~/codigos/thomson/myenv/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _handle_dbapi_exception(self, e, statement, parameters, cursor, context)
   1474                 util.raise_from_cause(newraise, exc_info)
   1475             elif should_wrap:
-> 1476                 util.raise_from_cause(sqlalchemy_exception, exc_info)
   1477             else:
   1478                 util.reraise(*exc_info)

~/codigos/thomson/myenv/lib/python3.6/site-packages/sqlalchemy/util/compat.py in raise_from_cause(exception, exc_info)
    396     exc_type, exc_value, exc_tb = exc_info
    397     cause = exc_value if exc_value is not exception else None
--> 398     reraise(type(exception), exception, tb=exc_tb, cause=cause)
    399 
    400 

~/codigos/thomson/myenv/lib/python3.6/site-packages/sqlalchemy/util/compat.py in reraise(tp, value, tb, cause)
    150             value.__cause__ = cause
    151         if value.__traceback__ is not tb:
--> 152             raise value.with_traceback(tb)
    153         raise value
    154 

~/codigos/thomson/myenv/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
   1244                 if not evt_handled:
   1245                     self.dialect.do_execute(
-> 1246                         cursor, statement, parameters, context
   1247                     )
   1248         except BaseException as e:

~/codigos/thomson/myenv/lib/python3.6/site-packages/sqlalchemy/engine/default.py in do_execute(self, cursor, statement, parameters, context)
    579 
    580     def do_execute(self, cursor, statement, parameters, context=None):
--> 581         cursor.execute(statement, parameters)
    582 
    583     def do_execute_no_params(self, cursor, statement, context=None):

src/pymssql.pyx in pymssql.Cursor.execute()

OperationalError: (pymssql.OperationalError) (4860, b'Cannot bulk load. The file "mysql.csv" does not exist.DB-Lib error message 20018, severity 16:\nGeneral SQL Server error: Check messages from the SQL Server\n')
[SQL: BULK INSERT prueba FROM 'mysql.csv';]
(Background on this error at: http://sqlalche.me/e/e3q8)

Feature request: pg8000 compatibility?

Any possibility of being able to use d6tstack with 'pg8000'?

Encoding error reading csv files to combine

I have csv files and I'm trying to fix different columns between them.

The function d6tstack.combine_csv.CombinerCSV works fine in other cases but now I have problems with the encoding.

I need to pass the parameter encoding="latin-1" to Pandas read_csv but it doesn't work.
In the documentation I found that is possible to pass read_csv_params={"encoding" : "latin-1"} but it doesn't work, it doesn't apply this encoding.

I tried reading the same file directly with Pandas and whit the parameter encoding works fine.

Are there another posivility to resolve it?
Thanks in advance.

ImportError: cannot import name 'coordinate_from_string' from 'openpyxl.utils'

Hi ,
I'm using :
Python : 3.7
openpyxl : 2.6.2 (Last Stable)
d6tstack : 0.1.8

Traceback (most recent call last):
  File "/home/kedf/PycharmProjects/XLStoSQL/xlsxtosql.py", line 1, in <module>
    import d6tstack.convert_xls
  File "/usr/local/lib/python3.7/site-packages/d6tstack/convert_xls.py", line 12, in <module>
    from openpyxl.utils import coordinate_from_string
ImportError: cannot import name 'coordinate_from_string' from 'openpyxl.utils' (/usr/local/lib/python3.7/site-packages/openpyxl/utils/__init__.py)

In the docs of openpyxl 2.6.2 , function seems to be named -> openpyxl.utils.cell.coordinate_from_string
https://openpyxl.readthedocs.io/en/stable/api/openpyxl.utils.cell.html#openpyxl.utils.cell.coordinate_from_string

So maybe changing line 12 of d6tstack/convert_xls.py"
from openpyxl.utils import coordinate_from_string
to
from openpyxl.utils.cell import coordinate_from_string
will fix it ?

Thanks

pd_to_psql does not work if your data has comma in it

The library works well with data columns that doesn't have commas but if you have a comma that's like a name e.g. "Smith, John", and then you try doing pd_to_psql, you will get the error below.

Any thoughts?

DataError Traceback (most recent call last)
in
----> 8 d6tstack.utils.pd_to_psql(df, fast_engine, 'xxx', 'xxx', if_exists='append')
9
10

/usr/local/lib/python3.7/site-packages/d6tstack/utils.py in pd_to_psql(df, uri, tablename, schema_name, if_exists)
122 df.to_csv(fbuf, index=False, header=False)
123 fbuf.seek(0)
--> 124 cursor.copy_from(fbuf, tablename, sep=',', null='')
125 sql_cnxn.commit()
126 cursor.close()

DataError: extra data after last expected column
CONTEXT: COPY xxxx, line 1:

Error while inserting a list/array object existing in a Dataframe Column to a Postgresql table column

Proving an example to replicate this issue:
Consider the following Pandas Dataframe called my_data:
| id | coordinates |
| 01| [50.1, 68.2] |
|02| [52.2, 67.9] |

and consider the following postgresql table definition

CREATE TABLE coordinates (
id int2,
coordinates int8range
);

Now, even if you cast this dataframe column to the type object
my_data['coordinates'] = my_data['coordinates'].astype(object)

the following insert operation throws an error:
d6tstack.utils.pd_to_psql(my_table, db, table, schema if_exists='replace')

The Error log is:
BadCopyFileFormat: extra data after last expected column

The same insert operation works with dataframe.to_sql functionality provided by psycopg2

pd_readsql_table_from_sqlengine doesn't work

pd_readsql_table_from_sqlengine errors due to doing "SELECT * FROM {};".fromat, presumably this should be format rather than fromat.

_mysql_connector.MySQLInterfaceError: LOAD DATA LOCAL INFILE file request rejected due to restrictions on access

I've been using d6stack for months and have never had any problems, but recently this error message appeared below. How can I do to be able to solve this, without having to change the variables or permissions of my database?

I think there's something connected to mysqlconnector. Is there another way to use it other than for this driver? For example: pymysql

[2022-01-07 15:40:35,524] {bash_operator.py:126} INFO - [2022-01-07 15:40:31] Insert - SaldosB2C IS
[2022-01-07 15:40:35,524] {bash_operator.py:126} INFO - creating tb_saldos_b2c_iss.csv ok
[2022-01-07 15:40:35,525] {bash_operator.py:126} INFO - loading tb_saldos_b2c_iss.csv ok
[2022-01-07 15:40:35,535] {bash_operator.py:126} INFO - Traceback (most recent call last):
[2022-01-07 15:40:35,536] {bash_operator.py:126} INFO - File "/usr/local/airflow/.local/lib/python3.7/site-packages/mysql/connector/connection_cext.py", line 519, in cmd_query
[2022-01-07 15:40:35,537] {bash_operator.py:126} INFO - query_attrs=self._query_attrs)
[2022-01-07 15:40:35,537] {bash_operator.py:126} INFO - _mysql_connector.MySQLInterfaceError: LOAD DATA LOCAL INFILE file request rejected due to restrictions on access.

After installing using pip: ImportError: No module named 'd6stack'

Python 3.5.5 :: Anaconda, Inc on Windows 7

d6stack shows up in "pip list"

However, I get an error when importing even from a python console.

I do see .pyc files in 'Lib/site-packages/d6stack/pycache/'

d6t / d6tstack Goto Github PK

d6tstack's People

Contributors

Stargazers

Watchers

Forkers

d6tstack's Issues

Recommend Projects

Recommend Topics

Recommend Org