d6t / d6tstack Goto Github PK
View Code? Open in Web Editor NEWQuickly ingest messy CSV and XLS files. Export to clean pandas, SQL, parquet
License: MIT License
Quickly ingest messy CSV and XLS files. Export to clean pandas, SQL, parquet
License: MIT License
Hello everyone,
I am having problems with the function d6tstack.utils.pd_to_mssql() when I want to import the Dataframe into Microsoft SQL Server database. The name of SQL Table is fxxProduktKatalogDE. I found that the name in function should be written lowercase and so I did. But I get the error as the Table does not exists. The table is there in database. I also tried to write in the table name the name exactly as in database but the same error happens.
Here the Code and the error
Can you help me to solve this issue ?
Thank you in advance!
The function pd_to_mssql uses BULK INSERT which requires the target file to exist on the same instance as the SQL Server. If you're running Python code on a remote machine, like your own computer, then pd_to_mssql will return the error, "Cannot bulk load because the file "mysql.csv" could not be opened". I'm hopeful you can find a workaround for this but I imagine you'll have to change your approach.
When I tried inserting I got an error like so:
Traceback (most recent call last):
File "results_sql_fasttext.py", line 284, in <module>
d6tstack.utils.pd_to_psql(aux_results, URI, 'aux_results', if_exists='append')
File "/usr/local/lib/python3.6/dist-packages/d6tcollect/__init__.py", line 58, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/d6tstack/utils.py", line 107, in pd_to_psql
cursor.copy_from(fbuf, table_name, sep=sep, null='')
psycopg2.errors.InvalidTextRepresentation: invalid input syntax for type integer: "ANSVAR"
CONTEXT: COPY aux_results, line 1, column aux_id: "ANSVAR"
aux_id is the id created by the database - nothing should be inserted into it since it is auto-generated. "ANSVAR" is a string that should go into the fourth column, not the first. So it seems there are multiple issues at play.
Hi,
I was trying to use module 'd6tstack.utils'for mssql but gettign below error:
module 'd6tstack.utils' has no attribute 'pd_to_mssql'
Any chance of adding this sooner.
Regards,
Jai
At least disable by default.
I have a 200million row dataframe that I've sliced in-memory into chunks of 200,000 to insert gradually.
After running d6tstack.utils.pd_to_psql
for 30 minutes or so (which is the halfway mark), I will get the following error:
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to server at "...", port ... failed: FATAL: sorry, too many clients already
I added the following lines to d6tstack.utils.py
and the error stopped reproducing:
sql_cnxn.close()
sql_engine.dispose()
Hello everybody,
great package, but I get a strange error message when I try the mentioned above function with a sqlalchemy engine connected to ODBC.
See code below:
engine = create_engine(constring2)
engine.url
mssql+pymssql://XXXXXX:***@azurexxxx
import d6tstack.utils as du
du.pd_to_mssql(I_df, engine, "t_abda_pacapo01_stage", if_exists="replace")
This results in the error message below:
argument of type 'Engine' is not iterable
Any ideas how I can solve this?
thanks and br
christoph
I tried to install in a package but I didn't succeed and I can't use it.
Would it be possible to create the repo for installation on anaconda?
I see that you changed the separator from tab. Is there something else?
Originally posted by @danielpoon in #4 (comment)
example code
`uri_mssql = f'mssql+pymssql://{u}:{p}@$server/$db'
d6tstack.utils.pd_to_mssql(df = df, uri = uri_mssql, table_name = 'mytable', schema_name = 'myschema', if_exists='replace')`
The error reads that I dont have access to the schema 'dbo' when making the connection, which I assume is the default schema that the connection trys to make when schema_name is set to None.
Line 69 in 155e9b9
The method pd_to_psql
has default value None
for schema_name
parameter, but in last update to v0.1.7 was added param to lower case statement. So, it breaks execution and has logical conflict.
The installation instruction are meant for pip, sadly, pip only queries python 2 only dependencies. I think modifying all instances of pip to pip3 would fix this issue.
I have searched but no luck. I maybe missing the right terms. I have a pre-existing table with a primary key.
class District(Base):
__tablename__ = 'districts'
id = Column(Integer, primary_key=True, autoincrement=True)
district_county = Column(String(30))
district_type = Column(String(100))
district_name = Column(String(100))
file_date = Column(Date)
I process a bunch of csv files with pandas and then pass it to a function
def _write_df(self, df, engine, table_name) -> None:
# df.to_sql(table_name, engine, if_exists='append', index=False, chunksize=100000)
# add null id column
# df.insert(0,'id','')
d6tstack.utils.pd_to_psql(df, str(engine.url), table_name, if_exists='append', sep='\t')
return None
But I get this error,
File "/usr/local/lib/python3.8/site-packages/d6tstack/utils.py", line 107, in pd_to_psql
cursor.copy_from(fbuf, table_name, sep=sep, null='')
psycopg2.errors.InvalidTextRepresentation: invalid input syntax for type integer: "BAKER"
CONTEXT: COPY districts, line 1, column id: "BAKER"
I tried adding an 'id' column with df.insert(0,'id','') but then I get this error.
psycopg2.errors.NotNullViolation: null value in column "id" of relation "districts" violates not-null constraint
What am I doing wrong? In using pandas.to_sql, I don't need an 'id' column and I use index=false in the call.
Hi,
Whenever I am dumping data using d6tstack it is appending a \n (escape character at the end of each row of the last column). Any suggestions that how can I tackle such situations.
Any response would be highly appreciated.
Thanks
Allow to set the encoding when writing the dataframe to csv, before loading infile. It gets an error when the csv contains unicode characters.
Hi,
there is a spelling error.
File "C:\Python37\lib\site-packages\d6tstack\utils.py", line 65, in pd_readsql_table_from_sqlengine
return pd_readsql_query_from_sqlengine(uri, "SELECT * FROM {};".fromat(table_name), schema_name=schema_name, connect_args=connect_args)
AttributeError: 'str' object has no attribute 'fromat'
use a database name with \ it
#server name is someserver123\data
#database name = Comissions
#no user name and password because Trusted_Connection = yes
import pandas as pd
import d6tstack.utils
myfile = "myfile.csv" #absolute file
csv_file = pd.read_csv(my_file, iterator = True, chunksize= 20000, low_memory=False, dtype=str)
for i, chunk in enumerate(csv_file):
d6tstack.utils.pd_to_mssql(chunk , 'mssql+pymssql://someserver123\data/Comissions', 'dbo.mydb', if_exists="append")
will yield error
UserWarning:
.pd_to_mssql()
is experimental, if any problems please raise an issue on https://github.com/d6t/d6tstack/issues or make a pull request
warnings.warn('.pd_to_mssql()
is experimental, if any problems please raise an issue on https://github.com/d6t/d6tstack/issues or make a pull request')
Traceback (most recent call last):
File "src\pymssql.pyx", line 636, in pymssql.connect
File "src_mssql.pyx", line 1957, in _mssql.connect
File "src_mssql.pyx", line 677, in _mssql.MSSQLConnection.init
_mssql.MSSQLDriverException: Connection to the database failed for an unknown reason.
this error is familiar with pyodbc, to fix this on pyodbc i use urllib and add a param. I dont know how to pull this off with d6stack
Hi,
I am wondering that it is possible to write the file in chunk size because when I am trying to handle a large size file, my server is getting out of memory. Though your library is working outstanding for the small size of file. I am using the following method to write file in the database.
d6tstack.utils.pd_to_mysql(meta_dataframe, 'mysql+mysqlconnector://abc:[email protected]/abcdb', meta_filename)
How can I handle large size of files? Any suggestions would be great.
Hi,
First of all, thank you very much for such a wonderful library.
I am facing while dumping data to MySQL database.
The NA and blank are getting either 'Null' or 'blank' randomly in the database.
For eg. If some value in CSV is 'null' or 'NA' then after dumping this kind of CSV into database these values become 'NA' or blank.
Can you please address this problem.
In the file:
/usr/local/lib/python3.7/site-packages/d6tstack/utils.py
def pd_to_psql(df, uri, tablename, schema_name=None, if_exists='fail'):
This references schema_name, not schema so it throws an error when needing to use a schema. I fixed it via:
sql_engine = sqlalchemy.create_engine(uri, connect_args={'options': '-csearch_path={}'.format(schema_name)})
Please make the change in your code. Thanks.
Regards,
Cody Kirrabilli
Hi, I'm trying to import a csv to a mssql server and I get the following error:
creating mysql.csv ok
loading mysql.csv ok
---------------------------------------------------------------------------
MSSQLDatabaseException Traceback (most recent call last)
src/pymssql.pyx in pymssql.Cursor.execute()
src/_mssql.pyx in _mssql.MSSQLConnection.execute_query()
src/_mssql.pyx in _mssql.MSSQLConnection.execute_query()
src/_mssql.pyx in _mssql.MSSQLConnection.format_and_run_query()
src/_mssql.pyx in _mssql.check_cancel_and_raise()
src/_mssql.pyx in _mssql.maybe_raise_MSSQLDatabaseException()
MSSQLDatabaseException: (4860, b'Cannot bulk load. The file "mysql.csv" does not exist.DB-Lib error message 20018, severity 16:\nGeneral SQL Server error: Check messages from the SQL Server\n')
During handling of the above exception, another exception occurred:
OperationalError Traceback (most recent call last)
~/codigos/thomson/myenv/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
1245 self.dialect.do_execute(
-> 1246 cursor, statement, parameters, context
1247 )
~/codigos/thomson/myenv/lib/python3.6/site-packages/sqlalchemy/engine/default.py in do_execute(self, cursor, statement, parameters, context)
580 def do_execute(self, cursor, statement, parameters, context=None):
--> 581 cursor.execute(statement, parameters)
582
src/pymssql.pyx in pymssql.Cursor.execute()
OperationalError: (4860, b'Cannot bulk load. The file "mysql.csv" does not exist.DB-Lib error message 20018, severity 16:\nGeneral SQL Server error: Check messages from the SQL Server\n')
The above exception was the direct cause of the following exception:
OperationalError Traceback (most recent call last)
<ipython-input-11-267ba832f344> in <module>
5 #uri_mssql = f'mysql+mysqlconnector://{user}:{passW}@{host}:1433/{db}'
6
----> 7 d6tstack.utils.pd_to_mssql(df, uri_mssql, 'prueba', if_exists='replace') # experimental
~/codigos/thomson/myenv/lib/python3.6/site-packages/d6tcollect/__init__.py in wrapper(*args, **kwargs)
78 payload['exceptionMsg'] = str(e)
79 _submit(payload)
---> 80 raise e
81
82 return wrapper
~/codigos/thomson/myenv/lib/python3.6/site-packages/d6tcollect/__init__.py in wrapper(*args, **kwargs)
72 _submit(payload)
73 try:
---> 74 return func(*args, **kwargs)
75 except Exception as e:
76 payload['event'] = 'exception'
~/codigos/thomson/myenv/lib/python3.6/site-packages/d6tstack/utils.py in pd_to_mssql(df, uri, table_name, schema_name, if_exists, tmpfile)
188 table_name = '{}.{}'.format(schema_name,table_name)
189 sql_load = "BULK INSERT {} FROM '{}';".format(table_name, tmpfile)
--> 190 sql_engine.execute(sql_load)
191
192 os.remove(tmpfile)
~/codigos/thomson/myenv/lib/python3.6/site-packages/sqlalchemy/engine/base.py in execute(self, statement, *multiparams, **params)
2180
2181 connection = self._contextual_connect(close_with_result=True)
-> 2182 return connection.execute(statement, *multiparams, **params)
2183
2184 def scalar(self, statement, *multiparams, **params):
~/codigos/thomson/myenv/lib/python3.6/site-packages/sqlalchemy/engine/base.py in execute(self, object_, *multiparams, **params)
974 """
975 if isinstance(object_, util.string_types[0]):
--> 976 return self._execute_text(object_, multiparams, params)
977 try:
978 meth = object_._execute_on_connection
~/codigos/thomson/myenv/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _execute_text(self, statement, multiparams, params)
1147 parameters,
1148 statement,
-> 1149 parameters,
1150 )
1151 if self._has_events or self.engine._has_events:
~/codigos/thomson/myenv/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
1248 except BaseException as e:
1249 self._handle_dbapi_exception(
-> 1250 e, statement, parameters, cursor, context
1251 )
1252
~/codigos/thomson/myenv/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _handle_dbapi_exception(self, e, statement, parameters, cursor, context)
1474 util.raise_from_cause(newraise, exc_info)
1475 elif should_wrap:
-> 1476 util.raise_from_cause(sqlalchemy_exception, exc_info)
1477 else:
1478 util.reraise(*exc_info)
~/codigos/thomson/myenv/lib/python3.6/site-packages/sqlalchemy/util/compat.py in raise_from_cause(exception, exc_info)
396 exc_type, exc_value, exc_tb = exc_info
397 cause = exc_value if exc_value is not exception else None
--> 398 reraise(type(exception), exception, tb=exc_tb, cause=cause)
399
400
~/codigos/thomson/myenv/lib/python3.6/site-packages/sqlalchemy/util/compat.py in reraise(tp, value, tb, cause)
150 value.__cause__ = cause
151 if value.__traceback__ is not tb:
--> 152 raise value.with_traceback(tb)
153 raise value
154
~/codigos/thomson/myenv/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
1244 if not evt_handled:
1245 self.dialect.do_execute(
-> 1246 cursor, statement, parameters, context
1247 )
1248 except BaseException as e:
~/codigos/thomson/myenv/lib/python3.6/site-packages/sqlalchemy/engine/default.py in do_execute(self, cursor, statement, parameters, context)
579
580 def do_execute(self, cursor, statement, parameters, context=None):
--> 581 cursor.execute(statement, parameters)
582
583 def do_execute_no_params(self, cursor, statement, context=None):
src/pymssql.pyx in pymssql.Cursor.execute()
OperationalError: (pymssql.OperationalError) (4860, b'Cannot bulk load. The file "mysql.csv" does not exist.DB-Lib error message 20018, severity 16:\nGeneral SQL Server error: Check messages from the SQL Server\n')
[SQL: BULK INSERT prueba FROM 'mysql.csv';]
(Background on this error at: http://sqlalche.me/e/e3q8)
Any possibility of being able to use d6tstack with 'pg8000'?
I have csv files and I'm trying to fix different columns between them.
The function d6tstack.combine_csv.CombinerCSV
works fine in other cases but now I have problems with the encoding.
I need to pass the parameter encoding="latin-1"
to Pandas read_csv but it doesn't work.
In the documentation I found that is possible to pass read_csv_params={"encoding" : "latin-1"}
but it doesn't work, it doesn't apply this encoding.
I tried reading the same file directly with Pandas and whit the parameter encoding works fine.
Are there another posivility to resolve it?
Thanks in advance.
Hi ,
I'm using :
Python : 3.7
openpyxl : 2.6.2 (Last Stable)
d6tstack : 0.1.8
Traceback (most recent call last):
File "/home/kedf/PycharmProjects/XLStoSQL/xlsxtosql.py", line 1, in <module>
import d6tstack.convert_xls
File "/usr/local/lib/python3.7/site-packages/d6tstack/convert_xls.py", line 12, in <module>
from openpyxl.utils import coordinate_from_string
ImportError: cannot import name 'coordinate_from_string' from 'openpyxl.utils' (/usr/local/lib/python3.7/site-packages/openpyxl/utils/__init__.py)
In the docs of openpyxl 2.6.2 , function seems to be named -> openpyxl.utils.cell.coordinate_from_string
https://openpyxl.readthedocs.io/en/stable/api/openpyxl.utils.cell.html#openpyxl.utils.cell.coordinate_from_string
So maybe changing line 12 of d6tstack/convert_xls.py"
from openpyxl.utils import coordinate_from_string
to
from openpyxl.utils.cell import coordinate_from_string
will fix it ?
Thanks
The library works well with data columns that doesn't have commas but if you have a comma that's like a name e.g. "Smith, John", and then you try doing pd_to_psql, you will get the error below.
Any thoughts?
DataError Traceback (most recent call last)
in
----> 8 d6tstack.utils.pd_to_psql(df, fast_engine, 'xxx', 'xxx', if_exists='append')
9
10
/usr/local/lib/python3.7/site-packages/d6tstack/utils.py in pd_to_psql(df, uri, tablename, schema_name, if_exists)
122 df.to_csv(fbuf, index=False, header=False)
123 fbuf.seek(0)
--> 124 cursor.copy_from(fbuf, tablename, sep=',', null='')
125 sql_cnxn.commit()
126 cursor.close()
DataError: extra data after last expected column
CONTEXT: COPY xxxx, line 1:
Proving an example to replicate this issue:
Consider the following Pandas Dataframe called my_data:
| id | coordinates |
| 01| [50.1, 68.2] |
|02| [52.2, 67.9] |
and consider the following postgresql table definition
CREATE TABLE coordinates (
id int2,
coordinates int8range
);
Now, even if you cast this dataframe column to the type object
my_data['coordinates'] = my_data['coordinates'].astype(object)
the following insert operation throws an error:
d6tstack.utils.pd_to_psql(my_table, db, table, schema if_exists='replace')
The Error log is:
BadCopyFileFormat: extra data after last expected column
The same insert operation works with dataframe.to_sql functionality provided by psycopg2
pd_readsql_table_from_sqlengine
errors due to doing "SELECT * FROM {};".fromat
, presumably this should be format rather than fromat.
I've been using d6stack for months and have never had any problems, but recently this error message appeared below. How can I do to be able to solve this, without having to change the variables or permissions of my database?
I think there's something connected to mysqlconnector. Is there another way to use it other than for this driver? For example: pymysql
[2022-01-07 15:40:35,524] {bash_operator.py:126} INFO - [2022-01-07 15:40:31] Insert - SaldosB2C IS
[2022-01-07 15:40:35,524] {bash_operator.py:126} INFO - creating tb_saldos_b2c_iss.csv ok
[2022-01-07 15:40:35,525] {bash_operator.py:126} INFO - loading tb_saldos_b2c_iss.csv ok
[2022-01-07 15:40:35,535] {bash_operator.py:126} INFO - Traceback (most recent call last):
[2022-01-07 15:40:35,536] {bash_operator.py:126} INFO - File "/usr/local/airflow/.local/lib/python3.7/site-packages/mysql/connector/connection_cext.py", line 519, in cmd_query
[2022-01-07 15:40:35,537] {bash_operator.py:126} INFO - query_attrs=self._query_attrs)
[2022-01-07 15:40:35,537] {bash_operator.py:126} INFO - _mysql_connector.MySQLInterfaceError: LOAD DATA LOCAL INFILE file request rejected due to restrictions on access.
Python 3.5.5 :: Anaconda, Inc on Windows 7
d6stack shows up in "pip list"
However, I get an error when importing even from a python console.
I do see .pyc files in 'Lib/site-packages/d6stack/pycache/'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.