Comments (8)
yield from
can be replaced by
data = DBF(...)
for el in data:
yield el
from dbfread.
dbfread author here.
I guess the main reason dbf2csv is faster is that it's written in C++. A lot of the operations in dbfread are fiddling with little bits of binary data which would be a lot faster in a language closer to the hardware.
That said I'm sure dbfread could be faster and I'd be interested in any suggestions you might have.
@kokes load=True
calls the same code as load=False
so I wouldn't expect it to be any faster. (It just calls the same generator and wraps the result in a list.) It should even be a little bit slower since it does a second pass to load deleted records:
def load(self):
if not self.loaded:
self._records = list(self._iter_records(b' '))
self._deleted = list(self._iter_records(b'*'))
from dbfread.
you're welcome.
from dbfread.
Can you benchmark the imports alone? I presume much of your benchmark is spent import pandas, which is quite a heavy dependency (takes 0.7s, warm, on my machine).
Plus there's a bit of overhead from the non-lazy DataFrame creation, but it's not terrible at 25% using my dummy dataset (1.1 MB, 20 columns, strings and ints).
Edit: I have now tried it with a larger file (40 MB) to see if the lazy/eager approach makes a larger difference and indeed it does - 45% extra runtime under pandas.
from dbfread.
@kokes I ran the following which only imports Pandas and dbfread once and doesn't re-execute Python during each iteration. This wasn't run on Python 2.7.12 due to yield from
being used. It was run on Python 3.5 with dbfread 2.0.7 and Pandas 0.24.2. It brought it down to 11m30s. It's a solid improvement.
$ vi run_3.py
from dbfread import DBF
import pandas as pd
def get_dbf():
yield from DBF('FILE.DBF', encoding='latin-1', char_decode_errors='strict')
for _ in range(0, 1000):
pd.DataFrame(get_dbf()).to_csv('FILE.csv', index=False, encoding='utf-8', mode='w')
$ time python3 run_3.py # 11m30.328s
Any more ideas of how we could get this closer to 74 seconds?
from dbfread.
Any more ideas of how we could get this closer to 74 seconds?
It comes down to what the limits are. I tried running the three obvious pieces of code:
- Pandas materialization and csv export (34.4 seconds)
- Streaming into a CSV file using the standard csv package (25.4 seconds)
- Streaming through the rows and doing nothing (21.6 seconds)
(I also tried the eager mode in dbfread itself, using load=True
, but got pretty much identical results)
This goes to show that the major overhead is this library and unless it's changed, you won't get faster than the baseline. So I wouldn't focus on the CSV writer, I'd look at the parsing itself, that is if you really want to go faster.
from dbfread.
@kokes I ran the following which only imports Pandas and dbfread once and doesn't re-execute Python during each iteration. This wasn't run on Python 2.7.12 due to
yield from
being used. It was run on Python 3.5 with dbfread 2.0.7 and Pandas 0.24.2. It brought it down to 11m30s. It's a solid improvement.$ vi run_3.pyfrom dbfread import DBF import pandas as pd def get_dbf(): yield from DBF('FILE.DBF', encoding='latin-1', char_decode_errors='strict') for _ in range(0, 1000): pd.DataFrame(get_dbf()).to_csv('FILE.csv', index=False, encoding='utf-8', mode='w')$ time python3 run_3.py # 11m30.328sAny more ideas of how we could get this closer to 74 seconds?
In the loop you use to save the CSV, you are instantiating a new DataFrame object at every iteration, this definetly adds up in wasted time. You don't need pandas here. When you iterate over a DBF object, you get an ordered dict. You can write them straight to the CSV file using csv.DictWriter as shown here. On a SSD drive I am getting ~8k lines written per second. And I am writing to a gzipped csv which adds the overhead of compression.
from dbfread.
In the loop you use to save the CSV, you are instantiating a new DataFrame object at every iteration, this definetly adds up in wasted time. You don't need pandas here. When you iterate over a DBF object, you get an ordered dict. You can write them straight to the CSV file using csv.DictWriter as shown here. On a SSD drive I am getting ~8k lines written per second. And I am writing to a gzipped csv which adds the overhead of compression.
We've already discussed this above: #38 (comment)
And since we know how fast the parsing itself is (by not writing anything, in the same post), we know the perf ceiling of this library - and as the author notes, the difference against the other library is primarily due to the language used.
I don't think there's much to add here.
from dbfread.
Related Issues (20)
- error occured when parsing a 'N' type field which is mistakenly defined as ‘C'
- Generalize the --encoding-xlsx input option for other file types in in2csv
- UnicodeEncodeError for special charater HOT 2
- dbfread read hidden rows
- Error parsing dates HOT 1
- Replace test system HOT 1
- Return dict instead of OrderedDict as default from 3.7 and up HOT 1
- Better tests for field parsing HOT 2
- Create debug tools for inspecting unsupported or broken files
- Break up code into functions that can be composed in different ways
- Support for dataclasses? HOT 1
- Add support for reading from a (non-seekable) file object instead of a filename HOT 1
- New PyPI release and release-checklist.rst file
- Throwing exceptions if column name contains commas
- UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 65: character maps to <undefined> HOT 1
- UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 HOT 1
- Import into postgres -specific schema HOT 1
- DB4MemoFile field termination.
- ValueError: Field type I must have length 4 (was 0) HOT 1
- ValueError: could not convert string to float: b'60.00\x00\x00' HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dbfread.