Comments (9)
It looks like pandas.read_csv()
also has a converters
options that behaves the same as the one for numpy.loadtxt()
. I am trying to figure out if this causes it to use a less efficient method or not under-the-hood.
UPDATE
Looking at the low-level source for pandas.read_csv()
, I found that they have three different parsers for floats. The source for the parsers tells me that the default parser that pandas.read_csv()
is using is actually pretty similar to what fastnumbers
is using for floats, so you really don't need to try to insert fastnumbers
as middleware since the pandas
devs are already being smart about their parsing and using a very fast algorithm (though at the expense of some precision). The same is true for ints.
from fastnumbers.
Since floating point numbers are often used in large datasets and that is where speed is of the utmost importance, would it make sense to integrate this as a drop-in replacement for use in numpy and as a result, also in Pandas?
While I like the sound of this, I am not really sure in practice if it is possible. numpy
and pandas
store floats and ints in memory as raw C datatypes, while a standard Python float
or int
is actually a PyObject
under-the-hood. fastnumbers
works by converting its input to PyObjects
that contain either float or int data. At this time, if fastnumbers
were somehow inserted into the numpy
or pandas
parser it would actually slow it down because it would have to "unpack" the PyObject
before storing.
Hypothetically, let's say this were not true and numpy
or pandas
could benefit from the fastnumbers
algorithm - I don't believe it could be used as a drop-in. I imagine that the parsers from those libraries are written in highly-optimized C code, so one would probably have to recompile the libraries to change the parsing scheme.
Does it also speed up float comparisons?
No, once fastnumbers
has converted the input to a float
its work is done. The output is just a standard Python float
, so it cannot really modify comparisons (which, I imagine, are already as fast as possible).
What do you think?
I like the idea (I have even toyed with the idea of trying to get this algorithm added to the Python core), but I think that it would have to be taken up with the pandas
or numpy
devs to get it to work, rather than trying to hack fastnumbers
to insert itself as middleware into those modules.
from fastnumbers.
from fastnumbers.
In order to answer that, it will help me to know what is your understanding of what fastnumbers
does. I ask because based on the question about speeding up floating point comparisons and the comment about formatting floats I am afraid there may be some confusion about the intent, functionality, and scope of this library.
from fastnumbers.
from fastnumbers.
This module primarily is for fast conversion to numbers with error handling. That is the primary reason I wrote this module - see the Examples in the README
for where the real benefit comes from. I wanted that functionality to help improve performance of natsort
. Additionally, check out the timing comparisons, and you will see that the real performance benefits come from the error handling part of the module.
The drop-in float()
and int()
functions were added after fastnumbers
had already existed for some time - I figured that since the code already existed I should expose it to users, and those that find they need a speed up without the error handling can use it.
from fastnumbers.
Having said this, it looks like numpy.loadtxt()
has a converters
option, which lets you as the user provide the function you want to use to convert a given column. So, one can certainly provide fastnumbers.float
or fastnumbers.int
to numpy.loadtxt()
to improve parsing performance. I will say that I am shocked that numpy.loadtxt()
first loads the data in memory as python objects, then converts them to C datatypes after loaded - that is NOT efficient, but they probably did it that way to so users could have the flexibility of using the converters
argument.
from fastnumbers.
Do you think the converters
option or the default read_csv
behavior addresses your concern, or do you still believe some action should be taken?
from fastnumbers.
from fastnumbers.
Related Issues (20)
- Unit test numeric issues on 32bit arm CPU HOT 3
- Speed not better than Python's int/float HOT 6
- Make most options keyword-only
- Rename "key" option to "on_fail"
- [BUG] FastNumbers can crash with a SystemError due to returning NULL without setting an exception HOT 5
- Fastest way to check is and object is int or float in one pass HOT 14
- Proposal: change behavior of isfloat with respect to treatment of float("nan") HOT 19
- Proposal: change behavior of isfloat function with respect to treatment of strings containing integers HOT 3
- Proposal: Do not raise an exception on None HOT 5
- python3.9 compatibility HOT 5
- Re-write using C++ and pybind11
- Add support to release Linux aarch64 wheels HOT 1
- Broken 3.2.0 installation
- Missing -lm breaks build on armv7hl
- Error: <built-in function isint/isfloat> returned NULL without setting an error HOT 2
- Use fast C++ methods like std::from_chars or fast_float HOT 1
- Improve performance with METH_FASTCALL
- Add support for operating on iterables
- Add numpy support? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fastnumbers.