Giter Club home page Giter Club logo

Comments (9)

SethMMorton avatar SethMMorton commented on May 28, 2024 1

It looks like pandas.read_csv() also has a converters options that behaves the same as the one for numpy.loadtxt(). I am trying to figure out if this causes it to use a less efficient method or not under-the-hood.


UPDATE

Looking at the low-level source for pandas.read_csv(), I found that they have three different parsers for floats. The source for the parsers tells me that the default parser that pandas.read_csv() is using is actually pretty similar to what fastnumbers is using for floats, so you really don't need to try to insert fastnumbers as middleware since the pandas devs are already being smart about their parsing and using a very fast algorithm (though at the expense of some precision). The same is true for ints.

from fastnumbers.

SethMMorton avatar SethMMorton commented on May 28, 2024

Since floating point numbers are often used in large datasets and that is where speed is of the utmost importance, would it make sense to integrate this as a drop-in replacement for use in numpy and as a result, also in Pandas?

While I like the sound of this, I am not really sure in practice if it is possible. numpy and pandas store floats and ints in memory as raw C datatypes, while a standard Python float or int is actually a PyObject under-the-hood. fastnumbers works by converting its input to PyObjects that contain either float or int data. At this time, if fastnumbers were somehow inserted into the numpy or pandas parser it would actually slow it down because it would have to "unpack" the PyObject before storing.

Hypothetically, let's say this were not true and numpy or pandas could benefit from the fastnumbers algorithm - I don't believe it could be used as a drop-in. I imagine that the parsers from those libraries are written in highly-optimized C code, so one would probably have to recompile the libraries to change the parsing scheme.

Does it also speed up float comparisons?

No, once fastnumbers has converted the input to a float its work is done. The output is just a standard Python float, so it cannot really modify comparisons (which, I imagine, are already as fast as possible).

What do you think?

I like the idea (I have even toyed with the idea of trying to get this algorithm added to the Python core), but I think that it would have to be taken up with the pandas or numpy devs to get it to work, rather than trying to hack fastnumbers to insert itself as middleware into those modules.

from fastnumbers.

rswgnu avatar rswgnu commented on May 28, 2024

from fastnumbers.

SethMMorton avatar SethMMorton commented on May 28, 2024

In order to answer that, it will help me to know what is your understanding of what fastnumbers does. I ask because based on the question about speeding up floating point comparisons and the comment about formatting floats I am afraid there may be some confusion about the intent, functionality, and scope of this library.

from fastnumbers.

rswgnu avatar rswgnu commented on May 28, 2024

from fastnumbers.

SethMMorton avatar SethMMorton commented on May 28, 2024

This module primarily is for fast conversion to numbers with error handling. That is the primary reason I wrote this module - see the Examples in the README for where the real benefit comes from. I wanted that functionality to help improve performance of natsort. Additionally, check out the timing comparisons, and you will see that the real performance benefits come from the error handling part of the module.

The drop-in float() and int() functions were added after fastnumbers had already existed for some time - I figured that since the code already existed I should expose it to users, and those that find they need a speed up without the error handling can use it.

from fastnumbers.

SethMMorton avatar SethMMorton commented on May 28, 2024

Having said this, it looks like numpy.loadtxt() has a converters option, which lets you as the user provide the function you want to use to convert a given column. So, one can certainly provide fastnumbers.float or fastnumbers.int to numpy.loadtxt() to improve parsing performance. I will say that I am shocked that numpy.loadtxt() first loads the data in memory as python objects, then converts them to C datatypes after loaded - that is NOT efficient, but they probably did it that way to so users could have the flexibility of using the converters argument.

from fastnumbers.

SethMMorton avatar SethMMorton commented on May 28, 2024

Do you think the converters option or the default read_csv behavior addresses your concern, or do you still believe some action should be taken?

from fastnumbers.

rswgnu avatar rswgnu commented on May 28, 2024

from fastnumbers.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.