Giter Club home page Giter Club logo

Comments (3)

deads avatar deads commented on July 23, 2024 1

First, there needs to be enough I/O bandwidth for a multi-core, multi-threaded approach to have a payoff. This can be achieved a number of ways, one of which is to combine several SSDs in a RAID configuration. At one thread, the workload is CPU-bound. As you add threads, more of the available I/O bandwidth is consumed until a certain point when all the cores are saturated or most of the I/O is consumed. Mac laptops are pretty limited in parallelism and SSD throughput.

Second, the reduce step is unnecessary in Pandas so with a small number of cores, the cost of reducing cannot easily be made up by increases in throughput.

Third, there is a lot of text data in that file. Since we internally use paratext for ML datasets, it treats string fields as categorical by default. paratext maintains a hash table of unique strings to integers for each categorical column. This can slow things down if you have a large column of unique strings. You can override this setting by using the text_names parameter.

paratext.load_csv_to_dict("j.csv", text_names=["col0","col1","col2",...])

Alternatively, you can set the maximum number of levels in a categorical column with the max_levels keyword argument:

paratext.load_csv_to_dict("j.csv, max_levels=0)

Fourth, as the benchmarks show, string creation is slow, which affects throughput relative to the I/O bandwidth for text data.

Fifth, I have not used it for a very long time. It all depends on how the NFS is tuned, the network hardware and configuration, and the workload. NFS is usually tuned for cumulative throughput across all users, but read throughput from a single workstation may be rather limited.

from paratext.

deads avatar deads commented on July 23, 2024

All of our public benchmarks were on Linux machines with multiple SSDs and at least 16 cores. Most of our internal uses of paratext is in a server environment. As such, we have not extensively optimized paratext performance for Mac OS X. However, there could be other factors to explain the differences.

Can you give more details about the file you are trying to load and the specs of your machine?

from paratext.

cottrell avatar cottrell commented on July 23, 2024

Quite likely the number of cores then. File is a sample from the UK Land Registry dataset. I wasn't careful with the file cache but always loaded the pandas run first which should make the second run faster I would think. I the benchmarks are only supposed to look good from cold state that could be it.

It would be good to know exactly under what situations one should look into this library. The 10x perf is quite attractive. I'm typically on 4-12 core Linux on cluster at work but ... NFS.

$ wc j.csv
 1000000 5323110 175040402 j.csv
$ du -sh j.csv
167M	j.csv
$ head j.csv
"{61D50B1A-FBBB-43B9-BFB3-10794185519D}","41950","1995-10-20 00:00","CO4 3FS","S","N","F","14","","TURNSTONE END","COLCHESTER","COLCHESTER","COLCHESTER","ESSEX","A","A"
"{7A9B0334-22C7-4F3B-BC5C-095E19C38C75}","96500","1995-09-22 00:00","RM4 1PX","S","N","F","FAIRWAY","","NORTH ROAD","HAVERING-ATTE-BOWER","ROMFORD","HAVERING","GREATER LONDON","A","A"
"{E585DFCF-8323-4C8D-A015-095E1E1272D0}","27500","1995-08-08 00:00","PO6 3RR","T","N","F","18","","HARLESTON ROAD","PORTSMOUTH","PORTSMOUTH","PORTSMOUTH","PORTSMOUTH","A","A"
"{4E698F7E-AB41-4722-800B-095E373063A2}","53000","1995-12-11 00:00","TS16 9EB","S","N","F","7","","WENTWORTH WAY","EAGLESCLIFFE","STOCKTON-ON-TEES","STOCKTON-ON-TEES","STOCKTON-ON-TEES","A","A"
"{2701C6AF-88A1-44BD-B108-0CE7CFAE171F}","75000","1995-03-30 00:00","IG7 6ET","D","N","F","34","","LAMBOURNE ROAD","CHIGWELL","CHIGWELL","EPPING FOREST","ESSEX","A","A"
"{A2B0C762-9DE8-4FC2-8350-0CE7FD7F0C46}","98000","1995-08-25 00:00","NW8 6ER","F","Y","L","61","FLAT 1","QUEENS GROVE","LONDON","LONDON","CITY OF WESTMINSTER","GREATER LONDON","A","A"
"{98666A50-C668-4369-8679-0CE80094AB81}","20000","1995-11-10 00:00","BB1 1SP","T","N","L","56","","NOTTINGHAM STREET","BLACKBURN","BLACKBURN","BLACKBURN","LANCASHIRE","A","A"
"{54F5CCDB-BB36-4305-9AB9-0CE805F2AAE9}","73500","1995-09-08 00:00","SE3 7TW","F","N","L","WYCOMBE COURT","15","ST JOHNS PARK","LONDON","LONDON","GREENWICH","GREATER LONDON","A","A"
"{62D37651-D34F-4AE8-87C9-141AFA53830F}","58400","1995-02-24 00:00","SG2 7DF","T","N","F","53","","THE PASTURES","STEVENAGE","STEVENAGE","STEVENAGE","HERTFORDSHIRE","A","A"
"{E4E27C38-F733-4383-B098-141AFBF0AA4D}","79000","1995-11-09 00:00","RH17 5BL","T","N","F","HARRADINES COTTAGES","2","LONDON LANE","CUCKFIELD","HAYWARDS HEATH","MID SUSSEX","WEST SUSSEX","A","A"

$ system_profiler SPHardwareDataType
Hardware:

    Hardware Overview:

      Model Name: MacBook Pro
      Model Identifier: MacBookPro8,1
      Processor Name: Intel Core i5
      Processor Speed: 2.3 GHz
      Number of Processors: 1
      Total Number of Cores: 2
      L2 Cache (per Core): 256 KB
      L3 Cache: 3 MB
      Memory: 16 GB

from paratext.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.