Giter Club home page Giter Club logo

usda-fdc-data's Issues

--keep_files is necessary

Hi! Cool project. I would love to see the results that I think you see.

The following libraries should be added to your dependencies, if you want to be really exact:

  • requests
  • fastparquet or pyarrow (I don't know the difference...)

Looks like a file has gone astray somewhere:

chaz@Charlies-Air usda-fdc-data % python3 main.py

Initializing processing of USDA FDC data. Output directory set to:
> fdc_data

Directory created:
> fdc_data

Downloading file paths to:
> fdc_data/FoodData_Central_raw/FoodData_Central_foundation_food_csv_2024-04-18.zip

Downloading file paths to:
> fdc_data/FoodData_Central_raw/FoodData_Central_csv_2024-04-18.zip


Initializing processing for:
> FoodData_Central_foundation_food_csv_2024-04-18

Traceback (most recent call last):
  File "/Users/chaz/usda-fdc-data/main.py", line 59, in <module>
    process_foundation(foundation_urls, OUTPUT_DIR, RAW_DIR, keep_files)
  File "/Users/chaz/usda-fdc-data/preprocessing/process_foundation.py", line 69, in process_foundation
    food_attribute = pd.read_csv(os.path.join(foundation_dir, 'food_attribute.csv'),
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 620, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1620, in __init__
    self._engine = self._make_engine(f, self.engine)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1880, in _make_engine
    self.handles = get_handle(
                   ^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/pandas/io/common.py", line 873, in get_handle
    handle = open(
             ^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'fdc_data/FoodData_Central_raw/FoodData_Central_foundation_food_csv_2024-04-18/food_attribute.csv'
chaz@Charlies-Air usda-fdc-data % 

I think what happened is it made a really gigantic .zip file, but the file didn't save.

All these files get made:

acquisition_samples.csv
agricultural_samples.csv
Download API Field Descriptions.xlsx
food_attribute_type.csv
food_attribute.csv
food_calorie_conversion_factor.csv
food_category.csv
food_component.csv
food_nutrient_conversion_factor.csv
food_nutrient.csv
food_portion.csv
food_protein_conversion_factor.csv
food_update_log_entry.csv
food.csv
foundation_food.csv
input_food.csv
lab_method_code.csv
lab_method_nutrient.csv
lab_method.csv
market_acquisition.csv
measure_unit.csv
nutrient.csv
sample_food.csv
sub_sample_food.csv
sub_sample_result.csv

The .zip then gets into the hundreds of megabytes, which is great.

At some point, it vanishes from my Finder.

Then I see this:

food.csv
food_nutrient.csv
food_portion.csv
measure_unit.csv
nutrient.csv

When I try with --keep_files, it works.

Finally, I now have a 1.6 gigabyte .parquet file. What do I do with it? Embarrassingly, my best thought was to load it into a Jupyter notebook using pd.read_parquet('usda_food_nutrition_data.parquet', engine='fastparquet'). My output is:

Corrupted thrift data at  4 :  2 0
Corrupted thrift data at  5 :  9 0
Corrupted thrift data at  8 :  22 0
Corrupted thrift data at  15 :  2 0
Corrupted thrift data at  29 :  25 15
Corrupted thrift data at  30 :  31 14
Corrupted thrift data at  33 :  6 14
Corrupted thrift data at  38 :  2 0
Corrupted thrift data at  51 :  24 14
Corrupted thrift data at  55 :  2 0
Corrupted thrift data at  65 :  2 0
Corrupted thrift data at  70 :  19 14
Corrupted thrift data at  78 :  2 0
Corrupted thrift data at  89 :  2 0
Corrupted thrift data at  92 :  12 14
Corrupted thrift data at  99 :  2 0

This kills the Python process and the notebook.

I don't think it is actually corrupt. I can look at it in Sublime and it looks great.

I guess I might want a subset of the data. Haha.

I'm not sure everything is included. Is this specific to the foundation data? As an example, I have been looking for data on black beans, which I ate today.

The ID 173735, from the legacy data, corresponding to cooked, unsalted, black beans, is contained in the 1.6 gig output file, but the ID 747444, Beans, Dry, Black (0% moisture), from the foundation data, is not contained in the output file. I do have the 747444 data in the downloaded files, however.

Upon further inspection, it looks like there are about 196 entries of foundation data (no more), and about 14000 sr_legacy data entries (no more). Then there are about 1,730,000 branded food entries. But you say 650,701 entries in the README.

Anyway, I know you put a TON of effort into this exceedingly special-purpose program, and I think it's awesome! Just letting you know what might be useful for an interested user. For the record, as far as I can tell, this is the only publicly hosted program that does this task, at least, that I have found in a day of searching. There are other people doing similar stuff, but the comprehensive approach is cool. So, I applaud your efforts.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.