The usda-fdc-data's discuss from mkayeterry

--keep_files is necessary

Hi! Cool project. I would love to see the results that I think you see.

The following libraries should be added to your dependencies, if you want to be really exact:

requests
fastparquet or pyarrow (I don't know the difference...)

Looks like a file has gone astray somewhere:

chaz@Charlies-Air usda-fdc-data % python3 main.py

Initializing processing of USDA FDC data. Output directory set to:
> fdc_data

Directory created:
> fdc_data

Downloading file paths to:
> fdc_data/FoodData_Central_raw/FoodData_Central_foundation_food_csv_2024-04-18.zip

Downloading file paths to:
> fdc_data/FoodData_Central_raw/FoodData_Central_csv_2024-04-18.zip


Initializing processing for:
> FoodData_Central_foundation_food_csv_2024-04-18

Traceback (most recent call last):
  File "/Users/chaz/usda-fdc-data/main.py", line 59, in <module>
    process_foundation(foundation_urls, OUTPUT_DIR, RAW_DIR, keep_files)
  File "/Users/chaz/usda-fdc-data/preprocessing/process_foundation.py", line 69, in process_foundation
    food_attribute = pd.read_csv(os.path.join(foundation_dir, 'food_attribute.csv'),
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 620, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1620, in __init__
    self._engine = self._make_engine(f, self.engine)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1880, in _make_engine
    self.handles = get_handle(
                   ^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/pandas/io/common.py", line 873, in get_handle
    handle = open(
             ^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'fdc_data/FoodData_Central_raw/FoodData_Central_foundation_food_csv_2024-04-18/food_attribute.csv'
chaz@Charlies-Air usda-fdc-data %

I think what happened is it made a really gigantic .zip file, but the file didn't save.

All these files get made:

acquisition_samples.csv
agricultural_samples.csv
Download API Field Descriptions.xlsx
food_attribute_type.csv
food_attribute.csv
food_calorie_conversion_factor.csv
food_category.csv
food_component.csv
food_nutrient_conversion_factor.csv
food_nutrient.csv
food_portion.csv
food_protein_conversion_factor.csv
food_update_log_entry.csv
food.csv
foundation_food.csv
input_food.csv
lab_method_code.csv
lab_method_nutrient.csv
lab_method.csv
market_acquisition.csv
measure_unit.csv
nutrient.csv
sample_food.csv
sub_sample_food.csv
sub_sample_result.csv

The .zip then gets into the hundreds of megabytes, which is great.

At some point, it vanishes from my Finder.

Then I see this:

food.csv
food_nutrient.csv
food_portion.csv
measure_unit.csv
nutrient.csv

When I try with --keep_files, it works.

Finally, I now have a 1.6 gigabyte .parquet file. What do I do with it? Embarrassingly, my best thought was to load it into a Jupyter notebook using pd.read_parquet('usda_food_nutrition_data.parquet', engine='fastparquet'). My output is:

Corrupted thrift data at  4 :  2 0
Corrupted thrift data at  5 :  9 0
Corrupted thrift data at  8 :  22 0
Corrupted thrift data at  15 :  2 0
Corrupted thrift data at  29 :  25 15
Corrupted thrift data at  30 :  31 14
Corrupted thrift data at  33 :  6 14
Corrupted thrift data at  38 :  2 0
Corrupted thrift data at  51 :  24 14
Corrupted thrift data at  55 :  2 0
Corrupted thrift data at  65 :  2 0
Corrupted thrift data at  70 :  19 14
Corrupted thrift data at  78 :  2 0
Corrupted thrift data at  89 :  2 0
Corrupted thrift data at  92 :  12 14
Corrupted thrift data at  99 :  2 0

This kills the Python process and the notebook.

I don't think it is actually corrupt. I can look at it in Sublime and it looks great.

I guess I might want a subset of the data. Haha.

I'm not sure everything is included. Is this specific to the foundation data? As an example, I have been looking for data on black beans, which I ate today.

The ID 173735, from the legacy data, corresponding to cooked, unsalted, black beans, is contained in the 1.6 gig output file, but the ID 747444, Beans, Dry, Black (0% moisture), from the foundation data, is not contained in the output file. I do have the 747444 data in the downloaded files, however.

Upon further inspection, it looks like there are about 196 entries of foundation data (no more), and about 14000 sr_legacy data entries (no more). Then there are about 1,730,000 branded food entries. But you say 650,701 entries in the README.

Anyway, I know you put a TON of effort into this exceedingly special-purpose program, and I think it's awesome! Just letting you know what might be useful for an interested user. For the record, as far as I can tell, this is the only publicly hosted program that does this task, at least, that I have found in a day of searching. There are other people doing similar stuff, but the comprehensive approach is cool. So, I applaud your efforts.

mkayeterry / usda-fdc-data Goto Github PK

usda-fdc-data's Issues

--keep_files is necessary

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent