mkayeterry / usda-fdc-data Goto Github PK
View Code? Open in Web Editor NEWRepository for cleaning, processing, and wrangling USDA FoodData Central datasets.
License: MIT License
Repository for cleaning, processing, and wrangling USDA FoodData Central datasets.
License: MIT License
Hi! Cool project. I would love to see the results that I think you see.
The following libraries should be added to your dependencies, if you want to be really exact:
Looks like a file has gone astray somewhere:
chaz@Charlies-Air usda-fdc-data % python3 main.py
Initializing processing of USDA FDC data. Output directory set to:
> fdc_data
Directory created:
> fdc_data
Downloading file paths to:
> fdc_data/FoodData_Central_raw/FoodData_Central_foundation_food_csv_2024-04-18.zip
Downloading file paths to:
> fdc_data/FoodData_Central_raw/FoodData_Central_csv_2024-04-18.zip
Initializing processing for:
> FoodData_Central_foundation_food_csv_2024-04-18
Traceback (most recent call last):
File "/Users/chaz/usda-fdc-data/main.py", line 59, in <module>
process_foundation(foundation_urls, OUTPUT_DIR, RAW_DIR, keep_files)
File "/Users/chaz/usda-fdc-data/preprocessing/process_foundation.py", line 69, in process_foundation
food_attribute = pd.read_csv(os.path.join(foundation_dir, 'food_attribute.csv'),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
return _read(filepath_or_buffer, kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 620, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1620, in __init__
self._engine = self._make_engine(f, self.engine)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1880, in _make_engine
self.handles = get_handle(
^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/pandas/io/common.py", line 873, in get_handle
handle = open(
^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'fdc_data/FoodData_Central_raw/FoodData_Central_foundation_food_csv_2024-04-18/food_attribute.csv'
chaz@Charlies-Air usda-fdc-data %
I think what happened is it made a really gigantic .zip file, but the file didn't save.
All these files get made:
acquisition_samples.csv
agricultural_samples.csv
Download API Field Descriptions.xlsx
food_attribute_type.csv
food_attribute.csv
food_calorie_conversion_factor.csv
food_category.csv
food_component.csv
food_nutrient_conversion_factor.csv
food_nutrient.csv
food_portion.csv
food_protein_conversion_factor.csv
food_update_log_entry.csv
food.csv
foundation_food.csv
input_food.csv
lab_method_code.csv
lab_method_nutrient.csv
lab_method.csv
market_acquisition.csv
measure_unit.csv
nutrient.csv
sample_food.csv
sub_sample_food.csv
sub_sample_result.csv
The .zip then gets into the hundreds of megabytes, which is great.
At some point, it vanishes from my Finder.
Then I see this:
food.csv
food_nutrient.csv
food_portion.csv
measure_unit.csv
nutrient.csv
When I try with --keep_files, it works.
Finally, I now have a 1.6 gigabyte .parquet file. What do I do with it? Embarrassingly, my best thought was to load it into a Jupyter notebook using pd.read_parquet('usda_food_nutrition_data.parquet', engine='fastparquet')
. My output is:
Corrupted thrift data at 4 : 2 0
Corrupted thrift data at 5 : 9 0
Corrupted thrift data at 8 : 22 0
Corrupted thrift data at 15 : 2 0
Corrupted thrift data at 29 : 25 15
Corrupted thrift data at 30 : 31 14
Corrupted thrift data at 33 : 6 14
Corrupted thrift data at 38 : 2 0
Corrupted thrift data at 51 : 24 14
Corrupted thrift data at 55 : 2 0
Corrupted thrift data at 65 : 2 0
Corrupted thrift data at 70 : 19 14
Corrupted thrift data at 78 : 2 0
Corrupted thrift data at 89 : 2 0
Corrupted thrift data at 92 : 12 14
Corrupted thrift data at 99 : 2 0
This kills the Python process and the notebook.
I don't think it is actually corrupt. I can look at it in Sublime and it looks great.
I guess I might want a subset of the data. Haha.
I'm not sure everything is included. Is this specific to the foundation data? As an example, I have been looking for data on black beans, which I ate today.
The ID 173735, from the legacy data, corresponding to cooked, unsalted, black beans, is contained in the 1.6 gig output file, but the ID 747444, Beans, Dry, Black (0% moisture), from the foundation data, is not contained in the output file. I do have the 747444 data in the downloaded files, however.
Upon further inspection, it looks like there are about 196 entries of foundation data (no more), and about 14000 sr_legacy data entries (no more). Then there are about 1,730,000 branded food entries. But you say 650,701 entries
in the README.
Anyway, I know you put a TON of effort into this exceedingly special-purpose program, and I think it's awesome! Just letting you know what might be useful for an interested user. For the record, as far as I can tell, this is the only publicly hosted program that does this task, at least, that I have found in a day of searching. There are other people doing similar stuff, but the comprehensive approach is cool. So, I applaud your efforts.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.