- Data discovery
- Kaggle dataset
- Recipe data
- Data model
- fact
- dim
- UOM
- EDA
- Classifying ingredients
- Clustering
- Numeric features
- Unstructured features
- Deep learning from unstructured text
- Clustering
- Conversions
- Adding this step for now
- Recipe comparison
- Data pipeline
- preprocessing (cleanse and prepare data)
- pre deployment
- post deployment
- predictions
- consumption
- preprocessing (cleanse and prepare data)
- Refactor code?
pip3 install virtualenv
virtualenv venv
source venv/bin/activate
virtual environment setup guide
- What foods can be substituted for meats and still have the same amount of protein?
- How many groups of foods are there based on ingredients?
- Can we classify foods simply based on their ingredients? Does it make intuitive sense?
- What foods have the highest sugar, protein, fat, or calories?
Each observation in the dataset has a unique label (both a 'key', "NBD_No", and a text label for the ingredient, "Descrip"). My hypothesis is that similar foods have similar nutritional values, meaning we may be able to see distinct groupings of these foods based on their nutritional value.
I examine three (3) different clustering algorithms for this dataset:
Result: There is some utility in clustering observations by their nutritional value, because some clusters have very similar foods assigned to them. However, this is not a flawless approach, since clusters will also contain very different foods, making it difficult to say what the cluster represents (in terms of food). Additionally, a numeric approach using the silhouette score for clustering algorithms shows that the clusters still overlap and are not distinct from each other.
Next Steps:
- Keep a baseline clustering method available and iterate to improve
- Examine other ways to assign foods to a group or give them a label using their nutritional value.
- Deep learning on a subset of manually labeled ingredients
- Decide on a apriori labeling method for food ingredients
- Deep learning on a subset of manually labeled ingredients
- data
https://www.kaggle.com/datasets/thedevastator/now-with-more-nutrients