hhursev / recipe-scrapers Goto Github PK
View Code? Open in Web Editor NEWPython package for scraping recipes data
License: MIT License
Python package for scraping recipes data
License: MIT License
Request to support budgetbytes.com, sample of a recipe:
https://www.budgetbytes.com/creamy-coconut-curry-lentils-with-spinach/
Add support for https://www.yummly.com
Sample recipe - https://www.yummly.com/recipe/Carrot-Milk-shake-1099424
I thought I'd share what I made with this: https://archive.org/details/recipes-en-201706
A full version of allrecipes, epicurious, cookstr, and bbc.co.uk, parsed into nice JSON with photos.
Sorry to abuse 'issues', there's no option to send a private message on github as far as I know.
The parser crashes on every attribute.
Cfr. https://gist.github.com/marcolussetti/49c46c893cea474677a0d0e789817364 where I have some quick&dirty checking of some random live recipes.
Excellent recipes for Thermomix/Bimby
https://cookidoo.it/
https://cookidoo.it/recipes/recipe/it-IT/r490818
Doesn't appear to parse the time correctly and returns empty instructions.
The parser also crashes if used on title due to missing elements.
Cfr. https://gist.github.com/marcolussetti/49c46c893cea474677a0d0e789817364 where I have some quick&dirty checking of some random live recipes.
Hi guys,
I want to add Hello Fresh to list of recipe website here. I've read through contributing sections and will get started soon with my little programming background.
Couple relevant questions:
Who can I PM in case I get stuck? Not with the python error but with the whole workflow.
I've noticed that recipes scraped from BBC GoodFood don't properly pull out the instructions. For instance scraping https://www.bbcgoodfood.com/recipes/herby-spring-chicken-pot-pie
gives them as
"Heat oven to 200C/180C fan/gas 6. Heat the oil in a large, shallow casserole dish on a medium heat. Add the spring onions and fry for 3 mins, then stir through the frozen spinach and cook for 2 mins or until it\u2019s starting to wilt. Remove the skin from the chicken and discard. Shred the chicken off the bone and into the pan, and discard the bones. Stir through the stock and mustard. Bring to a simmer and cook, uncovered, for 5-10 mins. Stir in the peas, cr\u00e8me fra\u00eeche and herbs, then remove from the heat. Scrunch the filo pastry sheets over the mixture, brush with a little oil and bake for 15-20 mins or until golden brown. MethodHeat oven to 200C/180C fan/gas 6. Heat the oil in a large, shallow casserole dish on a medium heat. Add the spring onions and fry for 3 mins, then stir through the frozen spinach and cook for 2 mins or until it\u2019s starting to wilt. Remove the skin from the chicken and discard. Shred the chicken off the bone and into the pan, and discard the bones. Stir through the stock and mustard. Bring to a simmer and cook, uncovered, for 5-10 mins.Stir in the peas, cr\u00e8me fra\u00eeche and herbs, then remove from the heat. Scrunch the filo pastry sheets over the mixture, brush with a little oil and bake for 15-20 mins or until golden brown. Recipe from Good Food magazine, April 2019"
While I'm here, I've noticed that most of the scrapers pull out just the recipe name, ingredients and instructions. Would contributions of scrapers which extracted more information (e.g. serving size) be accepted?
When I scrape a recipe from BBC Good Food, if the ingredients contains a tooltip it's included in the ingredient text.
Example
scrap_me = scrap_me('https://www.bbcgoodfood.com/recipes/artichoke-watercress-linguine')
print(json.dumps(
{
'title': scrap_me.title(),
'totalTime': scrap_me.total_time(),
'ingredients': scrap_me.ingredients(),
'instructions': scrap_me.instructions()
}
))
The result is
{
"title":"Artichoke & watercress linguine",
"totalTime":0,
"ingredients":[
"100g watercress Watercress wort-er-cressWith deep green leaves, and crisp, paler stems, watercress is related to mustard and is one of\u2026", //- here
"280g jar artichokes in olive oil",
"60g ricotta Ricotta ree-cot-aRicotta is an Italian curd cheese. Made from whey, it is traditionally a by-product of making\u2026", //- here
"220g dried linguine"
],
"instructions":"Blitz together the watercress, \u00be of the artichokes, the ricotta and 3 tbsp olive oil from the jar, then season to taste.\nBring a large pan of salted water to the boil and cook the linguine following pack instructions until al dente. Toss the pasta with the watercress pesto along with the remaining artichokes and a ladleful of pasta water. Finish with an extra drizzle of olive oil and black pepper."
}
Notice 100g watercress Watercress...
and 60g ricotta Ricotta ree-cot-aRicotta...
Could you help me with this site? its allrecipe brazilian version
http://allrecipes.com.br/
Recipe adress
http://allrecipes.com.br/receita/11219/bolo-de-iogurte-f-cil-de-liquidificador.aspx
Some methods for Giallo Zafferano are not working anymore.
Example URL: https://ricette.giallozafferano.it/Pesto-alla-Genovese.html
>>> scraper.total_time()
0
>>> scraper.yields()
get_serving_numbers error 'NoneType' object has no attribute 'get_text'
''
>>> scraper.ingredients()
[]
All fields are no longer correctly returned after the switch from class types to itemprop attributes.
Asking for the title causes a crash.
Doesn't appear to parse the time correctly and ingredients are returned as an empty list.
The parser also crashes if used on instructions due to missing elements.
Cfr. https://gist.github.com/marcolussetti/49c46c893cea474677a0d0e789817364 where I have some quick&dirty checking of some random live recipes.
The scraper is getting 404s whereas the links work fine on a real browser.
A simple scrape_me('https://www.finedininglovers.com/recipes/brunch/rocket-gorgonzola-souffle/')
is sufficient to trigger the error.
I have done no investigation of what the issue might be, but I imagine it could just be returning 404s based on the user agent?
I would like to add support for FoodNetwork.com. I will submit a pull request shortly.
Silently fails (return 0/empty string/empty array) on time, instructions and ingredients.
The parser also crashes if used on title. due to missing elements.
Cfr. https://gist.github.com/marcolussetti/49c46c893cea474677a0d0e789817364 where I have some quick&dirty checking of some random live recipes.
Budget Bytes is missing yields. Probably could use updating the testhml page alongside it.
Doesn't appear to parse the time correctly.
Cfr. https://gist.github.com/marcolussetti/49c46c893cea474677a0d0e789817364 where I have some quick&dirty checking of some random live recipes.
It will be very helpful as these are share-alike license for all the recipes.
An Example Recipe:
https://en.wikibooks.org/wiki/Cookbook:Pad_Thai
Missing totaltime, yields, ingredients, instructions. Cfr:
https://www.101cookbooks.com/archives/blood-orange-gin-sparkler-recipe.html
Hey, I was trying to write a scraper for recipes created with the wordpress recipe maker but am unsure about what should go into intructions.
example: soft-baked-gingerbread-cookies
I have two questions about it:
Generally it might be helpful to specify these things somewhere, don't you think?
This is an enhancement proposal.
I would live to add Giallo Zafferano to the list of supported scrapers. It's by far the largest set of recipes in the Italian language as far as I know.
I am about to submit a pull request for this and will detail the details of the implementation in that venue.
Could you help me with this site?
https://www.tudogostoso.com.br/
Recipe adress
https://www.tudogostoso.com.br/receita/114-brigadeiro.html
Unrelated to the https issue, the time is also not being parsed correctly.
Returns an empty list for ingredients.
Cfr. https://gist.github.com/marcolussetti/49c46c893cea474677a0d0e789817364 where I have some quick&dirty checking of some random live recipes.
Not sure how widespread this error is (everything on the domain or just this one recipe) but I have an automated test suite in a rails app that runs this script and it just started failing.
from recipe_scrapers import scrape_me
import sys
import json
try:
# give the url as a string, it can be url from any site listed below
scrape_me = scrape_me(str(sys.argv[1]))
# data = {
# "title": scrape_me.title(),
# "total_time": scrape_me.total_time(),
# "ingredients": scrape_me.ingredients(),
# "instructions": scrape_me.instructions()
# }
print(scrape_me.ingredients())
# with open('tmp/recipe_data.json', 'w') as outfile:
# json.dump(data, outfile)
except:
print("Error while scraping recipe", sys.exc_info())
Seeing
('Error while scraping recipe', (<type 'exceptions.UnicodeDecodeError'>, UnicodeDecodeError('ascii', '\xa0', 0, 1, 'ordinal not in range(128)'), <traceback object at 0x11145b6c8>))
.title()
and .total_time()
work but via commenting out I traced it back to .ingredients()
causing the issue. I'm not a python dev so I didn't dig any deeper.
I have a recipe app (for personal use) that uses your scrapers and I just built a dashboard to see what sites I use the most. Here's a list of the top ones that I use. So this isn't actually an issue, just a passive request to add any missing ones to the list if you have the time. Thanks again for such a great tool!
I tried scraper = scrape_me('https://www.tudogostoso.com.br/receita/7778-bobo-de-camarao.html')
in python3 and none of the values are populating afterwards.
how would it be if I wanted to get all the recipes from the site?
Missing total_time, yields, ingredients, instructions. Cfr.: https://www.mybakingaddiction.com/chocolate-coconut-zucchini-bread/
Scrape / scaper / scraping.
"Scrape" is pretty consistently "scrap" through the project, which is a different word :)
Missing: yields, instructions. Cfr.: https://www.simplyrecipes.com/recipes/one_pot_chicken_and_rice_soup/
Is there a reason why recipe images are not scraped?
A simple scrape_me('https://www.mybakingaddiction.com/tiramisu-trifles-recipe/')
is enough to trigger a certificate error. I quickly glanced at the error and it seemed like the certificate received was for test.mybakingaddiction.com which is weird as when opening the URL in a browser the correct certificate is used.
scraper = scrape_me('https://closetcooking.com/bacon-guacamole-grilled-cheese-sandwich/')
scraper.title()
scraper.ingredients()
Missing instructions. Cfr.: https://www.realsimple.com/food-recipes/browse-all-recipes/vanilla-cheesecake
I am using this project as the foundation for a web crawler to collect recipe data for a Natural Language Processing project I am working on. Would you accept a pull request that adds the ability to return the links that reference that calling domain?
Doesn't appear to parse the time correctly.
The parser also crashes if used on instructions due to missing elements.
Cfr. https://gist.github.com/marcolussetti/49c46c893cea474677a0d0e789817364 where I have some quick&dirty checking of some random live recipes.
Doesn't appear to parse the time correctly.
The parser also crashes if used on ingredients due to missing elements.
Cfr. https://gist.github.com/marcolussetti/49c46c893cea474677a0d0e789817364 where I have some quick&dirty checking of some random live recipes.
Missing title, yields, ingredients. Cfr.: https://inspiralized.com/brussels-sprouts-and-apple-salad-with-parmesan/
Example recipe: https://inspiralized.com/spiralized-pasta-carbonara-two-ways/
I'm going to submit a few of these and haven't yet looked at which sites I have the majority of my recipes from but this is one that I have at least several from.
Side note, thanks for the great work! If I get the time I'll write and PR some more scrapers. I'm mainly in ruby but I'm sure I can figure it out from examples.
Missing yields ("serving amounts"). Cfr.: https://www.hellofresh.com/recipes/oven-baked-portobellos-and-chive-mashed-potatoes-5bd1efb3ae08b5111b5f2ea2
scraper = scrape_me('http://foodnetwork.com/recipes/accordion-potatoes-2807602')
scraper.total_time()
scraper.ingredients()
scraper.instructions()
You need to create a free account with them but all recipes are available even if you do not subscribe.
URL: https://www.hellofresh.com/recipes
Looks like currently it's not picking up any elements for this site anymore.
I run it on a test at https://www.finedininglovers.com/recipes/brunch/rocket-gorgonzola-souffle/
and didn't even get the title.
Doesn't appear to parse the time correctly.
Cfr. https://gist.github.com/marcolussetti/49c46c893cea474677a0d0e789817364 where I have some quick&dirty checking of some random live recipes.
I've seen some times like: "Less than 30 minutes", "Under 1 hour", etc. Right now these get parsed as 0. Even if you want that, can you add some documentation to get_minutes explaining it?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.