hhursev / recipe-scrapers Goto Github PK

View Code? Open in Web Editor NEW

1.6K 1.6K 507.0 51.76 MB

Python package for scraping recipes data

License: MIT License

Python 100.00%

recipe-scrapers's People

Contributors

Stargazers

Watchers

Forkers

jasonwirth mattgia hengrumay tanpask philippeowagner williamratcliff learn-with-data unionxtreme rtlee9 mhornbacher yzxchn ryannoelk rashansmith za3k tombirk baditaflorin karltaylor mindboggld jessicaritter froskekongen hazibii technicallyrachel marcolussetti shadhopson danielmicallef justinhaef mbgregory louisebc ryanfeather xiaopeiwu rneillj gzachv jaold hagitt shaharazulay bstai vivekseth melonnmediatech fizzywhizbang jhylands sp0x tumbaleena cooksane stacktraceyo damuel4000 lionawurscht buneme briculliton ihusk mrnuggelz renrut dredivaris saurabhgoyal josephstocks gareththomasnz zenny shubspatel huynhtastic g-thor psymin timandrews335 jemarsha hugorut dao258 reloadbrain vmlynnj jwe0619 andylolz astarrr peppone jservca jacobkreider austin-j-ross zx2012flying bpthomps jotespeh tmcquinn devonsangha maxnagarajan sisyphus192 b-evans99 ybnd gytismic schif1 wbhinton ducdh1210 gokhu18 sumepr throttleup gregwa1953 dandes3 trashg0blin gasbarroni8 fboaventura pherodeon chiaratroiani frenchgap jwoods1 raman325 bbuneci

recipe-scrapers's Issues

Support for budgetbytes.com

Request to support budgetbytes.com, sample of a recipe:
https://www.budgetbytes.com/creamy-coconut-curry-lentils-with-spinach/

Add yummly to scrapers

Add support for https://www.yummly.com
Sample recipe - https://www.yummly.com/recipe/Carrot-Milk-shake-1099424

Share a project

I thought I'd share what I made with this: https://archive.org/details/recipes-en-201706
A full version of allrecipes, epicurious, cookstr, and bbc.co.uk, parsed into nice JSON with photos.

Sorry to abuse 'issues', there's no option to send a private message on github as far as I know.

Fix Cookstr

The parser crashes on every attribute.

Cfr. https://gist.github.com/marcolussetti/49c46c893cea474677a0d0e789817364 where I have some quick&dirty checking of some random live recipes.

New site

Excellent recipes for Thermomix/Bimby

https://cookidoo.it/
https://cookidoo.it/recipes/recipe/it-IT/r490818

Fix time, instruction, title on Steam Kitchen

Doesn't appear to parse the time correctly and returns empty instructions.
The parser also crashes if used on title due to missing elements.

Cfr. https://gist.github.com/marcolussetti/49c46c893cea474677a0d0e789817364 where I have some quick&dirty checking of some random live recipes.

New Website `Hello Fresh'

Hi guys,

I want to add Hello Fresh to list of recipe website here. I've read through contributing sections and will get started soon with my little programming background.

Couple relevant questions:

Is the scraping done using the Recipe Schema or some other way?
Hello fresh website don't support that schema right now, so do you know of ways to get around that to collect all the metadata on recipe?

Who can I PM in case I get stuck? Not with the python error but with the whole workflow.

BBC GoodFood instructions repeated twice

I've noticed that recipes scraped from BBC GoodFood don't properly pull out the instructions. For instance scraping https://www.bbcgoodfood.com/recipes/herby-spring-chicken-pot-pie gives them as

"Heat oven to 200C/180C fan/gas 6. Heat the oil in a large, shallow casserole dish on a medium heat. Add the spring onions and fry for 3 mins, then stir through the frozen spinach and cook for 2 mins or until it\u2019s starting to wilt. Remove the skin from the chicken and discard. Shred the chicken off the bone and into the pan, and discard the bones. Stir through the stock and mustard. Bring to a simmer and cook, uncovered, for 5-10 mins. Stir in the peas, cr\u00e8me fra\u00eeche and herbs, then remove from the heat. Scrunch the filo pastry sheets over the mixture, brush with a little oil and bake for 15-20 mins or until golden brown. MethodHeat oven to 200C/180C fan/gas 6. Heat the oil in a large, shallow casserole dish on a medium heat. Add the spring onions and fry for 3 mins, then stir through the frozen spinach and cook for 2 mins or until it\u2019s starting to wilt. Remove the skin from the chicken and discard. Shred the chicken off the bone and into the pan, and discard the bones. Stir through the stock and mustard. Bring to a simmer and cook, uncovered, for 5-10 mins.Stir in the peas, cr\u00e8me fra\u00eeche and herbs, then remove from the heat. Scrunch the filo pastry sheets over the mixture, brush with a little oil and bake for 15-20 mins or until golden brown. Recipe from Good Food magazine, April 2019"

While I'm here, I've noticed that most of the scrapers pull out just the recipe name, ingredients and instructions. Would contributions of scrapers which extracted more information (e.g. serving size) be accepted?

BBC Good Food - Ingredients also scrape tooltip.

When I scrape a recipe from BBC Good Food, if the ingredients contains a tooltip it's included in the ingredient text.

Example

scrap_me = scrap_me('https://www.bbcgoodfood.com/recipes/artichoke-watercress-linguine')

print(json.dumps(
    {
        'title': scrap_me.title(),
        'totalTime': scrap_me.total_time(),
        'ingredients': scrap_me.ingredients(),
        'instructions': scrap_me.instructions()
    }
))

The result is

{
   "title":"Artichoke & watercress linguine",
   "totalTime":0,
   "ingredients":[
      "100g watercress Watercress wort-er-cressWith deep green leaves, and crisp, paler stems, watercress is related to mustard and is one of\u2026", //- here
      "280g jar artichokes in olive oil",
      "60g ricotta Ricotta ree-cot-aRicotta is an Italian curd cheese. Made from whey, it is traditionally a by-product of making\u2026", //- here
      "220g dried linguine"
   ],
   "instructions":"Blitz together the watercress, \u00be of the artichokes, the ricotta and 3 tbsp olive oil from the jar, then season to taste.\nBring a large pan of salted water to the boil and cook the linguine following pack instructions until al dente. Toss the pasta with the watercress pesto along with the remaining artichokes and a ladleful of pasta water. Finish with an extra drizzle of olive oil and black pepper."
}

Notice 100g watercress Watercress... and 60g ricotta Ricotta ree-cot-aRicotta...

New site

Could you help me with this site? its allrecipe brazilian version
http://allrecipes.com.br/

Recipe adress
http://allrecipes.com.br/receita/11219/bolo-de-iogurte-f-cil-de-liquidificador.aspx

Giallo Zafferano not working anymore

Some methods for Giallo Zafferano are not working anymore.
Example URL: https://ricette.giallozafferano.it/Pesto-alla-Genovese.html

>>> scraper.total_time()
0
>>> scraper.yields()
get_serving_numbers error 'NoneType' object has no attribute 'get_text'
''
>>> scraper.ingredients()
[]

BBC Food (bbc.co.uk) parser is no longer working

All fields are no longer correctly returned after the switch from class types to itemprop attributes.
Asking for the title causes a crash.

Fix time, ingredients, instructions on Bon Appetit

Doesn't appear to parse the time correctly and ingredients are returned as an empty list.
The parser also crashes if used on instructions due to missing elements.

Cfr. https://gist.github.com/marcolussetti/49c46c893cea474677a0d0e789817364 where I have some quick&dirty checking of some random live recipes.

Investigate why Fine Dining Lovers is returning 404s

The scraper is getting 404s whereas the links work fine on a real browser.

A simple scrape_me('https://www.finedininglovers.com/recipes/brunch/rocket-gorgonzola-souffle/') is sufficient to trigger the error.

I have done no investigation of what the issue might be, but I imagine it could just be returning 404s based on the user agent?

Enhancement

I would like to add support for FoodNetwork.com. I will submit a pull request shortly.

Fix The Vintage Mixer

Silently fails (return 0/empty string/empty array) on time, instructions and ingredients.
The parser also crashes if used on title. due to missing elements.

Cfr. https://gist.github.com/marcolussetti/49c46c893cea474677a0d0e789817364 where I have some quick&dirty checking of some random live recipes.

Fix Budget Bytes

Budget Bytes is missing yields. Probably could use updating the testhml page alongside it.

Fix time scraping on BBC Good food scraper

Doesn't appear to parse the time correctly.

Cfr. https://gist.github.com/marcolussetti/49c46c893cea474677a0d0e789817364 where I have some quick&dirty checking of some random live recipes.

Please can you make scrapper for https://en.wikibooks.org/wiki/Category:Recipes

It will be very helpful as these are share-alike license for all the recipes.

An Example Recipe:
https://en.wikibooks.org/wiki/Cookbook:Pad_Thai

Fix 101cookbooks

Missing totaltime, yields, ingredients, instructions. Cfr:
https://www.101cookbooks.com/archives/blood-orange-gin-sparkler-recipe.html

Question on how to parse / what goes into intructions

Hey, I was trying to write a scraper for recipes created with the wordpress recipe maker but am unsure about what should go into intructions.
example: soft-baked-gingerbread-cookies

I have two questions about it:

Should instructions that come in a numbered list maintain those numbers or just be joined with newlines? I feel like dropping the numbers removes some of the information from the original.
Should supplementary information, such a notes also go into instructions? In some cases this seems to make more sense then in others, e.g. when the notes contain possible substitutions vs. when they just contain some sort of story about the recipe.

Generally it might be helpful to specify these things somewhere, don't you think?

Add support for Giallo Zafferano

This is an enhancement proposal.

I would live to add Giallo Zafferano to the list of supported scrapers. It's by far the largest set of recipes in the Italian language as far as I know.

I am about to submit a pull request for this and will detail the details of the implementation in that venue.

New site

Could you help me with this site?
https://www.tudogostoso.com.br/

Recipe adress
https://www.tudogostoso.com.br/receita/114-brigadeiro.html

Fix time on My Baking Addiction

Unrelated to the https issue, the time is also not being parsed correctly.

Fix ingredients scraping from Simply Recipes

Returns an empty list for ingredients.

Cfr. https://gist.github.com/marcolussetti/49c46c893cea474677a0d0e789817364 where I have some quick&dirty checking of some random live recipes.

Error scraping ingredients on allrecipes.com source

Not sure how widespread this error is (everything on the domain or just this one recipe) but I have an automated test suite in a rails app that runs this script and it just started failing.

from recipe_scrapers import scrape_me
import sys
import json

try:
    # give the url as a string, it can be url from any site listed below
    scrape_me = scrape_me(str(sys.argv[1]))

    # data = {
    #     "title": scrape_me.title(),
    #     "total_time": scrape_me.total_time(),
    #     "ingredients": scrape_me.ingredients(),
    #     "instructions": scrape_me.instructions()
    # }

    print(scrape_me.ingredients())

    # with open('tmp/recipe_data.json', 'w') as outfile:
    #     json.dump(data, outfile)
except:
    print("Error while scraping recipe", sys.exc_info())

Seeing

('Error while scraping recipe', (<type 'exceptions.UnicodeDecodeError'>, UnicodeDecodeError('ascii', '\xa0', 0, 1, 'ordinal not in range(128)'), <traceback object at 0x11145b6c8>))

for url https://www.allrecipes.com/recipe/45736/chicken-tikka-masala/?internalSource=hub%20recipe&referringId=2264&referringContentType=recipe%20hub&clickId=cardslot%2013

.title() and .total_time() work but via commenting out I traced it back to .ingredients() causing the issue. I'm not a python dev so I didn't dig any deeper.

"Less than

More scrapers request

I have a recipe app (for personal use) that uses your scrapers and I just built a dashboard to see what sites I use the most. Here's a list of the top ones that I use. So this isn't actually an issue, just a passive request to add any missing ones to the list if you have the time. Thanks again for such a great tool!

TudoGostoso returning empty values

I tried scraper = scrape_me('https://www.tudogostoso.com.br/receita/7778-bobo-de-camarao.html') in python3 and none of the values are populating afterwards.

All the recipes

how would it be if I wanted to get all the recipes from the site?

Fix mybakingaddiction

Missing total_time, yields, ingredients, instructions. Cfr.: https://www.mybakingaddiction.com/chocolate-coconut-zucchini-bread/

(Spelling) Scrap -> Scrape

Scrape / scaper / scraping.
"Scrape" is pretty consistently "scrap" through the project, which is a different word :)

Fix simplyrecipes

Missing: yields, instructions. Cfr.: https://www.simplyrecipes.com/recipes/one_pot_chicken_and_rice_soup/

Scraping recipe images

Is there a reason why recipe images are not scraped?

BBC food scraper no longer working

Title returns as an empty string, time is zero, ingredients is an empty list, instructions is an empty string, however the link dictionary is returning.
The scraper does not support the current page's link: https://www.bbc.com/food/recipes/_bacon_chop_with_hispi_99182
recipe_scrapers.WebsiteNotImplementedError: Website (bbc.com) is not supported
Note: The scraper can identify the site when the link is renamed as https://www.bbc.co.uk/food/recipes/_bacon_chop_with_hispi_99182

Investigate SSL certificate issue on My Baking Addiction

A simple scrape_me('https://www.mybakingaddiction.com/tiramisu-trifles-recipe/') is enough to trigger a certificate error. I quickly glanced at the error and it seemed like the certificate received was for test.mybakingaddiction.com which is weird as when opening the URL in a browser the correct certificate is used.

Closet Cooking, returning empty values

scraper = scrape_me('https://closetcooking.com/bacon-guacamole-grilled-cheese-sandwich/')
scraper.title()
scraper.ingredients()

Fix time, ingredients instructions on Real Simple

Fix realsimple

Missing instructions. Cfr.: https://www.realsimple.com/food-recipes/browse-all-recipes/vanilla-cheesecake

Links Function

I am using this project as the foundation for a web crawler to collect recipe data for a Natural Language Processing project I am working on. Would you accept a pull request that adds the ability to return the links that reference that calling domain?

Fix time, instructions on Epicurious

Doesn't appear to parse the time correctly.
The parser also crashes if used on instructions due to missing elements.

Cfr. https://gist.github.com/marcolussetti/49c46c893cea474677a0d0e789817364 where I have some quick&dirty checking of some random live recipes.

Fix time, ingredients on Two Peas and their Pod

Doesn't appear to parse the time correctly.
The parser also crashes if used on ingredients due to missing elements.

Cfr. https://gist.github.com/marcolussetti/49c46c893cea474677a0d0e789817364 where I have some quick&dirty checking of some random live recipes.

Fix inspiralized

Missing title, yields, ingredients. Cfr.: https://inspiralized.com/brussels-sprouts-and-apple-salad-with-parmesan/

Inspiralized.com support

Example recipe: https://inspiralized.com/spiralized-pasta-carbonara-two-ways/

I'm going to submit a few of these and haven't yet looked at which sites I have the majority of my recipes from but this is one that I have at least several from.

Side note, thanks for the great work! If I get the time I'll write and PR some more scrapers. I'm mainly in ruby but I'm sure I can figure it out from examples.

hhursev / recipe-scrapers Goto Github PK

recipe-scrapers's People

Contributors

Stargazers

Watchers

Forkers

recipe-scrapers's Issues

Recommend Projects

Recommend Topics

Recommend Org