Giter Club home page Giter Club logo

recipe-scrapers's People

Contributors

adamkeys avatar adrianpasternak avatar anthonyscorrea avatar bfcarpio avatar brett avatar cameronezell avatar cydanil avatar ewenquim avatar gregwa1953 avatar hhursev avatar jayaddison avatar jknndy avatar jksimoniii avatar khushhalm avatar lerbaek avatar makanz avatar maraid avatar marcolussetti avatar micahcochran avatar mlduff avatar mojito317 avatar mwcaisse avatar nhellfire avatar nhomka avatar patrickpierce avatar strangetom avatar tafaust avatar vabene1111 avatar weightwatchers-carlanderson avatar wengtad avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

recipe-scrapers's Issues

Share a project

I thought I'd share what I made with this: https://archive.org/details/recipes-en-201706
A full version of allrecipes, epicurious, cookstr, and bbc.co.uk, parsed into nice JSON with photos.

Sorry to abuse 'issues', there's no option to send a private message on github as far as I know.

New Website `Hello Fresh'

Hi guys,

I want to add Hello Fresh to list of recipe website here. I've read through contributing sections and will get started soon with my little programming background.

Couple relevant questions:

  1. Is the scraping done using the Recipe Schema or some other way?
  2. Hello fresh website don't support that schema right now, so do you know of ways to get around that to collect all the metadata on recipe?

Who can I PM in case I get stuck? Not with the python error but with the whole workflow.

BBC GoodFood instructions repeated twice

I've noticed that recipes scraped from BBC GoodFood don't properly pull out the instructions. For instance scraping https://www.bbcgoodfood.com/recipes/herby-spring-chicken-pot-pie gives them as

"Heat oven to 200C/180C fan/gas 6. Heat the oil in a large, shallow casserole dish on a medium heat. Add the spring onions and fry for 3 mins, then stir through the frozen spinach and cook for 2 mins or until it\u2019s starting to wilt. Remove the skin from the chicken and discard. Shred the chicken off the bone and into the pan, and discard the bones. Stir through the stock and mustard. Bring to a simmer and cook, uncovered, for 5-10 mins. Stir in the peas, cr\u00e8me fra\u00eeche and herbs, then remove from the heat. Scrunch the filo pastry sheets over the mixture, brush with a little oil and bake for 15-20 mins or until golden brown. MethodHeat oven to 200C/180C fan/gas 6. Heat the oil in a large, shallow casserole dish on a medium heat. Add the spring onions and fry for 3 mins, then stir through the frozen spinach and cook for 2 mins or until it\u2019s starting to wilt. Remove the skin from the chicken and discard. Shred the chicken off the bone and into the pan, and discard the bones. Stir through the stock and mustard. Bring to a simmer and cook, uncovered, for 5-10 mins.Stir in the peas, cr\u00e8me fra\u00eeche and herbs, then remove from the heat. Scrunch the filo pastry sheets over the mixture, brush with a little oil and bake for 15-20 mins or until golden brown. Recipe from Good Food magazine, April 2019"

While I'm here, I've noticed that most of the scrapers pull out just the recipe name, ingredients and instructions. Would contributions of scrapers which extracted more information (e.g. serving size) be accepted?

BBC Good Food - Ingredients also scrape tooltip.

When I scrape a recipe from BBC Good Food, if the ingredients contains a tooltip it's included in the ingredient text.

Example

scrap_me = scrap_me('https://www.bbcgoodfood.com/recipes/artichoke-watercress-linguine')

print(json.dumps(
    {
        'title': scrap_me.title(),
        'totalTime': scrap_me.total_time(),
        'ingredients': scrap_me.ingredients(),
        'instructions': scrap_me.instructions()
    }
))

The result is

{
   "title":"Artichoke & watercress linguine",
   "totalTime":0,
   "ingredients":[
      "100g watercress Watercress wort-er-cressWith deep green leaves, and crisp, paler stems, watercress is related to mustard and is one of\u2026", //- here
      "280g jar artichokes in olive oil",
      "60g ricotta Ricotta ree-cot-aRicotta is an Italian curd cheese. Made from whey, it is traditionally a by-product of making\u2026", //- here
      "220g dried linguine"
   ],
   "instructions":"Blitz together the watercress, \u00be of the artichokes, the ricotta and 3 tbsp olive oil from the jar, then season to taste.\nBring a large pan of salted water to the boil and cook the linguine following pack instructions until al dente. Toss the pasta with the watercress pesto along with the remaining artichokes and a ladleful of pasta water. Finish with an extra drizzle of olive oil and black pepper."
}

Notice 100g watercress Watercress... and 60g ricotta Ricotta ree-cot-aRicotta...

Investigate why Fine Dining Lovers is returning 404s

The scraper is getting 404s whereas the links work fine on a real browser.

A simple scrape_me('https://www.finedininglovers.com/recipes/brunch/rocket-gorgonzola-souffle/') is sufficient to trigger the error.

I have done no investigation of what the issue might be, but I imagine it could just be returning 404s based on the user agent?

Enhancement

I would like to add support for FoodNetwork.com. I will submit a pull request shortly.

Fix Budget Bytes

Budget Bytes is missing yields. Probably could use updating the testhml page alongside it.

Question on how to parse / what goes into intructions

Hey, I was trying to write a scraper for recipes created with the wordpress recipe maker but am unsure about what should go into intructions.
example: soft-baked-gingerbread-cookies

I have two questions about it:

  1. Should instructions that come in a numbered list maintain those numbers or just be joined with newlines? I feel like dropping the numbers removes some of the information from the original.
  2. Should supplementary information, such a notes also go into instructions? In some cases this seems to make more sense then in others, e.g. when the notes contain possible substitutions vs. when they just contain some sort of story about the recipe.

Generally it might be helpful to specify these things somewhere, don't you think?

Add support for Giallo Zafferano

This is an enhancement proposal.

I would live to add Giallo Zafferano to the list of supported scrapers. It's by far the largest set of recipes in the Italian language as far as I know.

I am about to submit a pull request for this and will detail the details of the implementation in that venue.

Error scraping ingredients on allrecipes.com source

Not sure how widespread this error is (everything on the domain or just this one recipe) but I have an automated test suite in a rails app that runs this script and it just started failing.

from recipe_scrapers import scrape_me
import sys
import json

try:
    # give the url as a string, it can be url from any site listed below
    scrape_me = scrape_me(str(sys.argv[1]))

    # data = {
    #     "title": scrape_me.title(),
    #     "total_time": scrape_me.total_time(),
    #     "ingredients": scrape_me.ingredients(),
    #     "instructions": scrape_me.instructions()
    # }

    print(scrape_me.ingredients())

    # with open('tmp/recipe_data.json', 'w') as outfile:
    #     json.dump(data, outfile)
except:
    print("Error while scraping recipe", sys.exc_info())

Seeing

('Error while scraping recipe', (<type 'exceptions.UnicodeDecodeError'>, UnicodeDecodeError('ascii', '\xa0', 0, 1, 'ordinal not in range(128)'), <traceback object at 0x11145b6c8>))

for url https://www.allrecipes.com/recipe/45736/chicken-tikka-masala/?internalSource=hub%20recipe&referringId=2264&referringContentType=recipe%20hub&clickId=cardslot%2013

.title() and .total_time() work but via commenting out I traced it back to .ingredients() causing the issue. I'm not a python dev so I didn't dig any deeper.

More scrapers request

I have a recipe app (for personal use) that uses your scrapers and I just built a dashboard to see what sites I use the most. Here's a list of the top ones that I use. So this isn't actually an issue, just a passive request to add any missing ones to the list if you have the time. Thanks again for such a great tool!

image

TudoGostoso returning empty values

I tried scraper = scrape_me('https://www.tudogostoso.com.br/receita/7778-bobo-de-camarao.html') in python3 and none of the values are populating afterwards.

All the recipes

how would it be if I wanted to get all the recipes from the site?

(Spelling) Scrap -> Scrape

Scrape / scaper / scraping.
"Scrape" is pretty consistently "scrap" through the project, which is a different word :)

BBC food scraper no longer working

  1. Title returns as an empty string, time is zero, ingredients is an empty list, instructions is an empty string, however the link dictionary is returning.
  2. The scraper does not support the current page's link: https://www.bbc.com/food/recipes/_bacon_chop_with_hispi_99182
    recipe_scrapers.WebsiteNotImplementedError: Website (bbc.com) is not supported
    Note: The scraper can identify the site when the link is renamed as https://www.bbc.co.uk/food/recipes/_bacon_chop_with_hispi_99182

Investigate SSL certificate issue on My Baking Addiction

A simple scrape_me('https://www.mybakingaddiction.com/tiramisu-trifles-recipe/') is enough to trigger a certificate error. I quickly glanced at the error and it seemed like the certificate received was for test.mybakingaddiction.com which is weird as when opening the URL in a browser the correct certificate is used.

Links Function

I am using this project as the foundation for a web crawler to collect recipe data for a Natural Language Processing project I am working on. Would you accept a pull request that adds the ability to return the links that reference that calling domain?

Inspiralized.com support

Example recipe: https://inspiralized.com/spiralized-pasta-carbonara-two-ways/

I'm going to submit a few of these and haven't yet looked at which sites I have the majority of my recipes from but this is one that I have at least several from.

Side note, thanks for the great work! If I get the time I'll write and PR some more scrapers. I'm mainly in ruby but I'm sure I can figure it out from examples.

Fix Fine Dining Lovers

Looks like currently it's not picking up any elements for this site anymore.

I run it on a test at https://www.finedininglovers.com/recipes/brunch/rocket-gorgonzola-souffle/ and didn't even get the title.

"Less than" times

I've seen some times like: "Less than 30 minutes", "Under 1 hour", etc. Right now these get parsed as 0. Even if you want that, can you add some documentation to get_minutes explaining it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.