Giter Club home page Giter Club logo

opengraph's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

opengraph's Issues

import opengraph

from opengraph import OpenGraph
cannot import name 'OpenGraph'
in opengraph/init.py

Not working in Python 3

It works when I run with Python 2, but when I run with Python 3 I get the following error.

Traceback (most recent call last):
  File "og.py", line 9, in <module>
    import opengraph
  File "/usr/local/lib/python3.5/dist-packages/opengraph/__init__.py", line 1, in <module>
    from opengraph import OpenGraph

How to set custom User Agent?

Udemy.com is blocking the default User Agent of opengraph.

I'm getting

How do I set a custom user agent for OpenGraph module

urllib2.HTTPError: HTTP Error 403: Unauthorized

As a workaround I have created a custom getter using requests module

def custom_get_img_from_link(link):
    """
    """
    #headers = {"User-Agent":get_random_UA()}
    headers = {"User-Agent": "My bot"}
    r = requests.get(link, headers=headers)

    parsed_uri = urlparse(link)
    domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)

    OpenGraph.parser = parser
    OpenGraph.scrape = True  # workaround for some subtle bug in opengraph

    page = OpenGraph(html=r.content)

    if page.is_valid():

        image_url = page.get('image', None)

        if not image_url.startswith('http'):
            image_url = urljoin(domain, page['image'])

        return image_url

Duplicate data

Ran these command in ipython:

from opengraph import OpenGraph
og=OpenGraph("http://www.livemint.com/Industry/VlSovF4AGkwhupYQ2Ps3YN/Bschool-placements-Modest-rise-in-average-salary-offered.html")
og

og=OpenGraph("http://facebook.com")
og

output1:
{'locale:alternate': 'ja_JP', 'site_name': 'http://www.livemint.com/', 'description': 'Initial analysis of the ongoing placement season at IIMs and other top B-schools indicate that most salaries offered show a single digit growth over last year', 'title': 'B-school placements see modest salary growth, fewer offers from start-ups', 'url': 'http://www.livemint.com/Industry/VlSovF4AGkwhupYQ2Ps3YN/Bschool-placements-Modest-rise-in-average-salary-offered.html', 'image': 'http://www.livemint.com/rf/Image-621x414/LiveMint/Period2/2017/02/08/Photos/Processed/[email protected]', 'locale': 'hi_IN', 'type': 'article'}

output2:
{'locale:alternate': 'ja_JP', 'site_name': 'Facebook', 'description': 'Initial analysis of the ongoing placement season at IIMs and other top B-schools indicate that most salaries offered show a single digit growth over last year', 'title': 'B-school placements see modest salary growth, fewer offers from start-ups', 'url': 'https://www.facebook.com/', 'image': 'https://www.facebook.com/images/fb_icon_325x325.png', 'locale': 'hi_IN', 'type': 'article'}

description and title are same.

Make it possible to specify the parser for BeautifulSoup4

If you have lxml installed, BeautifulSoup4 will set lxml as the default parser, so it would be better to be able to specify the parser depending on the situation.

doc = BeautifulSoup(html)

This is the default setting because we didn't actually do the parser above.

Depending on the environment, the following issue cases may occur due to the above reasons
#37

As a solution, I think it would be a good idea to add a new parser that can be selected in the following arguments

def __init__(self, url=None, html=None, scrape=False, **kwargs):

Licensing

Is it possible to add licensing information to this project? I'd like to modify it to suit my custom needs (in commercial product).

Some OG tags not found

Some OG tags are not found if the tags are in the body, not the header. Currently the code only checks doc.html.header. Will provide a fix to search all HTML (maybe this should be configurable?)

Metadata not in head but in the body

Hi,

I am having an issue with getting the metadata using opengraph_py3, urllib and bs4.

In parser method you are only checking the <head> but it looks like <meta> tags are sometimes in the body. Any ideas how can I fix this ? Is it due to the UserAgent ?

  • urllib3 1.23
  • opengraph-py3 0.71
  • beautifulsoup4 4.6.0
import re
import opengraph_py3 as opengraph
import urllib
from bs4 import BeautifulSoup

raw = urllib.request.FancyURLopener().open("https://youtu.be/DQwU_kU4pUg")
html = raw.read()
soap = BeautifulSoup(html, 'html.parser')

# This is the same code as in `parser`
soap.html.head.findAll(property=re.compile(r'^og'))
# []

soap.html.body.findAll(property=re.compile(r'^og'))
# [<meta content="YouTube" property="og:site_na....]

Warning from BeautifulSoup

C:\Python27\lib\site-packages\bs4\__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html5lib")

  markup_type=markup_type))

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.