Giter Club home page Giter Club logo

microdata's Introduction

microdata

Build Status

microdata.py is a small utility library for extracting HTML5 Microdata from HTML. It depends on html5lib to do the heavy lifting of building the DOM. For more about HTML5 Microdata check out Mark Pilgrim's chapter on on it in Dive Into HTML5.

Command Line

When you install microdata via pip it will also install a command line utility:

$ microdata https://www.youtube.com/watch?v=dQw4w9WgXcQ
https://www.youtube.com/watch?v=dQw4w9WgXcQ
{
  "items": [
    {
      "type": [
        "http://schema.org/VideoObject"
      ],
      "properties": {
        "url": [
          "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
        ],
        "name": [
          "Rick Astley - Never Gonna Give You Up (Official Music Video)"
        ],
        "description": [
          "The official video for \u00e2\u20ac\u0153Never Gonna Give You Up\u00e2\u20ac\ufffd by Rick Astley \u00e2\u20ac\u0153Never Gonna Give You Up\u00e2\u20ac\ufffd was a global smash on its release in July 1987, topping the charts ..."
        ],
        "paid": [
          "False"
        ],
        "channelId": [
          "UCuAXFkgsw1L7xaCfnd5JJOw"
        ],
        "videoId": [
          "dQw4w9WgXcQ"
        ],
        "duration": [
          "PT3M33S"
        ],
        "unlisted": [
          "False"
        ],
        "author": [
          {
            "type": [
              "http://schema.org/Person"
            ],
            "properties": {
              "url": [
                "http://www.youtube.com/channel/UCuAXFkgsw1L7xaCfnd5JJOw"
              ],
              "name": [
                ""
              ]
            }
          }
        ],
        "thumbnailUrl": [
          "https://i.ytimg.com/vi/dQw4w9WgXcQ/maxresdefault.jpg"
        ],
        "thumbnail": [
          {
            "type": [
              "http://schema.org/ImageObject"
            ],
            "properties": {
              "url": [
                "https://i.ytimg.com/vi/dQw4w9WgXcQ/maxresdefault.jpg"
              ],
              "width": [
                "1280"
              ],
              "height": [
                "720"
              ]
            }
          }
        ],
        "embedUrl": [
          "https://www.youtube.com/embed/dQw4w9WgXcQ"
        ],
        "playerType": [
          "HTML5 Flash"
        ],
        "width": [
          "1280"
        ],
        "height": [
          "720"
        ],
        "isFamilyFriendly": [
          "true"
        ],
        "regionsAllowed": [
          "AD,AE,AF,AG,AI,AL,AM,AO,AQ,AR,AS,AT,AU,AW,AX,AZ,BA,BB,BD,BE,BF,BG,BH,BI,BJ,BL,BM,BN,BO,BQ,BR,BS,BT,BV,BW,BY,BZ,CA,CC,CD,CF,CG,CH,CI,CK,CL,CM,CN,CO,CR,CU,CV,CW,CX,CY,CZ,DE,DJ,DK,DM,DO,DZ,EC,EE,EG,EH,ER,ES,ET,FI,FJ,FK,FM,FO,FR,GA,GB,GD,GE,GF,GG,GH,GI,GL,GM,GN,GP,GQ,GR,GS,GT,GU,GW,GY,HK,HM,HN,HR,HT,HU,ID,IE,IL,IM,IN,IO,IQ,IR,IS,IT,JE,JM,JO,JP,KE,KG,KH,KI,KM,KN,KP,KR,KW,KY,KZ,LA,LB,LC,LI,LK,LR,LS,LT,LU,LV,LY,MA,MC,MD,ME,MF,MG,MH,MK,ML,MM,MN,MO,MP,MQ,MR,MS,MT,MU,MV,MW,MX,MY,MZ,NA,NC,NE,NF,NG,NI,NL,NO,NP,NR,NU,NZ,OM,PA,PE,PF,PG,PH,PK,PL,PM,PN,PR,PS,PT,PW,PY,QA,RE,RO,RS,RU,RW,SA,SB,SC,SD,SE,SG,SH,SI,SJ,SK,SL,SM,SN,SO,SR,SS,ST,SV,SX,SY,SZ,TC,TD,TF,TG,TH,TJ,TK,TL,TM,TN,TO,TR,TT,TV,TW,TZ,UA,UG,UM,US,UY,UZ,VA,VC,VE,VG,VI,VN,VU,WF,WS,YE,YT,ZA,ZM,ZW"
        ],
        "interactionCount": [
          "1141688870"
        ],
        "datePublished": [
          "2009-10-24"
        ],
        "uploadDate": [
          "2009-10-24"
        ],
        "genre": [
          "Music"
        ]
      }
    }
  ]
}

Library

Here's the basic usage from Python using https://raw.github.com/edsu/microdata/master/test-data/example.html as an example:

>>> import microdata
>>> import urllib
>>> url = "https://raw.github.com/edsu/microdata/master/test-data/example.html"
>>> items = microdata.get_items(urllib.urlopen(url))
>>> item = items[0]
>>> item.itemtype
[http://schema.org/Person]
>>> item.name
u"Jane Doe"
>>> item.colleagues
u"http://www.xyz.edu/students/alicejones.html"
>>> item.get_all('colleagues')
[u"http://www.xyz.edu/students/alicejones.html", u"http://www.xyz.edu/students/bobsmith.html"]
>>> print item.json()
{
  "type": [
    "http://schema.org/Person"
  ],
  "id": "http://www.xyz.edu/~jane",
  "properties": {
    "colleagues": [
      "http://www.xyz.edu/students/alicejones.html",
      "http://www.xyz.edu/students/bobsmith.html"
    ],
    "name": [
      "Jane Doe"
    ],
    "url": [
      "http://www.janedoe.com"
    ],
    "jobTitle": [
      "Professor"
    ],
    "image": [
      "janedoe.jpg"
    ],
    "telephone": [
      "(425) 123-4567"
    ],
    "address": [
      {
        "type": [
          "http://schema.org/PostalAddress"
        ],
        "properties": {
          "addressLocality": [
            "Seattle"
          ],
          "addressRegion": [
            "WA"
          ],
          "streetAddress": [
            "\n          20341 Whitworth Institute\n          405 N. Whitworth\n        "
          ],
          "postalCode": [
            "98052"
          ]
        }
      }
    ],
    "email": [
      "mailto:[email protected]"
    ]
  }
}

License

  • CC0

microdata's People

Contributors

acdha avatar cameronmarlow avatar edsu avatar jeroenl avatar jesuslosada avatar joke2k avatar kenlsm avatar narphorium avatar ricardokirkner avatar robhammond avatar staeff avatar theofilis avatar timkaye11 avatar tweekmonster avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

microdata's Issues

handle dates

Might be worth parsing dates and times into Python objects. But it might suck too :-)

Loading microdata from certain sites

Hi,

is there any way how to get the microdata.py run for ecommerce sites such as Zalando.com or Otto.de. Both provide the data, but the microdata script just returns an empty string.

Thanks!

Robert

invalid microdata

So the web will undoubtedly be littered with messed up microdata. This package could use some testing of edge cases like:

  • itemprop but no itemscope
  • itemprop without an appropriate value
  • OK item with nested bad item

Latest version of html5lib throws TypeError with example code

Latest version of html5lib (v0.999999999 (nine 9s)) throws TypeError when using example code:

Traceback (most recent call last):                                                           
  File "C:\Users\landonjx\Documents\scrape\scrape.py", line 10, in <module>                  
    items = microdata.get_items(urllib.urlopen('http://www.telegraph.co.uk/'))               
  File "C:\Python27\lib\site-packages\microdata.py", line 21, in get_items                   
    tree = parser.parse(location, encoding=encoding)                                         
  File "C:\Python27\lib\site-packages\html5lib\html5parser.py", line 235, in parse           
    self._parse(stream, False, None, *args, **kwargs)                                        
  File "C:\Python27\lib\site-packages\html5lib\html5parser.py", line 85, in _parse           
    self.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs)                 
  File "C:\Python27\lib\site-packages\html5lib\_tokenizer.py", line 36, in __init__          
    self.stream = HTMLInputStream(stream, **kwargs)                                          
  File "C:\Python27\lib\site-packages\html5lib\_inputstream.py", line 151, in HTMLInputStream
    return HTMLBinaryInputStream(source, **kwargs)                                           
TypeError: __init__() got an unexpected keyword argument 'encoding'                          

v0.9999999 (seven 9s) works fine.

itemtype can be multivalued

Since the value of the itemtype attribute can be multivalued the resulting JSON should show the values as an array. I believe this may be a change since this library was first written.

From the spec concerning the JSON serialization:

If the item has any item types, add an entry to result called "type" whose value is an array listing the item types of item, in the order they were specified on the itemtype attribute.

Here's where the itemtype attribute is specified:
http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#attr-itemtype

href value of a tags always used as property values in preference to their text content

I'm not quite sure how to solve this correctly for all instances, but consider the following schema: http://schema.org/TVEpisode and the WHATWG spec: "When a string value is a URL, it is expressed using the a element and its href attribute, the img element and its src attribute, or other elements that link to or embed external resources." (http://www.whatwg.org/specs/web-apps/current-work/#the-basic-syntax)

The following HTML fragment will then have its microdata parsed as name='#', instead of name='Foo', even though the schema says the type is Text and not URL.

<div itemscope itemtype="http://schema.org/TVEpisode">
    <a itemprop="name" href="#">Foo</a>
</div>

multiple elements

It doesn't work with multiple element like with SiteNavigationElement, it will only detect one element, it takes the first name and url properties.

handle itemref attributes

The itemref attribute needs to be made part of the parsing algorithm. The use of itemref in the wild might be minimal, but microdata.py will currently give incorrect results when parsing a document that uses itemref.

You can use this example document extracted from the spec:
https://github.com/jronallo/microdata/blob/master/test/data/example_itemref.html
http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#the-basic-syntax

The results should look like this:

{
  "items": [
    {
      "properties": {
        "name": [
          "Amanda"
        ],
        "band": [
          {
            "properties": {
              "name": [
                "Jazz Band"
              ],
              "size": [
                "12"
              ]
            }
          }
        ]
      }
    }
  ]
}

Case sensitive parsing

When parsing recipe microdata, if there exists like itemprop="recipeinstructions" instead of itemprop="recipeInstructions", parser failes to recognize it.

Treat nested itemscope items as separate objects

When an itemscope'd element contains another itemscope'd element, the properties of the nested item should not be added to the original item.

Example:

 > microdata.py https://peerj.com/articles/182.html

"name": [
      "Ontogeny in the tube-crested dinosaur Parasaurolophus (Hadrosauridae) and heterochrony in hadrosaurids", // name of the main item
      "John Hutchinson" // name of a nested item
    ], 

make relative URLs absolute

The URI class should probably be smart enough to absolutize URLs as property values once they are removed from the context of the HTML they came from.

docstrings

public classes and methods need docstrings, like get_items() and the Item class.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.