Giter Club home page Giter Club logo

hypallage's People

Contributors

veselosky avatar

Watchers

 avatar  avatar

hypallage's Issues

URL properties should be typed objects, not strings

EDIT: Decreased the scope to include only identifying URLs as objects. Performing URL resolution and proper comparisons can be done on a separate issue as needed.

The microdata spec draws a distinction between plain string values and URL values.

Make the extractor aware of URLs as a separate type.

Basic testing of Extractor

Create some basic testing of the extractor to prove it does something useful. Full use case testing will require some additional test data, and so will be captured as a separate issue.

Proper error handling for circular references

Microdata spec requires specific behavior when encoded microdata results in a circular reference. Currently the Extractor does not follow this instruction, but merely terminates the circular reference and continues.

Make the Extractor behave according to spec.

Make it work with HTMLParser

BeautifulSoup supports several different parsers. The code seems to work with lxml and with html5lib, but when it falls back to the standard library's HTMLParser, it fails to find any itemscopes. That parser sets itemscope=None rather than itemscope="" as the others do.

If possible, figure out how to make it work with the fallback parser. Until then, it will only work with one of the 3rd party parsers. :(

Extractor constructor should accept pre-parsed bs4 structure

Currently the Extractor takes a URL or filesystem path, which it loads and parses. This could be a pain if you have already parsed the document in question for some other reason.

Extractor's constructor should accept a BeautifulSoup object as well as a URL. The Extractor instance should expose a property allowing access to the bs4 object.

Implement nodelist document-order sort for docs with itemrefs

Microdata spec requires that properties be sorted in document order. A depth-first traversal normally suffices for this, but when an item includes itemrefs, it is possible for the nodes to be out of order.

Implement sorting of nodes in _extract_properties

Extractor json serializer

The microdata spec details how a JSON data structure should be constructed. The Extractor should know how to serialize its extracted data as JSON.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.