Giter Club home page Giter Club logo

structominer's People

Contributors

aghz avatar olafura avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

structominer's Issues

XML parsing

Document could be passed an xml argument that it then calls etree.XML on.

Add DictField to create a mapping field from a list of items

Functions like ListField except the underlying value store is an OrderedDict. The key can be given either as an ElementsField subclass or as a string.

If the key is a field, its value must be serializable so it can be used as the value store key.

If the key is a string, it will be split by '/' and the segments used in successive element accesses on the item (which must support element access via sequence or mapping ABCs). E.g. key='info/0/name' will attempt to store item under item['info'][0]['name']

CSS selectors

ElementsField selects elements based on its xpath argument. It could accept an alternate css argument that it would then pass through lxml.cssselect to obtain an xpath.

Refactor field parsing to allow value preprocessing

Currently the parser decorator only allows after advice on parse. This is problematic if you want to extract e.g. the number of comments from the text "42 comments" as an IntField; currently you have to declare it as a TextField then extract and convert the number in the parser, thereby losing semantic expressivity.

Find a way to access structure subfields in a decorator expression

Currently subfield definitions in the structure of StructuredField can only be accessed via field.structure[key], which can't be used in a decorator expression. This makes declaring processors on subfields very awkward. Instead of the natural

@field.foo.preprocessor()
def _foo(value, **kwargs):
    return value

we have to use the elaborate way of applying decorators

def _foo(value, **kwargs):
    return value
field.structure['foo'].preprocessor()(_foo)

This is important because StructuredField and especially StructuredListField / StructuredDictField seem to be pretty common patterns on the web, especially when extracting datasets.

More examples

Add more examples, including a reference one on custom markup that illustrates all the features.

Add exception handler decorator

Add a decorator for declaring custom exception handlers on field parsing. Ideally, they should be able to re-raise the exception without messing up the traceback.

Separate parsing mechanism from ElementsField into a parent class Field

Turns out ElementsField fills two roles: to provide the generic mechanism for parsing fields, and to provide the initial xpath extraction that all subclasses refine further.

Separate the parsing mechanism into a parent class Field (described in #10), leaving ElementsField to only define the initial _parse.

Composition over inheritance

Currently fields are structured in an inheritance hierarchy. This is making the responsibility chain in parsing somewhat strange and makes inserting pointcuts therein rather awkward (e.g. for pre/post-processors). We can already see weird artifacts like __masquerades__.

Replace the inheritance with composition, each field constructor accepting a from object or class (with a sensible default for usability) that it uses to perform processing that isn't its responsibility. Basically each field will then specify what kinds of things it needs, and what kinds of things it outputs after parsing (e.g. list of elements, list of strings, float).

Remove the need for **kwargs in procesors

Perhaps by inspecting the decorated function to see which arguments it declares. Make sure to still support **kwargs (in which case all data is sent), because explicit is still better than implicit.

Fix recursive element access on Document, ListField, DictField and StructuredField

Element access is currently a bit wonky, because it returns directly the value of a field. So if you start with a ListField and you access field[0] you'll get the value of the field in position 0. If it's a scalar field this is fine and you get the value you expect, but if it's a collection field this will return a standard Python collection containing Fields. If you request subsequent elements, it will go through Python's object, thereby returning the Field, not its value: if field is a ListField of ListFields of IntFields, field[0][1] will return an IntField instead of an int. If the second level is a collection of collections as well, the behaviour returns to normal: field[0][1][0] is an int; so it's only collection fields in even-numbered levels of hierarchy that are problematic (child, child of grandchild, etc.)

The problem is additionally compounded by the fact that element access usually starts with the document, so if it has e.g. a ListField values of IntFields, document['values'] will return a Python list of IntFields instead of a list of ints.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.