aghz / structominer Goto Github PK

View Code? Open in Web Editor NEW

17.0 17.0 3.0 535 KB

Data scraping for a more civilized age

License: MIT License

Python 100.00%

structominer's People

Contributors

Stargazers

Watchers

Forkers

olafura pombredanne paulsmash

structominer's Issues

XML parsing

Document could be passed an xml argument that it then calls etree.XML on.

Add DictField to create a mapping field from a list of items

Functions like ListField except the underlying value store is an OrderedDict. The key can be given either as an ElementsField subclass or as a string.

If the key is a field, its value must be serializable so it can be used as the value store key.

If the key is a string, it will be split by '/' and the segments used in successive element accesses on the item (which must support element access via sequence or mapping ABCs). E.g. key='info/0/name' will attempt to store item under item['info'][0]['name']

CSS selectors

ElementsField selects elements based on its xpath argument. It could accept an alternate css argument that it would then pass through lxml.cssselect to obtain an xpath.

Fix string checking idioms

Use isinstance(var, basestring), dumbass.

Refactor field parsing to allow value preprocessing

Currently the parser decorator only allows after advice on parse. This is problematic if you want to extract e.g. the number of comments from the text "42 comments" as an IntField; currently you have to declare it as a TextField then extract and convert the number in the parser, thereby losing semantic expressivity.

Find a way to access structure subfields in a decorator expression

Currently subfield definitions in the structure of StructuredField can only be accessed via field.structure[key], which can't be used in a decorator expression. This makes declaring processors on subfields very awkward. Instead of the natural

@field.foo.preprocessor()
def _foo(value, **kwargs):
    return value

we have to use the elaborate way of applying decorators

def _foo(value, **kwargs):
    return value
field.structure['foo'].preprocessor()(_foo)

This is important because StructuredField and especially StructuredListField / StructuredDictField seem to be pretty common patterns on the web, especially when extracting datasets.

More examples

Add more examples, including a reference one on custom markup that illustrates all the features.

Document could theoretically subclass StructuredField

This way we could have different classes defining documents, and then use them as fields in other documents.

StringsField option to recursively extract descendant strings

StringsField flag for using descendant-or-self::*/text() instead of just text().

Thanks to @olafura for the suggestion.

Add filter decorator to DictField

Add exception handler decorator

Add a decorator for declaring custom exception handlers on field parsing. Ideally, they should be able to re-raise the exception without messing up the traceback.

Stop clobbering tracebacks when raising ParsingErrors in fields

Seriously, it got annoying even for me.

Write unit tests

Separate parsing mechanism from ElementsField into a parent class Field

Turns out ElementsField fills two roles: to provide the generic mechanism for parsing fields, and to provide the initial xpath extraction that all subclasses refine further.

Separate the parsing mechanism into a parent class Field (described in #10), leaving ElementsField to only define the initial _parse.

Composition over inheritance

Currently fields are structured in an inheritance hierarchy. This is making the responsibility chain in parsing somewhat strange and makes inserting pointcuts therein rather awkward (e.g. for pre/post-processors). We can already see weird artifacts like __masquerades__.

Replace the inheritance with composition, each field constructor accepting a from object or class (with a sensible default for usability) that it uses to perform processing that isn't its responsibility. Basically each field will then specify what kinds of things it needs, and what kinds of things it outputs after parsing (e.g. list of elements, list of strings, float).

Remove the need for **kwargs in procesors

Perhaps by inspecting the decorated function to see which arguments it declares. Make sure to still support **kwargs (in which case all data is sent), because explicit is still better than implicit.

Use decorator.decorator for @parser and @map

Look into using Michele Simionato's decorator package (documentation, SO thread). This might not work because we need access to self in there, but it's worth investigating in order to properly preserve function signatures.

Proper Python packaging

Add a proper setup.py so that structominer can be installed in the usual Python way.

Fix recursive element access on Document, ListField, DictField and StructuredField

Element access is currently a bit wonky, because it returns directly the value of a field. So if you start with a ListField and you access field[0] you'll get the value of the field in position 0. If it's a scalar field this is fine and you get the value you expect, but if it's a collection field this will return a standard Python collection containing Fields. If you request subsequent elements, it will go through Python's object, thereby returning the Field, not its value: if field is a ListField of ListFields of IntFields, field[0][1] will return an IntField instead of an int. If the second level is a collection of collections as well, the behaviour returns to normal: field[0][1][0] is an int; so it's only collection fields in even-numbered levels of hierarchy that are problematic (child, child of grandchild, etc.)

The problem is additionally compounded by the fact that element access usually starts with the document, so if it has e.g. a ListField values of IntFields, document['values'] will return a Python list of IntFields instead of a list of ints.