aghz / structominer Goto Github PK
View Code? Open in Web Editor NEWData scraping for a more civilized age
License: MIT License
Data scraping for a more civilized age
License: MIT License
Document
could be passed an xml argument that it then calls etree.XML
on.
Functions like ListField except the underlying value store is an OrderedDict. The key can be given either as an ElementsField subclass or as a string.
If the key is a field, its value must be serializable so it can be used as the value store key.
If the key is a string, it will be split by '/' and the segments used in successive element accesses on the item (which must support element access via sequence or mapping ABCs). E.g. key='info/0/name'
will attempt to store item
under item['info'][0]['name']
ElementsField
selects elements based on its xpath argument. It could accept an alternate css argument that it would then pass through lxml.cssselect to obtain an xpath.
Use isinstance(var, basestring)
, dumbass.
Currently the parser decorator only allows after advice on parse. This is problematic if you want to extract e.g. the number of comments from the text "42 comments" as an IntField; currently you have to declare it as a TextField then extract and convert the number in the parser, thereby losing semantic expressivity.
Currently subfield definitions in the structure of StructuredField can only be accessed via field.structure[key]
, which can't be used in a decorator expression. This makes declaring processors on subfields very awkward. Instead of the natural
@field.foo.preprocessor()
def _foo(value, **kwargs):
return value
we have to use the elaborate way of applying decorators
def _foo(value, **kwargs):
return value
field.structure['foo'].preprocessor()(_foo)
This is important because StructuredField and especially StructuredListField / StructuredDictField seem to be pretty common patterns on the web, especially when extracting datasets.
Add more examples, including a reference one on custom markup that illustrates all the features.
This way we could have different classes defining documents, and then use them as fields in other documents.
StringsField
flag for using descendant-or-self::*/text()
instead of just text()
.
Thanks to @olafura for the suggestion.
Add a decorator for declaring custom exception handlers on field parsing. Ideally, they should be able to re-raise the exception without messing up the traceback.
Seriously, it got annoying even for me.
Turns out ElementsField fills two roles: to provide the generic mechanism for parsing fields, and to provide the initial xpath extraction that all subclasses refine further.
Separate the parsing mechanism into a parent class Field (described in #10), leaving ElementsField to only define the initial _parse.
Currently fields are structured in an inheritance hierarchy. This is making the responsibility chain in parsing somewhat strange and makes inserting pointcuts therein rather awkward (e.g. for pre/post-processors). We can already see weird artifacts like __masquerades__
.
Replace the inheritance with composition, each field constructor accepting a from
object or class (with a sensible default for usability) that it uses to perform processing that isn't its responsibility. Basically each field will then specify what kinds of things it needs, and what kinds of things it outputs after parsing (e.g. list of elements, list of strings, float).
Perhaps by inspecting the decorated function to see which arguments it declares. Make sure to still support **kwargs
(in which case all data is sent), because explicit is still better than implicit.
Look into using Michele Simionato's decorator package (documentation, SO thread). This might not work because we need access to self
in there, but it's worth investigating in order to properly preserve function signatures.
Add a proper setup.py so that structominer can be installed in the usual Python way.
Element access is currently a bit wonky, because it returns directly the value of a field. So if you start with a ListField and you access field[0]
you'll get the value of the field in position 0. If it's a scalar field this is fine and you get the value you expect, but if it's a collection field this will return a standard Python collection containing Fields. If you request subsequent elements, it will go through Python's object, thereby returning the Field, not its value: if field is a ListField of ListFields of IntFields, field[0][1]
will return an IntField instead of an int. If the second level is a collection of collections as well, the behaviour returns to normal: field[0][1][0]
is an int; so it's only collection fields in even-numbered levels of hierarchy that are problematic (child, child of grandchild, etc.)
The problem is additionally compounded by the fact that element access usually starts with the document, so if it has e.g. a ListField values
of IntFields, document['values']
will return a Python list of IntFields instead of a list of ints.
Properly document the codebase, generate Sphinx docs and upload to RTD.
Similar to @filter
, store a list of functions to apply to each element
Very much like DateField.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.