delb-xml / delb-py Goto Github PK

View Code? Open in Web Editor NEW

15.0 4.0 0.0 2.86 MB

A library that provides an ergonomic model for XML encoded text documents (e.g. with TEI-XML).

Home Page: https://delb.readthedocs.io

License: GNU Affero General Public License v3.0

Python 99.60% Just 0.40%

xml dom text-encoding python tei-xml

delb-py's Introduction

delb

delb is a library that provides an ergonomic model for XML encoded text documents (e.g. TEI-XML) for the Python programming language. It fills a gap for the humanities-related field of software development towards the excellent (scientific) communities in the Python ecosystem.

For a more elaborated discussion see the Design chapter of the documentation.

Features

Loads documents from various source types. This is customizable and extensible.
XML DOM types are represented by distinct classes.
A completely type-annotated API.
Consistent design regarding names and callables' signatures.
Shadows comments and processing instructions by default.
Querying with XPath and CSS expressions.

Development status

You're invited to submit tests that reflect desired use cases or are merely of theoretical nature. Of course, any kind of proposals for or implementations of improvements are welcome as well.

Related Projects & Testimonials

snakesist is an eXist-db client that uses delb to expose database resources.

Kurt Raschke noted in 2010:

In a DOM-based implementation, it would be relatively easy […]
But lxml doesn't use text nodes; instead it uses [text] and [tail]
properties to hold text content.

delb-py's People

Contributors

Stargazers

Watchers

delb-py's Issues

Add wrapper wrapper for query results?

the current way to fetch the one expected result from a css or xpath query is imo not so well readable:
result = first(node.css_select(expression)).full_text.strip()

this might be better:
result = node.css_select(expression).first.full_text.strip()
it should 'perform' better to be read when Python expressions are longer / more method calls are chained and within nested function calls.

on the other hand, the first and last functions are usable w/ any interable and are also handy with other data.

the wrapping structure would probably be a list-like, possibly lazy-evaluating contents in the future. beside a first and last property, there could be other helpers, like adding additional filters.

could smell like overengineering though.

putting this to the 0.2 milestone w/ the option to postpone or discard.

Regarding the handling of IRIs as XML namespaces

before i dump an attempt to validate declared namespaces as IRIs (RFC 3987) in #68, i want to leave some notes about the problem.

lxml doesn't seem to validate namespace values, after all the test suite is full with invalid ones
rdflib validates URIs like this, it doesn't really
there's also uritools that focuses on parsing rather than validation and has no type annotations or support for RFC 3687 yet
delb's namepace validation logic (as part of a parser implementation) would also have to consider xml:base and thus require a function to resolve relative IRIs
in the Rust realm there are iref, iri-string, oxiri & sophia_iri

whatever a design decision will be, i'd rather like to see it late on the roadmap.

Raise more informative exceptions when document loading fails

the document loading logic is failure tolerant by design. however users are left clueless when a document can't be loaded regarding to the causes.

hence, the loaders should return an explanation why they do not consider a given source to be loadable or the exception that was raised when they tried so. in case of a failed loading, the users shall be informed accordingly.

Adding the Document.source_url property

as of now the contributed document loaders ftp_http_loader and https_loader add a source_url object to the Document.config namespace. it is reasonable to assume that further document loaders would add such attribute as well.

it is arguable whether this attribute is to be considered as configuration data, and not rather a first-level property of a document. and i propose to do that shifting.

the current mechanic for loaders to store that information in the config namespace can stay in place and the document bootstrapping would then move it. the file_loader should also store a source_url with the file://scheme instead of the source_path.

it should be documented that a document instance doesn't necessarily reflect what is available at the source_url property's value due to change.

Plugins: drop PluginManager.register_document_extension in favor of DocumentMixinHooks.__init_sublass__

the function's name should have been renamed anyway, anyway the class DocumentMixinHooks …

__init_sublass__ can do the same and users don't have to care
should be an abstract base class, but it may collide with other uses of ABC or metaclasses
could verify that derived classes don't implement __init__, if a ABC is feasible
would then better be named DocumentMixinBase

i think compatibility wrappers for the previous API wouldn't be worth the effort

Regarding the names of methods that refer to other nodes

i came across circumstances in the API design that have a flair of inconsistency:

there's the NodeBase.ancestors method and its complementary way to traverse a tree is NodeBase.child_nodes with the recurse argument set to True. this argument is unique among the methods to navigate from a node, all other only take filters as arguments.
the XPath axes implementations directly use a set of concordant navigational methods that often have descriptive names made of a verb, an adjective and a mathematical term (e.g. NodeBase.iterate_next_nodes), while the axes names all refer to the metaphor of kinship in Homo sapiens cultures (e.g. following-siblings).
the names of NodeBase's methods that refer to other nodes use either descriptions (e.g. next_node) or use the kinship metaphor (e.g. parent).
looking closer at it (see below), the names' structures are quiet diverse.

i'm certain i'd like to deprecate the NodeBase.child_nodes' recurse argument in favor for a NodeBase.descendants iterator.

the current names can be described like this:

name	structure*
parent & unfiltered shortcuts
`first_child`	`rk`
`last_child`	`rk`
`last_descendant`	`rk`
`parent`	`k`
fetching a single relative
`next_node`	`rm`
`next_node_in_stream`	`rmc`
`previous_node`	`rm`
`previous_node_in_stream`	`rmc`
iterating
`ancestors`	`K`
`child_nodes`	`kM`
`iterate_next_nodes`	`vrM`
`iterate_next_nodes_in_stream`	`vrMc`
`iterate_previous_nodes`	`vrM`
`iterate_previous_nodes_in_stream`	`vrMc`
adding nodes
`add_next`	`vr`
`add_previous`	`vr`
`insert_child`	`vk`
`prepend_child`	`vk`

* legend: r - relational adjective, k/K - kinship substantive (s./pl.), m/M - mathematical term (s./pl.), c - contextual aspect, v - verb

this is the distribution of composed structures, that's ten forms made of five types:

k .
K .
kM .
rk ...
rm ..
rmc ..
vk ..
vr ..
vrM ..
vrMc ..

with naming principles in general, it's a matter of gusto. personally i find the descriptive ones clearer and the kinship metaphor both inane and miserable. certainly many will find these customary. and an obvious problem is that i hadn't come up with something descriptive for ancestors, child_nodes and parent.

a point in favor for the XPath concept beside consistency imo is that it defines forward and reverse axes as behaviour. that would set a frame to clearly answer the question raised in #30, so that a given input sequence is inserted in axis direction.

in order to not lead users to the temptation to guess, an obvious way to do it can be be provided by methods names that follow a stringent structure.

a first step to streamlining can be to omit the node term for objects where it can be replaced by or reduced to kinshiply terms. also, methods that can take multiple nodes as input should use a pluralized form. but afaik there's no plural of following in poor english:

old name	structure	old or new name
parent & unfiltered shortcuts
`first_child`	`rk`	`first_child`
`last_child`	`rk`	`last_child`
`last_descendant`	`rk`	`last_descendant`
`parent`	`k`	`parent`
fetching a single relative
`next_node`	`rm` ⇘ `rk`	`following_sibling`
`next_node_in_stream`	`rmc` ⇘ `r`	`following`
`previous_node`	`rm` ⇘ `rk`	`preceding_sibling`
`previous_node_in_stream`	`rmc` ⇘ `r`	`preceding`
iterating
`ancestors`	`K`	`ancestors`
`child_nodes`	`kM` ⇘ `K`	`children`
	⇘ `K`	`descendants`
`iterate_next_nodes`	`vrM` ⇘ `vrK`	`iterate_following_siblings`
`iterate_next_nodes_in_stream`	`vrMc` ⇘ `vr`	`iterate_following`
`iterate_previous_nodes`	`vrM` ⇘ `vrK`	`iterate_preceding_siblings`
`iterate_previous_nodes_in_stream`	`vrMc` ⇘ `vr`	`iterate_preceding`
adding nodes
`add_next`	`vr` ⇘ `vrK`	`add_following_siblings`
`add_previous`	`vr` ⇘ `vrK`	`add_preceding_siblings`
`insert_child`	`vk` ⇘ `vK`	`insert_children`
`prepend_child`	`vk` ⇘ `vK`	`prepend_children`

this is the distribution of seven forms made of three word types:

k .
K ...
rk .....
r ..
vK ..
vr ..
vrK ....

given said lack of the nouns followings (in case of the intended meaning) and precedings, the verb iterate is necessary, and hence consistency in this corner is achieved by adding it where it's missing:

old name	structure	old or new name
parent & unfiltered shortcuts
`first_child`	`rk`	`first_child`
`last_child`	`rk`	`last_child`
`last_descendant`	`rk`	`last_descendant`
`parent`	`k`	`parent`
fetching a single relative
`next_node`	`rm` ⇘ `rk`	`following_sibling`
`next_node_in_stream`	`rmc` ⇘ `r`	`following`
`previous_node`	`rm` ⇘ `rk`	`preceding_sibling`
`previous_node_in_stream`	`rmc` ⇘ `r`	`preceding`
iterating
`ancestors`	`K` ⇘ `vK`	`iterate_ancestors`
`child_nodes`	`kM` ⇘ `vK`	`iterate_children`
	⇘ `vK`	`iterate_descendants`
`iterate_next_nodes`	`vrM` ⇘ `vrK`	`iterate_following_siblings`
`iterate_next_nodes_in_stream`	`vrMc`⇘ `vr`	`iterate_following`
`iterate_previous_nodes`	`vrM` ⇘ `vrK`	`iterate_preceding_siblings`
`iterate_previous_nodes_in_stream`	`vrMc` ⇘ `vr`	`iterate_preceding`
adding nodes
`add_next`	`vr` ⇘ `vrK`	`add_following_siblings`
`add_previous`	`vr` ⇘ `vrK`	`add_preceding_siblings`
`insert_child`	`vk` ⇘ `vK`	`insert_children`
`prepend_child`	`vk` ⇘ `vK`	`prepend_children`

resulting in this distribution of six forms:

k .
rk .....
r ..
vK .....
vr ..
vrK ....

one last measurement for streamlining could be to prefix methods to get a single node with the verb fetch (e.g. fetch_following_sibling) leading to a distribution with also six forms like so:

k .
rk ...
vK .....
vr ....
vrk ..
vrK ....

a further question could be whether also the self-or-* axes from XPath should be adapted as iterator methods on NodeBase. i'm quiet sure that i already had situations where i could have used it, but it's also simple to work around it. though implementation and maintenance would be almost no-cost.

the risk to introduce new bugs with the possible changes is very low, so it'd be okay to include it in the 0.4 release imo. but the triviality of the issue also doesn't make it urgent to decide on.

Make use of pyproject-fmt

once it stops to re-order seemingly randomly, possibly with the latest release from today, pyproject-fmt should be used like black for applying a canonical formatting and verifying it (maybe there's also a flake8 plugin for that?).

due to its extrapolation of trove classifiers about supported language versions, it fits well in conjunction with hynek's GH Action for building and validating wheels.

Find ideas on addressing namespaced attributes

i found this in the wild:

tag('cit', {f'{{{XML_NS}}}lang': lang})

i don't like it. there must be a better way.

i'm worried to eventually introduce a data type for qualified names. maybe a combination of an attributes adapter and a string subclass yields slick uses.

Create a GH workflow for releases

TODO add specs

Evaluate just as task runner

just is an alternative to make that is intended for what we use it as: a task runner.

i want to evaluate it, here's the plan:

support multiple predicates per location step
include a Justfile beside the Makefile
decide to drop or switch to it before the next release
possibly switch GitHub workflows to use just

Proposal to rename the 'master' branch to 'main'

please note that i am posting the following text / issue description to various projects that i'm (considering myself to be) significantly involved with. in fact it is about a general issue that is not specific to this project. but too often we just focus on the nitty-gritty details of design and implementations whilst operating within, supplying for, and are depending upon much broader and complex technological, social, economic and ecological relationships.

the torture that resulted in the death of George Floyd in this year's May intensified antiracist movements and debates colonial heritage that hasn't been overcome (or even compensated for) yet. it also initiated discussions about terminology used in technological contexts, their etymology, and its link to the aforementioned ideologies and practices of discrimination. though circumstances aren't homogeneous across societies where technological terminology is used, one must acknowledge that the context in which this terminology is evolving is American English, which reflects and manifests specific inequalities based on 'race' in the United States of America. thus, the connotations that are inherent to that language cannot be ignored elsewhere.

i'd have every understanding for anyone who would hesitate to contribute to this project because of language used that reproduces bullshit discrimination. i therefore propose to rename this project's git branch, from master to main. i'd have some imo more interesting, better-fitting alternatives to propose, but main is pragmatic because of its adoption in the Linux kernel VCS (and probable further adoptions that will reflect this), as well as the stable use of auto-completion in a shell.

please refer to this proposal for an RFC to establish an inclusive language within the "tech community", this discussion on the git-related etymology of the master term and this meanwhile accepted patch and related debate that prompted the change in the Linux kernel VCS. as the web is the web, you'll easily find more resources on the topic, possibly in your preferred language.

due to a lack of time on my side, i foresee this change taking place over this year's autumn in repositores where i'm authorized to do so. please consider that as a timeframe for feedback. i'm open to critical arguments on why we should withhold from that change, but trolls will be blocked right away where i have the privilege to do so. for GitHub hosted repositories there's this relevant piece of information.

descendant-or-self axis in TagNode.xpath method

The xpath method in TagNode behaves counter-intuitively when handling relative (descendant-or-self) xpath expressions.

For my convenience's sake, let me show this using the OAI-PMH schema conforming XML response from https://oai.sbb.berlin/?verb=ListRecords&metadataPrefix=mets&set=illustrierte.liedflugschriften. Imagine one would want to extract bibliographic data from each record in the list of records the response provides:

doc = Document('https://oai.sbb.berlin/?verb=ListRecords&metadataPrefix=mets&set=illustrierte.liedflugschriften')
records = doc.xpath('//ListRecords/record')
record = records[0]

Intuitively, one could choose to use a relative xpath expression against each one of those record tagnodes:

creator = record.xpath(
    '//metadata/oai_dc:dc/dc:creator',
    {
        'dc': 'http://purl.org/dc/elements/1.1/',
        'oai_dc': 'http://www.openarchives.org/OAI/2.0/oai_dc/'
    }
)

However, the results are not what one would expect, which is a list of only one single TagNode (representing <dc:creator .../>), but rather the results contain every single <dc:creator/> node from every single <record> node!

Missing parts in the online API docs

i noticed that the essential parts of the API doc are missing on RTD. a locally built documentation is fine though.

Test loaders in a separate target

the loaders ought to be tested in a seperate target / recipe and also other invokations of pytest shouldn't run them by default as they are time and energy consuming.

maybe after #40 is decided so that there's less chance to miss a sport.

Preferring document loaders

here i raised the concern that document loaders may employ indistinguishable notations for sources. @03b8 pointed out that one would hence need to control a preference of loaders.

here are some thoughts:

a) it is currently possible for application developers to manipulate the delb.plugins.plugin_manager.plugins.loaders object which is a list. that should be mentioned in the docs. this will be included in the 0.2 release.

b) the plugin manager could be enhanced with methods to de-/activate and reorder registered plugins. since both is possible with a), i'd implement that when someone can show that explicit methods for it would be preferable.

c) also, an additional argument preferred_loaders could be added to Document in order to specify the behaviour not in the application wide scope, but just for one concrete call on Document. as with b), i'd rather go for this when someone needs it.

Frequently asked questions

this thread is supposed to collect questions for a new FAQ section in the Design chapter. their final formulation and answers shall be discussed in a PR.

Isn't XML an obsolete format to encode for text encoding, invented by boomers and cynically held up by their Generation X apologetes? Why don't you put your efforts in developing new approaches such as storing text in a graph database?

Why is your XPath support so poor?

What are your long-term goals with this project?

What's the status of a Rust implementation?

Differentiate exceptions

we should review whether a taxonomy of exceptions makes sense as the InvalidOperation exception may be overused after a while of code growth.

Facilitate plugin-system and extendability of the Document class

while delb allows an application to extend the handling of any input to get a document out of it, it doesn't allow to extend the document class with interfaces it may need. E.g. a document loader may load from a database, but the resulting instance wouldn't have methods to store it back there.

it'd be therefore helpful to provide two entrypoints that

a) can extend the document loaders
b) can return mixin-classes for the Document class that extend it globally

in order to bind relevant objects or set other properties to new document instances, the document loaders would also return optional keyword arguments for initializations.

@03b8, this leads me to think of snakesist.NodeResource as a (sub)document, that hence could be encapsuled in a delb.Document instance. but i'm sick and unconcentrated to elaborate more atm.

Proposal for a method that creates a branch of nodes if it doesn't exist yet

particularly for metadata, but also witnessed for translations, there's a rather bloated pattern to ensure that a full branch of nodes exists:

root = Document("<root/>").root
if not root.xpath("./foo"):
    root.append_child(tag("foo"))
if not root.xpath("./foo/bar"):
    root.xpath("./foo").first.append_child(tag("bar"))
bar = root.xpath("./foo/bar").first

the deeper this goes, the more tedious it gets.

the idea is to enable something like this:

root = Document("<root/>").root
bar = root.get_or_create_by_xpath("./foo/bar")

the intrinsic assumption would be that only one node with a given name exists per location step.

shall attribute-test with distinct values also be considered? that'd be neat, e.g. ./titleStmt/title[@type="main"] and ./titleStmt/title[@type="sub"].

the good thing is that there's already an internally used data model for XPath expressions that can be extended.

a variant for css selector based expressions would just use the translations of such like the query methods.

for a first release of this feature, it will be marked as experimental.

Add an include_namespaces attribute to the TagNode.xpath method

Let's assume a node at some level below root defines its own namespace without its parents knowing anything about that namespace. Any attempt at addressing said node or any of its descendants in the namespace it defines from the root node would fail because for the root node, that's an illegal namespace. The nsmap itself of a node can't be modified neither.

Therefore, the TagNode.xpath method should have an optional parameter via which you can declare additional namespaces to be used for evaluating the xpath expression.

Should the Document constructor use a different parser for collapse_whitespace?

lxml.etree.tostring with pretty_print=True has this caveat:

If lxml cannot distinguish between whitespace and data, it will not alter your data. Whitespace is therefore only added between nodes that do not contain data. This is always the case for trees constructed element-by-element, so no problems should be expected here. For parsed trees, a good way to assure that no conflicting whitespace is left in the tree is the remove_blank_text option [...]

Now instantiating a delb.Document with the collapse_whitespace flag somewhat feels like it should do away with whitespaces in a way that makes the parsed XML suitable for custom formatting, e.g. calling:

  lxml.etree.tostring(document.root._etree_obj, pretty_print=True)

...or something like this. However, in order to be able to pretty print delb content, it is still necessary to use a custom parser on instantiation, e.g.

  document = Document(source, parser=etree.XMLParser(remove_blank_text=True))

...in which case the collapse_whitespace flag of the Document constructor isn't even relevant.

I feel like wanting to pretty-print delb objects as a usecase is somewhat justified (I needed it today in order to simplify a test), and think that this behaviour is somewhat obscured right now and should at least be documented in some way. But maybe this could even be handled in a more user-friendly way. Is there a point in using delb.Document with collapse_whitespace without an lxml parser that also removes whitespace or could the use of such a parser perhaps be implied by collapse_whitespace in general?

Should TagNode have a tostring method with an optional pretty_print flag as well?

should sample files for tests/benchmarks be included in sdist bundle?

currently, build file contains the sdist directive only-include = ['_delb', 'delb', 'CITATION.cff'].

QueryResults identity

A TEI document has the following <body>:

<body>
<ab>
<s corresp="src:tlaIBUBd4DTggLNoE2MvPgWWka2UdY">
<w corresp="src:tlaIBUBdzQ3wWIW60TVhNy3cRxYmgg"><unclear></unclear></w>
<w corresp="src:tlaIBUBd7n0fy1OPU1DjVU66j2B4Qc"><unclear></unclear></w>
<w corresp="src:tlaIBUBdzMdqTkhlEFpidr4rYPFyro"><unclear></unclear></w>
<w corresp="src:tlaIBUBd8yCk7rXFEayk6Xvs3N1jXE"><unclear></unclear></w>
<w corresp="src:tlaIBUBdyjtg3DJX0rwjyrdHZc26is"><unclear></unclear></w>
<w corresp="src:tlaIBUBd7UZxeumekGAks0Y5ht3nvs"><unclear></unclear></w>
<w corresp="src:tlaIBUBd4NQUh0FikJ0stCGrcxq9wk"><unclear></unclear></w>
<w corresp="src:tlaIBUBd7VHAfkj20bDsv3QzaQ4eoo"><unclear></unclear></w>
<w corresp="src:tlaIBUBd5L0JpJVZUOWoGFGhqXAfqc"><unclear></unclear></w>
<w corresp="src:tlaIBUBd9zrhmbxrkyqpG7t84kiw2s"><unclear></unclear></w>
<w corresp="src:tlaIBUBd18UraAwdkCUgzqNsZplqIw"><unclear></unclear></w>
<w corresp="src:tlaIBUBd8EjhpjvSERfoCg6pz5qYxc"><unclear></unclear></w>
<w corresp="src:tlaIBUBdQbsUzWoU0ZAg6KyIT74EPU"><unclear></unclear></w>
<w corresp="src:tlaIBUBd4C0OAENyE3Ti62GjqGFmto"><unclear></unclear></w>
<w corresp="src:tlaIBUBd8XAOdLv8k7WiQ1f7ZFe3IQ"><unclear></unclear></w>
<w corresp="src:tlaIBUBd37vHGdUYkiYqNF8ExKiO6M"><unclear></unclear></w>
</s>
</ab>
</body>

I can query all <w> nodes using css_select:

  from delb import Document
  d = Document("https://github.com/simondschweitzer/aed-tei/blob/master/files/2235T5FM5VFNLFTZN7P3MXW46U_hiero.xml")
  words = d.css_select('s w')

However, if I run method filtered_by against this QueryResults object, I encounter results that are somewhat counter-intuitive at least for my naive understanding:

  >>> words == words.filtered_by(lambda _: True)
  False

or, closer to a real-world use case:

  >>> words == words.filtered_by(lambda w: w.css_select('unclear'))
  False

even though

  >>> len(words) == len(words.filtered_by(lambda w: w.css_select('unclear')))
  True

Wouldn't it be reasonable or at least understandable to expect that running filtered_by on a QueryResult with a predicate that is always true would yield a results that equals that QueryResult?

Name and logo wanted

i'm not particularly fond of the library's name. there's no need to include 'lxml' because the api is intended to be sufficient on itself and the high-performance backend may be exchanged at some point. regarding the 'domesque' part i'm not sure whether it requires certain cultural knowledge to understand the nuance.

a project's logo should communicate the what-is-is as well. actually a straight-forward idea for it would be a tree around which a snake is winding. there could be (two different) fruits of knowledge on that tree. admittedly this has a strong reference to the culture of abrahamic religions.

you're welcome to share your ideas and feedback on those. if you like drawing, why not give it a try?

Add accessors to leading and tailing nodes

XML markup can contain a subset of node types (comments, PIs, maybe CDATA) before and after the root node (which is always a TagNode). ~~at the moment these might only be accessible by fetching siblings from the root node, i'm not sure though.~~

i propose to add two properties to the Document class: head_nodes and tail_nodes that should mostly behave like lists (with an additional prepend method).

i'm not sure whether or not it should be possible to 'hop' from the last / first items in these to the root node (with next_node and previous_node) and vice versa.

drop python 3.7 support

python 3.7 is EOL.

Is the insertion order of NodeBase.add_previous the intuitive behavior?

currently, when multiple nodes shall be prepended to one, the call

A.add_previous(B, C, D, E)

but also:

siblings = (B, C, D, E)
A.add_previous(*siblings)

results in node siblings ordered like:

E D C B A

whereas this is an alternative:

B C D E A

as far as i remember i chose the current order because it's behaving symmetrically to add_next. but i'm not sure whether that is something that'd be expected.

if there even is an behaviour that can be generally or somehow scoped be asserted as intuitive. with regards to consistency to add_next, both variants can make sense.

the question applies similarly to prepend_child.