punkish / bomfim Goto Github PK
View Code? Open in Web Editor NEWextract tags from xml files
License: Creative Commons Zero v1.0 Universal
extract tags from xml files
License: Creative Commons Zero v1.0 Universal
The huge variance comes from different handling of subSubSection in Plazi's conversion and Pensoft taxpub XML application.
There are only 12 types used in Plazi, whilst the rest is from Pensoft. Guido might be able to answer.
this is the list of subSubSection type=
tag with attr="type" | frequency
-- | --
<subSubSection type="nomenclature"> | 287774
<subSubSection type="materials_examined"> | 138953
<subSubSection type="description"> | 134941
<subSubSection type="distribution"> | 127740
<subSubSection type="reference_group"> | 121168
<subSubSection type="discussion"> | 70094
<subSubSection type="diagnosis"> | 67525
<subSubSection type="etymology"> | 53630
<subSubSection type="biology_ecology"> | 13268
<subSubSection type="key"> | 8415
<subSubSection type="conservation"> | 412
<subSubSection type="vernacular_name"> | 213
none of these <subSubSection types>
is from Plazi PDF based XML
this needs discussion with Pensoft to see how this can be made to use.
none of this types are from Plazi produced XML. they are all from pensoft
tag with attr="type" | figureCitations | frequency
-- | -- | --
<subSubSection type="additional photographic records"> | TRUE | 1
<subSubSection type="additional photos at popovkin"> | TRUE | 2
<subSubSection type="adult fig"> | TRUE | 1
<subSubSection type="caption"> | TRUE | 3
<subSubSection type="figs 1-6"> | TRUE | 2
<subSubSection type="figures in flick"> | TRUE | 1
<subSubSection type="figures in flickr"> | TRUE | 6
<subSubSection type="figures on flickr"> | TRUE | 6
<subSubSection type="figures"> | TRUE | 4
<subSubSection type="illustr"> | TRUE | 2
<subSubSection type="illustration"> | TRUE | 2
<subSubSection type="illustrations"> | TRUE | 5
<subSubSection type="images in nature examined"> | TRUE | 1
only this is from Plazi PDF XML
<subSubSection type="nomenclature"> | TRUE | 287774
all the rest is from Pensoft
tag with attr="type" | authors&ranks <subSubSection type="no + | frequency
-- | -- | --
<subSubSection type="authors of description"> | TRUE | 1
<subSubSection type="authors of the description"> | TRUE | 1
<subSubSection type="authors’ contributions"> | TRUE | 1
<subSubSection type="nomenclatorial note"> | TRUE | 1
<subSubSection type="nomenclatorial remark"> | TRUE | 1
<subSubSection type="nomenclatural and taxonomic emendations"> | TRUE | 2
<subSubSection type="nomenclatural and taxonomic remarks"> | TRUE | 3
<subSubSection type="nomenclatural and taxonomical notes"> | TRUE | 2
<subSubSection type="nomenclatural comment"> | TRUE | 2
<subSubSection type="nomenclatural comments"> | TRUE | 5
<subSubSection type="nomenclatural note"> | TRUE | 16
<subSubSection type="nomenclatural remarks"> | TRUE | 11
<subSubSection type="nomenclature citation"> | TRUE | 1
<subSubSection type="nomenclature notes"> | TRUE | 1
<subSubSection type="nomenclature of the type species"> | TRUE | 1
<subSubSection type="nomenclature remarks"> | TRUE | 2
<subSubSection type="nomenclature-citation"> | TRUE | 5
<subSubSection type="non-type materials examined"> | TRUE | 1
<subSubSection type="non-type specimens (not collected) photographed in situ"> | TRUE | 1
<subSubSection type="non-type specimens examined"> | TRUE | 1
only this one is from Plazi PDF conversion
<subSubSection type="reference_group"> | TRUE | 121168
this seems all to be from Pensoft
tag with attr="type" | references <subSubSection type="ref + | frequency
-- | -- | --
<subSubSection type="additional literature"> | TRUE | 3
<subSubSection type="citation of original description by amsel"> | TRUE | 1
<subSubSection type="citations"> | TRUE | 3
<subSubSection type="list of references"> | TRUE | 2
<subSubSection type="literature record"> | TRUE | 14
<subSubSection type="literature records"> | TRUE | 256
<subSubSection type="literature"> | TRUE | 4
<subSubSection type="ref-group"> | TRUE | 1
<subSubSection type="refGroup"> | TRUE | 7
<subSubSection type="ref_group"> | TRUE | 3
<subSubSection type="reference phylogeny"> | TRUE | 6
<subSubSection type="reference sequence"> | TRUE | 1
<subSubSection type="reference sequences"> | TRUE | 12
<subSubSection type="reference"> | TRUE | 696
<subSubSection type="referenceGroup"> | TRUE | 10
<subSubSection type="reference_Group"> | TRUE | 2
<subSubSection type="references"> | TRUE | 1272
with a caveat that
<subSubSection type="referenceGroup"> | TRUE | 10
<subSubSection type="reference_Group"> | TRUE | 2
could be missspellings. but this must be checked case by case, probably each being an article
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.