Giter Club home page Giter Club logo

lexicographi-sine-finibus's People

Contributors

fititnt avatar

Watchers

 avatar  avatar

lexicographi-sine-finibus's Issues

MVP of [`1603:1:??`] /Documentation about 1603 types of files/

Makes sense have an dictionary only to document what extensions of the the working (and in special generated files) means in practice.

We're already doing this for the fully numeric names of concepts. While would still need explain later the suffixes for these names (which are not part of the file extension) the last thing that would need some minimal documentation would be the files.

Generated Cōdex should have license images (CC0-1.0, or CC-PDDC; aka public domain variants) when is know

Related:


Each Cōdex already can express which license the main content have, not just per annex. So now we can also at lease for public domain (variant creative recent work, vs variant work already under public domain because creative part was on the past) can have also own image.

Each Cōdex should also explain related files

Related:


Currently each /Cōdex/@lat-Latn both document all direct information on the main /Dictiōnāria/@lat-Latn and attach explanations of lingual and interlingual information which de facto will appear on the /Cōdex/@lat-Latn. This can be improved, but already works.

However, we also need to do a similar approach, but for the archive files themselves. Such documentation needs to be targeted: only appear for files which actually are pre-compiled. Also already with exact file names.

Reasoning

Each /Cōdex/@lat-Latn have as target audience humans likely to review or fix errors on source of the Information. However, we know that users also intended to create derived works (or which may be tempted to do a lot of copy and pasting) may not know that we explicitly have machine readable files.

We're also having issues with Unicode rendering of the graphenes. So more than one option of Codex needs to be documented and extra information.

The PDF (despite being less accessible) has the advantage of allowing embed fonts so we're sure end users can at bare minimum see the content. However this depends on our realistic skill to implement it correctly. The issue on this point ishttps://github.com/EticaAI/multilingual-lexicography/issues/13. The ASCIIdoctor version is already the one we use to prepare the PDF (and could be used to export to many other formats) so we document that version too.

Limitation

The embed descriptions of files should not be overlong as this would be better done by dedicated books which explain the files better. The motivation for this is also that the explanation of the files would be better to not be hard to be translated. And by keeping short and focused, we somewhat help with this.

Taxonomic strategy to encode individual humans by P-Codes


From an ontological (Basic Formal Ontology) at the #43, the main point could be described as "metadata about aggregated collection of humans" with reference key 1:1 compatible with P-Codes (which at the moment uses 1603:16:{unm49}:{cod_ab_level}:{unpcode}). Under this logic, a prefix different from 1603:16: would be used, but as much as possible, by the dedicated #43 prefix alone, it should be possible to make the mappings.

Both for #43 and this new topic, as soon as the numerical taxonomy is drafted, the new information becomes metadata to the main keys. However, since eventually such nodes could become very overloaded with information witch would not be mere linguistic and some types of "metadata about aggregated collection of humans" could be thematic and overly specialized, we migth need to create a default infix (like the number 1) to represent the most generic term (such as for #43, assuming NN as base, the population statistics could become  1603:NN:1:{unm49}:{cod_ab_level}:{unpcode}) and when new relevant thematic groups appears (but would still be "metadata about aggregated collection of humans'') the way to distribute the data could have different packages.

Under ideal circumstances, since #43 would be an aggregated version about this issue, whatever would be the infix "1", they could be the same. So let's say "healthcare workers" would be relevant to receive an infix like "123", under #43, the metadata would be about collection (by region), and on this topic, they would be at individual level.

Relevance of this namespace

By having a dedicated prefix designed to have data about individual humans, both sensitive and public data, non-anonymized and anonymized data, would ideally appear in a predictable way. The numeric prefix alone could allow decisions to be made (such as enforce more checks). Most of the time, the type of data here should not be public. But there are cases where is relevant to have suggested prefix for data which can be public (even if eventually over the time laws could be changed, and datasets requested to be deleted):

  • (generic reason) having this reduce need of users create other ad-hoc numeric prefixes and need to understand deeply how to organize ontology
  • Some data about individuals can be public, even if personal information (which could be used as key) should not
    • Examples:
      • politician responsible for an administrative region by Wikidata public Q ID
      • data about vaccinated people (Brazil do have 169 millions of individuals vaccinated, and the anonymized data is public; the dataset is over 100GB, but such type of data can exist)
  • (indirect need) other datasets with final user data might need to group what is individual information (such as hospital patient, or victim of a crime) and even if the key is some sort of cryptographic secure hash, "by pointing" to this namespace, would help tools to understand that is about a person.
    • Sometimes such data would be exchanged outside a country (for example, to have help about some sort of epidemic) so this predictability could allow both who agree what can be exchanged and which level of detail do not need to have access to the data. Also other regions could know that the biggest difference between data from outside the region and their own region could be information which can also be public (like information of region), but even other related information could also have some additional level of anonymity.
      • The biggest argument here is that the mere default namespace does allow automation and checks for data that not even the decision makers of what is allowed should have access. And also the way the data is stored in the end user side could also have some other special actions (like the target region giving aid monitor how the data is used or delete it after some time)

However, realistic speaking, this topic would be by far the most stricter to represent data which should have higher protection when exchanged outside the place allowed to use own data and as public documentation on how to use own data with other public resources. Considering #43 maybe some infix to differentiate for example who is doctor or police officer and who is patient or crime victim could already simplify a lot such checking: both sides (such as doctors from different world regions) would be annoyed if the rules would trow warnings all the time because a human (the doctor) have metadata about where he works (the hospital) and a contact information which would be personal identifiable information (such as email) while the patients are anonymized.

TODO: some way to encode "organizations"

Not sure how to encode this right now, but we're likely to need another 2 numeric base namespaces both for individual and collective of organizations, such as hospital. The logic could become similar to what we have here:

  • One infix (such as "1") to mean every type of individual (or collective without division by themes) of organizations
  • Have other infixes for at least most popular organizations which worth to encode (not just hospitals)
    • Some more specialized types (for example, type of hospitals) could become metadata

Synchronization of generated preview versions of dictiōnaria and cōdex files with some CDN


We're using GitHub pages to free hosting of the dictionaries, however the binary formats (such as PDF and epub) would take too much space, even for git-tracked repository which is not really intended to have full history. However, even the .adoc formats already are getting bigger to preview. It still wort have then, but we're already in a moment were we should already have some CDN for the internal files while we still waiting to be published on humanitarian channels.

Github limitations to preview

Non-binary (but large files) already not render

One of our text formats, the Cōdex [1603:63:101]: //Dictiōnāria basibus dē rēbus vītālibus necessāriīs//, already does not render on GitHub web interface


Captura de tela de 2022-03-19 02-41-27

Example of how it render

For sake of comparison, this is the preview of Cōdex [1603:45:31]: Dictiōnāria de calamitātibus

Captura de tela de 2022-03-19 02-43-44

Explicitly user agents for read-only automated requests on Wikidata/Wikimedia/SPARQL; API Etiquette

Relevant: https://m.mediawiki.org/wiki/API:Etiquette


As we're starting to make more read-only requests to Wikidata SPARQL backend, while at this moment they're serial (so in theory we're fine) and even are done while we construct the scripts, soon or later this may not be the case.

The main point of this issue is to implement User Agent with hints on how to contact us to report errors. Since this is ready-only, the errors would be restricted to too many requests in a short time (mostly caused by errors on who is requesting) so it is better to preventively name or automate agents.

MVP de [`1603:84:1`] /Dictiōnāria dentālium/


Topic about do a minimal viable product of "Zahnschema", "Zahnbezeichnungen", "dental notation", (...) concept. Provisory contemporary Latin is /dentāle vocābulāriō/

MVP of [`1603:2600:1`] /tabulam numerae/


This issue is about an Minimal Viable product of _ [1613:2600:1] /tabulam numerālī/_,

Different from ISO 15924 (https://en.wikipedia.org/wiki/ISO_15924) actually seems to not any coding system to represent numeric systems. They do in fact appear Unicode CLDR (and in teory could be recompiled from Unicode) but for now we can go minimalistic.

There is also the issue on how we would label such numeric systems.

MVP of read access to Wikidata

This issue is about Minimal Viable Product with read-only access to Wikidata. One of main advantages is it's content already be on public domain, so this would allow generating external datasets some vocabularies even original copyright holders still need a long process of formal allowing any type of re-publishable license.


Trivia: Wikidata actually allows extraction of label translations from Wikipedia's related terms and it's explicitly public domain. This means any potential care will have very consistent mappings between our codes and Wikidata Q codes very relevant.

Automate SPARQL query generation to Wikidata by items with P

One item from #39, the P1585 https://www.wikidata.org/wiki/Property:P1585 //Dicionários de bases de dados espaciais do Brasil//@por-Latn actually is very well documented on Wikidata, so we would not need to fetch Wikidata Q one by one.

It's a rare case something so perfect, but the idea here would be create an additional option on ./999999999/0/1603_3_12.py to create the SPARQL query for us.

This obviously will need pagination. If with ~300 Wikidata Q we already timeout with over 250 languages on 1603_1_51 (for now using 5 batches), with sometime with 5700 items, well, this will be fun

Strategy to encode strict controlled vocabularies (terms in natural language which match stricter translations)

Quick links on the mentioned use case


Context

Currently, after the hard work of conciliating concepts with existing Wikidata Q, we're already able to get over > 100 languages terms. The way the Wikipedia ecosystem works (heavy self moderation) means in general the baseline result already is greater than alternatives. In fact, it is more likely humans using Wikipedia as reference then making corrections would deliver better results than guessing the term translations.

At this moment, we're already able to get these terms, compile, and re-share. They do not receive any special labeling.

Example: specialized use cases of controlled vocabularies

Both Basle Nomina Anatomica (BNA1895) and Terminologia Anatomica (TA98) are great references of controlled natural languages (yes, I'm aware Latin as an dead language is easier to do it) which are know to algo later become translated on several languages by experts association (often at country level).

TA98, despite being the active international reference on human anatomical terminology, has much less translations than the BNA1895. Also, the adoption of TA98 is not as perfect. Even in counties which do have translations (such as Brasil) some researchers such as this one (link link link) complain that the adoption of a stricter Portuguese version of TA98 is moderate.

In general, quite often experts publishing research may still use archaic terms (such as one the BNA1895 would have) even when translations exist.

However, the situation on languages with no official translation at all from TA98 are likely to be somewhat worse.

Important face to the reader: most (but not all) existing terms on BNA1895 were kept on TA98; terms which are not mere addition tend to me better specializations of old body parts or terminology "simplifications'' (which not rare, in my personal honest opinion, were made because English speakers preferred adopt old Geek roots instead of keep Latin roots; I know this is personal rant). Anyway, do exist old books everywhere with stricter anatomical terminology which could be reused for global compilation

Non latin examples

I'm not aware of other nomenclature translations, but if they're exist, are likely to be terminology heavily copyright, likely the ones from ISO.

They are not relevant for our nurse cases, as they're not scientific nomenclature.

The focus on this topic

The idea of this topic is that both have at least one real namespace of dictionaries as practical examples AND make ready the tooling and general documentation on how to encode specialized nomenclature.

Using MVP de [1603:25:1] /partes corporis humani/ #11 as example, we can both encode Latin (and Portuguese) based on stricter reference.

This approach does not exclude usage of terms from Wikidata Q (and, in fact, users could then change Wikidata to adhere to specialized vocabulary). But as we're using a much smaller subset of terms than full BNA1895/TA98 actually is feasible to do it. We can also somewhat have an idea how Wikidata/Wikipedia already diverse from stricter nomenclature.

One problem of creating tags for each regional organization (instead of generic one)

The way the TA98 was released (Latin and English) did not take in account any attempt to centralize international terminology. I'm aware copyright plays a role in this, but we're likely or not, except by Latin + English terms of TA98 released in 2011, pretty much every local language relies on books.

In other words: Wikipedia (Wikidata) without any extra effort, already is the closest to international link of such terminology. We may go a step further to make differentiation (but even this may later be used to correct non-strict terms on Wikidata).

Anyway, even in cases were is possible to create an special attribute for each organization which could validate terminology variants on each natural language know to generated on past either TA98 or (much more common) BNA1895, makes sense to have a common attribute to use in addition to the natural language codes to express that such terms are ones actually endorsed somewhere.

Nomenclature consistency is more important than copyright (and words cannot be copyrighted alone)

In special for nomenclature of anatomy (and our use cases are even fair use of smaller subset) it is unlikely anyone anywhere will oppose open initiatives to ensure consistency. Discussions such as this one here https://www.wikidata.org/wiki/Wikidata:Property_proposal/TA98_Latin_term may give fear for what does not make sense.

The alternative for this would mean reuse archaic terms (which is exactly what terminologists don't want). And the fact we're doing a massive compilation of terminology, is better (when is viable) do not deviate from the latest endorsed terms. It's not just "not wrong", but the right thing to do.

MVP of tooling able to upload/syncronization files with Wikimedia Commons, `1603_3_4.py` (not bot yet)


It may be relevant (even if not running as bot, but by human request) to allow to syncronize files, such as tabular data, with Wikipedia Commons.

Related to #26. Also weeks ago we started to discuss on IRC about some place to upload tabular-like format (or something not as structured as Wikidata).

Wikimedia commons do have a tabular format (which uses JSON). Is not exactly what we need, but have somewhat an advantage: it allows data upload there be used as source for everything else on Wikimedia wikis.

Anyway, one sandbox use case would be we be able to show which Q Itens we're working on without need to redirect people outside Wikipedia domains.

MVP de [`1603:25:1`] /partes corporis humani/


A minimal viable product of a Dictionary of general human body parts.

Important: this topic does not aim to translate full TA98 or BNA1895. Actually the goal is mostly the general external body parts, not internal organs

On Terminologia Anatomica 98 (TA98)

The TA98 is the closest we could try to make latin terms (and, if any) translations compatible.

TA98 itself is not public domain, but words alone cannot be claimed copyright.

Another strong disadvantage of TA98 is that only English terms get published as easily with the Latin terms, so we have a weird global situation where the main point of "latin as neutral language" is lost.

On Basle Nomina Anatomica (BNA)

The BNA is by far the most translated terminology we could use as reference. This is also relevant because despite BNA being "outdated" there are contents on TA98 which are still never translated.

Eventually (assuming on medium on long term there is review from experts, not just linguistic review) this work could at least also ensure we have part of TA98 related to general human body with strictly translations (even for languages never translated on BNA)

On the numeric coding system

I think our internal numeric taxonomy is unlikely to resemble TA98 numeric code. TA98 and BNA1895 have far more concepts than this specific dictionary needs so TA98 numeric code only makes sense as metadata, not as key to find other terms.


Changelog:

  • 2021-01-27: added initial text
  • 2021-01-27: changed name to /partes corporis humani/; Added link to working draft online spreadsheet

New data frontend strategy [map]: lightweight data layers for every entry point with location component (focus on non-binary static files)


Fact: data exchange often have some location component. It might not be easy to relate prepare data, and may not be the main focus of what user want, but is possible to not only key in data by the used numerical taxonomy Numerordĭnātĭo, but by location.

The idea here would be, in addition to the tabular formats which both can work as plain CSVs (but also via frictionless can be loaded into databases, as per #37) is something which could be loaded on tools that typically would use map. I think that some tools that work with graph (maybe plugins for Protege) would already do something, but this would be on #41.

Challenges

(Likely major issue at standards level) interlink data related to location without replicate geometry on every file

I might be wrong, but all I'm seeing (maybe because most GIS tools are strongly focused on Desktop and strong numerical precision) that they tend to allow attach data to administrative locations, BUT... the way most of then do it is duplicating the geometries when related data comes from several sources!

Interoperability to change location geometry references and data for same topic easily

Assuming the issue of allow (at client-level, likely these desktop programs or, with some documentation, web interfaces) to optionally not duplicate geometries every time static file contains related data, we come to the second point: allow end users change both.

There's reasons (both for changes on precision of geometries, or maybe because new data might have small variation) that users might not using exact same geometry, yet still relevant to interlink the data.

Potential first approach

Unless we resort to XML files (or would use geopackage, but this is binary format, not really what we want for things that need to allow user change parts) one close alternative would be... Geojson.

GeoJSON, GeoJSON-LD, ...

GeoJSON is by far the most well supported non XML format / non tabular format. The main complain (and is the fact TopoJSON was created) is that tend to have higher file size and take more memory from the user than binary formats.

However, while we might test if can work in practice, I think we can at least start generating geojson and mark the properties associated with each feature in some way that it could be understand as stricter RDF (not mere text). We already been doing this, so it would make it on a JSON-like format.

Maybe create "dummy" points and mark the real geometries with extensions to GeoJSON

GeoJSON itself does not allow to reuse geometries from other places (but we might use something based on JSON Schema or RDF to signal this), but we could at least, for clients that would not be able to understand, create dummy points, like the centroind of an administrative area.

One advantage of this approach is that GeoJSON with only single points would take very low extra weight, since most of the file size would actually be what user really want as metadata. This extra heigth also would be such that we would likely not have relevant benefit of using topojson (which I think mostly use arcs to simplify things, but no metadata is changed).

Strategies on how the "dummy" points could be replaced:

  • Either command like instruction (which could be generated as part of user documentation) or online tools could contatenate both the real geometries (most likely used for geometries which would be wasteful to repeat on every dataset, such as administrative regions).
    • This approach also allows for the benefict of users know how to merge several related data from different subjects on a "final" file.
  • At client-side, tools be aware of the exchanged data
    • Web interfaces this would means implement this (like with javascript or something)
      • One obvious advantage (even for tools that would allow import several geojson layers) is much lower memory usage
    • Desktop tools learn how to interlink the files
      • Not something sort term (but could make sense in long run) if already have considerable amount of data

Not focus here

On a quick look, it seems that there's very complex and detailed all-in-one servers, like open source geonode https://geonode.org/ or MapServer https://www.mapserver.org/, which would allow to deploy pretty much everything. The analogy would be a CKAN, but strongly focused on maps. They do use documented protocols, but unless we find ways to make very simplistic automated generation of static files to emulate then, is out of scope we try to create production-level server just as frontend for the data.

We can, however, automate or document how to ingest data. But at this point the start here is just make things work at client-side and server simply have static files in predictable ways. This is why we cannot rely too much on a public data warehouse for everyone.

[1603:??] Geographia (create base numerospace)


We need to create at least one numerospace dedicated to the general topic what in Latin is called Geographia (which actually have several other subgroups, like Cartographia or (without Latin term) Geographic Information System. As soon as this is created and somewhat documented, we could close this topic.

What we have now

Some dictionaries (which are actually not handcraved, but heavily automated, but we not formally publish them yet) from 1603:45 Normās interimperia, like the issue MVP of [1603.45.16] /"Ontologia"."United Nations"."P"/@eng-Latn #2 would fit this new category. Also, the entire is a bit arbitrary, since most concepts already are meant to be used for exchange at international level.

There's is also other dictionaries, like the [1603:45:19] Dictiōnāria dē locī generibus which are not as automated.

Only one numeric namespace?

Compared to other subjects, geographic information may be ones which we are likely to compile usable dictionaries faster and with production usage sooner. So it could in few years be over 1-99 dictionaries. But anyway:

  • We do not have limitations on how many digits a dictionary could have (we it could be 1-999 maybe even 1-9999)
  • Dictionaries can be subdivided. We already do this at spreadsheet level, but there is noting blocking us to actually the base group of dictionaries become instead of [1603:9:99]example be spited later in 3 like [1603:9:99:19]example, [1603:9:99:29]example and [1603:9:99:39]example

Allow specify preferred fallback languages for Cōdex generation when Latin is not available

  • Relevant to
    • [1603:1:2020] //Guia rápido de lexicografia para colaboradores//@por-Latn #32

The 1603_1_99 https://docs.google.com/spreadsheets/d/1ih3ouvx_n8W5ntNcYBqoyZ2NRMdaA0LRg5F9mGriZm4/edit#gid=304729260 already stores some translations for what before was hardcoded on python script. The idea of this issue is allow specify on command line preferred order of fallback languages, so we already could use alternatives

What is not on this issue

Is not included on this issue the small Latin terms (example: Tabula contentorum, Praefātiō, Methodī ex cōdice, Rēs dē factō in dictiōnāriīs, ...). They eventually would be translatable (but the idea would be focus on other writing systems). Someone who understand Portuguese/Spanish can handle Latin.

MVP of image cover for Cōdex version of each generated group of dictionaries

Quick links


Since we're already generating PDF format (and EPUB via asciidoctor) one common complain is the lack of image cover.

We maybe could create an SVG format (with both individualized name of each work with have text option) and already add some summary of what is there. For example, number of concepts, maximum number of languages de facto used on that group of dictionaries, and also the number of interlingual codes.

SVG format is likely to be the easier one to work with, even if later becomes necessary generate some jpeg/tiff/png to some book format. Or we could use the Calibre command line (https://manual.calibre-ebook.com/generated/en/cli-index.html) to generate the other ebook formats form already generated epub from AsciiDoctor.

[`1603:1:??`] /Dictiōnāria dē generibus grammaticīs/


Topic about creating and doing basic integration of a group of dictionaries about grammatical gender.

Eventually if we have some way to build sentences, it is necessary to know types of grammatical gender (and actually other features of languages, but those would be other tables). That's why this table is somewhat Internal.

Cōdex (PDF files) should work flawlessly with RTL scripts (example use cases: Arabic, Hebrew, Pashto, Urdu, and Sindhi)

Quick links on the topic


This issue is similar (but not equal) to #13 (the one about rendering the characters). The core point here is that the underlying technologies (HTML and PDF) do already support Right to Left. The core issue is being able to generate this flawlessly.

For Cōdex, we're already using Latin for key terms while listing all other languages terms. This approach also means we keep terms that would need to be translatable at minimum. However, it is still worth at least eventually being able to generate Cōdex on other prestige dialects for other writing systems.

The cases such as Hant/Hans are likely to not be as problematic (unless we go top to bottom, which is actually feasible). But the Right to Left writing systems would be better to publish the entire Cōdex. Not just the language, but already use preferred numeric representations than 0123456789.

The issue with Right to Left here is mostly the lack of people who both know the technology and the languages. And this becomes critical because the entire document rendering charges. For example, most references of developers of Asciidoctor, they seems very interested on make this work, but is hard for them both sample documents and people to test.


Changes

  • Added reference to #13

New exported format: frictionlessdata Tabular Data Package + Data Package Catalogs


As the tittle says, let's do a minimal viable product

Generic tooling for explain files of published dictionaries (file validation; human explanation)

The way we're documenting how Numerordĭnātĭo released dictionaries somewhat already would allow not just a human consult as a guide, but eventually even allow for very strict checking. We don't need to go to this extreme, but it is viable to start with an explainer of what each field means.

HXLTM (Subset of HXL for multilingual Terminologies)

This format is not as strict as Numerordĭnātĭo. For example the HXLTM documentation explicitly allows mixing more tags from HXL dictionaries, so I think we may leave as it is.

We maybe could just not enforce the need to have a global ID and tolerate languages not documented.

Note: potentially a way to near validate HXLTM would be export to Numerordĭnātĭo.

Numerodinatio

Numerordĭnātĭo (when stored on HXL) is a stricter subset where everything is either concept metadata or linguistic data and the global IDs works as a numerical taxonomic organization.

Under this approach, just syntax analysis is quick to make, but we're looking for and implements always having at least most important tables that explain the other tables.

The content we intended to formally release already is likely to be public domain (sometimes this may require strip metadata; however as translations can expand a lot file sizes, there are reasons beyond licenses to not do full validations). This means we could actually validate much easier things that otherwise wouldn't be possible using external tools for "semantic web" et al.

(Internal) Use cases

The way we bootraping dictionaries is necessary manual work. The syntax of HXL, HXLTM and Numerordĭnātĭo on HXL can be well defined, but now we're talking about creative work intended for general public reuse.

One obvious use case is misspellings of attributes we use when creating new columns. This can also happen when copy and lasting from outdated references.

The second (yet relevant) use case is to change attributes (or know old tables not updated yet) with old attributes. In general this is more likely to happen on more recent content, but unless the attributes themselves eventually become numeric, the best we can do to make changes less likely to happen is document very well why the previous decision was made. Anyway another alternative could be to have deprecated attributes and document them very well.

External use cases

Not initial goal, but an explained of this type could actually generate an something like a README.md. One challenge is that such readmes could be in more than one human language.

MVP of `[1603.45.16]` /"Ontologia"."United Nations"."P"/@eng-Latn

Quick links


This issue is about minimal viable product of encode the entire public available P-Codes on numerordinatio. The scripts may need to get some cron job or manual upgrade over time, but this issue is mostly about at least have first version.

Replacing ISO 3661-1 alpha 2 with UN M49

P-codes are prefixed with 2 letter codes, which have advantage of deal with leading zeros. So, for P-Codes, this make sense leading letters, which also allow use pure P-Codes as programming variables. However the numerordinatio works, we can go fully numeric.

[1603.45.16] vs [1603.45.49]

In theory, [1603.45.16] could be a more specific version of [1603.45.49] (https://unstats.un.org/unsd/methodology/m49/) instead of have own base namespace. This may change later.

Another point is that depending of how numerordinatio would be done, the codes could have aliases.


Changes

  • [1603.45.15] renamed to [1603.45.16] (US-ASCII alphabet with K makes P as 16, not as 15).

Strategies to detect them fix on external sources or report issues with likely wrong codes themselves (not just translations)

On the temporary namespace 999999 we're already downloading/pre-processing similar datasets. However, even for data conciliation between sources, in best case we have missing data, but at worst is likely that even codes (at least the ones by non-primary sources) may actually be wrong. This is starting to become clear as we make this a monorepo.

Please note that I'm not talking about "Wikipedia (actually Wikidata, which allow public domain reuse, such as translations, we don't even need web scraping) is wrong", but this can happens in non-primary sources, such as thessauries or data providers using code from others, so either Wikidata can be a reflex of this or someone else sharing data from a concept (such as an location).

This type of experience is also called "data roundtrip", https://diff.wikimedia.org/2019/12/13/data-roundtripping-a-new-frontier-for-glam-wiki-collaborations/. And this is quite relevant for under financed organizations who already exchange codes.

Potential actions outside here (medium to long term)

If Wikidara "is wrong" we can do it directly. There are even APIs to allow command line operations, but this type of thing I believe would still require human input. First because sometimes the amount isn't worth time to automate, but also because sometimes there are more than one concept on Wikidata (such as cases related to administrative regions with disputed territory; these actually reflect even translation labels).

However, do other ontologies / dictionaries /thesauruses have inconsistencies? In simple cases it's missing data. This could be as simple as send mails and point a link to a spreadsheet where they could update it. However if it is not only lack of data, but potentially errors in code (such as sharing data with ISO 3661 part 1 alpha 2) then the humans could compare and check with others.

Optimizations here (short to medium term)

We're already starting in every stage to check if file types are not malformed. This is quite relevant in special for outside world input data but can also help with tools. It's not perfect, and will not catch more specific human errors, but it helps with quality control.

Then, since the ideal case would be to run jobs from time to time, it is quite realistic that at some point new versions (as this happens with tabular data) may have new bugs with such malformed data, then the ideal is keep using old cached data until manual human fix. One strategy we're doing is always preparing the new datasets on temporary files and then, after checking at least if it is valid format (later could be more than this) we replace the final result. This division also helps to know what files actually need to be updated (and use faster temporary directories) while also allowing us files update time to be aware of dependency necessary to rebuild everything else.


Additional notes

  • "Primary sources"
    • We must be aware that for very primary sources (think organization saying code for an "English" word like a name for a country) the lack of explaining exactly what this is makes primary sources non-falsifiable by design.
      • Such vagueness (in particular if such organizations do not even actually exchange data related to that topic, such as ISO endorsing other coding) makes it quite convenient for ISO to do a lazy job.
      • If we want to conciliate translations such as the ones from Wikidata, this means potentially helping more the real primary sources which actually explain better instead of wasting time with ISO (which, by the way, prohibit translations). The real primary sources already are more likely to welcome this.
  • Focus on points likely to be human error
    • Sometimes problems happen because of actually underlying political issues (such as territory disputes). However, we were actually more concerned with human error such as when labels in one language are very different or if a standard reuse another (such as ISO 3166 part 1 alpha 2) inconsistently. So at least make them aware can allow change to consider corrections unless they're cannot break internal systems

Mappings from UN P-Codes and Wikidata Q IDs


While the mappings at least at admin0 ("country level") is straightforward (since we can map ISO 3166-1 used on P-Codes prefix and UN m49), things get tricky already at admin boundary level 1. We know some UN PCode patterns of at least some regions (such as the case of P-Codes from Brazil) which we're even lucky have a mapping ready to use like https://www.wikidata.org/wiki/Property:P1585. But not sure about the rest.

Why such mappings becomes relevant

Even if we only manage to somewhat make mappings at best case of admin 1 and only specific administrative regions got very detailed, this alone already allow get more data from Wikidata, which is by far the best place different persons and organizations use it. I personally think (at least as soon as it get decent) worth allow publish such mappings as dedicated public domain dataset, so ITOS or OCHA can at least use it even if for internal comparisons. However, this "soon" can take time and is more likely that for population statistics such as #43, the data from such mappings would be less accurate than what OCHA have, in special for countries with active crisis.

However, in any case, the mappings start allow we know much more mappings (including OpenStreetMap and UN/LOCODE). But by no means I think this will be something ready anytime soon (assuming is something that could be ready at all, since regions can change over time).

Potential approaches

Tooling specialized to integrate intermediate controlled vocabularies

Note: by "intermediate controlled vocabularies" we're talking about anything that could be used to triangulate what could later be assumed to be an exact match with P-Codes

This topic alone will require create several scripts and strategies (even if the early ones would become not as necessary in the medium term) to start know how to make the other relations. The ones we should do more attention are what is relevant to run from time to time to discover new changes.

1. (Not sure, needs testing) maybe compare by matching geometries

At the moment we did not attempted to run tools which could make any type of matching by geometries, but while this definitely would need human intervention, maybe it could work.

To reach this point, not only we would need to create the scripts, but likely allow it run (maybe weekly or monthly) to check the official COD-ABs with what whatever is on Wikidata uses.

2. Trying reverse engineering numeric part of P-Codes (and hope already exist Wikidata P with them)

Since the documentation on how to design P-Codes for more than one decade already recommended to try reuse existing country codes, is likely that more regions would have equivalences such as IBGE Code P1585. The only thing we're sure is that all P-Codes without admi0 prefix are fully numeric (with few exceptions), so this already exclude a lot of potential existing codes

However, the new problem would become if other countries do have mappings on Wikidata P property (and such mappings be as updated as P1585 by others). Otherwise, even if we could know country by country how the P-Codes where designed without try and error (and they be be an 1:1 matching P-Code, which, again, we can't take for granted without human intervention) we cannot use it.

In any case, whatever would be the strategy to map P-Codes to Wikidata Q, we would need to document very well to allow revision.

3. Other inferences

There's several other codes on Wikidata, from OpenStreet Map (https://www.wikidata.org/wiki/Property:P402), UN/LOCODE (https://www.wikidata.org/wiki/Property:P1937), HASC (https://www.wikidata.org/wiki/Property:P8119) to a popular one, the GeoNames (https://www.wikidata.org/wiki/Property:P1566, this one not sure why somethines have more than one code for same place). They might somewhat allow some way to triangulate with P-Codes, but not sure at the moment.

[1603:1:2020] //Guia rápido de lexicografia para colaboradores//@por-Latn


Ainda que cada dicionário não deva ser gerenciado por issues do GitHub, como tópico pode implicar em algumas pequenas alterações no software e/ou usar templates para usar um dicionário e gerar uma documentação/guia apenas com o que foi armazenado no dicionário, vamos mencionar ele aqui.


Editado:

  • 2022-04-24: adicionado link

MVP of `[1603:??:1603]` /HXL/; focus on pre-compile replacement maps


The HXL standard, except for document own vocabulary (which do in fact is relevant here), could be used, could have a neutral namespace for put data there. But he focus of this MVP is [1603:??:1603], in special standard way to replacement data.

I think most use cases for Replacement-maps are common misspellings or translate nomenclature from one language to another. Which, by the way, would be very relevant if we go further extracting translations from MVP of read access to Wikidata #3. However an more immediate use could be simple we also generate table conversions from one coding standard, such as ISO 3661-2 (https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes) to UN m49 (https://unstats.un.org/unsd/methodology/m49/) the one we use here.

New exported format: JSON-LD metadata to explain the CSVs, using W3C Tabular Data (Basic implementation only)


We already export CSVs which can be explained by more than one convention. The more popular one is frictionlessdata mentioned on #35, but W3C, while started based on frictionlessdata, diverged to be compatible with RDF in a way similar to what happened to JSON-LD (which also allow to be mapped directly to anything supported by RDFs).

Basic implementation only

The W3C Tabular Data is obviously more complex to implement than frictionlessdata, but in case things get complicated, this will focus on basic implementation only. For example, ff we share in similar way the CLLD do (https://github.com/dictionaria/daakaka/blob/master/cldf/cldf-metadata.json) the "package" would require each group of dictionaries also export the subset of other tables it uses (in a way that someone could import the full think in a database).

Note that maybe already is possible to convert frictionlessdata to W3C Tabular Data, so we may implement at least one way to do it

About catalog of all datasets: use different approach (out of this issue)

This would be better move to a different issue, but while frictionlessdata have some way to organize more than one dataset (a catalog, https://specs.frictionlessdata.io/patterns/#describing-data-package-catalogs-using-the-data-package-format), which I'm not sure if this is a new feature) there is no direct equivalent with same strategy of this issue.

The close would more likely emulate what CKAN does. At https://github.com/ckan/ckanext-dcat there are some drafts about this. However, CKAN implies there is some way to at least paginate the global list (which is obviously not viable using static generation). In the best case, I think either pushing data to a CKAN (or making a fetch for some static version of all datasets) could be one way to synchronize dictionaries.

[Early tests] Pocket versions of Cōdex even on PDF versions

Currently, we're still using the defaults from ASCIIdoctor to generate PDF versions. We already made several changes to make it flow much better on ebook versions, but this had the side effect of (by removing the use of tables) already made PDF versions taking more pages. Maybe near doubling the number of pages.

Now, we have another problem: the ideal would be to align PDF versions in more columns, but this is not easier to accomplish with AaciiDoctor. In fact the main developer argue this feature is not at all very requested. However, even if we do offer ebook versions, people may at first use PDF versions on mobile. And by creating two (maybe three) columns, we would make the experience bad.

Approach to be tested

Let's just make even the PDF version an A5 instead of A4 size. We both fix the usability issue on mobile (for people using PDF I stead of e-book) and people using computer can use the sidebar to access the Table of Contents.

About really printed versions on paper.

One somewhat advantage of go A5 is, without too much explanation, someone willing to print on paper may have ideas to print... Two per page. This would make a 800 page A5 group of dictionaries fit on 400 pages on paper.

Actually on our tests, it is somewhat viable to even fit 4 of this new version on a typical A4, and the final font size would not be far different than an average dictionary. That would make the 800 pages down to 200 pages (or 100 if both sides).

Like I said, most people already use mobile phones. But even those who need paper could still decide how much paper they would want for a version.

Other optimizations

We may need to make some further optimizations related to font size and space. For example we do use space on description lists to give an idea of the level of the information. But even on the Ebook versions, some readers remove such spaces entirely. This may be one reason we have to resort, even om PDF versions (to not need to have more than one .adoc) add some visual character to indicate level of nesting of the information. Maybe this will be archived by adding numeric prefix or alphabetic numbers (to differentiate from the coding numbers).

Naturally, versions not using Latin could resort to other strategies on paper. If we use "I, II, III, IV, V, ..." this should not be replicated in any other dictionary which does not use the Latin alphabet.

[praeparātiō ex automatīs] MVP of idea of automation to pre-process external data to LSF internal file format


This point is an minimal viable product of a one or more "crawlers" or "scripts" or "conversors" that transform external dictionaries (aka the ones we would label origo_per_automata, origin trough automation, vs origo_per_amanuenses, origin trough manuenses, the way we mostly optimized now) into the working format.

Focuses

  • The data we're interested are already referential data, which is smaller subset of what is shared
    • Is more important have less, but actively updated with primary source and very high quality than do data hoarding and ignore the important ones.
  • We're really interested in referential data we can document how to use
    • This also means we may intentionally name the data fields in ways that make easier to document; even if this means automatically generate user documentation
    • The entire idea must allow ways to receive collaborators help to translate documentation (not need be on sort term, but at least be planned from very start)
  • Referential data can be public; but most information managers will deal with sensitive data
    • The best potential end user, aka the information managers, are likely to ingest all the data as soon as new emergency happens.
    • Even if information managers have good data proeficiency, or know some programming language, they're likely to be overloaded; so we need to make as easier as possible to mitigate human error (on the reference tables)
  • We're interested on reference data useful to disaster preparedness
    • This makes even more important the idea of optimize for faster releases, user documentation, care about make less likely users would leak sensitive data, and to make data schema interoperable at international level

External examples of type of reference data

International Federation of Red Cross and Red Crescent Societies | IFRC Data initiatives

Common Operational datasets (overview)

2007 reference (somewhat outdated)

From https://interagencystandingcommittee.org/system/files/legacy_files/Country%20Level%20OCHA%20and%20HIC%20Minimum%20Common%20Operational%20Datasets%20v1.1.pdf

Table One :Minimum Common Operational Datasets
Category Datalayer Recommendedscaleof** sourcematerial**
Political/Administrativeboundaries CountryboundariesAdmin level1Adminlevel2Adminlevel3Adminlevel4 1:250K
Populated places (with attributes including:latitude/longitude,alternativenames,populationfigures,classification) Settlements
1:100K–1:250K
Transportationnetwork RoadsRailways 1:250K
Transportationinfrastructure Airports/HelipadsSeaports 1:250K
Hydrology RiversLakes 1:250K
Citymaps Scannedcitymaps 1:10K
Table Two: Optional Datasets
Category Datalayer Recommendedscaleofsourcematerial
Marine Coastlines 1:250K
Terrain Elevation 1:250K
Nationalmapseries Scannedtoposheets 1:50K-1:250K
Satelliteimagery Landsat,ASTER,Ikonos, Quickbirdimagery Various
Naturalhazards2 Various Various
Thematic Various Various

[`1603:1:51`] /Dictiōnāria Linguārum ad MMXXII ex Numerordĭnātĭo/@lat-Latn


While we could pack several external existing language codes for data exchange, we will definely use some languages much more heavily. Also, some data source providers can actually use non-standard codes, so soon or later we would need to do this.

[`1603:32`] //Librāria de translitteratio//

This topic is a centralized point about all dictionaries related to transliterations.

This could be 1601:1, because it somewhat works as a bootstrapping. However, it is an area so Important (and also can get very detailed on long run) that it is better to have a dedicated namespace.

About the decision of number 32, it is still not 100%.

Draft of 1603/32/README.md

# [`1603:32`] //librāria de translitteratio//
> **NOTE: this numeric namespace may change. But for now it will be 32**

> Translitteratio[1] est conversio inter systemata scribendi, cum si potis est una littera in systemate valet idem ac littera in alio. Si systemata inter se non congruunt, sonitus litterarum ipsi notandi sunt.
- https://la.wikipedia.org/wiki/Translitteratio
- https://la.wikipedia.org/wiki/Systema_scripturae

- librāria, n, pl, nōminātīvus, https://en.wiktionary.org/wiki/librarium#Latin
- librāriīs, n, pl, dativus
- dē (+ ablātīvus), n, pl, ---, https://en.wiktionary.org/wiki/de#Latin
- dictiōnāriīs, n, pl, ablātīvus, https://en.wiktionary.org/wiki/dictionarium#Latin
- translitteratio, ?, ?, https://la.wikipedia.org/wiki/Translitteratio
- librāria de translitteratio
- librāria dē translitteratio

- /Translitteratio librāriīs/
[32] Translitteratio; T=20, L=12; 20 + 12 = 32
T=20, L=12

## Notes

### mul
- https://en.wikipedia.org/wiki/ALA-LC_romanization


#### rus
- https://en.wikipedia.org/wiki/ISO_9
- https://en.wikipedia.org/wiki/Romanization_of_Russian


### zho
- https://en.wikipedia.org/wiki/Pinyin_table

MVP of Glotocodes


This issue is about the Minimal Viable Product of packing Glotocodes. We have already drafted ISO 639-3 (and their mappings to 639-2 and 639-1). However Glotocodes (in addition to being very, very well documented) also even have a friendly license, so it could make it easier to have official distributions or work together.

Maybe one issue we will face is:

  • decide which number-only code for Glotocodes (since every key we use already is numeric).
  • Glotocodes do already have an explicit taxonomy in alternative to the linearized version. So we may actually rework our own Numerordinatio (https://numerordinatio.etica.ai/) tools just to make even friendly content already as great as theirs.

MVP of dictionaries direct dependency on other dictionaries on the Cōdex format


From this tweet https://twitter.com/fititnt/status/1503494979922153474, while we're already aware Cōdex about human rights violations could be complex, it turns out that even the most basic Cōdex (the generic ones) cannot be done without dependencies.

The [1603:63:101] //Dictiōnāria basibus dē rēbus vītālibus necessāriīs// (which at this moment goes over 950 pages in A5 format) is somewhat borderline viable to create an MVP as single codes, but human rights violations are quite challenging.

Let's use this as example, from very top to bottom

Example use cases

Crimes Against Humanity

With a quick look, we can obviously compile terms, such as "genocide" and even do basic categorization about types of genocide.

This can work pretty well to use dictionaries for translation of terms.

However, it doesn't help to explain what can be considered genocide. Yes, it is possible to write down some standard definition, but then it would need to start explaining the terms of this definition.

Also, beyond have translations such as type of genocides, our draft already have as concepts international treaties which explain in deep general ideas. So some dictionaries already could reference external documents.

Also note that there's a difference between the terms for "genocide" when we translate, and what have go to a forma trial on international courts. So at this level, we already need to start breaking dictionaries.

Human Rights Violations (at individual level)

The idea of crimes against humanity needs at least dictionaries about crimes against humans (not because they're part of a collective of humans, and not done by people which have responsibility to protect them).

On this dictionaries, again, we can start have concepts such as crimes like "rape". And then start break by types of have we can find translations and are immediately useful for internal data exchange as a strict single concept.

But then, again, a problem: while breaking rape into more focused concepts, this do allow translations, but we miss the opportunity necessary on data collection of individual facts that can be attested before the conclusion of if is or not "rape".

Note that even the act which other person is killed can have several terms, it can go from self defense, to planned assassination. Even types of suicide can be from simple suicide to murder suicide, or political suicide. Or suicide bombing. Even if we distribute translations, and they are accurate, the decision behind them depends on several factors.

Also, we have cases where it is not possible to collect evidence directly. Common case is not being able to tell if the person is killed because we need a concept of forced disappearance and likely several others to allow some conclusion if some could assume the person as dead. Similar findings could be draw about rape when despite not be viable collect evidence, would be need concepts which could used on why the evidence was not collected (such as person be detained for longer time before direct evidence be collected).

Again, I'm not saying it is impossible to use the terms. It is. But to maximize dictionaries usability, it makes sense to have more dependencies. Another good reason for this is that implements (the ones with access to very sensitive daga) start to allow cross comparisons between jurisdictions and depend less on ad hoc translations. The concepts which allow more details, should have them. And this is easier on more observable events than conclusion about such events.

Generic dictionaries about torture methods

While torture methods could go on deeper level of details (waterboarding, for example, could be done in several ways) here we can start to have dictionaries which are more direct usable.

Generic dictionaries about ways of death

This one, again, may by itself, be broken into smaller parts. However, it is more easily observable, but when no intention of hiding evidence exists.

But is one necessary as part of data collection with more details.

Burial method (including cremation and types of unmarked graves)

This one is very relevant for data collection of more serious human rights. And even mass graves need differentiation between, for example, if it was for reasons of epidemic, so by dedicated codes, this can start to be more detailed.

Relationship between concepts inside a same group of Dictiōnāria and in a Cōdex

Is necessary some strategy to reference concepts inside the same local group of dictionaries.

Use case

Currently, the [1603:45:31] Dictiōnāria de calamitātibus (https://docs.google.com/spreadsheets/d/1ih3ouvx_n8W5ntNcYBqoyZ2NRMdaA0LRg5F9mGriZm4/edit#gid=1610303107) expands to a Codex with 66 pages:

1603_45_31.mul-Latn.codex.pdf
1603_45_31.no1.tm.hxl.csv
1603_45_31.no11.tm.hxl.csv

However, disaster types such as ones used by ReliefWeb https://reliefweb.int/taxonomy-descriptions#disastertype endorsed as OCHA Vocabulary (https://vocabulary.unocha.org/) use same code for different terms. Lets take Tropical Cyclone (GLIDE hazard code: TC) as example

"Hurricane", "cyclone" and "typhoon" (GLIDE hazard code: TC) are different terms for the same weather phenomenon which is accompanied by torrential rain and maximum sustained wind speeds (near centre) exceeding 119 kilometers per hour: In the western North Atlantic, central and eastern North Pacific, Caribbean Sea and Gulf of Mexico, such a weather phenomenon is called "hurricanes"; In the western North Pacific, it is called "typhoons"; In the Bay of Bengal and Arabian Sea, it is called "cyclones"; In western South Pacific and southeast India Ocean, it is called “severe tropical cyclones”; In the southwest India Ocean, it is called “tropical cyclones.” (WMO)

Obviously there are other examples, but we need some reasonable way to explain these type of relationships. Also, something such as [1603:45:31] Dictiōnāria de calamitātibus would need to eventually allow reference other types of controlled vocabularies

New data warehouse strategy [tabular]: SQL database populated with dictionaries data (experimental feature)


Current know context at the moment

  • The #36 , if followed strictly, would allow creating a package importable to some database. But I'd we do it, would require duplicate more CSVs on each focused base of dictionaries
  • The #35 , from frictionless, have an experimental feature (just done a quick test, and it somewhat works) which allows write a populated SQLite database from an datapackage.json
  • The entire 1603 already designed to be friendly to allow users have everything as local copy
    • Different from generic datasets most data portals ingest, the dictionaries we do are very structured
      • The fact we use 1603 as global prefix, if the dictionaries already are on a database, users could use other global prefixes to ingest actual data and then use SQL to manipulate/transform real world data (an alternative to work CSVs directly)
  • The way we already structured the dictionaries, some from [1603:1] already are required to generate each Cōdex. _They already somewhat have an implicit schema, but the CLIs can work with plain text (the CSVs)

Idea of this issue

TODO: Experimental CLI feature to bootrapp a database from selected dictionaries (...somewhat equivalent to bootstrap a data warehouse)

Do not make sense pre-generate binary databases for end users, somewhat a waste of space. Also, users could be more interested in some dictionaries than others, so even a near single global database would both be too big, potentially be in an inconsistent state from time to time, and obviously make the compilation times absurdly huge.

However soon or later people (or at least we, for our internal use) could want to ingest everything of interest on some relational database. In fact, this would be a side effect of better data formats to explain the datasets such as the frictionless or W3C Tabular Data.

However, we can cut a lot of time (and too much pain, like commands to re-ingest dictionaries again one by one) by simply allowing (even if using the experimental features of friccionesdata) already optimized to create the full database with already selected groups of dictionaries. This also would be more aligned with the philosophy of automating what would take more documentation AND could help get a better overview of the datasets without going one by one.

Other comments

The common use case here assume data related to dictionaries can be re-bootstrapped and, when finished, no more writes would occur (at least not on the reference tables). So SQLite would be a perfect case (even for production use and huge databases, as long as no concurrent writes are necessary). However PostgreSQL (or whatever use would want to convert the SQLite) would be another alternative.

Open room for conventions to store Common Operational Datasets (at least COD-ABs)

While the dictionaries we're doing have their index handcrafted (even if the terminology translations are compiled with software) the perfect first candidates to optimize to users ingest in a predictable way would be CODs.

Note: in case we fetch data from other sources (such as @digital-guard) the actual use case here would be focus on live data, not archived data.

Before go to CODs, means optimize dictionaries that explain then

To have a sane way to ingest data, we would fist start to have dictionaries from [1603:??] Geographia (create base numerospace) #31 already done.

Our dictionaries can reuse other dictionaries (so the things get better over time) and at least on concepts related to places, the number to access the dictionary can actually mean the country.

[`1603:1:7`] //Dictiōnāria basibus de rēs interlinguīs//

This topic is very similar to [1603:1:51] /Dictiōnāria Linguārum ad MMXXII ex Numerordĭnātĭo/@lat-Latn #9 however is focused on interlingual codes (which are likely to be considered concept-level codes).

The closest we have on this is the [1603:3:12:6] (https://docs.google.com/spreadsheets/d/1ih3ouvx_n8W5ntNcYBqoyZ2NRMdaA0LRg5F9mGriZm4/edit#gid=1281616178) which uses Wikidara P properties. In fact we may reuse that table.

The main reason

At this moment, it is already possible to generate a textual version of each numeric namespace of dictionaries. The  [1603:1:51] already is acknowledged automatically: only the language the human will see on respective dictionaries will be explained.

However, we don't have this for concept level codes. So we could already move the explanation for dedicated dictionaries.

An interesting challenge (or why centralize what could cause recursive issues)

The way we build this work, technically it is possible that new dictionaries can define even new terms.

For example, the dental dictionaries (MVP de [1603:84:1] /Dictiōnāria dentālium/ #8) do this. Depending on how we build new dictionaries (not user data) this could cause cyclical dependencies (which is not as ideal without very good reason).

With this in mind, whatever becomes the use of this main namespace for interlingual "HXL attributes' / HXLTM "BCP47 -x- codes", we need to be careful.

On automation vs expansion

Similar to  [1603:1:51], this table is likely to be manual work. We ideally should keep translations or other explanations at minimum.

If necessary, we could build a manual table to document other languages (such as English, Portuguese, etc.

MVP of better management of Dictiōnāria + Cōdex automated updated (necessary for future schedule/cron)

Related

  • Synchronization of generated preview versions of dictiōnaria and cōdex files with some CDN #28

Since we already have several dictionaries (some more complete than others) it's getting complicated call then manually. The fetching of the Wikidata Q labels in special is prone to timeouts (it varies by hour of the day) so the thing need to deal with remote calls failing and still retrying with some delay or eventually give up and try hours later.

For sake of this Minimal Viable Product, the idea is at least start to use the 1603:1:1 as starting point to know all available dictionaries, then evoke one by one instead of going add then directly on shell scripts.

New exported format [offline data exploration]: Orange Data Mining project file (generic .ows)


From the entire EticaAI/HXL-Data-Science-file-formats project, the Orange Data Mining was one of the easier point-and-clicky (even among the ones with interfaces) to do statistical analysis and machine learning based on data. It still takes more memory ram than Weka (or even low level with python) but still reasonable considering most audiences would use Excel for it.

The idea here is to make proof of concepts of generating orange project files with some basic functionality to analyse data.

Know limitation

Even if we manage to make some templated orange project files (the way we do now with datapackage.json and csv-metadata.json) orange can even try to infer the categories of the data, but it will fail to decide which is "target" which the user wants to discover the relations.

Even for datasets which might have data that would worth analyse with Orange, we may either put some dummy target or leave to user select. In this sense, it not strictly an limitation, but we can't automate this part.

MVP of RDF/Turtle canonization/file formatting for generated dictionaries


As we're moving to prepare more data to be shared, the nature of RDF triples may be easier to compare than when same data is on 250+ column CSVs (and one change update the entire row), but several tools can have variations on the way white space, line breaks and etc are handled, so we need to think about this to reduce noise.

The https://json-ld.github.io/rdf-dataset-canonicalization/spec/ and sveral of their mentioned works or papers which discuss this in deep. Some of then even try ideas as far as make digital signatures to assert one or a group of RDF tripples would really come from a source, but this is not as scope now, not only because lack of tooling, but we really, need to fix first the file diffs.

The MVP

The idea here (before we start generating very, very large RDF files which naturally will evolve over time) is create some tool or documentation on how to make some conventions about the turtle outputs in such way that every generated files uses it.

Eventually this could be improved, but for now if we do not do this, the repositories which receive updates would increase in size for something which could have simple solution sooner.

Cōdex (PDF files) should display characters of all languages it contains

Quick links on the topic


Context: We managed to create the first versions of "Cōdex" (which are at the moment is how we call an book-like version of dictionaries, which not only are one alternative to see the results, but also can explain how the compilation was done without need to explain the entire Numerordĭnātĭo). We already are able to compile from community translations (such as public domain Wikidata) up to 200 languages.

The core issue: PDFs can't display several of those 200 languages. But the technology, which does allow embedding fonts (so it can render characters regardless of what is installed on each computer), is not implemented correctly. This eventually needs to be addressed.


At the end, there is a screenshot. But as reference, on an Ubuntu 20.04 LTS, the strategy used to be able to render near all fonts was this:

sudo apt install fonts-noto fonts-noto-color-emoji
#  Download: 234 MB
#  Disk after: 663 MB

Very likely we're still missing fonts, but note that the average implementer reading the data files would already need some help on how to render them. This also can give an idea of fonts for world languages that are likely to require huge space.

The second part of the screenshot shows our first attempt to generate PDFs, without any additional guide (as they do exist for ASCIIDoctor PDF generation) to deal with fonts.


Captura-de-tela-de-2022-02-03-21-50-09

Planned procedural generic strategy to generate reversible numeric codes for concepts which global standards have no writing system neutral coding alternative

Numerical codes not only are computationally efficient and easier for usage when defining large amounts of codes (such as internal divisions or organizations of their own country, but also used by modern such as Terminologia Anatomica) but also much more ideal for multilingual lexicography.

By "neutral codes" people sometimes think as if different regions hate other alphabets, but actually there are serious usability issues. For example using US-ASCII alpha (which, by the way, is not full Latin alphabet) no matter how hard an average person (not only) native speaker of any Arabic dialect, simply they can't pronounce all letters because several sounds are uncommon. Such a fact actually does happen inside languages which do use the Latin alphabet, to a point of usage coping mechanisms such as using the ICAO spelling alphabet to pronounce each letter. But in comparison, the sound of numbers quite often in most languages is very quite different.

Use cases of why makes sense even coordination of lexicography not have single coordination

TICO-19

One interesting fact we discovered empiracly on the lexicograpy of the [working-draft] Public domain datasets from
Translation Initiative for COVID-19 on the format
HXLTM (Multilingual Terminology in Humanitarian Language Exchange).
(current link here https://github.com/EticaAI/tico-19-hxltm). I will focus on the wordlists (the TICO-19 "terminology" without concept description).

The final result has more errors on non-Latin scripts.  This does't mean errors did not occurred on por-Latn, spa-Latn, ita-Latn, etc (but fun fact, several translations are better than the eng-Latn used as initial reference), and the more common issue was "literal translation". However, despite the "terminology/wordlists" even having professional translators, the issue on non-Latin writing systems was not perceived as quality control and is likely make easier to distribute last step review could improve this.

TICO-19 use case:

  • For a language where the transliteration of "coronavirus" would be "koronavirus" (but in another alphabet) the translation of "CV" (since translator did not understood CV was abbreviation) instead of "KV" was translated also on the letter by letter "CV" on that writing system (so users would not understand connection with "koronavirus"
    • Under ideal circumstances, the "ideal" way to prepare translations would be link terms by concepts (so not only languages on non-Latin scripts, but everyone) could know the better variant. However, as TICO-19 word lists were done under urgency need at that moment, such attempts of organize could only come later.

I could talk more on this topic, but to make an equivalent quality control would not require that lexicographers (people who compile result of others) actually know each language, but know at least one language and know the writing system. This is likely what already was the quality control on TICO-19 for languages in Latin script (likely some of they knew more than English, yet the work was more on translators and reviewer)

License issues

Slow response for humanitarian usage

This topic alone would take full discussion, but even for humanitarian usage, licences are problematic. Emergency initiatives on translations would require much less response time for authorization than lawyers of average organization copyright holder is able to respond.

Not practical to mention everyone collaboration on aggregated result

See also

Also, there are several issues when compiling together work of different organizations. EVEN if it could be possible to know everyone ho helped, and they do donate for free, how do handle this?
I will let one example from https://upload.wikimedia.org/wikipedia/commons/1/18/Arguments_on_CC0-licensing_for_data.pdf

Captura-de-tela-de-2022-01-06-22-03-08

How numeric codes can both help with international review from different regions and cope with licensing

While there are other use cases, some way to procedural generate such numbers can help at least with review (or even break work from different regions) and licensing.

In the worst case scenario, the terms on initial reference language can be removed immediately as soon as DMCA requests are done. This also copes with the fact that by default, if minimally creative work is done by volunteers (which by the way lexicographers using Numerordĭnātĭo would already have more context to explain concepts) could not be claimed by any initial implementation.

In practice, this could allow translations initiatives focused on humanitarian area start quickly, and still be welcoming give end work to be validated/reused by the organizations, here aiming general public benefit, however if the lawyers of such organizations try to troll, is up to external lexicography coordinators remove reference to the initial standards for what already is not fair use. An average consequence would mean removing the "copyrighted" source terms (often English and sometimes French) and release well curated versions of everything else in usable file formats friendly to use.

Note that in practice is unlikely such lawyers would go this far against translations for humanitarian use, and is more likely this be done by noob lawyers or "near automated" responses.

New data warehouse strategy (graph): RDF/SPARQL graph database populated with dictionaries data (experimental feature)


The way we organize the dictionaries entry point for some time already is very regular, and after the #38, there's no reason to we start to do practical tests.

To Do's of this minimal viable product

Export a format easy to parse

While #38 could be used to import on some graph database, it's not as optimized for speed. So, it's better we export at least one format as easier and compact to parse than alternatives intended to be edited by hand.

Do actual test on one or more graph database

While on #37 the SQLite is quite useful for quick debug, we would need at least one or two tests actually importing to some graph database.

We also need to somewhat take in account ways to potentially allow validation/integrity tests of the entire library as soon as it is on a graph database. It would be easier do this way.

[1603:1:6] //Dictionaries of type of term types//@eng-Latn

Quick examples of external references


This table is a linguistic specialization of [1603:1:51] #9 (which explains natural languages plus writing systems).

The [1603:1:51] is insufficient to explain the stricter type of what term is in a language. This is necessary for interoperability with Terminology Bases where such differentiation is relevant.

Example of challenges

The definition of this numeric namespace of dictionaries is not as hard as how to deal with real world usage. This means we can't simply design something without taking in account how hard would be implement it.

Why the real world is complicated

When trying to scale up terminology translations, quite often translators will use fail safe strategies which are divergent from what a person would expect.

A quite common fact is someone asking "translate this abbreviation for me" for an organization with a name in Latin Alphabet but the target language is simply... not an alphabet. These more obvious cases actually may be easier to avoid errors, despite someone asking an impossible translation for writing systems the individual do not know

Then, the problem becomes language where the translation could be possible (or the source term is not an abbreviation) but the fail safe strategy of translators is to generate very verbose translations (like an entire sentence to explain a term). These types of nuances are likely to mean that most first versions of translations are likely to need review in future. And we also cannot blame initial translators because such less strict translations are quite often good initial alternatives

Organization strategy to deal with numeric namespace of dictionaries which are handled at administrative level

Context

Currently, we compile dictionaries which are mostly intended for international use. The only ones which are country/territory level are one of the early versions, the #2, which we did apply any additional data export beyond basic HXL (and, anyway, they are based on outdated version, mostly intended for testing the way to organize over 100 countries).

Potential approaches (need more testing)

One potential approach could be we, for every sub level after the entrypoint [1603], we reserve a number (likely based on UN m49) to always be reserved for dictionaries related to administration of the upper level.

However, if we take this approach, we need to think about what to do if a region of an upper administrative region decides to publish dictionaries. Even if we (at least not at global level) publish such dictionaries, we may want to at least reserve a second namespace intended to to require two intermediate codes (the first one being the UN m49 entrypoint, and the second one being UN P49, but without country prefix). So every country would require at least two numbers always reserved at second 1603 level.

Use case

Notes:

  • always is possible for users use another base prefix than [1603], such as if want do a hardfork or need to republish a local version and do manual review.
  • in special dictionaries which are related to administrative boundaries would have a very strong link between the ones released by OCHA and the local ones.
    • Potential TODO: since the regional ones are likely to have much more information linked, we could make inferences from the global one, at least related to place coding.

Brazil

Let's use Brazil as use case. One potential approach (which

Under this logic

  • [1603:25] Medicina (Global scope by default)
    • [1603:25:49:76] (prefix for UN m49 concept "Brazil", but the area still expected to be /Medicina/@lat-Latn
  • [1603:44] Forēnsis scientiae (Global scope by default)
    • [1603:25:16:76:31] (prefix for UN m49 concept "Brazil", UN P Code sublevel "Minas Gerais", but the area still expected to be /Forēnsis scientiae/@lat-Latn

Edit

  • Brazil UN m49 -> 076

New exported format: Simple Knowledge Organization System (SKOS) (Basic implementation or better); RDF on Turtle


Simple Knowledge Organization System (SKOS) is a W3C recommendation designed for representation of thesauri, classification schemes, taxonomies, subject-heading systems, or any other type of structured controlled vocabulary. SKOS is part of the Semantic Web family of standards built upon RDF and RDFS, and its main objective is to enable easy publication and use of such vocabularies as linked data. -- via https://en.wikipedia.org/wiki/Simple_Knowledge_Organization_System

Based on suggestion in https://dadosabertos.social/t/metadados-legislativos-e-semantica/390/3?u=rocha by @augusto-herrmann, let's add SKOS as additional exportable format.

Notes

Default codes used for languages, while valid, may intentionally use non-normalized BCP47

Our language codes are BCP47 valid but not the most normalized format (which would require an IANA lookup). This is mostly because we

  • use ISO 639-3, ignoring completely ISO 639-2; this practice is also used by linguists including on Glottocodes
    • Example: por instead of pt in por-Latn
    • This approach is less clear in European languages (unless is minority languages, than is very relevant) but it make easier to intentionally be friendly for languages which never got into ISO 639-1 and ISO 639-2
  • Use ISO 15924 even when likely be unnecessary;
    • Example: Arab in arb-Arab, Latn in lat-Latn, ...

The SKOS do not state best practices of how to encode languages (just recommend BCP47). So by deciding this default, we can actually later also release the language tables.


Changes

MVP of taxonomical+packaging strategy to handle collections of same concept (but incompatible or overly complex to exchange together)

TL;DR of relevant context

  • Taxonomical defaults greatly simplify reuse of tooling even for unknown data
    • Even in case of specialized areas with complex rules, this approach could allow, instead of user learn configure a new tool, others "shape/redistribute" a recommended collection poiting to that default for their user base
      • This approach also means (at least for public data) as soon as a new world region have more specialized use case, the way we use stricter ontology to describe it, it makes the logic reusable for other places
  • Both tabular/traditional databases and on graph databases (because RDF "default IRIs subjects" are far useful of merged data with everything) need consistency
    • e.g. the user don't need like provided data (or might have own data) but simplify A LOT decent defaults
  • Some providers of thematic data naturally have own conventions (and they may already distribute such data for several regions)
    • This topic alone means average end user would love "swap" data providers while no need to learn too much the details
      • These details still necessary for who republish; but they're simplified by end users testing things fast
  • Breaking "same" collections (but either different provider or heterogeneous form of publish it) which end result make then result on same final entry points also allow comparability for republishes
    • Also reduce package size. This will matter a lot.
      • and if we do it somewhat smarter, even RDF format instead of repeat EVERY time the source, the time was revalidated, etc, in addition to the data, we could simply distribute some "update query" to expand that dataset with such metadata
      • The analogy here would be what user would need to read the documentation about minor details, but instead of using natural language we add such rules as another file which is machine readable
        • How ever this would require some advanced ways to "find" related compatible data which cannot be only by the "default" path

1. The challenge

As the title says, while we have a very rudimentary skeleton on @MDCIII by now only Places, different from #43 and #44, which means create other conventions of default entry points, the challenge here is we will have data in the same collection which is "incompatible" with others on same group AND this incompatibly is not really political (something with agree with a point of view allows decide source of truth)

By incompatible, is not really incompatible. RDF would allow pretty much anything, but most users will not want too many levels of details because even if we could document in natural language, this would not allow it to be automatically validated. But on SQL tables, even if we could automate the merge of data, the user would need to do advanced data cleaning to for example know duplicates.

2. The plan

We need to improve the way to describe URNs which, if imported, would point to the same entry point targets (aka RDF nodes or SQL database names). And this needs to be done smart enough as if we would have kept old content updated with as little effort as possible AND without breaking existing things by adding new. Obviously we can make global schema changes (its optimized for that), but still need planning things that even latest usable version would not make sense on tabular format.

Note that at this point, we're already going to allow implementer initialize massive amount of very structured data in ways they can start doing queries and make inferences. But the title points to something we can't solve, but can automate more than half way the work users would have.

In practices, this likely means plan how would be the naming of the folders of the packages (like the GitHub repositories). We might need some suffix or some way to index variants and allow users find them.

2.1 Why better do this sooner

If we don't break this way, things sooner would get too complicated to explain to end users. And we're heavily optimizing to make things understandable.

I know we could in fact eventually automate even creation of documentation with examples of queries, but even if we could do this, as the this type of challenge is not about mere political decision, but heterogeneous, this would make documentation inviable to automate. Incompatible packages of same collection could have documentation individually about how to query then, but a mere addition of one heterogeneous dataset could make every other documentation too hard to differentiate both.

3. Example use cases

3.1 Places

Turns out it was naive to think that even the same official reference will have only one recommendation. Using Brazil alone (which in theory does not even have significant territorial disputes) things like calculating the centroid already is impraticable. Like there's an island over 1000km in the ocean that if we get a map that has it, the Brazil centroid would change.

Obviously it's a good idea to take in account disputes or "world view based on the reference of a country", but the average case already needs to be flexible. Also very often users would prefer to take only one of the versions that co-exist as official if they don't want too many levels of details.

3.1.1. ...statistics by places

Pretty much any statistics already would need some way to break in different collections. But if we take this approach here, this means we would likely also break them broken by package instead of pushing all on the same groups.

3.1.2. Points of interest (like hospitals, schools, etc)

Turns out different sources will not only have different metadata and be community or based directly on governments data, but... will be invisible to enforce then point to the perfectly same "node".

In practice, this means that the best granted consistency will always be the collection entrypoint (e.g. database name or RDF prefix) but how user would be sure two collections are talking about the same subject... That's complicated. It's very complicated.

3.1.3. ... OpenStreetMap POIs outdated / data hoarding / lack of engagement with locals

Despite the humanitarian non-profits being considered "a partner" (often means European or North America), the actual heavy work tends to be done directly on OpenStreetMap and by people not related at all with them. There's even a call to actions to decolonize aid inside the mapping community and I understand why. The Asians in particular are very upset with how things are done.

By breaking in packages, we would also allow bridges to be re validated like OpenStreetMap and Wikidata, even if the country level is likely to have far more details. And I say this because the current "humanitarian-like" use of OpenStreetMap seems to be more focused on discourse than on updatability or relevance to actually be used by the local community.

While streets, rivers and other geographic features tend to be stable (maybe name changes) and taken seriously by other uses of OSM, POIs (points of interest) used by humanitarians are not. Even if volunteers could keep up to date, it is less clear to me how new people can deal with data added several years ago since the way used would scale in the long term. Places like Brazil tend to synchronize data with official sources or get engagement with local developers if they had to do something similar, but I wouldn't be surprised if the discourse at international level to engage locals made OSM data even less updated than any locally led initiative. And I don't think this is a problem of the OSM community, but how internationals love the idea of taking credit without caring for the medium to long term.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.