eticaai / hxl-data-science-file-formats Goto Github PK

Common file formats used for Data Science and language localization exported from (and to) HXL (The Humanitarian Exchange Language)

Home Page: https://hdp.etica.ai/

License: The Unlicense

Python 81.95% Shell 9.41% JavaScript 3.82% HTML 1.63% Common Lisp 0.01% Ada 0.01% Haskell 0.02% Clojure 0.11% Racket 3.03% Scheme 0.02%

data-mining orange-data-mining weka xliff xliff2 tbx tmx utx

hxl-data-science-file-formats's People

Contributors

Stargazers

Watchers

hxl-data-science-file-formats's Issues

[meta issue] HXL and data directly from and to SQL databases

TL;DR:
- This issue is just to have a simple place to put information related to use of SQL storages. It may or not have more than proof of concepts
- To add value (than merely over complicate an database dump to csv and an database import; that could be done by external tools and documentation) **any command line tool should at least read directly from one or more databases and allow create valid SQL file that could be used to import data again.
  - If this becomes too hard, at least we could document scripts to convert CSV to SQL
- Interoperability (think plan taxonomies and how to save equivalent of HXL Hashtags as database column name) are the main objective, even if this mean just prepare documentation and cut performance features that could be used by external tools, not HXL using SQL database directly
  - The direct implication of this (this is my guess, not tested) is that most hxl parser tools, not just the libhxl-python (either directly if implemented or if using as library on tools like here) still likely to not optimize the commands as conversion to SQL select equivalents and still have to work with temporary files (that could still be acceptable fast)
    - In other words: hxl importers/exporters (in theory, not tested) could not break hard if you do not have memory (like some other data mining tools would fill your memory until your computer crash) but large datasets may be slow
  - Please note that even if we compare HXL tools with some programs that can load data from databases, most of them are also optimized for files on disk, or even have to load the entire dataset on memory
    - And some enterprise ones (if already are not expensive) seems to cost even an extra to allow work directly from database than their proprietary file format.
    - But even if HXL tools or HXL-proxy could not be like super optimized for Gigabyte size database processing (like 50ms response time with very optimized index and selects) it could still be useful for who would use HXL to merge several datasets on a single place

This issue is an draft. Some extra information may be edited/added later.

[meta] pre-build packages, automated testing, unitary tests, integration tests; continuous integration (CI), etc

From the tools on EticaAI/HXL-Data-Science-file-formats, the drafted (not yet even as proof as concept) library temporally called hxlm.core (see hxlm #11) is already increasing essential complexity. Even if this library become used mostly used by a few people on @EticaAI/@HXL-CPLP, I believe that the bare minimum would be to add tests so new features don't break past implementations or, if they have to break, at least we know when and what.

This dedicated issue is mostly to have public references if others need to set up similar features. Also continuous integration on it's own is different from code.

Context

The current hxlm.core is written in Python. While the concept was born from an single all-in-one file, HXLMeta (see hxlquickmeta` (cli tool) + HXLMeta (Usable Class) #9), we're drafting a concept (that may be too hard to be feasible) of Declarative Programming (see comment #11 (comment) ) and use YAML syntax to, at least:

reference groups of data (often HDataset + HFile)
reference of how the data can be used/manipulated (and this part is very import to be not only in local language, but designed to fit legal documents); while the internal name on v0.7.3 still hcompliance, in English would be like "acceptable-use-policy"

In a context of the original idea of HXL-Data-Science-file-formats the idea nave minimum viable products to enforce what is "right" and "what is wrong" and have tooling systems that deal with the technical parts could allow exchange sensitive data with fast paced while still respecting laws. Also does exist a need that even who (either semi-automated with by human or totally automated HRouting) could not see themselves sensitive data but still be able to parse metadata.

Please note that even if the point 2 (hcompliance) do have MVPs and plan from start to allow even automated auditing, the idea is make easier work of who already share sensitive data and need to make decisions quickly in the name of someone else while (if necessary) have logs of what was done.

Yes, the idea of automated tests in such context is not overkill compared to the full thing.

Also, all the things here are dedicated to public domain. Including the use of testing.

Evaluating continuous integration tools

At this moment I'm not sure if I should use the GitHub actions (that seems the new thing) or something more traditional like Travis-CI (the open source version).

I know that Travis has very good open source CPU time limits allowed. Not sure about GitHub. I know I could do something like setup an Jenkins, but as I'm also writing the python code (and also that I have no money to let yet more an server to be running for years; and this may need to know past issues) I think Jenkins is not an option now.

Preparation to move `hxltmcli`, `hxltmdexml`, `ontologia/cor.hxltm.yml` and documentation at https://hdp.etica.ai/hxltm to exclusive repository

Label HXLTM:
- https://github.com/EticaAI/HXL-Data-Science-file-formats/issues?q=label%3AHXLTM
Test case for HXLM-Action: datasets from Translation Initiative for COVID-19 "TICO-19" (fititnt/hxltm-action#5)
Some existing related repositóries
- https://github.com/EticaAI/tico-19-hxltm
- https://github.com/fititnt/hxltm-action
Cited test case
- https://tico-19.github.io/
- https://github.com/tico-19/tico-19.github.io

The EticaAI/HXL-Data-Science-file-formats already is some sort of monorepo (see https://en.wikipedia.org/wiki/Monorepo) but if the recent simplifications to make require less dependencies already did not made it better to divide, by trying to apply to more real test cases, like the Translation Initiative for COVID-19 (and assuming any other initiative would have much less people with information technology background, so TICO-19 actually is some of the best case scenarios) I myself believe that the HXLTM, even if some improvements to make more friendly to deal with bilingual files, should at least be much more documented.

Note that in general, bilingual is the supposed to be one of the easier cases (HXLTM focus on multilingual by default). But the way people submitted translations to TICO-19 (as translation pairs) make this type of optimization need.

Beyond just "software documentation"

One of the early challenges on the TICO-19 conversion actually is not even file conversion. Obviously, since is SO MANY LANGUAGUES, the merge back, like described here fititnt/hxltm-action#5 (comment), start to get very repetitive.

Maybe even document how users could drop files on some folder (maybe even with drivers to fetch from Google Drive or other averange user preference, so they would not need know git or something).

The language codes problem

The way different providers use to explain what the terms of a language are is not consistent. And this break hard any automation. Assuming average big providers would follow IETF BCP 47 language tag as per specification is too optimistic, so if they read how to use the hxltmcli /hxltmdexml and the ontologia, is reasonable to assume we will have to give a crash course on other standards.

About minimum standards on how to collect terminology

I will not talk a lot of this on this issue, but even more critical than the decision of language codes be something that really means what someone could submit to some more global initiative, one of the main challenges still how the translations are collected. So, if we create a dedicated place that explains how to use the data convention, and (even without create dedicated "best practices") give intentional nudges on how to cope with anti-patterns on terminology translations, this could give a hint that the quality of translations is heavily based on how well documented is the bootstrapping material.

Potential example approach

Maybe we even intentionally create some specialized tagging subtag for "the case of source translation is not good enough as source term" it be be used as source term when exporting formats intended for receive translations back, like XLIFF. This fix two points:

The first one, is anyone can hotfix translations before generate a new XLIFF, without publicly say that the source term was bad, yet without hurt translations
- This also could be used in case of source language term have copyright.
The second one is tolerate translations from terms that become some sort of standard and cannot be changed because would break software.

Please note that we already have ways to add more description to terms, but if the users don't use that, we could still allow this tricky on documentation.

`hxlquickimport`

Spreadsheet data

See EticaAI-Data_HXL-Data-Science-file-formats_hxlquickimport (https://docs.google.com/spreadsheets/d/1vFkBSharAEg5g5K2u_iDLCBvpWWPqpzC1hcL6QpFNZY/edit#gid=1097528220) for updated content. This is an snapshot.

Category	Nome	URL	URL source
#item+category	#item +name	#item +url	#item +source +url
test-dataset	mx.gob.dados_dataset_informacion-referente-a-casos-covid-19-en-mexico_2020-06-01.csv	https://drive.google.com/file/d/1nQAu6lHvdh2AV7q6aewGBQIxFz7VrCF9/view?usp=sharing	https://github.com/CMedelR/dataCovid19
test-dataset	br.einstein_dataset_covid-pacientes-hospital-albert-einstein-anonimizado_2020-03-28_before-HXLate	https://docs.google.com/spreadsheets/d/1GQVrCQGEetx7RmKaZJ8eD5dgsr5i1zNy_UJpX3_AgrE/edit?usp=sharing	https://www.kaggle.com/einsteindata4u/covid19
research-paper	data-mining-for-the-study-of-the-epidemic-sars-cov-2-covid-19-algorithm-for-the-identification-of-patients-sars-cov-2-covid-19-in-mexico.pdf	https://drive.google.com/file/d/1WaW2b7bGiSZjvc4OdA0kjrBtRTkKV11N/view?usp=sharing	https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3619549

`hxl2tab`: tab format, focused for compatibility with Orange Data Mining

Orange Data Miner specification about .tab
- http://orange.biolab.si/docs/latest/reference/rst/Orange.data.formats.html
- https://orange-data-mining-library.readthedocs.io/en/latest/reference/data.io.html
File:
- bin/hxl2tab
- https://github.com/EticaAI/HXL-Data-Science-file-formats/blob/main/bin/hxl2tab

TODO: add more information.

`hxl2arff`: Attribute-Relation File Format (ARFF), focused for compatibility with WEKA, "The workbench for machine learning"

Attribute-Relation File Format (ARFF) specification
- https://waikato.github.io/weka-wiki/formats_and_processing/arff/
File:
- bin/hxl2arff
- https://github.com/EticaAI/HXL-Data-Science-file-formats/blob/main/bin/hxl2arff
Vocabulary/Taxonomy
- ID: EticaAI-Data_HXL-Data-Science-file-formats
- Public URL: https://docs.google.com/spreadsheets/d/1vFkBSharAEg5g5K2u_iDLCBvpWWPqpzC1hcL6QpFNZY/edit#gid=1297379331

TODO: add more information

[meta] HDP Declarative Programming (working draft)

Trivia:

HDP naming:

HDP = 'HDP Declarative Programming' is the default name.

When in doubt (or you or your tools can't detect intent of use in immediate context) this is a good way to call it. See Wikipedia for Declarative programming and Recursive acronym

HDP = 'Humanitarian Declarative Programming' could be one way to call when the intent of the moment is strictly humanitarian.

The definition of humanitarian is out of scope.

The triggering motivation

Context: the HXL-Data-Science-file-formats was aimed to use HXL as file format to direct input on softwares for data mining/machine learning/statistical analysis and, since HXL is an solution for fast sharing of humanitarian data, one problem becomes how to also make easier authorization to access data and/or minimal level of anonymization also as fast as possible (ideally in real time).

The initial motivation for HDP was to be able to abstract "acceptable use policies" (AUP) both understood by humans (think from judges able to enforce/authorize usage to even local community representatives that could write rules knowing that machines could enforce then) on how data could be processed and by machines.
1. The average scenario usage in this context is already with a huge level of lack of trust between different groups of humans.
  1. While by no means (in special because I'm from @EticaAI, and we mostly advocate for avoiding misuse of A/IS) implementing systems would not make mistakes, from the start we're already planning ways to allow auditing without actually needing to access sensitive data.
  2. Auditing rules that could be reviewed even by people outside the origin country or without knowledge of the language would make things easier,
    1. even if this means an human, how know the native language but no programming, could create an quick file to say what one term means
    2. ... and even if it means already planning ahead one way that such translations tables could be digitally signed
Is possible that the average usage of HDP (if actually go beyond proof of concepts or internal usage) may actually be ways to both reference datasets (eve if they are not ready to use, but could be triggered to built) and instructions to process them
1. To be practical, only a syntax for abstract "acceptable use policies" (AUP) without some implementation would not be usable. So this actually is a requirement.
2. At this moment (2021-03-16) is not clear what should be very optimized for end user HDP and what could be on just a few languages (like in English and Portuguese), but the idea of already try to allow use of different natural languages to express references to datasets works as a benchmark.

Some drafted goals/restrictions (as 2021-03-16):

Both the documentation on how to write the concept of HDP and the proof of concepts to implement are public domain dedication. BSD-0 can be used as alternative.
1. No licenses or pre-authorizations to use are necessary.
Be in the user creator language. This means that the underlining tool should allow exchange HDP files (that in practice means how to find datasets or manipulate them) for example in Portuguese or Russian and others could still (with help of HDP) convert the key terms of don't understand such languages
1. The v0.7.5 already drafted this. But > v0.8.0 should improve the proof of concepts. At the moment the core_vocab already has the 6 UN official languages and, because this project was born via the @HXL-CPLP, the Portuguese language.
  2. Note that special care is done with HDP keywords and instructions that would likely to be used by people who, de facto, need to homologate how data can be used. Since often the data columns may be in the native language one or two humans with both technical skills and a way to understand the native language may need to create a filter and label such filter with tasks that accomplish what that users want and then digitally sign this filter.
2. The inner steps of commands delegated to underlining tools (the wonderful HXL python library is a great example!) is not aimed for average end users so, for the propose of this goal and (as new tools to abstract could increase over time) to make easier localization we strictly don't grant them translations.
  1. There exists a possibility that the code editors already show usage tips in English for the underlining tools, that at least for some languages (like Portuguese) we from HXL-CPLP may translate the help messages. Something similar could be done in other languages with volunteers.
The syntax of HDP must be planned in such a way that make it intentionally hard to average user save information that would make the file itself a secret (like passwords or direct URLs to private resources)
1. Users around human rights or typical collaboration in the middle of urgent disasters may share their HDP files with average end user cloud file sharing. Since HDP files themselves may be shared across several small groups (but users only know that the dataset exists, while not requesting it), if in the worst case scenario only the data that one group is affected, the potential damage is mitigated by default.
2. In general the idea is to allow some level of indirection for things that need to be kept private while still maintaining usability.
Be offline by default and, when applicable, be air-gapped network friendly
1. In some cases people may want to potentially use HDP to manage files on one local network because they need to work offline (or the files are too big) while using the same HDP files to share with others, but others could still use an online version.
2. Since HDP collection of files could potentially allow really big projects, soon or later in particular who need to have access to data from several other groups could fear that such level of abstraction could lead to being targeted. While this is actually not a problem unique to HDP potential usage and most documentation is likely to be focused to help who consumes sensitive data on the last mile, at least allow applicability of air-gapped network seems reasonable.
3. Please note that each case is a case. By "being offline by default" doesn't mean that all resources must be downloaded (in fact, this would be opposite of interest of who would like to share data with others, even if they're trusted), but the fact that command line tools or projects that make reference for load resources outside of network need to have at least some explicitly authorization.
"Batteries included", aka try already offer tools that do syntax checks of HDP files.
1. If you use a code editor that supports JSON Schema, the v0.7.5 already has an early version that warns misuse. At the moment it still requires writing with the internal terms used (Latin). But if eventually the schema becomes generated using the internal core_vocab, this means that other languages would have such support too.
The average HDP file should be optimized if it needs to be printed on paper as it is and have ways to express complex but common items of acceptable use policy (as 2021-03-16 not sure if this is tbe best approach) as some sort of constant. (This means the ideal max characters per line and typical indentation level should be carefully planned ahead). (This type of hint was based on suggestions we hear)
1. Yes, we're in 2021, but as friendly as possible to allow the HDP files (in special the ones that are about authorization) being able to be saved as PDF or even on paper, the better. Even in places that allow attach digital archives, while the authorization can be public, the attachments may require extra authorization (like being a lawyer or at least be in person requesting the files).
2. Ideally the end result could be concise enough to discourage large amounts of texts on the files themselves (even if it means we developer like "custom constants" that are part of the HDP specification itself, like an tag that means 'authorized with full non-anonymised access to strict use to red cross/MSF' or 'destruct any copy after not more necessary').
  1. Such types of constants can both help to make rules concise (so worst case scenario if people have to write again letter by letter something that is just not an customization of well know HDP example/template files, its possible) but also would allow with automatic translation
Do exist other ideas, but as much of possible, both by the syntax of HDP files (that may be easier just have translation for the core key terms) and, if necessary, creation of constants to abstract concepts, ideally should allow that the exact file (either digitally signed or with literally PDF of an judge authorization, so the "authorization" could be an link to such file) be able to be understood even outside the original country.
1. Both for how some custom filters may need to be created, or if either the language used was a totally new one, or the original source wrote a term wrong, the idea here is allow an human, who accepts and digitally signs an extra HDP file, can take full responsibility for mistakes.
2. Again, the idea of average HDP files not requiring ways to point to resources or have reference to passwords also is perfect when is made by paper and the decisions (and who create the underlying rules if is something more specific) could be audited.
  1. Note that some types of auditing could be a human reading the new rule or, since the filters start to have common patterns, the filters someone else creates can be tested against example datasets. While not as ideal as human review, as long as some example datasets for that language already exist (think for example one that simulates Spreadsheets malformed but with personal information) could be used against what was proposed to help that rule of the initial user. (this type of extra validations don't need to be public)

`hxl2pandas`: Pandas DataFrame

hxl +public
meta +status	working-draft
meta +discussion+public
meta +id	EticaAI-Data_HXL-Data-Science-file-formats_Pandas
meta +hxlproxy +url	https://proxy.hxlstandard.org/data?dest=data_view&url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1vFkBSharAEg5g5K2u_iDLCBvpWWPqpzC1hcL6QpFNZY%2Fedit%23gid%3D723336363
meta +specification +url	https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#dtypes
meta +seealso +url	https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html
meta +description +i_eng	Important point: both the `hxl2pandas` and theEticaAI-Data_HXL-Data-Science-file-formats_Pandas reference tableare mostly as reference of how pandas (more specifically DataFrame)could be used as an intermediate format to export HXL to other formatsalready supported by Pandas.While the reference table may still be useful for those who are doingmanual conversionor to help understand how different tools used fordata mining / machine learningwould use HXL attributes, the `hxl2pandas`may not be implemented at all. Alsosome of the intermediate formats maybe converted using other libraries.

At this moment I'm not 100% sure if using pandas just because it allows to export to several formats may be a good approach.

Fist, there is a problem with overhead (but this alone is not main reason). But if the underlining libraries could eventually allow store some additional metadata (like to be able to reconstruction the source hashtags, could would be very nice to have.

The overhead start to become a problem if is 100% granted that the DataFrame loads everything on memory (even if is just numerical representation of strings) before save the formats. While this still more efficient than like load entire Excel file or CSVs, I think that if someone would be using this to convert from an huge CSV, it would be acceptable to be slower, like first save to an local file on /tmp, and then convert the HXLated CSV using the header as additional instructions for whatever would be the new format and use the most efficient loader as possible.

Anyway, this if have to focus, the strategies that generate file formats that do have friendly interfaces (like Orange and Weka; both may not require any command line commands at all to use) seems more an win-win over formats that the end user could simply consume CSVs directly. But these advanced cases can still serve as reference on how to choose the attributes and not just consider two applications (Orange and Weka).

`hxl-yml-spec-to-hxl-json-spec`: HXL Data processing specs exporter

Quick links:

"JSON processing specs for HXL data, David Megginson, 2021-03-11"
- https://docs.google.com/presentation/d/17vXOnq2atIDnrODGLs36P1EaUvT-vXPjsc2I1q1Qc50/edit#slide=id.p
Test online
- https://proxy.hxlstandard.org/api/from-spec.html

Let's do an proof of concept of the thing!

`hxlquickmeta` (cli tool) + HXLMeta (Usable Class)

One feature of the HXLTabConverter common class #8 (since we're already reading all documentation to see how to make inferences without forcing users to use type hints in all places) actually requires knowing the supposed data types of already HXLated datasets. So, let's break in an separate class [and as much as possible already try to use data structures that could be converted from JSON or something] to create something that actually could make these inferences

The more specific HXL Core hashtags

One advantage of using the hashtag that already is the very own defined on the specification is that the specification for several cases enforce the types. This happens on special for indicators. So, actually, is possible to (at least if is not doing something like brute forcing with the hxlquickimport) be somewhat sure about what to expect from the data columns.

Which accuracy to aim?

Note: "accuracy" in this case means, when the user does not explicitly already enforce on the source HXLated dataset the "data types" or "data flags", suggest something that could be corrected.

In my personal, honest opinion, >90% of the cases is good enough, including making inferences beyond the official documentation (but at this point may need to do some checking on at least a good amount of rows to deduce. But should exist one way that allows users to explicitly enforce (even if it means a more verbose attribute).

Maybe a different approach to tolerate even less accuracy on first try (think like >75%", maybe less) is if is possible to easily import back the exported format (think the .tab from Orange Data Mining, but could be Weka and others) we assume that the data types and data flags (is meta? This can be ignored? Etc) could already be imported back with more data type hints that if exported again would not change.

In other words: for very long spreadsheets, somewhat already optimized to be corrected on an external program. (I think this is much more likely to happen for data flags than data types, in fact we may need to create some way to allow more than one target variable).

How to warn the suggestions outside what already is strictly defined on the HXL Standard

Also, since already do exist the concept of Debug logs, I think when we try to make inferences on the tags that are less than 90% (or maybe we discover that an analysis of 100 (or up to 10.000) the user literally done poor tagging and this is 98% likely to fail on external data mining tools, we still warn the user (This type of feature would be need if trying to brute force with the hxlquickimport, so at least some quick checks could already exist).

[meta issue] hxlm

This issue will be used to reference commits from this repository and others.

TODO: add more context.

Update 1 (2021-03-01):

Ok. I liked the idea of YAML-like projects!!! But may be easier to do the full thing than explain upfront. (I'm obviously biased because of Ansible, but ok; anyway I know is possible to even implement testinfra; but would be easier to create an "Ansible for datasets+ (automated) compliance" than reuse Ansible)

Also YAML, different from JSON, is much more Human friendly (for example: it allows comments!) so this can somewhat help.

Being practical, at this moment I think mostly will be wrapper to libraries and APIs already existing (aka syntetic sugar, not really new features). But as soon as the building blocks are ready, the YAML projects themselves become powerful!

`urnresolver`: Uniform Resource Names - URN Resolver

Quick links

Wikipedia
- EN: https://en.wikipedia.org/wiki/Uniform_Resource_Name
- PT: https://pt.wikipedia.org/wiki/URN
RFC
- Uniform Resource Names (URNs): https://tools.ietf.org/html/rfc8141
- Functional Requirements for Uniform Resource Names: https://tools.ietf.org/html/rfc1737
IANA Namespaces: https://www.iana.org/assignments/urn-namespaces/urn-namespaces.xhtml

"A Uniform Resource Name (URN) is a Uniform Resource Identifier (URI) that uses the urn scheme. URNs are globally unique persistent identifiers assigned within defined namespaces so they will be available for a long period of time, even after the resource which they identify ceases to exist or becomes unavailable.[1] URNs cannot be used to directly locate an item and need not be resolvable, as they are simply templates that another parser may use to find an item." -- Wikipedia

As part of reference the datasets (temporary internal name: hdataset) from different groups (temporary internal name: hsilo) makes sense to have some way to padronize naming. And URNs, even if is complicated to implement in practice, at least could serve as hint for humans simply avoid using whatever is their creative idea at the moment. (This actually is more important if we're implementing localized translations as part of the [meta issue] hxlm #11 with equal equivalent between translations).

[meta] HDP files strategies of integrity and authenticity (hash, digital signatures, ...)

[meta] HDP Declarative Programming (working draft) #16

First things first: one primary goal of HDP files themselves is both to allow exchange of how to reference datasets and how data is allowed to be manipulated and, as consequence, this means auditability. Also, HDP files (at least the ones used for end users) are meant to be usable if printed on paper (think a judge attaching HDP instructions that on worst case someone would have to digit again). HDP files should be human readable

Note that the data themselves can (and by default is!) considered sensitive. But the ideal (and, this means what is being optimized) is that even if people exchange HDP files could do it without fear if the files leak or need to be audited. This means that even if we could make it easier to embed passwords or direct access to private resources on file we're likely to make it intentionally hard, so the average user is likely to simply don't know how to use it.

File Based Encryption of (typically) HDP file is not an goal, but integrity (and some cases, authenticity) is required

1. So what's the point of integrity checks?

One core feature of HDP is having in common vocabulary to allow translation of the HDP files between different human natural languages and do in such a way that whatever was the original natural language written, the file could ideally still keep like the original way.

In other words: if the HDP file is being translated on-the-fly if an user does not understand Modern Standard Arab, we could have multiple teams exchange (maybe even working with the same filesystem!) even if most people don't speak same language

But then one point of improvement happen:

how to check if an on-the-fly translation was not changed?
What if tools make some easy to catch mistakes and now the original file is not reversible on-the-fly?
What if tools that make the hash received upgrade, new hashing do not match? (Note that for this case, since HDP have much, MUCH more moving parts than an static file, users could upgrade old files or at least use external tool, like file based, to test integrity)

Note: the HDP files themselves (as soon as eventually not just Latin Language being the reference, but all other core languages being equally valid) may intentionally need changes. So some way to check can help humans to avoid out-of-sync states

1.1 Some non-cryptographic hashing

Actually to make it feasible to translate from and to other languages we need some integrity check. This is why we need to get it working as soon as possible.

It's not rocket science. Even an MD5-like would do it. This is meant to be used for non-intentional errors.

We may actually use some weak (and explicitly say that is weak) hashing integrity check so the users don't have a false sense of security.

1.1 Authenticated signatures

Authenticated signatures, maybe both with a secret (think password-like string) or public key authentication still worth having. Note that it is always still possible to just do this with entire source files (without using any HDP internal hashing to selectively ignore parts that don't matter) but at some point we may also release some way to allow authenticated/integrity checks also considering internals.

But the main point here is that if the default is not user friendly enough, or it could actually make users experience miserable (like keep track of several secrets just to know the authenticity, and then encourage bad usages) we may enforce everyone.

Also, we're aware one actually the average user base (instead of maybe use Git, like private repositories on GitHub/GitLab/Gitee) is likely way to share would be Google Drive/Dropbox/Etc and (even without considering "State Sponsored attacks'', but actually just someone stealing access from an collaborator to that cloud storage;). So actually may be desirable to use such features if the files themselves are saved outside an secure network.

2. Reflective quote "What's your threat model?" (Extra: memes added)

There are so many potential threat models that, at least in my personal option, we could either go for users' simplicity (while still operational) or go full military-grade authenticity, like use of GPG FIPS compliant smartcards ready to use on air gapped networks.

On image: meme about threat models

Note that I'm very aware that (in special for potential users who create HDP files or process HDP files from others) the ideal perfect usage (think like an information manager working as an data hub for MANY other working groups) is the extreme of air gapped network, but our point here is that HDP files themselves shouldn't require the same level of sensitivity of data themselves. We may not be able to implement the most user-friendly implementations, but whoever processes the data or prepares HDP files to be exchanged, should care that the consumers must have some friendly way to check authenticity.

On image: meme about how we should not use ways to check authenticity (that is different from encryption) that average end user could use it wrong.

3. Opinionated idea about not use security by obscurity or "strong algorithms" used wrong.

This is directed to people who would think that AES 256 is 2 times stronger than AES 128. This is from 2009, but for who undestand English, can give an idea of who just using strong algoritms can make things go wrong https://www.youtube.com/watch?v=ySQl0NhW1J0.

I also really like the idea of we try to focus on acceptable secure that is more likely to not be used wrong. Note that an good part of HDP itself, by allowing multiple natural languages, meet the criteria 2 on '2. Speak the user’s language!':

Source: https://www.usenix.org/sites/default/files/conference/protected-files/hotsec15_slides_green.pdf

In other words, in general maybe the HDP itself as one way to exchange what is meant to be is likely to not implement features that are unsafe for average user, and when is not avoidable implement ones that can go wrong, we still keep simplicity by default while allowing who have advanced threat models fit an HDP on your current workflow.

`hxlquickimporttab`

TODO: add context

HXLTabConverter common class

See:

Spreadsheets
- https://docs.google.com/spreadsheets/d/1vFkBSharAEg5g5K2u_iDLCBvpWWPqpzC1hcL6QpFNZY/edit#gid=245471857
Specifications
- Orange Data Mining Tab
  - https://orange-data-mining-library.readthedocs.io/en/latest/reference/data.io.html
- HXL
  - https://hxlstandard.org/standard/1-1final/dictionary/
  - https://data.humdata.org/dataset/hxl-core-schemas
- Other
  - https://en.wikipedia.org/wiki/Statistical_data_type (for naming usage)
  - http://www.saedsayad.com/ (for general idea of how terms are used)
  - https://waikato.github.io/weka-wiki/formats_and_processing/arff_stable/ (Weka may be another exporter)
Affected libraries:
- hxl2tab #2
- hxlquickimporttab #7

(As expected) both hxl2tab and hxlquickimporttab are starting to share common code. Also, while the one-liner Orange Data Mining format is actually very similar to HXL itself (sometimes it just add an extra 2 chararacters before the base hashtag, but uses tab instead of comma) if the user is importing back one format already previously saved on the orange data mining, it uses the non-compact format. So some very basic functionality may need to not only export HXL to .tab , but import back.

The initial idea of HXLTabConverter is a move the already existing functionality of export/import to a single class (even if, for sake of simplicity for one-file executable scripts, for now just duplicate the code).

Compared to the base libhxl-python (https://github.com/HXLStandard/libhxl-python), one downside of the HXLTabConverter is that actually may have to implement some of the schema on the code itself instead of use an external schema in special because for sake of simplicity, it would have to make inferences on the type of some hashtags+attributes without explicity attributes understood by the HXLTabConverter

Minimal documentation about how to use the command line tools

TL;DR: This issue is about how to install the bin/ scripts.

How to install now

The hxl2example (simply output CSV from direct from libhxl) and the hxl2tab at this moment are one-file-put-on-your-path scripts.
As long as the dependencies are already installed they are likely to work.
- Python 3 is an requirement
- The python library "libhxl" (https://github.com/HXLStandard/libhxl-python; https://pypi.org/project/libhxl/) is an requeriment for every tool from this repository
- Some additional tools may require other libraries. At bare minimum this will be documented per script

Installation via pip

At this moment, there is no way to install these scripts via pip.

Maybe this will not be implemented at all, in special if the dependencies between different exporters get too complicated to keep it from breaking a system.

How to create new exporters

Even if this set of tools get complicated over time, at least the bin/hxl2example may be kept as minimum viable product to create some exporter that actually could be shared with others.

Also, without refactoring, other scripts like hxl2tab may still this way for long time. This may be useful, even if for testing, because some exporters may have to implement opinionated HXL attributes, and on worst case scenario if some the conversion table is not external (like remote spreadsheet or local HXLated csv) this can work for ad hoc changes.

`hxl2*`: common issues and/or opinionated decisions on for all HXL D.S. exporters

hxl +public
meta +status	working-draft
meta +id	EticaAI-Data_HXL-Data-Science-file-formats_hxl2
meta +discussion+public	#5
meta +hxlproxy +url	https://proxy.hxlstandard.org/data?dest=data_view&url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1vFkBSharAEg5g5K2u_iDLCBvpWWPqpzC1hcL6QpFNZY%2Fedit%23gid%3D1797514580
meta +description	This dataset contains general ideas of common issues or recommendations thatare expected to affect all HXL2 exports (code: hxl2) or some specific exporter(like hxl2tab for Orange Data Mining or hxl2arff for Weka).This does not mean that all issues here are already solved on eachimplementation, but at least is a place for reference.

This issue is just a common place to comment or reference on commits points that are very likely to affect all (or at least most) data science file exported based on this project.

~~TODO: update with more information~~ done.

[meta] Internationalization and localization (`i18n`, `l10n`) and internal working vocabulary

Quick links:

This issue may be used to make references to the internal working vocabulary and how to deal with Internationalization and localization in special for the [meta issue] hxlm #11.

A lot of work already was done, but in addition to be used internally, for tools like the https://json-schema.org/ that can be used to generate helpers for who uses code editors like VSCode when editing YAML by hand, for allow multiple languages (even for the key, not just he content) eventually we may need to generate the JSON schemas (there is no native way to make JSON Schemas multilanguage).

TODO: add more context

[meta] `HXLm.lisp` and/or related strategies for portable 'Turing complete' HDP custom functions

About this topic

This topic is a draft. It will be referenced on specific commits and other discussions.

But to say upfront that as much as possible, the idea here is keep as much as possible documents that could be used by decision makers to authorize usage and/or for people who document that datasets do exist (even if they do not say on the document how to find them) and, for what is not feasible already have via the underlining python implementation, allow customization.

Note that these customization, while not explicitly sandboxes (but could be) do not need to be allowed to have direct disk or network access. This approach is not just more safe, also open room for they be more reusable and (this is very important!) simplify documentation on how to use, even by individuals who would not speak the same language.

eticaai / hxl-data-science-file-formats Goto Github PK

hxl-data-science-file-formats's People

Contributors

Stargazers

Watchers

hxl-data-science-file-formats's Issues

Context

Evaluating continuous integration tools

Beyond just "software documentation"

The language codes problem

About minimum standards on how to collect terminology

Potential example approach

Meta

Spreadsheet data

The triggering motivation

Some drafted goals/restrictions (as 2021-03-16):

The more specific HXL Core hashtags

Which accuracy to aim?

How to warn the suggestions outside what already is strictly defined on the HXL Standard

1. So what's the point of integrity checks?

1.1 Some non-cryptographic hashing

1.1 Authenticated signatures

2. Reflective quote "What's your threat model?" (Extra: memes added)

3. Opinionated idea about not use security by obscurity or "strong algorithms" used wrong.

How to install now

Installation via pip

How to create new exporters

About this topic

Recommend Projects

Recommend Topics

Recommend Org