Giter Club home page Giter Club logo

htcds's Introduction

CURRENT VERSION: 2.0

HTCDS Introduction

The Human Trafficking Case Data Standard (HTCDS) is a global format and common approach to collecting and recording case data related to human trafficking. The standard will enable organizations around the world to collect and potentially share information related to human trafficking cases in a consistent way. The HTCDS is intended to be a reference for organizations handling cases related to human trafficking, technology service providers and independent software vendors (ISVs).

Over the last few years, case management technology has become increasingly affordable and accessible to small and large organizations supporting victims on the front line. Systems range from spreadsheet databases through to more sophisticated relational database systems supporting workflows and advanced security models. Some case management systems are operated as business services and others as independent systems. Without open data standards there is a risk that a growing ecosystem of systems and services diverge in terms of definitions and data designs. This could preclude future efforts to enable process-centric integration (such as case referral solutions), and analyse data in aggregate.

Purpose

The primary motivations behind HTCDS are:

  1. Provide common definitions and language describing important aspects of trafficking case data. This will enable more precise comparisons across datasets and geographic regions, as well as helping professionals and leaders describe situations using common language and terminology.
  2. Support interoperability and data exchange between systems, services and organizations. This includes process-centric integration as well as aggregate data analysis such as that provided by the Collaborative Trafficking Data Collaborative (CTDC).
  3. Unlocking innovation by encouraging technology organisations to develop new systems and services based upon the standard.
  4. Reduce the costs associated with developing case management systems by providing tools to accelerate the development of systems based upon the standard.

Open Data Standards

The HTCDS standard intends to support the major principles behind open data standards. These are described fully in the OpenStand resource (https://open-stand.org/about-us/principles) and referenced in ODI’s open standards guidance. For the terms of use of the site, please see here.

Human trafficking and case management varies widely across geographies, organizations and contexts. Although the primary sponsor of this standard is IOM, the future success of the standard will require collaboration and contributions from a range of organizations, including technology companies, NGOs, and academia. As with other open data standards, the HTCDS is a voluntary standard whose success will depend upon a community developing and implementing the standard so that it remains relevant and useful.

A principle of the HTCDS is that the standard remains as agnostic as possible to the technical solution selected by organizations for implementation. This is to ensure organizations have the broadest range of technology options available, but also to ensure the standard does not preclude new technological advancements developed in the future. More information is # provided in the “implementation” section of this standard.

HTCDS Toolkit

The HTCDS Toolkit is a growing collection of tools on popular platforms which aim to accelerate the implementation of the HTCDS standard. The toolkit implements the current version of HTCDS.

You can also learn more about how to manage victim case data in a complementary e-learning course. It is available for free, self-paced, and developed in line with the HTCDS.

How Do I Start?

This respository contains several reference documents as well as field standards.

  • Read this document!
  • Read the high level Guidance.
  • Download the HTCDS Field Reference spreadsheet which describes the main fields. The columns in the spreadsheet are described in the HTCDS Field Column Reference.
  • Read the toolkit guidance to see if any of these tools can help you implement HTCDS more quickly.
  • Data Sensitivity: For information on how to create synthetic datasets that have the same statistical properties as the original sensitive datasets, please see (here).

The HTCDS is not intended to fully describe all of the elements necessary in a case management system. However, it does offer useful ways to describe many aspects of case data relevant to human trafficking. Whether you are building a spreadsheet or a more sophisticated relational database these standards should be a useful reference.

Analytics

This initiative was funded by The Global Fund to End Modern Slavery (GFEMS) under a cooperative agreement with the United States Department of State. The opinions, findings and conclusions stated herein are those of the authors and do not necessarily reflect those of GFEMS or the United States Department of State.

      

htcds's People

Contributors

bgobena avatar clairegalez-davis avatar gerardtosserams avatar lorrainewongmw avatar philhb avatar verenasattler avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

htcds's Issues

methodsOfControl property should be singular ie. "methodOfControl"

Referring to the schema: https://hapi.etica.ai/eng-Latn/data-schema/UN/HTCDS/HTCDS.json

      "vic_MethodsOfControl": {
                  "description": "Methods of Control",
                  "type": "string", 
                   ... 
      }

Problem:
From the property name, there could be multiple methods of control (plural). However the schema defines the expected type as "string". This causes confusion if multiple methods of control are required, should the developer:

  1. disregard the schema and use array of strings? [method1, method2]
  2. follow the schema and concat multiple methods together eg. method1; method2

This is undesirable for data interoperability and sharing.

Proposal:
Change the method name to singular "vic_MethodOfControl"

Excel Picklists

For Excel users, it should be clear how to clear the picklist or choose more than one option from the list.

Forced Criminality missing from the Exploitation typology

As many of you will know, an emerging issue, particularly in the Southeast Asian region is people being trafficked into criminal owned compounds and forced to conduct romance and other online financial scams.

Blue Dragon has been involved with rescuing and supporting some victims, and we are currently unable to identify them effectively in our case management database as the HTCDS doesn't include this form of exploitation.

UNODC have recently developed indicators for the identification and understanding of this form of exploitation - see attached.

I think that the HTCDS needs to include this form of exploitation in the standards. Not sure of the process to add something, but I would appreciate consideration of this - it will be particularly necessary for organisations working in SEA, but also important more broadly so that we can shed light on this particular issue, and understand the size of the problem, and demographics of the victims.
UNODC Key Indicators of TIP for Forced Criminality FINAL.pdf

Originally posted by @cmwyndham in #27

Perpetrator

Users should at least have the option to record information about the perpetrator when they can be discerned beyond the location and method of recruitment and control. In choosing instead to collect information about the victim's various vulnerabilities, we run the risk of creating a data story that implies that people were solely responsible for their own trafficking. Ultimately, this favors policy solutions that aim to control or constrain the freedoms of potential victims; meanwhile, we can't answer fundamental questions that are important to policymaking such as "Does the median trafficker in Vietnam have 3 or 350 victims?" So, I would favor at least collecting information like whether the trafficker individual/business, nationality, gender, approximate age (if applicable), and other details, as Liberty Shared does.

HTCDS 1.0 just published while still confusing licensing (conflict with "open standard") + serious data exchange flaws for (at least) common persons name from Latin America and Asia

I'm glad that the HTCDS 1.0 was just published this sunday.

While I, here as lexicographer, have no problem with the creators of HTCDS and the United Nations International Organization for Migration employee trying to intermediate on what to do, the lawyers responsible to give copyright advice are the real target of confusing license. Note: we are still trying to follow what license we use, and this is not clear.

Note that lawyers, who still have not replied to any request for clarification after months, if likely trying to copycat the failed model of ISO organization while obviously the HTCDS requires much more help because systems world-wide are incompatible. The problem is that trying to use ISO as role model is that ISO actively DMCA down any serious translation initiative even for COVID-19 response (with they "freely available in read-only format" only for Englsh/French), which makes it unfit for humanitarian usage where wrong translation kills people and there is no reference resource for an average developer who uses English to not create tooling that will fail when exchanging personal data.

1. Serious data exchange flaws for (at least) common persons name from Latin America and Asia

It seems that one requirement to be "compatible with off-the-shelf existing systems", a software from an US based company focused on marketing called Salesforce, actually makes data exchange with serious flaws. I will repeat what has already been said here #7 (comment).

A trafficked person with common names used in Latin America, if shared using the current standard proposed to UN IOM, will get an incomplete name. For names originally not writing in Latin script, in addition to name order be likely to be swapped, there not only one organization can share using original script, but there is more than one transliteration strategy, with makes data exchange of people from Asia much more likely to get wrong, because the reference of IOM is doing wrong.

Let me repeat: trafficked Latin Americans and Asians when exchanged using the current HTCDS 1.0 terminology are known to be specially lost. This reason alone is sufficient to care about, even if HTCDS persist as it is in English.

1.2. Why am I citing the issue already in English

I'm doing this in public not to shame current work on HTCDS (because this is a very important project, and in fact terminology like this is a generic need, and also because by creating Portuguese version, we're also criticized), but in case of lawyers keeping this conflict license, the work needed to "translate" (in special the generic salesforce fields) actually requires complete rework.

The closest existing work related to persons data is https://github.com/SEMICeu/Core-Person-Vocabulary. And most corner cases the SEMICeu/Core-Person-Vocabulary uses are very common on humanitarian data. Even if HTCDS 1.0 rename fields (which already would make it a 2.0) it would at least need to consider people who have more than one official name. (yes, one person actually can have more than one name; then add these people who have birth names in non-Latin scripts while having their name transliterated, and there is more than one way to transliterate names).

Even if we could comply with such a confusing license, the people who could help us from HXL-CPLP would need to try implementations outside the HTCDS repository, in particular how to deal with non-Latin written names.

2. Our approach to this conflict license

We from HXL-CPLP will release the concepts (extra descriptions, translations, examples, etc) and templates to build glossaries and data schemas under public domain. Actually it has been since HTCDS v0.2.

Cease-and-desist-letters (or ask for help from other implementers getting DMCA requests) are welcomed at [email protected].

The standard HTCDS can keep whatever license is. But we here will not wait while people still bury their heads in sand, but we will not stop just because a thing known to get our names wrong just because it was easier to require a standard be compatible with a software used for marketing. I will explain why we will do public domain from our terminology while still making it reusable for humanitarian usage.

2.1. Why we, initial team from HXL-CPLP, refuse to allow copyright holding for all concepts to any organization

The way lexicography is done is close to words in a dictionary. We even developed software to convert not only to translator formats like XLIFF, but to TBX. This means our average spreadsheet is a frontend like Europe IATE https://iate.europa.eu/.

So, while maybe the name and the description of HTCDS could be copyrightable, concepts like how to break a person's name or birthdate are clearly insane any implicit implication with the current license that try to deny reuse for other cases. It's like trying to copyright a word in English. This is absurd.

2.1.1 Some quick context (for technical people, not the lawyers, to get idea of building blocks)

What for HTCDS is a standard (as a composition of words and definitions), in our case is a work break in concepts (as in concept-based translation, instead of term-based translation), that should have added explanations to aid translate differentiate ambiguous terms, then the final result could both be exported to create a glossary (like a PDF) or templated files where terms can be extracted back and generate something like a data schema or even scripts to convert data from one format to another. From the more "end user collaborator", what started with HTCDS 0.2, was this:

Every script is public domain and optimized to go from new terms to actionable scripts ready to be published at the speed needed in case of emergency response. This means, for example, if the same way new scripts/data schemas/documentations can be templated to new versions of something related to HTCDS, previous terminologies can be used to new implementations also related to humanitarian response.

There was one problem we empirically realized while doing technical translations around HTCDS 0.2 (that actually is reason why is harder to scale up translations, but since there is no one to do this, is unlikely this will even happen beyond English):

  1. This type of "translation" actually is a type of multilingual controlled vocabulary, which makes orders of magnitudes hard to bootstrap "translations" if initial concepts are not well planned (which, by the way, most fields based on Salesforce are beyond repair).2. And even if well planned, some terms are so new that this means introducing new terms to target languages (which makes step necessary as "provisional terms" years while actively publishing explanations on sites like Wiktionary and incentiving publication on traditional dictionaries). And, by "target language" this can even be English (like the case of decine how to break persons name)

Our point here is: serious terminological translation would need to also (as we did with software) allow glossary exportation for key terms and is inherently reusable. The HTCDS is actually just one of the items of our current Request For Feedback (https://docs.google.com/spreadsheets/d/1ysjiu_noghR1i9u0FWa3jSpk8kBq9Ezp95mqPLD9obE/edit#gid=846730778). We for example are aiming like the Lexicography of common terms used on COVID-19 data exchange (focus public data) (which also have other complaints like https://www.sciencediplomacy.org/article/2020/we-can-do-better-lessons-learned-data-sharing-in-covid-19-pandemic-can-inform-future).

2.2 Copyright over generic concepts are against promises needs to be done to volunteer translators/terminologist

Average person whilling to help not only are doing in good intent, but likely to actually either be a victim of human trafficking or know the real bad situation. Especially after this crisis of Afghanistan of interpreters left behind, I'm hearing a level so high of distrust that the bare minimum we can do for them is make sure things can be usable even assuming the current copyright holder will not be interested (or may be forced to DMCA down) their work.

Also, like I said earlier (about known issues with nomenclature that is flawed with common names in Latin America) translators are scared to the point that we have to rewrite the thing because it is beyond repair. Again: I'm not complaining about current people editing the HTCDS, because I know this is much more complicated. The point is that this would need much more help from outside and licensing is not clear enough if this already in next years will be "orphan work".

While we already were concerned with translation from HTCDS 0.2, we would like to point that even if IOM tries to go the fantastic reference, the standard UN M49 path, and have accredited translations to UN working languages, the Portuguese would not be one of them. That's one reason why I'm trying to make some workflow easier here.

This means that even if the current copyright holder of HTCDS tries it's best, we realistically would not have accredited translations like Hindi or Portuguese. This makes UN IOM announcing one standard while potentially denying (even if volunteers based, because of public interest) translations to, for example, Portuguese (like ISO organizations do), a threat to implement on countries like Angola, Brazil and Portugal. Remember: we're already stressed with ISO's way of "protecting" their standards, and saying out loud that current copyright holders don't have the infrastructure to have the Portuguese version is realistic.
I'm not complaining, I'm just saying that we here would need a higher threshold. But as is the interest from everyone, there is no need to make it harder. Remember: average people willing to help in these subjects really care about.

2.3 Copyright over generic concepts on multilingual controlled vocabularies are against other UN agencies, Red Cross, Amnesty, etc

Except for concepts too specific to human trafficking, a lot of concepts of one project (as in forms to exchange data) are usable between other organizations. Salesforce fields are not a good reference. Having multilingual controlled vocabularies very well reviewed with implementers for generic terms like person's name is a serious need for humanitarian organizations. The IOM lawyers clearly don't know how bad other humanitarian organizations need this in particular for private data. Or, again, be inspired by the failed ISO approach.

Organizations like Amnesty have interest in existence of standards on sharing police cases as a way to allow strategies like identify police torture.

Implementers (like the ones who give aid) even use biometry because something as simple as a person's name is not standardized, which makes private data storage likely to put a lot of pressure on few developers. This means software writing in English gets wrong data even when inserted by information managers reading people official identification cards which leads to no alternative but collect biometry that is known to be pressured to be shared even with governments that may target the person later.

I could cite so many examples here were the lack of authoritative terminology is a problem beyond human trafficking data exchange. But, again, the point here from the view of a lexicographer is to maximize usage/reuse of vocabulary and even if the best approach would donate translations to the organizations who actually can endorse it, I'm not even sure donate work to HTCDS will be allowed to be reused by other humanitarian organizations.

End comments

Since even after HTCDS 1.0 still no license (and no clarification on what to do beyond no response), we will keep the drafts under public domain. This makes it reusable in the middle of such confusion.

The people behind this data are not numbers. If the English-Speaking community is so accustomed with this to a point of not at least make easier to other languages, seriously, just do the paperwork. Age or experience don't make this type of thinking any good role model for people from other regions.

Screening Info

I suggest to create fields that indicate how victims were screened in or gives a general category for the referring organization (healthcare, law enforcement, educator, etc.). This could help organizations see how victims are being identified, or where more effort needs to be made to screen.

vic_ Vulnerabilities

It would be good to clarify the concept of this field. Are we talking about things that people may be vulnerable to or risk factors that may make them more vulnerable to trafficking (or something else). For the latter, we have some new screening forms that may serve to augment this field .

Concerns about authoritative versions in Arabic (macrolanguage), Chinese (macrolanguage), French, German, Spanish and Russian from English version of HTDCS 1.0


TL;DR:

  1. The end of this document has some strong suggestions. But in short:
    1. Is about having some neutral person/group already inside from UN to overview interactions between HTDCS team and translators (this does not need to be open)
    2. The file with the result of translations with additional metadata should be exported. This is both with information that allows reuse of terminology from HTCDS immediately on glossaries while having extra information to explain how it was created. (this end result needs to be open)
  2. IF HTCDS have both terms and definitions exported to the UN Working languages this is a very, very big thing.
    1. Even if the result is provisional, it's still a big thing. Most content related to software tends to be done either in English or, if any, in French. The experts in software at the UN don't talk with UN translators.
    2. Other big thing is open license (bit I will not repeat here) because most projects started at UN if not core feature of the sponsor tend to eventually never be updated again and become "ophan works" in 5, max 10 years.
  3. The proposed suggestions of internal person/group to overview plus final file are both somewhat based on
    1. The fact that the current copyright holder, UN IOM, because it is essential to have immunity of an IGO, can be both complicated for external collaborators (who can be ignored) and internal employees/contractors (who could be punished). The proposed suggestions, in our opinion, at this moment seems to be one way that still has acceptable open standards while still allowing UN IOM to keep final decision to protect its collaborators independence in special without strong technical motivation (like linguistic viability)
    2. Naming conventions about human trafficking is much less likely have dispute by State Members (like names of territories for UN M49, endorsed by UN members states), while still a liguistic issue (likely to be a mix of experts opinion plus language regulators. This extra metadata could be used to receive feedback by these external organizations (including when necessary introduce new terms to languages, and ensure consistency)
    3. The need for multilingual controlled vocabularies, both usable for technology (like data forms and spreadsheets) and glossaries, is a constant urgent need, including for the humanitarian sector and other UN agencies. The HTCDS is just one example. But we know often they work with higher language variety, which means the final shared file must already allow great external help assuming original publishers will be overloaded.
    4. The de facto recommendations start on the "4.1". If there is a terminologist working to merge the translations, only 4.1 and 4.2.1.3 (about how to label languages) are actually the suggestions that are worth reading from here.

1. The big picture

We from HXL-CPLP, aware that the first translations of the HTCDS 1.0 are planned to be released by October (which would mean a but too short time) are concerned because there are some issues we know are hard. This post actually is not a criticism about the current copyright holder, the UN IOM, but potentially existing workflows on how UN Translations is documented to be done and potentially too strict requirements of time to deliver less ideal final results.

Why we care. One reason for us from HXL-CPLP in having IGOs like UN agencies be able to have standards with authoritative translations (even if it is only UN working languages, which excludes for example Portuguese and Hindi) is that alternatives are worse, as they don't tolerate translations at all. For example, the ISO actively DMCA down any serious translation initiative, and not even pandemic allowed exemption; for example the COVID-19 response (with they "freely available in read-only format" only for English/French). I could cite other bad examples, but even vocabularies/taxonomies already not created inside UN agencies but could be started elsewhere, are quite complicated standards that allow Portuguese versions.

Trivia: the world was prepared to exchange data on how to create vaccines (study case: the GISAID https://www.gisaid.org/) but no convention on how information managers could understand how to deploy it efficiently. A lot of vaccines are wasted, including from rich countries, since they actually are not easy to manage. Emergency translation optimized to be used by machines could have a bigger impact on implementers, since it also allows reuse of software or at least speeds up how to implement ideas working in other world regions.

Why is it important to optimize to be faster (when necessary). Maybe the HTCDS is not as critical in matters of hours/days, but in humanitarian areas the need for endorsement for taxonomies/vocabularies would need to be, when necessary, optimized to be fast. One argumentation is minimal conventions for something like fields to share public data related to COVID, there are many others. The implication is to optimize creations (or at least updates for new terms) if endorsed by IGOs without need to wait for lawyers.

2. References about existing translation processes with accredited translations with equal equivalence

Both the UN and European Union are known to publish translations with equivalent authoritative equivalence. These are just quick comments, mostly to resume challenges that even the best translators would encounter (so is not just about problems community translations would face).

This can help to have more empathy, so future translators would not refuse this type of work. I'm very sure such attempts of trying to translate standards have been tried in the past, but HTCDS would be the first ever attempt on translation standards to be used in software (not law or prose documents) inside the UN.

2.1 1980, Evaluation of the Translation Process in the United Nations System (50 pages)

"Because they feel that they are viewed (when they are noticed) as non-creative appendages performing a costly but mechanical report processing function, they can come to view their work as a high-pressure but rather thankless and tedious task."

In case of trying to translate HTCDS using existing UN translations workflow, since except for the additional guidance (like the README.md, Governance and Contributions.md, Guidance.md, etc) the core part that matters is actually very complex. The reason that I put this quote from this document is the better we try to break what can be translated (and prepare the document very well) less likely it would be complicated by first translators.

2.2. 2008, Translation at the United Nations as Specialized Translation (16 pages)

2.2.1 UN already have very strict workflow to translate documents (HTCDS is more specialized, better break concepts)

This document is more close to what is public knowledge, that is the internal translation process inside the UN. It documents the following steps:

  • (1) Documentation programming and monitoring
  • (2) Documents control
  • (3) Editorial control
  • (4) Reference and Terminology
  • (5) Translation
  • (6) Text processing and typographic style
  • (7) Official Records
  • (8) Copy preparation and proof-reading
  • (9) Publishing

This process is optimized for the type of document translated at the UN (which is mostly prose, not concepts or terminology plus descriptions). In other words: the important part of HTCDS is that not even the UN has a translation workflow. I'm not saying that something like HTCDS would need to have all these steps, but note that empathy with translations is a need.

One coping strategy is not to treat HTCDS as average prose text. This means that if the relevant part of HTCDS (the fields and definition of what every concept means) can be broken, this both could allow automation pipelines and, as new concepts are needed, retranslations faster, but very likely early attempts will need a lot of copy pasting with results from translators. The point here is by citing the translation step as one part of a big workflow is important to mitigate translator burnout. For example, if even UN Translators complain about something, it is relevant to take note for future works.

2.2.2 Word equivalence between languages is a myth

One quote from this document:

“Difficulties due to the multi-racial and multilingual characteristics of UN work are regularly encountered by translators. The occasions when one is unable to find equivalents for a word or concept in another language are frequent. For instance, the English words ‘liability’ and ‘responsibility’ have to be translated by the single French word ‘responsabilité’.(...)”

This document also admits that the idea of being viable to have full equivalence between terms is a myth. Actually, even major spoken languages don't have terms, or the existing ones are vague. So, in a context of HTCDS (or potential other works) the best we be prepared for the fact that one or more languages may need to use provisional terms (that often makes one word in one sentence) and, when necessary, have the following long term view: by working with language regulators and providing machine-readable glossaries** (so software, search engines, etc etc etc) to explain explain what a new term means, is possible to introduce a new, more specific term, in a language.

This short explanation about "prefered" vs "provisional" vs "proposed" can be seen as just another row in a table, and also the request at the end of this topic about the need of add definition for the fields of HTCDS may seem as strange, but if the copyright holder allow for those working with language regulators, this allow this type of long term planning.

2.3. 2017, Interaction of law and language in the EU: Challenges of translating in multilingual environment (17 pages)

Quotes:

"English language used in the EU context (...) It is a novel version of the language, often called "EU English" that is different from the English spoken in the UK or Ireland"

"EU legal texts in English very often contain imprecise terms, which is not something one would associate with traditional UK legal language"

The reason for citing this article is, even if the HTCDS standard (I suppose) was written by native speakers, some fields based on the Salesforce software are too vague. They could work for example as a marketing tool, since using more formal names, like "Given name" may be too formal. Also, consider the steps cited on the UN Translations workflow, and would need to have a terminological review.

3. Practical examples

Obviously the HTCDS itself is a common need already on the field of human trafficking . Here I will cite parts of it that are relevant both for it and other humanitarian / human rights usage.

3.1 Generic multilingual vocabulary about person data

FACT: Not only is there a lack of where a software developer could get authoritative translations good enough to be used to collect a person's data with some minimal assurance that information managers (note: is not even end user, is who manage data) could put the same data everywhere majority of the time.

Trust me. I really looked everywhere.

What often happens is someone drafts a piece of software, then only at later stages translations are made again and again. But this is very prone to errors, in particular because often they are done using English terms that even in English are vague. And, to defend software developers who use these vague terms, they often are vague because requirements are vague. Either software developers (which often may use existing references) or translators could make errors.

Under ideal scenario, since form consistency on data collection is essential to mitigate no alternative but reduce dependency biometry (or, if the intention is not get full name, ensure all translations would not require full name), one approach that makes sense would be... have curated translations (here as multilingual controlled vocabulary, where a developer (whatever is English speaker or French, or Spanish, etc etc) if select one term, is granted that translations have higher granted on bigger range of translations; also note that very often the humanitarian operations do occur on places that have languages with low number of global speakers or very specific dialects, which makes this a need to get right for everyone). Note that this simplifies even assentements of desired level of details based on privacy requirement and potentially actions based on data processing.

3.1.1 To UN Translators trying to create versions derived from the person's vocabulary of HTCDS

The text that would be written here was published here #22 (comment)

4.1 Recommendations from HXL-CPLP to be considered on the first more than translations of HTCDS to UN working languages

Note 1: we from HXL-CPLP are concerned only with translations of the core concepts (like the field names, definition of each field name without including format of field, and potentially user readable labels of fields that are standalized) that end user adding/editing data could see. This is the part we're more interested in helping, since that is what allows translations and generation later of glossaries etc. We have no suggestion at all about translations of any additional content (like the README.md, Governance and Contributions.md, Guidance.md, etc).

4.1 Suggestions on management part (no need to be exposed)

TL;DR: The HTCDS does not fit on the traditional model used on document processing of UN, but someone could act as quality control or internal ombudsman for UN translators on this project.

  1. Have one person/group who could work to overview/intermediate communication from the HTCDS team compiling the translations and the UN translators (including if outsourced) from the first contact
    1. The person/group with this role tends to be perfect to allow other more-than-translations inside UN than HTCDS. Even if as an observer, it is worth the effort to have it.
    2. Is desirable that such a person/group does not have any potential conflict of interest with HTCDS (but doesn't mean someone outside the UN or need to full bureaucracy).
      1. In case HTCDS try to rush time, or pressure translators to something, have power to intermediate while protecting the more-than-translator.
      2. When 2 or more of these more-than-translators have conflict, such person/group could solve it
    3. This person/group doesn't necessarily need to be who "compiles" the result.
  2. In addition to the more-than-translations, even if the source document is in English, it is relevant that the translation process has alternative to an reviewned version in English.
    1. Reasoning: note that the United Nations workflow already considers review of source documents, so is reasonable the person/group agreed to overview/intermediate also have power to decide with translators (or dedicated terminological review) terms changes on terms already published on HTCDS 1.0
    2. If the contents of this extra column become equal to the reviewed HTCDS, the publishers of HTCDS don't need to expose it to the public. One valid reason for HTCDS not to merge it is that proposals of the core terms (like the ones used to label the fields) could make new versions of HTCDS backward incompatible.

4.2 Suggestions on the exported file that is able to be reused (this is what is exposed)

Note: if the person/group who could work to overview/intermediate communication already has experience with terminology:

  1. the entire 4.2 about exported files here can be ignored, but the part about how to label the languages is still relevant (we plan to scale up translations while trying to make different language regulators agree on language, so lang alpha-2 complicates it). Whatever format with more metadata that makes your internal work, if necessary we form HXL-CPLP create software to export to our needs. Actually, even if you export the entire thing in DOCX (but have more metadata) we would be willing to do the copy pasting.

  2. If want reference of similar work, the OCHA Taxonomy As Service publish the Countries & Territories Taxonomy MVP https://docs.google.com/spreadsheets/d/1NjSI2LaS3SqbgYc0HdD8oIb7lofGtiHgoKKATCpwVdY/edit#gid=1088874596. But the file from HTCDS needs to be more close to like Europe IATE https://iate.europa.eu/fields-explained, because the terms would be created/edited both by you and later by volunteers in other languages. And we would need a lot of metadata with so many files exchanged.

  3. One (as 2021-10-01) not yet updated for term+concept, based on the HTCDS 0.2 we use from HXL-CPLP, is here https://docs.google.com/spreadsheets/d/1ih3ouvx_n8W5ntNcYBqoyZ2NRMdaA0LRg5F9mGriZm4/edit#gid=1292720422. This file is still not updated by placing together term+concept (as we suggest done here, but we even have a software to convert these HXL files to TBX, XLIFF, etc at https://hdp.etica.ai/hxltm/. Whatever would be the final format created by HTCDS with translations (we recommend that be the easiest for you to manage, even if manual), we will either use software or human copy pasting to put on spreadsheets like these.

4.2.1. Opinionated suggestions on what put on the file shared with HTCDS versions in other languages

4.2.1.1. Crash course on what is Concept, Language, and Term

terminator

source of image: https://terminator.readthedocs.io/en/latest/_images/TBX_termEntry_structure.png

  1. "Concept ID" is necessary to group concepts inside the multilingual glossary
  2. The very bare minimum for is the "term" and what language this is related
        1. Some languages (like not translated yet) can be empty.
        2. Each language can have several terms; the additional metadata explain what it is, like to differentiate each other
        3. If using tabular format, when having several terms, (like preferred, admitted, etc) is better to place the best to use first. This allows some softwares that doesn't understand the table to consider just the first head term for the first concept. This is the UTX approach
        4. The most ideal "Term" could be extracted from this table to generate the spreadsheet headings for each language
  3. Concept definition in the natural language is the next most relevant information.
        1. Each language can have only ONE definition per concept. (But can have several terms)
        2. Definitions of the same concept in different natural languages don't need to be strictly literal translation, but they cannot be different at the point of representing different concepts
        3. Each language that already has a well crafted definition is immediately available to be exported as a glossary (like to generate documentation, an e-book, etc)
        4. If for immediate use a language has a provisional term (like a long sentence) because there is a lack of better terminology, the addition of another term as "proposed" should be done on the same concept (not by creating other concepts). The export to generate the glossary to help spread new terms can be done with software.
        5. The best way to create new translations is the human relying on the definitions instead of only a term (solves ambiguity)
        6. This concept
  4. Each term can have some way to express "how good it is"; one approach
        1. TBX+IATE is a great reference for having a field for reliabilityCode with unicode digits result and administrativeStatus (Preferred, Admitted, Deprecated, Obsolete, Proposed).
  5. Whatever the format (Spreadsheet, raw XML editing) makes sense to create custom fields for feedback of translators
        1. If there is something specific about one suggested term, it's term-level information
        2. If it is about several terms in one language, or comments about problems with the viability of the concept (including problems related to converting data from one format to another), then the field is language-level
        4. If the Information is concept level AND language neutral, for example, codes equivalence in other glossaries, links to external references, etc then the custom field is concept-level.

4.2.1.2. Note on representation of reliability and administrative statues

  1. One industry standard of sharing terminology, the TBX, and the biggest reference on public collaborative terminology, the IATE (see deeper explanation here https://iate.europa.eu/fields-explained) have a numerical way to explain how reliable a term is for a definition. We strongly suggest both versions endorsed by HTCDS and community ones try to follow it faithfully and document it well, so future collaborators be more effective
        1. reliabilityCode is term-level. Means how theoretically the endorsement available makes it faithful. It could be less. It could be more. But the way to attribute it has objective procedure
        2. The IATE uses a scale between 1 to 10, where 6/10 (minimum reliability, 2 of 4 stars) is the most common and already tends to be acceptable. The next is 9/10 (Reliable 3 of 4 stars)
        3. 6/10 is the default value a native speaker could attribute for their own suggestions. This means both creators themselves of HTCDS and UN Translators, without giving sufficient context on why each term is the best to represent that idea, would have a 6/10.
        4. The "sufficient context" means that each term, already on the published file, has more metadata that proves the term is representative. Often this means link to external sources already referenced on the subject AND with authority on the language
  2. For the concepts related to at least human trafficking, assuming that the International Organization of Migration can be reference on the subject at world level (AND for any natural language, as one UN agency), IOM have power to go straight 9/10.
        1. Either for good or for bad, this actually allows IOM to be the primary source without need to justify. That's simple. Even a web page with a simple FAQ or glossary allows it to be a citable source, unless per term there is reference of lower reliability, the default now would assume 9/10.
        2. This also applies to translations. Europe IATE often has English/French with 9/10 (because monolingual glossaries) while user contributed translations like for Spanish go at max 6/10.
        3. However concepts like person name, since they are too generic, unless there is external reference with relevance with a term that matches the concept, don't make sense to annotate higher reliability.
  3. Differences on how natural languages are de facto promoted make a huge difference on the viability of endorse terminology without the need of expert organizations like IOM
       1. One extreme example of less likely to have agreement ever using this strategy is actually... English. Other languages often have multi country organizations able even to influence governmental policies to control language evolution or are restrict to small regions (so agreement requires less organizations)
       2. One potential implication is, even with HTCDS published with 6 UN working languages, external translations with more detailed explanation of how new terms were formed could eventually allow endorsement even for generic terms.
            1. Note the relevance of "Proposed" (in addition to the term to be used immediately); this could means even changes on educational education material for learners of the language

4.2.1.3. A note on code used to represent the languages

  1. Together with terms, the way to express language is important not only for interoperability, but to reduce problems when collecting terminology where actually several language regulators agree.
        1. This is why we suggest, instead of "IETF BCP 47 language" style, use ISO 639-3 + ISO 15924 like ar: "ara-Arab", en: "eng-Latn", fr: "fra-Latn", ru: "rus-Cyrl", zh: "zho-Hans", es: "spa-Latn".
        2. The total replacement of ISO 639-1 alpha-2 for ISO 639-3 both helps with languages that never got an alpha-2 code (like the code of the biggest minority of Europe) and because when using dialects, there is no need to use country codes
        3. The use of ISO 15924 (writing system), while having the advantage of reducing bias of what would be default, also helps to make non-native speakers know the difference between alphabets just by looking at how it is labeled.
  2. We strongly suggest that if translators are already able to provide transliterations at least for terms (no need transliterate also definitions), they shouldn't be blocked by limitation of the distributed file.
        1. Example: Hanyu Pinyin ("zho-Latn-pinyin" ?) is quite a popular example, especially for learners.

That was it!

ReadMe

The readme file refers to the platform as the Collaborative Trafficking Data Collaborative.

About the Human Trafficking Case Data Standard (HTCDS) with HXL (The Humanitarian Exchange Language) hashtags

Hi.

I'm Emerson Rocha, here as one of the members of the @HXL-CPLP, an Community User group of HXL with special focus on CPLP. (I'm also a member of some local groups of Amnesty).

While looking for APIs and Schemas for the project (https://github.com/HXL-CPLP/Auxilium-Humanitarium-API) I just found this standard, so my interest here is consider potential promotion both for HXL in general and at least in our community!

Existing HXL hashtags (and the existing need for conventions on areas like the HTCDS)

In fact, do exist well documented HXL hashtags already used in production on the HDX site, https://data.humdata.org/, (so, HXL standard is already used on humanitarian area). But while HXL Standard is flexible, it does lack documentation related to more sensitive (or at least, dis-aggregated data, like per individual). A TL;DR is that most HXLated datasets are (not surprisingly) data that already is public.

Also, in general, discussion of sensitive data and it's tools are a taboo. Even in English. And people are dying for it.

So, on this very first hello from me, I believe that while this standard do not (as expected) document more broader aspects of sensitive data, definitely worth the effort to offer also an HXLAted version of the toolkit! And then, the result could be used by CPLP.

I (and if need, ask for a review from other users of the HXL international community) I'm interested in helping with this!

New HXL hashtags

On very simplified terms, an HXLAted addition to the current Kit would be an spreadsheet with base columns with some base #hashtag to represent an human and and +attributes (that could be similar to what exists using English names.

Then, at least one Spreadsheet that explains that each one is.

Then examples with fake data, so tools could be used.

Then, tools, example dashboards, etc. On this point do exist a lot of HXL tools, some like the https://HXLDash.com to help with data visualization.

My contact email is Rocha(at)ieee.org. if necessary we could talk more by email or slack or other channels!


PS 1: the current project, HTCDS does not have an typical well know software/database-like license (like https://spdx.org/licenses/), so this means that would require a lawyer to understand if it can be used or not. Also, the "Terms of use" (that seems to be based on some site, not a license for a standard) may also be perceived as conflicting with the also mentioned https://open-stand.org/about-us/principles/. These points are very pertinent on context of get help to use with HXL because HXL is about terms to express meaning on tabular data, if really enforced the all Users must not: use HTCDS or the information therein contained for any purpose different to the purpose of HTCDS as defined in Section 1 this means that because HXL lack of conventions to express data at individual level (aka a human, not a group of human), the reference in HXL for the HTCDS could deny other UN agencies, Red Cross, Amnesty, etc to use HXL and tools that could be aware of these data because they would use for things not related to human traffic. If at least the column names and short English description is released under public domain (aka, not 'try to enforce patent on #x_person+first+name') this make things simpler.

PS 2: we from HXL-CPLP, for example, would likely to release a version with liked data to express the concepts of the spreadsheet, like for example link what is in English "Gender" wikidata Q48277, so this could assist automated processing. If even tools to undestand how to process (or how to export to) the HTCDS, we would kindly ask some other type of coding that would be patent free and not use "Gender", "Nationality", "Title", etc, because the "13. Termination, Denying Access" could at any time break software or tools. Like I said, the people who consume data do not have lawyers, if we have to attach some license, they would not use the HTCDS. But if UNMigration do already is using some convention, from our side, even if we would need to create conversion tools just to deal with license issues, we do that.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.