Giter Club home page Giter Club logo

klingon-assistant-data's Introduction

klingon-assistant-data

Klingon language data files for boQwI' and associated apps.

The notes fields are for typical users of the lexicon. An attempt should be made to keep information there "in-universe". The hidden_notes field is for (typically) "out-of-universe" information such as puns (what Marc Okrand calls "coincidences"), or background stories about how a word or phrase was invented (such as having to backfit a movie edit). For some entries (e.g., {Hov leng:n} or the names of actors or actresses), keeping the notes "in-universe" might not be possible, so this is not a strict requirement.

The entry_name field should exactly match how the definition appears in the original source if possible. This is important as the database is used by software which may compare its entries to other lexicons. In particular, KWOTD (Klingon Word Of The Day) functionality in boQwI' partially depends on matching the entry_name to the word or phase received from the Hol 'ampaS server. A mismatch may result in failure to retrieve the KWOTD. However, full sentences should have final punctuation for consistency. (If the English translation has final punctuation, the Klingon sentence should use the same punctuation mark, but otherwise it should end in a period, or an exclamation mark if that is more appropriate.)

If a definition appears multiple times in the same source, the broadest definition should be used. For example, {tu':v} appears as "discover, find, observe, notice" in TKD in the K-E side, but also as just "find, observe" in the body text, as well as separately under each of those four words in the E-K side. The K-E definition should be used in this case. Contradictions (e.g., differences between K-E and E-K definitions) and errors should be noted in hidden_notes.

If an entry is defined differently in different sources, the definitions should be reconciled, and the reconciliation noted under hidden_notes or notes as appropriate. Sometimes, it may be appropriate to split a word into multiple entries. For example, {meS:v} has separate entries for "tie a knot" and "encrypt", even though the latter meaning is obviously derived from the former. There is some discretion in whether an entry should be split up or not.

Translations of the definition field can take liberties as necessary to convey the meaning. For example, it may be the case that disambiguating text in brackets in the original English definition is not necessary in another language, or conversely, that disambiguating text needs to be added. Words which are in brackets or quotes may need to be added as search_tags (in the corresponding language as appropriate) if they are likely to be searched. (A quirk of the database system means that words in the definition fields which are enclosed in brackets or quotes are not tokenised as search terms automatically.)

The notes fields in languages other than English should be direct translations if possible, but may differ if it is necessary to include information specific to a language. For example, the German entry for {ngech:n:2} notes a common misunderstanding specific to the German language. Every link and source referenced in the English notes should be referenced in the translations (to the degree that it is possible). If the notes field in a non-English language is empty, the English notes will be displayed by the app. If this is undesirable (because the English notes are inapplicable and thus don't need to be translated), the translated notes field can be set to "-" to suppress the display of any notes.

When adding a new entry, the blank.xml template should be used. There is a script call_google_translate.py which may be used to automatically translate the definition and notes fields. An attempt will be made to use Google Translate to translate any non-English definition or notes field which contain only the content "TRANSLATE". (The non-English definition fields are already filled in with "TRANSLATE" in the template.) After calling the translation script, it may be necessary to do some postprocessing. Instructions are found in the comments to the script file.

The database source files are divided into letters, with additional sections for suffixes, extra entries, and example entries. For the purposes of this database, "canon" is defined as having come from (or approved by) Marc Okrand. Canon words and phrases which appear in pedagogical sources (books, audiotapes or CDs, software, and qep'a' or qepHom) belong in the main section (i.e., any file other than extra or examples). The extra section is for miscellaneous entries such as words of uncertain provenance or known not to have come from or been approved by Marc Okrand (e.g., they were invented by the author of a Star Trek novel), transliterations of Terran fauna, flora, or place names not accepted as native Klingon words (such as strawberry or New York), and things which are low-priority when searching or don't belong elsewhere. It is also for canon sentences which appear in the TV shows or movies, or on DVD cases or advertising materials (because, from an "in-universe" point of view, these would not normally be found in a dictionary or phrasebook, unless they happen to be proverbs or such). The examples section is for entries created for pedagogical purposes (such as Beginner's Conversation sentences) or to make search easier (because a search term corresponds to a verb with suffixes or a complex noun). It also contains canon examples, if they are parenthetical (created by Okrand merely for the purpose of explaining an entry which is in the main section). Entries from the extra and examples sections are excluded when using the "random" function within the Android app.

It is a convention to link only once to another entry within each entry. Subsequent references to another entry should be tagged with nolink. If there is already a link to another entry in notes, then the target entry should not typically appear again in see_also.

Commits containing manual translations should change only one language (though occasionally it may make sense to translate one or a few entries into multiple languages, such as after a large vocabulary reveal at an event such as the KLI qep'a' or Saarbrücken qepHom'a'). Commits created using the commit_submissions.py script are exempt from this rule, but must be manually reviewed. Pull requests of large translation commits should typically be merged using the "Squash and merge" option.

There is a script review_changes.sh which takes in a language code and an optional commit (which defaults to upstream/main if omitted). This should be used by translators to check translations before a pull request is made.

After changes to the database, it is important to run the write_db.sh script (in the Android repo) to ensure that the database still compiles. Running this script also updates the EXTRA file (which marks where the "extra" section of the database begins). Optionally, one may also run the check_audio_files.pl script (in the scripts directory of the main repo) to see if any syllables have been added which are not available in the TTS.

Conventions for translators

German

  • All adjectivally used verbs should be translated as "[quality] sein", not just the quality as an adjective.

  • Any suggestions and recommendations ("for x, use y") should be written in a neutral form ("for x, y is used"). The autotranslated sentences use the very formal "Sie" which looks too formal for this app. To avoid discussions about using the informal "du", such phrases can be rearranged into general statements like "dieses Wort wird verwendet" ("this word is used").

Chinese (Hong Kong)

  • The Cantonese transliteration of "Klingon" used in Hong Kong is "克林崗", and not "克林貢" which is used in Taiwanese Mandarin (and which is returned by Google Translate for "Traditional Chinese").

Finnish

  • Translations need not to be direct translations from TKD entries, but later clarifications and additions must be considered when choosing the Finnish word. The Finnish translation may have fewer or more words than the English one, if they are not necessary to understand the translation. Parenthesized notes may be similarly removed if they are not necessary.

  • Adjectives are translated as "olla [adjektiivi]".

  • Fictive things (not including proper names) are translated as "eräs [X:ää] muistuttava [Y]" tai "eräs [Y]" where Y is eg. "eläin" (this includes all Klingon animals that are only glossed as their Earth equivalent in the English dictionary). For example leSpal is translated as "eräs kielisoitin" and HurDagh is translated as "kielisoitin" (because it is a general term).

  • If the object of a verb is inflected in Finnish in other case than accusative (jokin) or partitive (jotakin), the Finnish definition must include the word "jokin" inflected appropriately (for example. parHa' "pitää jostakin").

  • Remember to use the correct transitivity (eg. pyöriä jIr vs. pyörittää jIrmoH).

Special logic in the Android parser

The parser in the Android app makes certain assumptions, and may need to be updated if entries are added to the database matching certain criteria.

Because "h" can stand for H in "xifan hol" mode, the sequence "ngh" is ambiguous (either n + gh or ng + H). As well, because "g" can stand for gh, the sequence "ng" may potentially be intended to mean n + gh (instead of ng). The parser has a hardcoded list of entries whose names contain ngh or ngH, which needs to be updated if any such entries are added to the database. (Otherwise, e.g., manghom may be expanded as man + ghom instead of mang + Hom).

For reasons of correctness and efficiency, it is not normally attempted to parse queries of 4 letters or fewer as complex words. However, because a few 2-letter verbs exist (which can take 2-letter prefixes), there is special handling for a hardcoded list of such verbs. This list needs to be updated if any 2-letter verbs are added to the database.

klingon-assistant-data's People

Contributors

barricade86 avatar cuthbert avatar cyberman-tm avatar dadap avatar de7vid avatar dlyongemallo avatar fergusq avatar lievenlitaer avatar mycatisavulcan avatar redjimi avatar sigrlin avatar thiagocmatos avatar zrajm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

klingon-assistant-data's Issues

remove the id field

Change the build scripts so that the _id field is dynamically generated at build time, so that changes to the _id aren't checked in.

This is the biggest source of headaches for merging.

Tag inherently plural words and their singular forms referring to `being`s.

Currently, the being tag is used to display a message about the plural suffix to use (i.e., -pu'). Since inherently plural words and their singular forms don't use plural suffixes, they haven't been so tagged. However, it was pointed out that this also affects which possessive suffixes they take, and so properly these words should be tagged as being.

For example: qempa' and no', mang and negh.

Sentence type "bc" - not explained

Example:
SoH 'Iv?
sen:bc

"bc" is never explained, should be in the header XML, I think.

Apparently it means "Beginner's Conversation"?

Should boQwI' know ungrammatical but canonical stuff?

For example the suffixes -luH and -la' .
Apparently both are canonical, but definitely ungrammatical. Yet people use them. Rarely, fortunately, but I've recently encountered -luH and wasn't able to decipher it until someone pointed me toward those ungrammatical suffixes.

I was surprised that boQwI' doesn't know them - oversight or intentional?

IMO they should be listed - with strong warnings against their use.

[edit] Those suffixes com from KGT it seems:
http://klingonska.org/dict/?q=-luH

Font severely distorted on iOS 14

There isn't much to say other than what's clearly visible in this screenshot:

image

The same sort of distortion/terrible kerning is visible when typing search terms.

Add unique ID to each entry

Would it be possible to get a <column name="unique_id">…</column> field for each entry? Exactly what the value looks like is not important, as long as it's guaranteed to be unique in the dictionary, and never reused (should a word ever be removed from the dictionary).

This would make it easier for me to keep track of changes in your database over time, there allowing me to work through all the new stuff piecemeal and add the corresponding info to Archive of Okrandian Canon, and the Klingonska Akademien Dictionary.

generate a list of all valid words

Somebody made a request to generate a list of all valid words from the database. Basically, they want a list of all words (which would exclude things like sentences) with all possible affixes. The purpose of the list is validation (matching a user-entered input against the list).

I've explained to them that the existing code already does validation but apparently their application requires the explicit list. So if anybody feels like writing a script that outputs this, I can put you in touch with them.

German translation for "qeb" makes no sense. Don't have a better one, though...

I'm writing this as an "issue" because I don't have a good suggestion for a replacement.

However, the current word "Schusdek instrument" makes no sense - neither Google nor Wikipedia even seem to KNOW the word.
Even if it would be a specialized word, I think either of them should find something.

However, I don't know how to translate "windbag instrument" unless you want to describe it. As far as I can tell, that isn't a correct specification in english either (which may make sense, as the correct specifications are rather abstract. Bagpipes are classified as "woodwind"...)

"Luftbeutel-Instrument" would be a more literal translation, perhaps no more awkward that "windbag instrument" ?
(The actual literal translation would be "Windbeutel", but that is mainly the german name for profiteroles - I don't think it's a word actually used for instruments.)

Misspelled German word

The word puj (weakness) is shown as Schäche, but it's Schwäche in German. This doesn't let you find the word by searching.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.