Giter Club home page Giter Club logo

manx-corpus-search's People

Contributors

david-allison avatar dependabot[bot] avatar lauterb avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

lauterb

manx-corpus-search's Issues

Redo Logo

It's a little too tall on mobile (dominating compared to the objective of the site).

Make it wider (or redo it completely).

Screenshot 2023-01-27 at 03 56 43

Investigate SIGKILL

Jul 07 14:05:59 ubuntu-s-1vcpu-1gb-lon1-01 systemd[1]: manx-corpus.service: Main process exited, code=killed, status=9/KILL
Jul 07 14:05:59 ubuntu-s-1vcpu-1gb-lon1-01 systemd[1]: manx-corpus.service: Failed with result 'signal'.
Jul 07 14:06:09 ubuntu-s-1vcpu-1gb-lon1-01 systemd[1]: manx-corpus.service: Scheduled restart job, restart counter is at 1.
Jul 07 14:06:09 ubuntu-s-1vcpu-1gb-lon1-01 systemd[1]: Stopped Manx Corpus Search.
Jul 07 14:06:09 ubuntu-s-1vcpu-1gb-lon1-01 systemd[1]: Started Manx Corpus Search.

Move dictionary to server-side

I noticed that it won't be able to find a definition for goll-mygeayrt but it can for goll and mygeayrt

Dictionary definitions were tested as a 'nice to have' - it seems they're useful enough to stay (I've had a few positive comments on them).

Currently, we don't provide the server with enough context to provide a translation. We work on the 'word' level, but the client side does not have enough context to either:

  • break a word into morphemes
  • expand a word to a known phrase/idiom

The request should be: (context, selection) rather than (word). From here, on the server we can process the selection and more accurately determine the correct phrase/word to return.

Test case above (could also be handled at the dictionary-level), but we lose out on phrase selection, and defining recursive/contextual structures.

Make the search language English/Manx selectable in the URL

I have added ManxCorpus to the Manx-English dictionaries accessible via Multidict

https://multidict.net/multidict/?sl=gv&tl=en&dict=MxCorp&word=snaue 

Very easy, since the URL

https://manxcorpus.com/?q=snaue

will search for the Manx word “snaue”.

However, I have bother doing things English searches, since I can’t find any URL which will search ManxCorpus for the English word “crawl”, for example. What I would hope for is something like

https://manxcorpus.com/?q=crawl&lang=en

Make source of document more prominent

It would useful to have some clear marking when documents are the original text (as in most of the eighteenth century religious texts which are translations), and when they are new translations from the Manx (Ned Beg for example).

Handle Newspaper sources

https://www.imuseum.im/Olive/APA/IsleofMan/get.res?id=page.Scripts&kind=script&uq=20210325071104&for=%7E%2Fdefault.aspx&mode=group

version:'2.7.102.0',appVersion:'5.2.6',uq:'20210325071104',viewPointVersion:'4.28.18017.37845',lastModificationDateString:'2021-03-25-19-11-04'

COM:"Calf of Man Bird Observatory Report",
CEC:"Camp Echo",
CHU:"Camp Humor",
KCZ:"Camp Zeitung",
CTG:"Castletown Gazette",
DSC:"Das Schleierlicht",
MGN:"German Gymnastics Association",
GFL:"Green Final",
HNS:"Holiday News",
IDT:"Isle of Man Daily Times",
IME:"Isle of Man Examiner",
IMT:"Isle of Man Times",
WAC:"Isle of Man Weekly Advertising Circular",
IWG:"Isle of Man Weekly Gazette",
JMM:"Journal of The Manx Museum",
LAE:"Lager Echo",DLA:"Lager Laterne",
LAZ:"Lager Zeitung",LAU:"Lager-Ulk",
MNA:"Manks Advertiser",
MNM:"Manks Mercury",
TMC:"Manx Cat",
TFP:"Manx Free Press",
MNB:"Manx Liberal",
MMNT:"Manx Museum and National Trust Report",
MNP:"Manx Patriot",
MRS:"Manx Rising Sun",
MNS:"Manx Star",
TMS:"Manx Sun",
TMN:"Manxman",
MDP:"Mona Daily Programme",
MNH:"Mona's Herald",
PCG:"Peel City Guardian",
PSL:"Peel Sentinel",
QUT:"Quousque Tandem",
RCE:"Ramsey Chronicle",
RYC:"Ramsey Courier",
RWN:"Ramsey Weekly News",
TRS:"Rising Sun",
TTS:"TT Special",
UNU:"Unter Uns",
WER:"Werden"

A newspaper image seems to be in the format: https://www.imuseum.im/Olive/APA/IsleofMan/get/image.ashx?kind=block&href=MNH%2F1833%2F10%2F25&id=Ar0010000&ext=.png
&id=Ar0080001&ext=.png

Where MNH%2F1833%2F10%2F25 = MNH/1833/10/25

==Problems (currently)==

Unknown if these are solvable

Notes: Hide & make clickable

JH would prefer for notes to not be visible while reading the corpus.

I agree. It would be sensible to hide notes by default and toggle them if [1] is clicked.

If a linkage between notes and the text can't be found, then the notes should be shown as-is.

[chore] Remove SQLite

Takes up RAM, more finnicky and verbose than using C#, and we no longer need it now we've moved to Lucene

[enhancement] Handle newspaper archive linkage/provenance

We have a lot of corpus information from the Newspaper archives on https://www.imuseum.im/newspapers/.

Ideal:

  • Map from a text-based reference in manifest.json.txt to a data structure. Add an arbitrary validation check on manx-search-data
  • Map from this data structure to a URL on the corpus search site
  • Redirect from this URL to the appropriate page on the iMuseum site
  • Add a link on the public-facing site

Improve SEO

Now we have a domain, we should focus more on SEO. Lots of low hanging fruit here:

  • Add details to the HEAD of the main corpus search page
  • Provide HTML versions of works
  • Backlinks (for later, long-term domains requested from govt)

[enhancement] Improve highlighting

Highlighting likely won't be perfect, but we can make it a lot better:

  • Add highlighting to the main page
  • Obtain the positions of the token that match in Lucene
    • See if we can match these to the 'real text' which is displayed to the user - likely possible in 98% of cases - word at index 5 in Lucene will likely be word 5 in the real text.
    • Question: Server-side or client side? Both are equally as powerful, and it seems I'm avoiding Client-side because I'm avoiding JavaScript
  • Translations - likely going to be harder, main aim is to fix punctuation

CI Publish: Hangs on downloading npm dependencies

https://github.com/david-allison-1/manx-corpus-search/runs/2550569740?check_suite_focus=true

2021-05-10T23:29:10.8274884Z Determining projects to restore...
2021-05-10T23:29:11.7157857Z Restored /var/www/manx-corpus/CorpusSearch/CorpusSearch.csproj (in 472 ms).
2021-05-10T23:29:12.5325316Z Restored /var/www/manx-corpus/CorpusSearch.Test/CorpusSearch.Test.csproj (in 800 ms).
2021-05-10T23:29:24.2855022Z CorpusSearch -> /var/www/manx-corpus/CorpusSearch/bin/Debug/net5.0/CorpusSearch.dll
2021-05-10T23:29:24.2866213Z CorpusSearch -> /var/www/manx-corpus/CorpusSearch/bin/Debug/net5.0/CorpusSearch.Views.dll
2021-05-10T23:29:24.4409503Z v15.7.0
2021-05-10T23:29:24.4426333Z Restoring dependencies using 'npm'. This may take several minutes...
2021-05-10T23:53:16.3627646Z ##[error]The operation was canceled.

Unsure what's happening here. Works when I ssh into the machine

  • Need to re-run to see if this is flaky, or broken
  • Might want to consider using a local npm repo mirror on the box

[enhancement] Make Dictionaries more clear

I like it that the Search throws up Cregeen entry and dictionary entry first. Presumably dictionary here means Phil Kelly’s dictionary, while “Dictionary” top right main page = Cregeen Aa-orderit. It would be helpful to make this clear.

From Max

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.