Giter Club home page Giter Club logo

glossematics's People

Contributors

simongray avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

kuhumcst

glossematics's Issues

Handle edge case in TEI files

See: http://localhost:8080/files/tei/acc-1992_0005_032_Uldall_0780-tei-final.xml

Until now I have assumed that every TEI file had a tag containing a

child and inside this
were a series of tags separating the document into pages.

Unfortunately, in the (presumably hand-transcribed) document above, the structure is

<div type="letter">
  <pb>
  ...
  <div>...</div>
</div>

<div type="letter">>
  <pb>
  ...
</div>

<div type="note">>
  ...
</div>

which is much more problematic and currently NOT handled.

I'm unsure how many other files follow this format, but presumably many do. I'm also unsure whether other files follow a different idiosyncratic structure.


In order to support at least this structure, I will have to rewrite the carousel-pbs transformer such that it matches the tag instead, retrieves the contents of the all non-notes

tags and uses this concatenated content to construct pages.

Encyclopedia

The :ref attributes in the TEI documents should all resolve to an appropriate encyclopedia article. In some cases, members will have written some data down, in other cases I will have harvested basic metadata about the :ref directly from the TEI documents.

Firefox nav rendering bug

When caching is disabled, Firefox displays the nav menu items just fine, but when not, their width increases, creating the appearance of large margins between the items.

Meeting prep

This is an issue to track the desired changes that I will attempt to implement before the meeting on 2021/09/14 Tuesday. The changes are added via the feature/meeting-prep branch.

Must-haves

  • Make the current code work again. Refactor where necessary to incorporate changes from the required libraries, mostly cuphic and stucco.
  • Implement frontend routing. This is what will ensure smooth page navigation. Should probably use reitit.
  • Implement backend files endpoint. Serving the current batch of TEI and image files.
  • Create registry of available files. Solely for testing display and retrieval of TEI and facsimile.
  • Implement synchronized tei-facsimile view. Should be able reuse most of the code from Stucco.

Nice-to-haves

  • Implement database. Use Asami. At first should just contain mock triples. The backend needs an endpoint that will construct an Asami query from a list of filters.
  • Implement backend entity endpoint. Transit-based, probably using transito. Support entity fetching. At first should just serve mock data.
  • Add a frontend page for searching/filtering. Uses the backend endpoint just created.
  • Add metadata to database from the partial dataset provided by Dorte. Requires writing a lot of Cuphic patterns or retrieval functions.
  • Display modes for search results. An alternative display using the timeline widget.
  • Language switching support. Basically, add Danish in the UI. Right now only English needs to be implemented.

Index lists for search

Based on feedback from Dorte, there should be some index lists to initiate searches sorted by names, e.g. a list of senders, a list of receivers, and so on. When clicking a name, it should go directly to the a search containing that parameter.

Expands on #34.

Search interface improvements

  • network request loading indicators for metadata and search result fetches
  • move sorting/date filter closer to results page
  • default to send date sort?
  • nicer-looking search results

Mobile/reponsive design

Make the page function on a smartphone. Essentially requires fixing all of the many little layout issues that appear when the screen size is small.

Generate triples from TEI XML documents

Regardless of which database is eventually going to be used (see: #4) it 99% likely that I will be using a triplestore of some kind. There is functionality available in Cuphic (see: kuhumcst/cuphic#1) to facilitate this, although it may have to be tweaker further.

I now have access to the university's "N drive" where it should be possible to find sample data. The task is now to recursively go through each document in a list of documents and return metadata valid triples. These triples should be derived from both the actual metadata in the TEI header, as well as metadata in the contents + possible implied metadata that can be derived from the content, e.g. the presence of certain words or some other feature.

API-level and frontend authz

The current authz functionality works on a route level, but this SPA (and most modern web apps) usually have a thick client which renders content locally and a server API which is accessed via a single endpoint. This necessitates use of authz in these domains too.

Supposing the API is a single endpoint, the endpoint itself may be either open to all or hidden behind an authentication restriction. Within the logic of the API, it should be possible (by passing along the relevant authz info) to have several (permit-if...) calls which use the call context, take some condition, and creates branching. The branching might even be missing, in which case the code could simply throw and exception which would be caught by the route's interceptor chain.

Throwing an exception might be desirable if the frontend is also fully integrated with the permission system. A similar (maybe the same, e.g. CLJC) function could be used to create branching content based on permissions in the frontend code. This could and should probably be simplified by setting a dynamic variable containing the authz info which the frontend code can then refer to.

No automated HTTPS certificate renewal

The SSL certificate from Let's Encrypt (for handling HTTPS on port 443) needs to be renewed often (3 month max). The renewal process is manual and fairly tedious currently:

  • shut down the docker container
  • run certbot renew
  • incorporate the newly generated certificate into a keystore to be used by Pedestal (Jetty)

Let's Encrypt uses the certbot command line tool to handle renewals, but since we use a non-standard web server (Jetty) it is necessary to either reverse proxy it using e.g. nginx or set up code for renewing certificates and creating keystores within Jetty to set up automatic renewal.

Having to shut down the web service since renewal happens on port 80 is especially annoying, but that can probably be mitigated by simply only serving on port 443.

While the nginx solution sounds the easiest to deal with, I am unsure how it affects the whole SAML flow and the way I have set it up currently inside the Pedestal configuration. Perhaps it isn't a big deal, but some code will definitely have to be rewritten on the Clojure side in either case.

Delete documents where *-final.xml versions exist

It seems like Dorte's system is to label TEI files with -final whenever they have an actual transscription. This means lots of duplicates in the dataset that need to removed. I think doing so programmatically during the bootstrap process is the best solution. It should simply be an Asami query that lists all of the entities ending in "-final.xml" and then the entities that need to deleted can be derived from that.

Dynamic timeline HotZones

With the timeline experiment being largely succesful and merged into the master branch, one issue remains:

  • how to make the timeline work for a variable amount of events with an unknown time scale? e.g. from search results.

Description of problem

Basically, when there’s a whole bunch of events all occurring within minutes or seconds of each other, but many of the other events occur days or weeks before or after this concentration of events. How do you represent that on a timeline?

You can do it linearly on a scale using something like seconds as the main unit of time, but that will result in many events events being separated by an incredible distance on the timeline since the time series spans weeks.

Another option is to use weeks as the unit of time, but that results in events getting piled in those hot spots where there is a new event every minute or second.

So the solution of the Simile Widgets library is to predefine hot spots and stretch the time band at those places. The issue is that the Simeline Timeline is meant to be used for static, known data so the zones are manually created and tailored to the data at hand. Manually created hotspot zones are not compatible with dynamically generated timelines.

Privacy statement

This is needed to document e.g. legitimate interest with regard to the use of names in the search metadata and as an overall explanation of the processed data.

Also include in Pedestal SP consent window somehow...? Maybe just as a link to the privacy statement.

Add unpublished documents to bibliography

Certain key documents have never been published, yet are nearly fully formed. Henrik suggests that they are merged into the bibliography and placed after the other documents for each listed year, clearly marked.

Restriction override for Pedestal SP

During development it should be possible to disable or replace auth restrictions for all endpoints. Disabling auth restrictions entirely is very convenient when testing e.g. API requests.

Since all of the Pedestal SP configuration happens in the conf map this is also where this should be defined. I think a key called :restriction-override could be a way to accomplish this.

Now the tricky part is how to implement this in a way such that it works across the entire application for all backend endpoints as well as on the frontend. The only way I imagine this working is by adding a key to the assertions map (which is all you have to go on in the frontend or elsewhere using inline checks) also called :restriction-override which contains the replacement restriction (can just be :all).

Now this needs to work in a variety of different scenarios:

  • inline checks, i.e. if-permit and only-permit.
  • In the guard-ic
  • When checking routes in advance using permit-request?

By setting a key in the assertions map it is possible to reach all of these places, but each macro/function will need to contain logic to deal with it.

The place to insert the information is in a new interceptor that might be called override-ic and which should be inserted immediately after the session-ic (which adds the assertions to the request map). This should be included automatically in the auth-chain function based on the key in the conf map.

Delegate GZipHandler and DoSFilter to nginx

The solution to #8 will most likely be to run nginx and certbot inside Docker containers alongside the web service as part of a docker-compose configuration. Part of the challenge here is to remove web server functionality from the Jetty configuration, putting them inside nginx instead, specifically: GZipHandler and DoSFilter.

Dev time reactivity broken

I speculate that it has something to do with reitit, more specifically the fact that the page rendering function is kept as state, which means the old rendering function is kept around whenever changes are made. The fix should be rerunning the reitit navigation code as part of the shadow-cljs reload hook.

Session time extension

Currently, the default is to have a max session time of 8 hours. It would make sense to increase the default time to some higher value—e.g. 24 hours—and then keep extending it every time there is a request made from an authenticated account. In this way, accounts that do regular, daily work will never have to log in, except perhaps after the weekend.

Inline notes

Basically, just support replacing inlined note elements with a small clickable element that expands to the full note.

One potential issue is distinguishing between the inlined notes and the bottom notes div.

Database choice & API design

These two seem interrelated. The choice of database affects how well the API matches.

Some requirements for the database:

  • Should be able to filter TEI documents based on metadata.
  • Should also support some degree of freetext search of TEI documents.
  • Should link TEI documents with comments and support persistent adding of comments (so no in-memory db... for this part at least).
  • Should perhaps also link TEI documents to facsimiles (the images) although that could also be accomplished by simply putting the images in the public resources folder of the web server and fetching them on an adhoc basis based on the content of the TEI XML files.
  • Should be reasonably performant.

Some additional considerations for the database:

  • K.I.S.S.
  • Should fit the data model (documents, comments)
  • Should preferable be able to adapt to changing requirements, e.g. changes to the TEI subset we support.
  • Is it better to represent documents as URIs to files on disk or put them fully in the database? I'm leaning towards the former, but maybe there's some advantage to storing XML within a database.

Some requirements for the API:

  • Should be quite minimal and relatively standards-compliant.
  • Preferably fits the data model of the database well.

Bibliography

Certain documents appear in a bibliography list. This list should appear in a similar way to the search indices, but the content is slightly different.

Fix page break detection

Certain documents have buggy page segmentation with pb elements that aren't picked up correctly by the current pattern, e.g. acc-1992_0005_032_Uldall_1000-final.xml.

Frontpage

Should contain

  • A project description
  • A login button (scrap the login page entirely)?
  • Photo collage of relevant people harvested from the AAU website
  • A list of important search entry points, e.g. important topics and correspondences

Click search criteria to change field

Clicking on a search criteria should trigger a dropdown allowing a field change.

This will make it easy to start a search via clicking on a ref or via the index lists, while also allowing it to be completely fine-tuned afterwards, e.g. maybe the field should actually be author as opposed to anything.

Considerations:

  • Selectable fields should match the ones allowed for the given entity-type.
  • Field+Entity combinations that are already present as a criteria should not be possible to make, i.e. these are also disabled in the dropdown options.
    • Furthermore, the above UX feature should also be used in the primary field select widget next to the text input field.
  • This probably requires putting a transparent select widget on top of the search criteria -OR- requires replacing either of the text parts of the criteria into a select widget.
    • One complication is the fact that we currently do not display any field identifier if the chosen field is anything and the most obvious place to click is the field identifier.

Mark TEI documents that have/don't have a text body in the database

Most of the TEI documents do not have any text body. Dorte has chosen to represent this as

<text>
        <body>
            <div type="document">
                <p xml:id="p1"><!-- tekst -->                </p>
            </div>
        </body>
    </text>

Basically, if the content of the text body is a

with an empty

tag then we can assume that it is metadata only!
This should be handled as a part of the scraping process when bootstrapping the database.

Metadata extraction from TEI documents

This is contingent on ongoing database architecture decisions TBD in #4, i.e. the amount of metadata extraction that is actually required depends on the amount that can be automatically extracted through the way the XML itself is represented in the database of choice.

Anyway, supposing there is no way to query all of the metadata of the TEI XML through the its database representation, it becomes necessary to batch process the TEI documents beforehand and extract the metadata - and probably the text content too - in order to represent this inside the database.

The program itself should probably just be a script that loads TEI files recursively from a source folder and writes out maps of metadata into a single file (newline-delimited EDN? https://github.com/lambdaisland/edn-lines). This file can then be imported into whatever database solution is in use.

User menu

User assertions info, customisation, log out action, and other things.

Timeline design

Currently going through different options for creating a timeline.

Sofie's timeline (VEGA-based)

https://github.com/kuhumcst/sofie-vega-timeline

She got pretty far, but the design is not favoured by Viggo (and probably Frans too). It also has a few remaining layout issues that I am unlikely to solve (e.g. what to do about clipped or overlaid text). On the other hand, it is easy to embed inside reagent.

Simile widgets timeline

https://github.com/simile-widgets/timeline (or maybe https://github.com/simile-widgets/ancient-simile-widgets)

This is the timeline widget used by the Georg Brandes project. The design is exactly what Frans and Viggo want but, aside from certain issues I have with the design itself, it has a few problems:

  • It is unmaintained and people are having problems actually building it from source.
  • The official documentation is gone.
  • It definitely isn't accessibility-compliant.
  • It is based on static compilation using a Java program into an HTML file. This is fine for a single static page (i.e. Viggos data), but it won't work for a dynamic view of search results.

Asami not properly bootstrapped in prod

I guess I will need to add logging in various places to see what's happening.

Maybe also a good idea to clean up the overall bootstrap process (not just db) and update the README.md.

Collection is not an integer

Currently, the collection value in the TEI documents is resolved as an integer, since they often do appear as an integer string. However, in some cases they are decidedly not.

  1. they shouldn't be converted to integer
  2. the value shouldn't be sortable

Search frontend

  • Should comprise an expandable area at the top of "badges" (registered entities) as well as a datalist text input below + select widget for specific relations to match.
    • ... but a lot prettier than now
  • The text input should validate input such that only correct input can be submitted.
  • The datalist content should be sourced via a network call on page load.
    • Both full name and "surname, first name" versions should be listed.
  • Should initialize ordered-set from entity in query-params for the first page load.
  • Dates should be automatically detected and context-dependent options must appear when a date is being entered, i.e. select on date, after date, before date.

Styling that is closer to the overall look of the documents

Now that there is a larger selection facsimile, the styling of the TEI rendering can be appropriately adjuested, e.g. paragraphs can be indented style rather than the traditional HTML separated-by-newline style. Another option is changing the default font to a typewriter'esque one.

Rate limiting

Several options exist:

Currently I favour using Jetty filters. They are probably the most efficient solution and configuration seems quite basic too:

Comment functionality

It must be possible to attach comments to documents, preferably linked to specific paragraphs.

  • how should it work?
  • how should it look?
  • what tech to use, Asami or sqlite?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.