kuhumcst / glossematics Goto Github PK

View Code? Open in Web Editor NEW

4.0 2.0 1.0 9.87 MB

The life of Louis Hjelmslev.

Home Page: https://glossematics.dk

Clojure 52.55% CSS 10.79% Dockerfile 0.48% Shell 0.03% JavaScript 35.78% HTML 0.38%

tei tei-xml xml facsimile reagent

glossematics's People

Contributors

Stargazers

Watchers

Forkers

kuhumcst

glossematics's Issues

Handle edge case in TEI files

See: http://localhost:8080/files/tei/acc-1992_0005_032_Uldall_0780-tei-final.xml

Until now I have assumed that every TEI file had a tag containing a

child and inside this

were a series of tags separating the document into pages.

Unfortunately, in the (presumably hand-transcribed) document above, the structure is

<div type="letter">
  <pb>
  ...
  <div>...</div>
</div>

<div type="letter">>
  <pb>
  ...
</div>

<div type="note">>
  ...
</div>

which is much more problematic and currently NOT handled.

I'm unsure how many other files follow this format, but presumably many do. I'm also unsure whether other files follow a different idiosyncratic structure.

In order to support at least this structure, I will have to rewrite the carousel-pbs transformer such that it matches the tag instead, retrieves the contents of the all non-notes

tags and uses this concatenated content to construct pages.

The :ref attributes in the TEI documents should all resolve to an appropriate encyclopedia article. In some cases, members will have written some data down, in other cases I will have harvested basic metadata about the :ref directly from the TEI documents.

Firefox nav rendering bug

When caching is disabled, Firefox displays the nav menu items just fine, but when not, their width increases, creating the appearance of large margins between the items.

Meeting prep

This is an issue to track the desired changes that I will attempt to implement before the meeting on 2021/09/14 Tuesday. The changes are added via the feature/meeting-prep branch.

Must-haves

Make the current code work again. Refactor where necessary to incorporate changes from the required libraries, mostly cuphic and stucco.
Implement frontend routing. This is what will ensure smooth page navigation. Should probably use reitit.
Implement backend files endpoint. Serving the current batch of TEI and image files.
Create registry of available files. Solely for testing display and retrieval of TEI and facsimile.
Implement synchronized tei-facsimile view. Should be able reuse most of the code from Stucco.

Nice-to-haves

Implement database. Use Asami. At first should just contain mock triples. The backend needs an endpoint that will construct an Asami query from a list of filters.
Implement backend entity endpoint. Transit-based, probably using transito. Support entity fetching. At first should just serve mock data.
Add a frontend page for searching/filtering. Uses the backend endpoint just created.
Add metadata to database from the partial dataset provided by Dorte. Requires writing a lot of Cuphic patterns or retrieval functions.
Display modes for search results. An alternative display using the timeline widget.
Language switching support. Basically, add Danish in the UI. Right now only English needs to be implemented.

Docker Swarm instead of Docker Compose?

Based on this reply I got on HN: https://news.ycombinator.com/item?id=27366319

It seems like using Docker Swarm has benefits even in a single node setup like this one and may provide some valuable metrics for observability.

Changing search params doesn't jump to top

e.g. clicking on a name in a search result initiates a new search, so the app should jump to the top of the page.

Index lists for search

Based on feedback from Dorte, there should be some index lists to initiate searches sorted by names, e.g. a list of senders, a list of receivers, and so on. When clicking a name, it should go directly to the a search containing that parameter.

Expands on #34.

Search interface improvements

network request loading indicators for metadata and search result fetches
move sorting/date filter closer to results page
~~default to send date sort?~~
nicer-looking search results

Mobile/reponsive design

Make the page function on a smartphone. Essentially requires fixing all of the many little layout issues that appear when the screen size is small.

Generate triples from TEI XML documents

Regardless of which database is eventually going to be used (see: #4) it 99% likely that I will be using a triplestore of some kind. There is functionality available in Cuphic (see: kuhumcst/cuphic#1) to facilitate this, although it may have to be tweaker further.

I now have access to the university's "N drive" where it should be possible to find sample data. The task is now to recursively go through each document in a list of documents and return metadata valid triples. These triples should be derived from both the actual metadata in the TEI header, as well as metadata in the contents + possible implied metadata that can be derived from the content, e.g. the presence of certain words or some other feature.

API-level and frontend authz

The current authz functionality works on a route level, but this SPA (and most modern web apps) usually have a thick client which renders content locally and a server API which is accessed via a single endpoint. This necessitates use of authz in these domains too.

Supposing the API is a single endpoint, the endpoint itself may be either open to all or hidden behind an authentication restriction. Within the logic of the API, it should be possible (by passing along the relevant authz info) to have several (permit-if...) calls which use the call context, take some condition, and creates branching. The branching might even be missing, in which case the code could simply throw and exception which would be caught by the route's interceptor chain.

Throwing an exception might be desirable if the frontend is also fully integrated with the permission system. A similar (maybe the same, e.g. CLJC) function could be used to create branching content based on permissions in the frontend code. This could and should probably be simplified by setting a dynamic variable containing the authz info which the frontend code can then refer to.

No automated HTTPS certificate renewal

The SSL certificate from Let's Encrypt (for handling HTTPS on port 443) needs to be renewed often (3 month max). The renewal process is manual and fairly tedious currently:

shut down the docker container
run certbot renew
incorporate the newly generated certificate into a keystore to be used by Pedestal (Jetty)

Let's Encrypt uses the certbot command line tool to handle renewals, but since we use a non-standard web server (Jetty) it is necessary to either reverse proxy it using e.g. nginx or set up code for renewing certificates and creating keystores within Jetty to set up automatic renewal.

Having to shut down the web service since renewal happens on port 80 is especially annoying, but that can probably be mitigated by simply only serving on port 443.

While the nginx solution sounds the easiest to deal with, I am unsure how it affects the whole SAML flow and the way I have set it up currently inside the Pedestal configuration. Perhaps it isn't a big deal, but some code will definitely have to be rewritten on the Clojure side in either case.

Delete documents where *-final.xml versions exist

It seems like Dorte's system is to label TEI files with -final whenever they have an actual transscription. This means lots of duplicates in the dataset that need to removed. I think doing so programmatically during the bootstrap process is the best solution. It should simply be an Asami query that lists all of the entities ending in "-final.xml" and then the entities that need to deleted can be derived from that.

Dynamic timeline HotZones

With the timeline experiment being largely succesful and merged into the master branch, one issue remains:

how to make the timeline work for a variable amount of events with an unknown time scale? e.g. from search results.

Description of problem

Basically, when there’s a whole bunch of events all occurring within minutes or seconds of each other, but many of the other events occur days or weeks before or after this concentration of events. How do you represent that on a timeline?

You can do it linearly on a scale using something like seconds as the main unit of time, but that will result in many events events being separated by an incredible distance on the timeline since the time series spans weeks.

Another option is to use weeks as the unit of time, but that results in events getting piled in those hot spots where there is a new event every minute or second.

So the solution of the Simile Widgets library is to predefine hot spots and stretch the time band at those places. The issue is that the Simeline Timeline is meant to be used for static, known data so the zones are manually created and tailored to the data at hand. Manually created hotspot zones are not compatible with dynamically generated timelines.

Language

Perhaps using tonsky's lib.

Privacy statement

This is needed to document e.g. legitimate interest with regard to the use of names in the search metadata and as an overall explanation of the processed data.

Also include in Pedestal SP consent window somehow...? Maybe just as a link to the privacy statement.

Reader crashes when changing document while logged out

Due to 403 redirect. Should probably reload the page on 403.

Add unpublished documents to bibliography

Certain key documents have never been published, yet are nearly fully formed. Henrik suggests that they are merged into the bibliography and placed after the other documents for each listed year, clearly marked.

Restriction override for Pedestal SP

During development it should be possible to disable or replace auth restrictions for all endpoints. Disabling auth restrictions entirely is very convenient when testing e.g. API requests.

Since all of the Pedestal SP configuration happens in the conf map this is also where this should be defined. I think a key called :restriction-override could be a way to accomplish this.

Now the tricky part is how to implement this in a way such that it works across the entire application for all backend endpoints as well as on the frontend. The only way I imagine this working is by adding a key to the assertions map (which is all you have to go on in the frontend or elsewhere using inline checks) also called :restriction-override which contains the replacement restriction (can just be :all).

Now this needs to work in a variety of different scenarios:

inline checks, i.e. if-permit and only-permit.
In the guard-ic
When checking routes in advance using permit-request?

By setting a key in the assertions map it is possible to reach all of these places, but each macro/function will need to contain logic to deal with it.

The place to insert the information is in a new interceptor that might be called override-ic and which should be inserted immediately after the session-ic (which adds the assertions to the request map). This should be included automatically in the auth-chain function based on the key in the conf map.

Mark TEI documents where the facsimile is a multi-page PDF

e.g. the stuff in 8_Udgivelser.

Unsure if the documents themselves reveal this fact or if I need to place those files in a separate root directory.

Delegate GZipHandler and DoSFilter to nginx

The solution to #8 will most likely be to run nginx and certbot inside Docker containers alongside the web service as part of a docker-compose configuration. Part of the challenge here is to remove web server functionality from the Jetty configuration, putting them inside nginx instead, specifically: GZipHandler and DoSFilter.

Delete documents where -final.xml versions exist

Full text search

Might not be feasible given how shitty the OCR quality is. A portion of the documents have been manually transcribed, but these already provide excellent metadata for search.

Candidate libs:

https://github.com/coderafting/memsearch
https://github.com/juji-io/symspell-clj (could also be used for misspelled names)

Dev time reactivity broken

I speculate that it has something to do with reitit, more specifically the fact that the page rendering function is kept as state, which means the old rendering function is kept around whenever changes are made. The fix should be rerunning the reitit navigation code as part of the shadow-cljs reload hook.

Session time extension

Currently, the default is to have a max session time of 8 hours. It would make sense to increase the default time to some higher value—e.g. 24 hours—and then keep extending it every time there is a request made from an authenticated account. In this way, accounts that do regular, daily work will never have to log in, except perhaps after the weekend.

Inline notes

Basically, just support replacing inlined note elements with a small clickable element that expands to the full note.

One potential issue is distinguishing between the inlined notes and the bottom notes div.

Interception after authentication

Used to "force" the user to agree/disagree to various things, such as

GDPR approval
Cookie lifetime

Database choice & API design

These two seem interrelated. The choice of database affects how well the API matches.

Some requirements for the database:

Should be able to filter TEI documents based on metadata.
Should also support some degree of freetext search of TEI documents.
Should link TEI documents with comments and support persistent adding of comments (so no in-memory db... for this part at least).
Should perhaps also link TEI documents to facsimiles (the images) although that could also be accomplished by simply putting the images in the public resources folder of the web server and fetching them on an adhoc basis based on the content of the TEI XML files.
Should be reasonably performant.

Some additional considerations for the database:

K.I.S.S.
Should fit the data model (documents, comments)
Should preferable be able to adapt to changing requirements, e.g. changes to the TEI subset we support.
Is it better to represent documents as URIs to files on disk or put them fully in the database? I'm leaning towards the former, but maybe there's some advantage to storing XML within a database.

Some requirements for the API:

Should be quite minimal and relatively standards-compliant.
Preferably fits the data model of the database well.

Bibliography

Certain documents appear in a bibliography list. This list should appear in a similar way to the search indices, but the content is slightly different.

Fix page break detection

Certain documents have buggy page segmentation with pb elements that aren't picked up correctly by the current pattern, e.g. acc-1992_0005_032_Uldall_1000-final.xml.

Frontpage

Should contain

A project description
A login button (scrap the login page entirely)?
Photo collage of relevant people harvested from the AAU website
A list of important search entry points, e.g. important topics and correspondences

Security checklist

e.g. https://observatory.mozilla.org/analyze/glossematics.org

Add citation row to metadata table

Ask Lorenzo for more.

Click search criteria to change field

Clicking on a search criteria should trigger a dropdown allowing a field change.

This will make it easy to start a search via clicking on a ref or via the index lists, while also allowing it to be completely fine-tuned afterwards, e.g. maybe the field should actually be author as opposed to anything.

Considerations:

Selectable fields should match the ones allowed for the given entity-type.
Field+Entity combinations that are already present as a criteria should not be possible to make, i.e. these are also disabled in the dropdown options.
- Furthermore, the above UX feature should also be used in the primary field select widget next to the text input field.
This probably requires putting a transparent select widget on top of the search criteria -OR- requires replacing either of the text parts of the criteria into a select widget.
- One complication is the fact that we currently do not display any field identifier if the chosen field is anything and the most obvious place to click is the field identifier.

Mark TEI documents that have/don't have a text body in the database

Most of the TEI documents do not have any text body. Dorte has chosen to represent this as

<text>
        <body>
            <div type="document">
                <p xml:id="p1"><!-- tekst -->                </p>
            </div>
        </body>
    </text>

Basically, if the content of the text body is a

with an empty

tag then we can assume that it is metadata only!
This should be handled as a part of the scraping process when bootstrapping the database.

Metadata extraction from TEI documents

This is contingent on ongoing database architecture decisions TBD in #4, i.e. the amount of metadata extraction that is actually required depends on the amount that can be automatically extracted through the way the XML itself is represented in the database of choice.

Anyway, supposing there is no way to query all of the metadata of the TEI XML through the its database representation, it becomes necessary to batch process the TEI documents beforehand and extract the metadata - and probably the text content too - in order to represent this inside the database.

The program itself should probably just be a script that loads TEI files recursively from a source folder and writes out maps of metadata into a single file (newline-delimited EDN? https://github.com/lambdaisland/edn-lines). This file can then be imported into whatever database solution is in use.

Limit page dots in carousel + allow goto page input

e.g. currently quite broken in http://localhost:8080/app/reader/acc-2013_0058_008_HWF_0040-tei.xml

Will need to be fixed

Determine SAML attributes needed, institutions allowed, and domain ownership

For attributes, see: https://wayf.dk/da/attributter

Bulk image extraction/conversion

Extract images from PDFs
~~Convert TIFFs to a better format~~
- Probably imagemagick

User menu

User assertions info, customisation, log out action, and other things.

Timeline design

Currently going through different options for creating a timeline.

Sofie's timeline (VEGA-based)

https://github.com/kuhumcst/sofie-vega-timeline

She got pretty far, but the design is not favoured by Viggo (and probably Frans too). It also has a few remaining layout issues that I am unlikely to solve (e.g. what to do about clipped or overlaid text). On the other hand, it is easy to embed inside reagent.

Simile widgets timeline

https://github.com/simile-widgets/timeline (or maybe https://github.com/simile-widgets/ancient-simile-widgets)

This is the timeline widget used by the Georg Brandes project. The design is exactly what Frans and Viggo want but, aside from certain issues I have with the design itself, it has a few problems:

It is unmaintained and people are having problems actually building it from source.
The official documentation is gone.
It definitely isn't accessibility-compliant.
It is based on static compilation using a Java program into an HTML file. This is fine for a single static page (i.e. Viggos data), but it won't work for a dynamic view of search results.

Asami not properly bootstrapped in prod

I guess I will need to add logging in various places to see what's happening.

Maybe also a good idea to clean up the overall bootstrap process (not just db) and update the README.md.

Collection is not an integer

Currently, the collection value in the TEI documents is resolved as an integer, since they often do appear as an integer string. However, in some cases they are decidedly not.

they shouldn't be converted to integer
the value shouldn't be sortable

Dev/production environment config split

Presumably through something like https://github.com/juxt/aero.

No need for Component/Integrant/Mount/other at the moment.

Build CLJS release in Dockerfile

I should include shadow-cljs and compile a release as part of the Dockerfile.

Search frontend

Should comprise an expandable area at the top of "badges" (registered entities) as well as a datalist text input below + select widget for specific relations to match.
- ... but a lot prettier than now
The text input should validate input such that only correct input can be submitted.
The datalist content should be sourced via a network call on page load.
- Both full name and "surname, first name" versions should be listed.
Should initialize ordered-set from entity in query-params for the first page load.
Dates should be automatically detected and context-dependent options must appear when a date is being entered, i.e. select on date, after date, before date.

Styling that is closer to the overall look of the documents

Now that there is a larger selection facsimile, the styling of the TEI rendering can be appropriately adjuested, e.g. paragraphs can be indented style rather than the traditional HTML separated-by-newline style. Another option is changing the default font to a typewriter'esque one.

Rate limiting

Several options exist:

Use some proxy like nginx which can be configured to rate limit
Use ring middleware for it: https://github.com/myfreeweb/ring-ratelimit
Use Jetty's filters.

Currently I favour using Jetty filters. They are probably the most efficient solution and configuration seems quite basic too:

Relevant filters:
- https://www.eclipse.org/jetty/documentation/current/dos-filter.html
- https://www.eclipse.org/jetty/documentation/current/qos-filter.html
How to configure programatically (in Java):
-https://www.programcreek.com/java-api-examples/?class=org.eclipse.jetty.servlet.ServletContextHandler&method=addFilter
How to access ServletContextHandler from Pedestal according to the documentation:
- Define the value of the :context-configurator key to be a function taking one arg (= the current org.eclipse.jetty.servlet.ServletContextHandler). Methods can then be called on that object from within the function.

Set up caching

https://www.nginx.com/blog/nginx-caching-guide/

The root, .html, images, CSS, and other "static" content can be reliably cached.

Comment functionality

It must be possible to attach comments to documents, preferably linked to specific paragraphs.

how should it work?
how should it look?
what tech to use, Asami or sqlite?