icatproject / icat.lucene Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 332 KB

Provides access to a lucene index

License: Other

Java 95.60% Python 1.91% CSS 0.34% HTML 2.15%

icat.lucene's People

Watchers

Forkers

vktb

icat.lucene's Issues

DatafileParameters not indexed

The ISIS facility has extracted metadata from many of the files it has logged in ICAT. It would be useful if this metadata were searchable. It will normally include the file run number, run title, start date, end date and any other details stored in the Nexus or RAW file. These are all stored in DatafileParameters.

2 billion document limit

Number of documents in a single index is limited to the maximum positive integer (2^31)-1 = 2,147,483,647 which is exceeded by the number of DLS Datafiles. To circumvent this, allow multiple "sub-indices" to be aliased to a single entity, and route incoming documents between them. At search time, combine results using a MultiReader.

Enable field sorting

Currently we do not allow for the Lucene search results to be sorted explicitly, and only rely on the default sorting by score. In principle, we could sort on any indexed field, but dates are likely the most useful case here.

Lucene component should be capable of accepting a specified field to sort on a part of a search request. Being able to sort by fields is relevant to ral-facilities/datagateway#1152

InvestigationUsers not indexed

The search component should be able to find all the investigations a user is involved in using just their name. To do this it will have to start indexing InvestigationUsers.

Open data permissions special case

Taken from notes of ICAT F2F, in response to question asked by @RKrahl about how "MyData" search adds query criteria about the user to speed up authorisation of results:

Can we add an equivalent "Open" search option that adds a criteria like the "MyData" search? With the criteria being configurable?

In principle yes, however this requires anything the "Open" query relies on being indexed - e.g. releaseDate, doi...

Don't know if this is currently the case - it would be best to indexed everything that would be useful so we don't have to re-index everything later when we realise we don't have everything we need

Config location? Could add it in DG but might be better for DG to just pass a boolean and icat.server to handle the specifics

To implement this, changes would also be needed in icat.server and DataGateway

Unable to match similar strings

All files that are produced by the ISIS facility take the format 'WISH1234.nxs' where 'WISH' is the instrument and '1234' is the run number. The run number can be prefixed with up to 3 zeros depending on the instrument. Many users and instrument scientists will know the run numbers associated with their experiments and may wish to find them in ICAT.

The Lucene search component can't locate the run number as a substring, instead requiring the user to enter the full file name. It appears that Lucene is limited to finding exact matches, other than when a separating character happens to fall in the right place. It would be good to remove this limitation to allow users to search by run number.

Add code coverage to the repo

Improve Lucene Search Functionality

There are a number of changes that could be made to the free text search to enable better functionality. These should be made to one branch rather than directly to master/main, as they are experimental and may need further discussion before being released.

New features:

#16
- #11
#19

Core functionality:

Lucene doesn't treat underscores as separators

Diamond wants users to be able to find files via run number. However if the file path contained the following fragment:

'54192_liver_area3'

'54192' would not be searchable for this file

Enable Facets

Lucene offeres three types of facets:

Taxonomy based facets (defined, hierarchical categories)
Ranges for numeric values
Sorted doc values

Of these, the third seems the simplest to implement first (for text fields), as we do not have a good understanding of the taxonomy in use and numerics will require unit conversion which is non-trivial.

Relevant documentation
Examples

Fields to facet

This may be something to discuss further with users/facilities, as it would depend where the important information is stored in the schema. Especially with regards to implementing taxonomy, which may be more relevant for ICAT 5 with the introduction of "technique".
For now, Sample and Parameter can be indexed with (relatively) short text/string fields, so sparse faceting here may be the easiest place to start.

Tests do not run content of Lucene.java

Description:
The current unit testing does not import anything from Lucene.java when running tests. Instead, the logic performed there is mocked (adding/searching documents), and tests run on this. IcatAnalyzer is tested.

Acceptance criteria:

Change tests actually test the code in Lucene.java
Use a code coverage tool to ensure these tests are sufficient?

Encode related documents' ICAT ids

Currently, we do not encode the ICAT ids of "related" documents, such as Parameters or Samples. This makes updating or removing these from the index impossible.

Additionally, we don't do this for the relations of those relations. For example, when we index and InvestigationUser we make note of the Investigation id (which the InvestigationUser has access to, but not the InvestigationUser id or the User id. Most of the actual content here lives on the User, so in principle we need to update the Lucene index when either the InvestigationUser or the User changes by using their respective ids.

While adressing this, it will be an opportunity to move further Lucene logic out of icat.server (e.g. the method of specifying Lucene class types in the JSON built in LuceneApi) and move towards a (hopefully) more intuitive method of encoding the Documents.

Datafiles are not searchable by date

It looks like the code to query Lucene for Datafiles by date was copy-pasted from the dataset or investigation endpoints.

The code for Datafiles, Datasets and Investigations all search on startDate and endDate. This is correct for Datasets and Investigations which both have these parameters. However, Datafiles have a single date parameter.

I propose that we keep the API endpoint the same but change the query to add a TermRangeQuery on the date property between the bounds. We will need to be careful to match the behaviour of the Dataset and Investigation queries regarding whether the lower or upper bounds are inclusive or not eg.

lower <= date < upper vs lower <= date <= upper.

Flattening high value parameters onto parent entity

Taken from notes of ICAT F2F, in response to question asked by @antolinos:

Can the entities indexed be controlled? If only interested in Datasets, and specific DatasetParameters (~6 valuable ones, the rest are not interesting)?

entitiesToIndex is a config option in ICAT server. Only these will be indexed by Lucene/the search engine backend. This config option will still be present in the "new" version of free text search.

Currently, all Parameters are stored in their own index (one for Investigation, Dataset, Datafile and Sample). When searching/faceting, under the hood we "join" the main entity index to the Parameter index.

Joining has a negative performance impact, but is the only way to retain nested lists of objects (i.e. the only way to keep the type.name, type.units associated with the same numericValue)

In your use case, where there are certain valuable Parameters, it would be better to (as you have already done) "flatten" these parameters into fields on the Dataset document, as you do not need to worry about needing to be able to update these Parameters or an explosion in the number of Parameter fields.

This is not currently possible in either the icat.lucene or the OS/ES backend support, however in principle it should be possible to do by writing additional logic (and would be more performant) providing you don't mind the following drawbacks:

Parameters need to be reduced to key:value pairs, so units would need to be embedded into either the key or the value, rangeTop/rangeBottom would need to be mapped to a single value etc.

You would not be able to easily modify the "ParameterType" information - e.g. to change the ParameterType.name would mean changing the mapping of the entire index, or adding an alias for the field which would need very specific logic compared to the rest of the functionality

To implement this, changes would also be needed in icat.server and DataGateway

Migrate CI from Travis to GitHub Actions

Description:
The CI needs to be migrated from Travis to GitHub Actions.

Acceptance criteria:

CI runs

Enable synonyms for scientific terminology

Lucene supports synonym files, which can be used for both alternate spellings (e.g. "ionisation" vs "ionization") and scientific terms which are equivalent (such as chemical symbols to element names).

This can either be applied on indexing, on search, or both. It seems that injecting synonyms at search times is the expected approach (https://lucene.apache.org/core/8_5_1/core/org/apache/lucene/analysis/package-summary.html?is-external=true).

The Solr format for synonym files is supported by both elasticsearch, so should be compatible if we move to that.

The other thing to consider is where in the current analyzer we put the synonym injection. To recognise chemical symbols, keeping upper case letters might be useful? However this might be problematic for going the other way round, e.g. hydrogen, Hydrogen and HYDROGEN are all equally valid.

icatproject / icat.lucene Goto Github PK

icat.lucene's People

Watchers

Forkers

icat.lucene's Issues

Fields to facet

Recommend Projects

Recommend Topics

Recommend Org