Comments (13)
I assume this is the analyzer you are using. The backoffice uses a culture invariant analzyer - see https://github.com/umbraco/Umbraco-CMS/blob/8b878a7aa6302a1ee060f4f601cd6994ca178e3f/src/Umbraco.Examine.Lucene/DependencyInjection/ConfigureIndexOptions.cs#L36
the external one uses a standard analyzer.
Under the hood, the CultureInvariantWhitespaceAnalyzer is this: https://github.com/Shazwazza/Examine/blob/release/3.0/src/Examine.Lucene/Analyzers/CultureInvariantWhitespaceAnalyzer.cs
Which is a whitespace analyzer + LowerCaseFilter + ASCIIFoldingFilter (removes international symbols and converts to plain ascii)
You could try this for the external index: https://github.com/Shazwazza/Examine/blob/release/3.0/src/Examine.Lucene/Analyzers/CultureInvariantStandardAnalyzer.cs
which is the same as above, but with standard analyzer instead of whitespace.
from examine.
That's for InternalIndex
in backoffice global search (or when searching in InternalIndex
from Examine dashboard I guees).
ExternalIndex
is using the standard analyzer
https://github.com/umbraco/Umbraco-CMS/blob/8b878a7aa6302a1ee060f4f601cd6994ca178e3f/src/Umbraco.Examine.Lucene/DependencyInjection/ConfigureIndexOptions.cs#L41
I tried setting Analyzer
in configuration, but it didn't seem to make a difference - and I think it would only be necessary if I wanted to change the default in Umbraco :)
I will check with CultureInvariantStandardAnalyzer
.
from examine.
Actually it was InternalIndex
I was searching in for this specific task as it was including search on unpublished nodes and searching category nodes in backoffice.
which doesn't return results from Examine dashboard in InternalIndex
:
from examine.
I'm pretty sure this is because of the analyzer. You can test by searching with the ascii folder chars instead.
from examine.
The backoffice uses the culture invariant analyzer to try to provide a reasonable all-rounder experience for anyone working in the back office. If you have a very specific language structure in your entire site and all of your editors are the same language, than you can change the default analyzer to Standard, or whatever suits your team.
from examine.
Yeah, I tried this
private const LuceneVersion _luceneVersion = LuceneVersion.LUCENE_48;
case Umbraco.Cms.Core.Constants.UmbracoIndexes.InternalIndexName:
//options.Analyzer = new CultureInvariantWhitespaceAnalyzer();
options.Analyzer = new StandardAnalyzer(_luceneVersion);
break;
but didn't seem it returned the results with Danish characters.
I found something like this if we want to customize/extend a specific analyzer. Not sure if it has been documented.
https://stackoverflow.com/a/14811453
Will investigate further :)
from examine.
That link just shows what we already have for the CultureInvariantStandardAnalyzer https://github.com/Shazwazza/Examine/blob/release/3.0/src/Examine.Lucene/Analyzers/CultureInvariantStandardAnalyzer.cs
from examine.
Yes :)
Actually I have this instead:
case Umbraco.Cms.Core.Constants.UmbracoIndexes.InternalIndexName:
options.Analyzer = new StandardAnalyzer(LuceneInfo.CurrentVersion);
break;
would have assumed the search returned the same results as searching ExternalIndex
:
https://github.com/umbraco/Umbraco-CMS/blob/8b878a7aa6302a1ee060f4f601cd6994ca178e3f/src/Umbraco.Examine.Lucene/DependencyInjection/ConfigureIndexOptions.cs#L41
but I recall the NativeQuery()
sometimes returns different results than e.g. default search using StandardAnalyzer on ExternalIndex.
https://github.com/umbraco/Umbraco-CMS/blob/8b878a7aa6302a1ee060f4f601cd6994ca178e3f/src/Umbraco.Examine.Lucene/BackOfficeExamineSearcher.cs#L152
I tried replacing the analyzer with CultureInvariantStandardAnalyzer instead, but it seems it also return zero results for term bæredygtig
or bæredygtighed
:
case Umbraco.Cms.Core.Constants.UmbracoIndexes.InternalIndexName:
options.Analyzer = new CultureInvariantStandardAnalyzer();
break;
from examine.
I couldn't make it work by replacing the analyzer, so for now I have this workaround instead to replace the Danish letters æ
, ø
and å
before passing in term to query:
if (!string.IsNullOrEmpty(term))
{
var replacement = new Dictionary<string, string>
{
{ "æ", "ae" },
{ "ø", "o" },
{ "å", "a" }
};
term = term.ToLowerInvariant().ReplaceMany(replacement);
}
Then it find results like bæredygtighed
, grøn
and affaldshåndtering
.
from examine.
I noticed there's a ScandinavianFoldingFilter and ScandinavianNormalizationFilter
The difference is:
ScandinavianFoldingFilter
blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej == blabarsyltetoj
ScandinavianNormalizationFilter
blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej but not blabarsyltetoj
I wonder if it makes sence to able to use a different filter than ASCIIFoldingFilter
in CultureInvariantStandardAnalyzer
?
I tried making a copy of that and used ScandinavianFoldingFilter
instead, but the raw query include the Danish letter and although I can see the DefaultAnalyzer
change based on configuration, it didn't seem to have effect on the results except when I use the replacements in #383 (comment)
from examine.
It looks like a similar issue @jemayn had in #263 (comment) replacing æøå
:
https://github.com/skybrud/Skybrud.Umbraco.Search/blob/v3/main/src/Skybrud.Umbraco.Search/SearchHelper.cs#L43-L56
from examine.
@Shazwazza btw in the current logic without any configuration of InternalIndex
and custom analyzer set, this find results searching the exact word bæredygtighed
:
if (string.IsNullOrEmpty(searchTerm))
{
return filter;
}
searchTerm = searchTerm.Replace("-", string.Empty);
var words = Tokenize(QueryParserBase.Escape(searchTerm)).ToArray();
filter.And().GroupedOr(searchFields, words?.ToArray());
but it seems to be related to wilcard search as you mentioned here:
umbraco/Umbraco-CMS#11176 (comment)
from examine.
@bjarnef Appreciate all the feedback and research here but ultimately this comes down to how analyzers are configured for the various indexes in Umbraco.
My advice to get to the bottom of this is to run simple tests i.e. clone the Examine Repo and create a test case using the FluentApiTests - this is quite easy and will allow you to iterate quicker in testing to validate results/expectations. As I don't see this being an Examine bug, I will close this issue but feel free to comment on it. I'm more than happy to make tweaks to Examine where it makes sense but in this case I don't think this is Examine specific and is mostly based on how Umbraco configures the indexes.
from examine.
Related Issues (20)
- And( q=> q.GroupedOr(...)) adds and (+) to the first term of the groupedOr HOT 3
- Abstaction of LuceneIndex.cs HOT 5
- Any plan to release new Version HOT 2
- Sorting and paging highlight same both menu items HOT 6
- Content without an English (default language) version is not indexed HOT 1
- Indexing new valuesets adds unique documents instead of updating existing with the same __NodeId HOT 8
- Synchronous indexing HOT 8
- Failed to retrieve indexer details. HOT 3
- Query by Id does not return search result HOT 5
- ❓ How to tell if an Examine Index is Healthy? Possible ASPNET HealthCheck 💡 HOT 2
- NativeQuery performance CPU usage HOT 5
- Same query but different results if executed as NativeQuery vs Fluent API HOT 7
- Hardcoded default limit of max 500 search results is not obvious
- GetMultiFieldQuery shouldn't return an empty lucene query if there are no field values
- Examine on load balanced environment HOT 8
- Getting Searcher Synchronously? HOT 4
- Lucene.Net.Index.CorruptIndexException: invalid deletion count: 2 vs docCount=1 HOT 7
- How to make a boosted phrase with FluentAPI? HOT 2
- Field $facets was not indexed with SortedSetDocValues
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from examine.