Comments (5)
So a bit more complex, as we have some similar use cases that cannot be workarounded with pseudo-random results. For example, "Saint-Pierre", where we need to have best results (i.e. the cities "Saint-Pierre"). So we can't just to the manual scan as tried in d2fc0e6
from addok.
This commit improves a bit the "saint pierre" case.
from addok.
But it cannot be the solution actually.
Let's take the example of searching "Champagne-sur-Seine".
> FREQUENCY champagne
52955
--------------------------------------------------------------------------------
> FREQUENCY sur
321112
--------------------------------------------------------------------------------
> FREQUENCY seine
93802
So we'll take champagne and loop over it. Problem here is that "Champagne" is only 1/3 of "Champagne-sur-Seine" string, so it will have a score worst than all the "Champagne" hamlets for example, so even taking the 500 first ids of "champagne" set we'll still have a moment where it will not be enough to take the "champagne with a long name" city name we are actually searching.
As an illustration (this command shows the docs with the best score on their relation to champagne; obviously, Champagne-sur-Seine is not there):
> BESTSCORE champagne
Champagné 4.00878 72054
Champagne 4.0075 28069
Champagne 4.0075 17083
Champagne 4.0075 07051
Champagne 95660 Champagne-sur-Oise 4.005 95134B017H
Champagne 82120 Lavit 4.005 82097B040L
Champagne 79000 Niort 4.005 79191B061K
Champagne 74150 Rumilly 4.005 74225B015D
Champagne 74270 Frangy 4.005 74131B028B
Champagne 73290 La Motte-Servolex 4.005 73179B031Z
Champagne 69830 Saint-Georges-de-Reneins 4.005 69206B020H
Champagne 69110 Sainte-Foy-lès-Lyon 4.005 69202B003G
Champagne 69250 Neuville-sur-Saône 4.005 69143B009V
Champagne 69130 Écully 4.005 69081B006L
Champagne 53400 Craon 4.005 53084B069U
Champagne 50490 Saint-Sauveur-Lendelin 4.005 50550B011P
Champagne 49125 Tiercé 4.005 49347B091M
Champagne 49260 Montreuil-Bellay 4.005 49215B054W
Champagne 49150 Baugé-en-Anjou 4.005 49018C753Z
Champagne 44310 Saint-Colomban 4.005 44155B188U
Champagne 43350 Saint-Paulien 4.005 43216B036S
And (where we can see that Champagne-sur-Seine is ranked 965th on token champagne):
> INDEX 77079
chanpagn 1.3370166666666665 965
sur 1.3370166666666665 4433
sein 1.3370166666666665 92
from addok.
Last episode: I've introduced INTERSECT_LIMIT in 005867d (with default value 100000).
The idea behind is that as soon as we have a token with less than INTERSECT_LIMIT frequency, and even if we have only common terms, we can in the last resort issue a regular intersect. Otherwise, we just go for the manual scan workaround, which only retrieves 10 random results. But this should only happen with those very common words that doesn't have any meaning.
To be continued…
from addok.
For the record, here is the list of the tokens where frequency > 100000:
101740 kalvado
101918 42
102033 indr
102828 cher
104717 gran
105368 57
107571 45
108934 park
109210 63
109530 klo
109861 savoi
110837 morbian
112206 41
113942 kalai
117221 pa
117369 mozel
121350 40
121363 71
124333 garon
125970 saon
128355 39
129851 nor
132108 maritim
134835 37
135322 frans
136072 maien
136198 48
138423 azur
138749 provens
143679 finister
144564 38
146516 36
147874 chemin
147978 franch
148099 auvergn
148287 atlantik
148968 32
151018 34
151534 komt
152032 sart
153485 56
153489 boi
156574 en
163326 27
165089 30
165891 main
168171 31
170237 28
171312 33
174384 26
176299 72
180591 44
181365 53
189657 lorain
193733 24
195139 manch
204018 au
204700 akitain
204959 23
207728 vilain
209638 49
214175 langedok
214790 rousilon
216872 25
219270 l
219497 chanp
223359 ving
238866 dix neuf
239852 pr
246178 charent
246379 poitou
252146 50
262643 21
267196 29
267552 midi
279910 armor
280361 sentr
280886 bourgogn
300647 35
302921 dix ui
308160 kinz
319545 dix sep
320186 pirene
321112 sur
327508 treiz
330635 seiz
347397 il
373383 bas
381063 ron
381134 aut
387079 onz
409167 normandi
410303 katorz
414531 douz
418944 neuf
421243 dix
461077 ui
468344 sep
470310 22
485876 alp
510154 kot
517236 six
527645 sink
542558 du
567567 katr
580283 troi
587972 un
593129 ru
627199 deu
634499 d
635154 pai
729552 sain
741924 bretagn
783164 et
913687 loir
1577949 le
1598672 la
2012949 de
(118 in total)
from addok.
Related Issues (20)
- Unidecode is under GPL-2 HOT 2
- Installation fails under python 3.7 HOT 2
- Ajoût d'un healthcheck pour le fonctionnement dans un environnement kubernetes HOT 2
- Create a CI for test automation HOT 4
- addok-france: add "crs" abbreviation for "cours" HOT 6
- distance search configuration
- test_create_edge_ngrams fails on macOS HOT 1
- Re-implement multiprocessing for macOS
- Redis 6.2+ geo indexes
- Problème d'emballement de redis HOT 2
- Installation instruction : use python-venv instead of python-virtualenv
- #553 issue not resolved yet HOT 11
- Add result_type support to csv geocoding HOT 2
- Addok lets you index some data that will cause a Python error during a search... HOT 2
- Uncompatible version of addok-csv 1.1.0 with current addok 1.1.1 and falcon 3.1.1 HOT 2
- cd
- différences de score entre le endpoint /csv et le endpoint search/ HOT 2
- For multiple postcodes, score should not depend of the postcode
- Problematic behavior with street names that start with a number (when this number is also one of its housenumbers) HOT 1
- contos7报错-bash: venv/bin/activate: No such file or directory
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from addok.