Giter Club home page Giter Club logo

Comments (5)

yohanboniface avatar yohanboniface commented on July 21, 2024

So a bit more complex, as we have some similar use cases that cannot be workarounded with pseudo-random results. For example, "Saint-Pierre", where we need to have best results (i.e. the cities "Saint-Pierre"). So we can't just to the manual scan as tried in d2fc0e6

from addok.

yohanboniface avatar yohanboniface commented on July 21, 2024

This commit improves a bit the "saint pierre" case.

from addok.

yohanboniface avatar yohanboniface commented on July 21, 2024

But it cannot be the solution actually.
Let's take the example of searching "Champagne-sur-Seine".

> FREQUENCY champagne
52955
--------------------------------------------------------------------------------
> FREQUENCY sur
321112
--------------------------------------------------------------------------------
> FREQUENCY seine
93802

So we'll take champagne and loop over it. Problem here is that "Champagne" is only 1/3 of "Champagne-sur-Seine" string, so it will have a score worst than all the "Champagne" hamlets for example, so even taking the 500 first ids of "champagne" set we'll still have a moment where it will not be enough to take the "champagne with a long name" city name we are actually searching.
As an illustration (this command shows the docs with the best score on their relation to champagne; obviously, Champagne-sur-Seine is not there):

> BESTSCORE champagne
Champagné 4.00878 72054
Champagne 4.0075 28069
Champagne 4.0075 17083
Champagne 4.0075 07051
Champagne 95660 Champagne-sur-Oise 4.005 95134B017H
Champagne 82120 Lavit 4.005 82097B040L
Champagne 79000 Niort 4.005 79191B061K
Champagne 74150 Rumilly 4.005 74225B015D
Champagne 74270 Frangy 4.005 74131B028B
Champagne 73290 La Motte-Servolex 4.005 73179B031Z
Champagne 69830 Saint-Georges-de-Reneins 4.005 69206B020H
Champagne 69110 Sainte-Foy-lès-Lyon 4.005 69202B003G
Champagne 69250 Neuville-sur-Saône 4.005 69143B009V
Champagne 69130 Écully 4.005 69081B006L
Champagne 53400 Craon 4.005 53084B069U
Champagne 50490 Saint-Sauveur-Lendelin 4.005 50550B011P
Champagne 49125 Tiercé 4.005 49347B091M
Champagne 49260 Montreuil-Bellay 4.005 49215B054W
Champagne 49150 Baugé-en-Anjou 4.005 49018C753Z
Champagne 44310 Saint-Colomban 4.005 44155B188U
Champagne 43350 Saint-Paulien 4.005 43216B036S

And (where we can see that Champagne-sur-Seine is ranked 965th on token champagne):

> INDEX 77079
chanpagn 1.3370166666666665 965
sur 1.3370166666666665 4433
sein 1.3370166666666665 92

from addok.

yohanboniface avatar yohanboniface commented on July 21, 2024

Last episode: I've introduced INTERSECT_LIMIT in 005867d (with default value 100000).
The idea behind is that as soon as we have a token with less than INTERSECT_LIMIT frequency, and even if we have only common terms, we can in the last resort issue a regular intersect. Otherwise, we just go for the manual scan workaround, which only retrieves 10 random results. But this should only happen with those very common words that doesn't have any meaning.

To be continued…

from addok.

yohanboniface avatar yohanboniface commented on July 21, 2024

For the record, here is the list of the tokens where frequency > 100000:

101740 kalvado
101918 42
102033 indr
102828 cher
104717 gran
105368 57
107571 45
108934 park
109210 63
109530 klo
109861 savoi
110837 morbian
112206 41
113942 kalai
117221 pa
117369 mozel
121350 40
121363 71
124333 garon
125970 saon
128355 39
129851 nor
132108 maritim
134835 37
135322 frans
136072 maien
136198 48
138423 azur
138749 provens
143679 finister
144564 38
146516 36
147874 chemin
147978 franch
148099 auvergn
148287 atlantik
148968 32
151018 34
151534 komt
152032 sart
153485 56
153489 boi
156574 en
163326 27
165089 30
165891 main
168171 31
170237 28
171312 33
174384 26
176299 72
180591 44
181365 53
189657 lorain
193733 24
195139 manch
204018 au
204700 akitain
204959 23
207728 vilain
209638 49
214175 langedok
214790 rousilon
216872 25
219270 l
219497 chanp
223359 ving
238866 dix neuf
239852 pr
246178 charent
246379 poitou
252146 50
262643 21
267196 29
267552 midi
279910 armor
280361 sentr
280886 bourgogn
300647 35
302921 dix ui
308160 kinz
319545 dix sep
320186 pirene
321112 sur
327508 treiz
330635 seiz
347397 il
373383 bas
381063 ron
381134 aut
387079 onz
409167 normandi
410303 katorz
414531 douz
418944 neuf
421243 dix
461077 ui
468344 sep
470310 22
485876 alp
510154 kot
517236 six
527645 sink
542558 du
567567 katr
580283 troi
587972 un
593129 ru
627199 deu
634499 d
635154 pai
729552 sain
741924 bretagn
783164 et
913687 loir
1577949 le
1598672 la
2012949 de

(118 in total)

from addok.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.