import string
non_word_boundaries = set(string.digits + string.ascii_letters + '_')
print(non_word_boundaries)
>> {'k', '6', 's', 'M', 'i', 'S', 'm', 'E', 'r', 'W', 'v', 'l',
'R', 'f', 'e', 'X', '7', '3', 'q', 'w', '0', 'x', 'V', 'C', 'n',
'I', '4', 'D', 'z', 'G', 'L', '2', 'T', 'U', '_', 'B', 't', 'Q',
'd', '9', 'h', 'o', 'c', 'u', 'P', 'K', 'Y', 'p', 'A', 'J', 'O',
'N', 'H', 'j', 'a', 'Z', '5', '1', 'b', 'y', 'F', '8', 'g'}
The problem arises when one decides to lookup cyrillic character keywords in a cyrillic text. Due to the limitation flashgeotext does not reliably extract the longest match, as every character not present in non_word_boundaries will stop the traversing thru the trie early.
{"Нижневартовск": ["Нижневартовск"]