Expected Behavior
Adding new questions to an item should not adversely affect any answers that were previously correctly matched.
Actual Behavior
Adding a new question can actually weaken the score of an existing good match, sometimes causing another item to have a stronger score. See example below.
Steps to Reproduce the Problem
Import two items in attached file:
test.txt
Switch to 'test' tab, and test question "tell me about snorkeling" - observe that the expected item is 'test.001' has the higher score (though by a slim margin)
Now edit 'test.001' and add a second question: "what should I know about snorkeling"
Rerun the test with the same question.. Now the other answer has the higher score.
It is counter intuitive, and undesirable, that adding the second question would change the answer.
Analysis
The QnABot uses elastic search ‘full text search’ capability to create ‘relevance scores’ for each QnA item. Relevance scores are computed by weighting a number of different factors in an effort to get the best match – see What is relevance
There are three factors to the scoring a) term frequency, b) inverse document frequency, and 3) field length norm.. I believe it is this third factor that is biting us here.
This is because the effect of adding the second question to the first item was to make the whole ‘question’ field longer, which reduced the relevance score of the match of item 1 (die to the ‘field length norm’ behavior aforementioned). The score was reduced to the point where it was slightly lower than the first item.
NOTE: This situation really only arises when the question results in very similar scores on multiple items.. Ie when similarilties between the questions prevent a strong unique match.
Options
A shortterm workaround can be to duplicate the question in order to increase the 'terms frequency' part of the scoring equation. ie adding a 3rd question to 'test.001' duplicating the initial question "tell me about snorkelling" once again results in this item having the highest score. Despite the fact that the 3rd question also lengthened the field further, the fact that it was a strong match to the question had the net effect of strengthening the overall score.
However, while this technique might be useful for avoiding this specific problem, I do worry that it could introduce new problems by weakening the score of other question variants. Could become a game of 'whack-a-mole'!
Better if we can fix a fix in the code
- (preferred) find a way to construct the doctype mapping or the query to negate the problematic 'field length norm' factor when matching on the question lists. The total number of or length of the questions should ideally not affect the scoring of a match. See disable field-length norm in mapping
- UPDATE: disabling 'field-length norm' did resolve this issue but created a new issue.. namely with it disabled, the question "tell me about snorkeling" returned identical relevance scores for the two test items "tell me about snorkeling' and 'tell me about snorkel prices' since after stemming (snorkeling=snorkel) the matches were identical. So back to the drawing board. Next up - see if mapping the question array as a nested datatype will help.
- alternatively, enhance the elastic search document structure to model each question independently, either by duplicating answers where there are multiple questions, or using a parent/child nested mapping to next questions as separate documents under the parent answer.