We need to decide as a community, how to handle tokenization, stemming, etc. <p di

Copied from <a class="user-mention notranslate" data-hovercard-type="user" data-hoverc

James, can you incorporate this as appropriate? <a href="https://github.com/twitter/tw

Decide on (Lucene) Analyzer about twitter-tools HOT 15 CLOSED

lintool commented on May 30, 2024

Decide on (Lucene) Analyzer

from twitter-tools.

Comments (15)

JamesMcMinn commented on May 30, 2024

Per discussion on Skype, I'll look into this over the new week or so.

from twitter-tools.

stewhdcs commented on May 30, 2024

Copied from @telsayed in #26, the analyzer should keep intact URLs, and not strip preceding @'s and #'s:

•hashtags (i.e., that are mentioned in the tweet), [so that we can retrieve all tweets that have a specific hashtag].
•mentions (usernames mentioned in the tweet)
•URLs [so that we can get all tweets that point to a specific URL]
•comments (that are written beside a retweet)

from twitter-tools.

lintool commented on May 30, 2024

The name of the current analyzer is LowerCaseHashtagMentionPreservingTokenizer, so that is the intent... :)

from twitter-tools.

JamesMcMinn commented on May 30, 2024

Current Sate of Tokenizer

Below is an example of some tweets, and how the current Tokenizer handles them (a | represents a token character)

AT&T getting secret immunity from wiretapping laws for government surveillance http://vrge.co/ZP3Fx5
AT|T|getting|secret|immunity|from|wiretapping|laws|for|government|surveillance|http|||vrge|co|ZP3Fx5

want to see the @verge aston martin GT4 racer tear up long beach? http://theracersgroup.kinja.com/watch-an-aston-martin-vantage-gt4-tear-around-long-beac-479726219 …
want|to|see|the|@verge|aston|martin|GT4|racer|tear|up|long|beach||http|||theracersgroup|kinja|com|watch|an|aston|martin|vantage|gt4|tear|around|long|beac|479726219||

Incredibly good news! #Drupal users rally http://bit.ly/Z8ZoFe  to ensure blind accessibility contributor gets to @DrupalCon #Opensource
Incredibly|good|news||#Drupal|users|rally|http|||bit|ly|Z8ZoFe||to|ensure|blind|accessibility|contributor|gets|to|@DrupalCon|#Opensource

We're entering the quiet hours at #amznhack. #Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
We|re|entering|the|quiet|hours|at|#amznhack||#Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz

The 2013 Social Event Detection Task (SED) at #mediaeval2013, http://bit.ly/16nITsf  supported by @linkedtv @project_mmixer @socialsensor_ip
The|2013|Social|Event|Detection|Task||SED||at|#mediaeval2013||http|||bit|ly|16nITsf||supported|by|@linkedtv|@project|mmixer|@socialsensor|ip

http://www.google.co.uk/#sclient=psy-ab&q=boston&oq=boston&gs_l=hp.3..0i3l3j0.1740.1740.1.1935.1.1.0.0.0.0.73.73.1.1.0...0.0...1c.1.11.psy-ab.Y6qmtfBiK3M&pbx=1&bav=on.2,or.r_qf.&bvm=bv.45580626,d.d2k&fp=d5785e6541662a88&biw=1087&bih=989
http|||www|google|co|uk|#sclient|psy|ab|q|boston|oq|boston|gs|l|hp|3||0i3l3j0|1740|1740|1|1935|1|1|0|0|0|0|73|73|1|1|0|||0|0|||1c|1|11|psy|ab|Y6qmtfBiK3M|pbx|1|bav|on|2|or|r|qf||bvm|bv|45580626|d|d2k|fp|d5785e6541662a88|biw|1087|bih|989

this is [email protected]
this|is|an@example|com

:@blah#hashtag
|@blah#hashtag

Observations

It handles most mentions and hashtags correctly, however that are a number of cases where it does not work (e.g. the underscore character _ should be valid)
There are a number of cases when tokenization should have occurred due the hashtags and mentions being invalid (i.e. when the "mention" is proceeded by an invalid character, such as an email address)
URLs simply do not work
Words which contain apostrophes are split - I'm not sure this is desirable

I believe tokenization needs to occur in multiple steps:

Split at white space characters only
Use regular expression to match entities
Perform further tokenization depending on entity/non-entity

Regular expressions for valid entities can be found here. I plan on using these to decide which type of tokenization to perform after whitespace tokenization.

(Post to follow with examples of the new tokenizer)

from twitter-tools.

lintool commented on May 30, 2024

James, can you incorporate this as appropriate? https://github.com/twitter/twitter-text-java

from twitter-tools.

JamesMcMinn commented on May 30, 2024

Jimmy, thanks for pointing out that there was a Java version, I hadn't seen it before. I've incorporated the code, and it seems to work fine.

Delimiters

At the moment I'm concentrating on the tokenization. What characters, other than whitespace, do we want to use a delimiters? At the moment my list is:
_ - ? ! , ; : . ( ) [ ] @ # / \

There's also the issue of URLs. bit.ly URLs are case sensitive - this means that if we make them lowercase they will no longer work. At the moment I'm lowercasing everything except URLs, however since the original case is preserved the the full text I'm not sure it's an issue. Discussion is welcome.

Stemming

I'm using porter stemming, as it's well known and built into Lucene. At the moment it only stems non-entities, and leaves mentions/hashtags alone. Any comments on this are welcome.

Stop Word Removal?

Is this something we also want to do?

Current Output

AT&T getting secret immunity from wiretapping laws for government surveillance http://vrge.co/ZP3Fx5
att|get|secret|immun|from|wiretap|law|for|govern|surveil|http://vrge.co/ZP3Fx5|

want to see the @verge aston martin GT4 racer tear up long beach? http://theracersgroup.kinja.com/watch-an-aston-martin-vantage-gt4-tear-around-long-beac-479726219 …
want|to|see|the|@verge|aston|martin|gt4|racer|tear|up|long|beach||http://theracersgroup.kinja.com/watch-an-aston-martin-vantage-gt4-tear-around-long-beac-479726219||

Incredibly good news! #Drupal users rally http://bit.ly/Z8ZoFe  to ensure blind accessibility contributor gets to @DrupalCon #Opensource
incredibli|good|new||#drupal|user|ralli|http://bit.ly/Z8ZoFe|to|ensur|blind|access|contributor|get|to|@drupalcon|#opensource|

We're entering the quiet hours at #amznhack. #Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
were|enter|the|quiet|hour|at|#amznhack||#rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz|

The 2013 Social Event Detection Task (SED) at #mediaeval2013, http://bit.ly/16nITsf  supported by @linkedtv @project_mmixer @socialsensor_ip
the|2013|social|event|detect|task||sed||at|#mediaeval2013||http://bit.ly/16nITsf|support|by|@linkedtv|@project_mmixer|@socialsensor_ip|

this is [email protected]
thi|i|an|exampl|

this is @a_valid_mention and this_is_multiple_words
thi|i|@a_valid_mention|and|thi|i|multipl|word|

12354
12354|

this-is-lots!of!words.seperated.without~whitespace-cause=no+one-likes(whitespace)(do]they? U.S.A. U.K. What the hell?hello...hello.a.b..c.
thi|i|lot|of|word|seper|withoutwhitespac|causenoon|like|whitespac||do|thei||usa|uk|what|the|hell|hello|hello|ab|c|

this @should.not.work
thi|@should|not|work|

PLEASE BE LOWER CASE WHEN YOU COME OUT THE OTHER SIDE - ALSO A @VALID_VALID-INVALID
pleas|be|lower|case|when|you|come|out|the|other|side|||also|a|@valid_valid|invalid|

＠reply @with #crazy ~＃at
＠reply|@with|#crazy||＃at|

:@valid testing(valid)#hashtags. RT:@meniton (the last @mention is #valid and so is this:@valid), however this is@invalid
|@valid|test|valid|#hashtags||rt|@meniton||the|last|@mention|i|#valid|and|so|i|thi|@valid|||howev|thi|i|invalid|

from twitter-tools.

lintool commented on May 30, 2024

I'm ambivalent about stemming, will go with whatever the community decides.
I'll evaluate the stopwords issues from the systems perspective: if it doesn't slow down query latency/throughput and doesn't blow up the index size (too much), I'm fine with keeping stopwords.

I noticed that emails aren't handled properly. Not a big deal, maybe as a //TODO in the code.

from twitter-tools.

yubink commented on May 30, 2024

+1 on stemming, and perhaps we should keep stopwords. People will then have the option to stop by stopping their queries or not.

from twitter-tools.

yubink commented on May 30, 2024

Saw the specification and thanks for all the hard work!

Just curious, why those particular set of characters? Why not { } or other POSIX punctuation characters?

from twitter-tools.

JamesMcMinn commented on May 30, 2024

I left the set of characters lacking specifically to try and generate some feedback. I'm not really sure what the best set is, and I don't think the POSIX set is perfect (for example, we probably don't want to use '), however I do think it's a good place to start.

The POSIX set contains:

 ] [ ! " # $ % & ' ( ) * + , . / : ; < = > ? @ \ ^ _ ` { | } ~ -

However this leaves out characters such as ¬, · and …, and we probably want to remove ' from the list for contractions.

I believe OS X replaces 3 periods with an ellipsis by default now, so certainly we should include that in the list.

The set that I would suggest is:

 ] [ ! " # $ % & ( ) * + , . / : ; < = > ? @ \ ^ _ ` { | } ~ - … ¬ ·

However, I was rather hoping that someone else would make some suggestions. Perhaps we should split on non-alphanumeric characters (except ') instead?

from twitter-tools.

yubink commented on May 30, 2024

Haha, nice trick, that.

You've obviously given more thought to the tokenizer split set than I have!
The set you propose looks good to me.

However, I don't think splitting on everything but alphanum is a good idea
because of potential non-ascii/latin characters. Actually, what is going to
happen to non-ascii/latin characters?

from twitter-tools.

lintool commented on May 30, 2024

James, would you mind pushing a branch that fixes this, and also tweak wiki?

Thanks!

from twitter-tools.

amjedbj commented on May 30, 2024

I'm agree with @yubink, I think that alphanum based tokenization is a good idea.
Some languages use special punctuation marks, for instance « » (fr) · (el) ؟ ؛ ، (ar, ur,fa) （）【】 (cn,jp).

I have 2 questions. Are the two experssions "# trec2013" and "trec2013" considered as two different terms in the index? Are tweets containing only "#trec2013" selected for query "trec2013 guideline"?

from twitter-tools.

JamesMcMinn commented on May 30, 2024

(For now) the tokenizer considers "#trec2013" and "trec2013" as different terms. I'm personally of the opinion that they should be treated separately, however discussion is very welcome.

I'm not sure that it's worth spending much time tuning the tokenizer to work with other languages since (to the best of my knowledge) only English tweets are considered relevant. However, if there's enough demand then I've got nothing against adding them.

@lintool I'll try and push an updated version some time tomorrow with the new delimiters.

from twitter-tools.

lintool commented on May 30, 2024

This task has been completed and results have been merged into the trec2013-api branch.

from twitter-tools.

Decide on (Lucene) Analyzer about twitter-tools HOT 15 CLOSED

Comments (15)

Current Sate of Tokenizer

Observations

Delimiters

Stemming

Stop Word Removal?

Current Output

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent