Giter Club home page Giter Club logo

Comments (15)

JamesMcMinn avatar JamesMcMinn commented on May 30, 2024

Per discussion on Skype, I'll look into this over the new week or so.

from twitter-tools.

stewhdcs avatar stewhdcs commented on May 30, 2024

Copied from @telsayed in #26, the analyzer should keep intact URLs, and not strip preceding @'s and #'s:

•hashtags (i.e., that are mentioned in the tweet), [so that we can retrieve all tweets that have a specific hashtag].
•mentions (usernames mentioned in the tweet)
•URLs [so that we can get all tweets that point to a specific URL]
•comments (that are written beside a retweet)

from twitter-tools.

lintool avatar lintool commented on May 30, 2024

The name of the current analyzer is LowerCaseHashtagMentionPreservingTokenizer, so that is the intent... :)

from twitter-tools.

JamesMcMinn avatar JamesMcMinn commented on May 30, 2024

Current Sate of Tokenizer

Below is an example of some tweets, and how the current Tokenizer handles them (a | represents a token character)

AT&T getting secret immunity from wiretapping laws for government surveillance http://vrge.co/ZP3Fx5
AT|T|getting|secret|immunity|from|wiretapping|laws|for|government|surveillance|http|||vrge|co|ZP3Fx5

want to see the @verge aston martin GT4 racer tear up long beach? http://theracersgroup.kinja.com/watch-an-aston-martin-vantage-gt4-tear-around-long-beac-479726219 …
want|to|see|the|@verge|aston|martin|GT4|racer|tear|up|long|beach||http|||theracersgroup|kinja|com|watch|an|aston|martin|vantage|gt4|tear|around|long|beac|479726219||

Incredibly good news! #Drupal users rally http://bit.ly/Z8ZoFe  to ensure blind accessibility contributor gets to @DrupalCon #Opensource
Incredibly|good|news||#Drupal|users|rally|http|||bit|ly|Z8ZoFe||to|ensure|blind|accessibility|contributor|gets|to|@DrupalCon|#Opensource

We're entering the quiet hours at #amznhack. #Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
We|re|entering|the|quiet|hours|at|#amznhack||#Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz

The 2013 Social Event Detection Task (SED) at #mediaeval2013, http://bit.ly/16nITsf  supported by @linkedtv @project_mmixer @socialsensor_ip
The|2013|Social|Event|Detection|Task||SED||at|#mediaeval2013||http|||bit|ly|16nITsf||supported|by|@linkedtv|@project|mmixer|@socialsensor|ip

http://www.google.co.uk/#sclient=psy-ab&q=boston&oq=boston&gs_l=hp.3..0i3l3j0.1740.1740.1.1935.1.1.0.0.0.0.73.73.1.1.0...0.0...1c.1.11.psy-ab.Y6qmtfBiK3M&pbx=1&bav=on.2,or.r_qf.&bvm=bv.45580626,d.d2k&fp=d5785e6541662a88&biw=1087&bih=989
http|||www|google|co|uk|#sclient|psy|ab|q|boston|oq|boston|gs|l|hp|3||0i3l3j0|1740|1740|1|1935|1|1|0|0|0|0|73|73|1|1|0|||0|0|||1c|1|11|psy|ab|Y6qmtfBiK3M|pbx|1|bav|on|2|or|r|qf||bvm|bv|45580626|d|d2k|fp|d5785e6541662a88|biw|1087|bih|989

this is [email protected]
this|is|an@example|com

:@blah#hashtag
|@blah#hashtag

Observations

  • It handles most mentions and hashtags correctly, however that are a number of cases where it does not work (e.g. the underscore character _ should be valid)
  • There are a number of cases when tokenization should have occurred due the hashtags and mentions being invalid (i.e. when the "mention" is proceeded by an invalid character, such as an email address)
  • URLs simply do not work
  • Words which contain apostrophes are split - I'm not sure this is desirable

I believe tokenization needs to occur in multiple steps:

  1. Split at white space characters only
  2. Use regular expression to match entities
  3. Perform further tokenization depending on entity/non-entity

Regular expressions for valid entities can be found here. I plan on using these to decide which type of tokenization to perform after whitespace tokenization.

(Post to follow with examples of the new tokenizer)

from twitter-tools.

lintool avatar lintool commented on May 30, 2024

James, can you incorporate this as appropriate? https://github.com/twitter/twitter-text-java

from twitter-tools.

JamesMcMinn avatar JamesMcMinn commented on May 30, 2024

Jimmy, thanks for pointing out that there was a Java version, I hadn't seen it before. I've incorporated the code, and it seems to work fine.

Delimiters

At the moment I'm concentrating on the tokenization. What characters, other than whitespace, do we want to use a delimiters? At the moment my list is:
_ - ? ! , ; : . ( ) [ ] @ # / \

There's also the issue of URLs. bit.ly URLs are case sensitive - this means that if we make them lowercase they will no longer work. At the moment I'm lowercasing everything except URLs, however since the original case is preserved the the full text I'm not sure it's an issue. Discussion is welcome.

Stemming

I'm using porter stemming, as it's well known and built into Lucene. At the moment it only stems non-entities, and leaves mentions/hashtags alone. Any comments on this are welcome.

Stop Word Removal?

Is this something we also want to do?

Current Output

AT&T getting secret immunity from wiretapping laws for government surveillance http://vrge.co/ZP3Fx5
att|get|secret|immun|from|wiretap|law|for|govern|surveil|http://vrge.co/ZP3Fx5|

want to see the @verge aston martin GT4 racer tear up long beach? http://theracersgroup.kinja.com/watch-an-aston-martin-vantage-gt4-tear-around-long-beac-479726219 …
want|to|see|the|@verge|aston|martin|gt4|racer|tear|up|long|beach||http://theracersgroup.kinja.com/watch-an-aston-martin-vantage-gt4-tear-around-long-beac-479726219||

Incredibly good news! #Drupal users rally http://bit.ly/Z8ZoFe  to ensure blind accessibility contributor gets to @DrupalCon #Opensource
incredibli|good|new||#drupal|user|ralli|http://bit.ly/Z8ZoFe|to|ensur|blind|access|contributor|get|to|@drupalcon|#opensource|

We're entering the quiet hours at #amznhack. #Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
were|enter|the|quiet|hour|at|#amznhack||#rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz|

The 2013 Social Event Detection Task (SED) at #mediaeval2013, http://bit.ly/16nITsf  supported by @linkedtv @project_mmixer @socialsensor_ip
the|2013|social|event|detect|task||sed||at|#mediaeval2013||http://bit.ly/16nITsf|support|by|@linkedtv|@project_mmixer|@socialsensor_ip|

this is [email protected]
thi|i|an|exampl|

this is @a_valid_mention and this_is_multiple_words
thi|i|@a_valid_mention|and|thi|i|multipl|word|

12354
12354|

this-is-lots!of!words.seperated.without~whitespace-cause=no+one-likes(whitespace)(do]they? U.S.A. U.K. What the hell?hello...hello.a.b..c.
thi|i|lot|of|word|seper|withoutwhitespac|causenoon|like|whitespac||do|thei||usa|uk|what|the|hell|hello|hello|ab|c|

this @should.not.work
thi|@should|not|work|

PLEASE BE LOWER CASE WHEN YOU COME OUT THE OTHER SIDE - ALSO A @VALID_VALID-INVALID
pleas|be|lower|case|when|you|come|out|the|other|side|||also|a|@valid_valid|invalid|

@reply @with #crazy ~#at
@reply|@with|#crazy||#at|

:@valid testing(valid)#hashtags. RT:@meniton (the last @mention is #valid and so is this:@valid), however this is@invalid
|@valid|test|valid|#hashtags||rt|@meniton||the|last|@mention|i|#valid|and|so|i|thi|@valid|||howev|thi|i|invalid|

from twitter-tools.

lintool avatar lintool commented on May 30, 2024

I'm ambivalent about stemming, will go with whatever the community decides.
I'll evaluate the stopwords issues from the systems perspective: if it doesn't slow down query latency/throughput and doesn't blow up the index size (too much), I'm fine with keeping stopwords.

I noticed that emails aren't handled properly. Not a big deal, maybe as a //TODO in the code.

from twitter-tools.

yubink avatar yubink commented on May 30, 2024

+1 on stemming, and perhaps we should keep stopwords. People will then have the option to stop by stopping their queries or not.

from twitter-tools.

yubink avatar yubink commented on May 30, 2024

Saw the specification and thanks for all the hard work!

Just curious, why those particular set of characters? Why not { } or other POSIX punctuation characters?

from twitter-tools.

JamesMcMinn avatar JamesMcMinn commented on May 30, 2024

I left the set of characters lacking specifically to try and generate some feedback. I'm not really sure what the best set is, and I don't think the POSIX set is perfect (for example, we probably don't want to use '), however I do think it's a good place to start.

The POSIX set contains:

 ] [ ! " # $ % & ' ( ) * + , . / : ; < = > ? @ \ ^ _ ` { | } ~ -

However this leaves out characters such as ¬, · and …, and we probably want to remove ' from the list for contractions.

I believe OS X replaces 3 periods with an ellipsis by default now, so certainly we should include that in the list.

The set that I would suggest is:

 ] [ ! " # $ % & ( ) * + , . / : ; < = > ? @ \ ^ _ ` { | } ~ - … ¬ · 

However, I was rather hoping that someone else would make some suggestions. Perhaps we should split on non-alphanumeric characters (except ') instead?

from twitter-tools.

yubink avatar yubink commented on May 30, 2024

Haha, nice trick, that.

You've obviously given more thought to the tokenizer split set than I have!
The set you propose looks good to me.

However, I don't think splitting on everything but alphanum is a good idea
because of potential non-ascii/latin characters. Actually, what is going to
happen to non-ascii/latin characters?

from twitter-tools.

lintool avatar lintool commented on May 30, 2024

James, would you mind pushing a branch that fixes this, and also tweak wiki?

Thanks!

from twitter-tools.

amjedbj avatar amjedbj commented on May 30, 2024

I'm agree with @yubink, I think that alphanum based tokenization is a good idea.
Some languages use special punctuation marks, for instance « » (fr) · (el) ؟ ؛ ، (ar, ur,fa) ( ) 【 】 (cn,jp).

I have 2 questions. Are the two experssions "# trec2013" and "trec2013" considered as two different terms in the index? Are tweets containing only "#trec2013" selected for query "trec2013 guideline"?

from twitter-tools.

JamesMcMinn avatar JamesMcMinn commented on May 30, 2024

(For now) the tokenizer considers "#trec2013" and "trec2013" as different terms. I'm personally of the opinion that they should be treated separately, however discussion is very welcome.

I'm not sure that it's worth spending much time tuning the tokenizer to work with other languages since (to the best of my knowledge) only English tweets are considered relevant. However, if there's enough demand then I've got nothing against adding them.

@lintool I'll try and push an updated version some time tomorrow with the new delimiters.

from twitter-tools.

lintool avatar lintool commented on May 30, 2024

This task has been completed and results have been merged into the trec2013-api branch.

from twitter-tools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.