Comments (15)
Per discussion on Skype, I'll look into this over the new week or so.
from twitter-tools.
Copied from @telsayed in #26, the analyzer should keep intact URLs, and not strip preceding @'s and #'s:
•hashtags (i.e., that are mentioned in the tweet), [so that we can retrieve all tweets that have a specific hashtag].
•mentions (usernames mentioned in the tweet)
•URLs [so that we can get all tweets that point to a specific URL]
•comments (that are written beside a retweet)
from twitter-tools.
The name of the current analyzer is LowerCaseHashtagMentionPreservingTokenizer, so that is the intent... :)
from twitter-tools.
Current Sate of Tokenizer
Below is an example of some tweets, and how the current Tokenizer handles them (a | represents a token character)
AT&T getting secret immunity from wiretapping laws for government surveillance http://vrge.co/ZP3Fx5
AT|T|getting|secret|immunity|from|wiretapping|laws|for|government|surveillance|http|||vrge|co|ZP3Fx5
want to see the @verge aston martin GT4 racer tear up long beach? http://theracersgroup.kinja.com/watch-an-aston-martin-vantage-gt4-tear-around-long-beac-479726219 …
want|to|see|the|@verge|aston|martin|GT4|racer|tear|up|long|beach||http|||theracersgroup|kinja|com|watch|an|aston|martin|vantage|gt4|tear|around|long|beac|479726219||
Incredibly good news! #Drupal users rally http://bit.ly/Z8ZoFe to ensure blind accessibility contributor gets to @DrupalCon #Opensource
Incredibly|good|news||#Drupal|users|rally|http|||bit|ly|Z8ZoFe||to|ensure|blind|accessibility|contributor|gets|to|@DrupalCon|#Opensource
We're entering the quiet hours at #amznhack. #Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
We|re|entering|the|quiet|hours|at|#amznhack||#Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
The 2013 Social Event Detection Task (SED) at #mediaeval2013, http://bit.ly/16nITsf supported by @linkedtv @project_mmixer @socialsensor_ip
The|2013|Social|Event|Detection|Task||SED||at|#mediaeval2013||http|||bit|ly|16nITsf||supported|by|@linkedtv|@project|mmixer|@socialsensor|ip
http://www.google.co.uk/#sclient=psy-ab&q=boston&oq=boston&gs_l=hp.3..0i3l3j0.1740.1740.1.1935.1.1.0.0.0.0.73.73.1.1.0...0.0...1c.1.11.psy-ab.Y6qmtfBiK3M&pbx=1&bav=on.2,or.r_qf.&bvm=bv.45580626,d.d2k&fp=d5785e6541662a88&biw=1087&bih=989
http|||www|google|co|uk|#sclient|psy|ab|q|boston|oq|boston|gs|l|hp|3||0i3l3j0|1740|1740|1|1935|1|1|0|0|0|0|73|73|1|1|0|||0|0|||1c|1|11|psy|ab|Y6qmtfBiK3M|pbx|1|bav|on|2|or|r|qf||bvm|bv|45580626|d|d2k|fp|d5785e6541662a88|biw|1087|bih|989
this is [email protected]
this|is|an@example|com
:@blah#hashtag
|@blah#hashtag
Observations
- It handles most mentions and hashtags correctly, however that are a number of cases where it does not work (e.g. the underscore character _ should be valid)
- There are a number of cases when tokenization should have occurred due the hashtags and mentions being invalid (i.e. when the "mention" is proceeded by an invalid character, such as an email address)
- URLs simply do not work
- Words which contain apostrophes are split - I'm not sure this is desirable
I believe tokenization needs to occur in multiple steps:
- Split at white space characters only
- Use regular expression to match entities
- Perform further tokenization depending on entity/non-entity
Regular expressions for valid entities can be found here. I plan on using these to decide which type of tokenization to perform after whitespace tokenization.
(Post to follow with examples of the new tokenizer)
from twitter-tools.
James, can you incorporate this as appropriate? https://github.com/twitter/twitter-text-java
from twitter-tools.
Jimmy, thanks for pointing out that there was a Java version, I hadn't seen it before. I've incorporated the code, and it seems to work fine.
Delimiters
At the moment I'm concentrating on the tokenization. What characters, other than whitespace, do we want to use a delimiters? At the moment my list is:
_ - ? ! , ; : . ( ) [ ] @ # / \
There's also the issue of URLs. bit.ly URLs are case sensitive - this means that if we make them lowercase they will no longer work. At the moment I'm lowercasing everything except URLs, however since the original case is preserved the the full text I'm not sure it's an issue. Discussion is welcome.
Stemming
I'm using porter stemming, as it's well known and built into Lucene. At the moment it only stems non-entities, and leaves mentions/hashtags alone. Any comments on this are welcome.
Stop Word Removal?
Is this something we also want to do?
Current Output
AT&T getting secret immunity from wiretapping laws for government surveillance http://vrge.co/ZP3Fx5
att|get|secret|immun|from|wiretap|law|for|govern|surveil|http://vrge.co/ZP3Fx5|
want to see the @verge aston martin GT4 racer tear up long beach? http://theracersgroup.kinja.com/watch-an-aston-martin-vantage-gt4-tear-around-long-beac-479726219 …
want|to|see|the|@verge|aston|martin|gt4|racer|tear|up|long|beach||http://theracersgroup.kinja.com/watch-an-aston-martin-vantage-gt4-tear-around-long-beac-479726219||
Incredibly good news! #Drupal users rally http://bit.ly/Z8ZoFe to ensure blind accessibility contributor gets to @DrupalCon #Opensource
incredibli|good|new||#drupal|user|ralli|http://bit.ly/Z8ZoFe|to|ensur|blind|access|contributor|get|to|@drupalcon|#opensource|
We're entering the quiet hours at #amznhack. #Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
were|enter|the|quiet|hour|at|#amznhack||#rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz|
The 2013 Social Event Detection Task (SED) at #mediaeval2013, http://bit.ly/16nITsf supported by @linkedtv @project_mmixer @socialsensor_ip
the|2013|social|event|detect|task||sed||at|#mediaeval2013||http://bit.ly/16nITsf|support|by|@linkedtv|@project_mmixer|@socialsensor_ip|
this is [email protected]
thi|i|an|exampl|
this is @a_valid_mention and this_is_multiple_words
thi|i|@a_valid_mention|and|thi|i|multipl|word|
12354
12354|
this-is-lots!of!words.seperated.without~whitespace-cause=no+one-likes(whitespace)(do]they? U.S.A. U.K. What the hell?hello...hello.a.b..c.
thi|i|lot|of|word|seper|withoutwhitespac|causenoon|like|whitespac||do|thei||usa|uk|what|the|hell|hello|hello|ab|c|
this @should.not.work
thi|@should|not|work|
PLEASE BE LOWER CASE WHEN YOU COME OUT THE OTHER SIDE - ALSO A @VALID_VALID-INVALID
pleas|be|lower|case|when|you|come|out|the|other|side|||also|a|@valid_valid|invalid|
@reply @with #crazy ~#at
@reply|@with|#crazy||#at|
:@valid testing(valid)#hashtags. RT:@meniton (the last @mention is #valid and so is this:@valid), however this is@invalid
|@valid|test|valid|#hashtags||rt|@meniton||the|last|@mention|i|#valid|and|so|i|thi|@valid|||howev|thi|i|invalid|
from twitter-tools.
I'm ambivalent about stemming, will go with whatever the community decides.
I'll evaluate the stopwords issues from the systems perspective: if it doesn't slow down query latency/throughput and doesn't blow up the index size (too much), I'm fine with keeping stopwords.
I noticed that emails aren't handled properly. Not a big deal, maybe as a //TODO in the code.
from twitter-tools.
+1 on stemming, and perhaps we should keep stopwords. People will then have the option to stop by stopping their queries or not.
from twitter-tools.
Saw the specification and thanks for all the hard work!
Just curious, why those particular set of characters? Why not { } or other POSIX punctuation characters?
from twitter-tools.
I left the set of characters lacking specifically to try and generate some feedback. I'm not really sure what the best set is, and I don't think the POSIX set is perfect (for example, we probably don't want to use '), however I do think it's a good place to start.
The POSIX set contains:
] [ ! " # $ % & ' ( ) * + , . / : ; < = > ? @ \ ^ _ ` { | } ~ -
However this leaves out characters such as ¬, · and …, and we probably want to remove ' from the list for contractions.
I believe OS X replaces 3 periods with an ellipsis by default now, so certainly we should include that in the list.
The set that I would suggest is:
] [ ! " # $ % & ( ) * + , . / : ; < = > ? @ \ ^ _ ` { | } ~ - … ¬ ·
However, I was rather hoping that someone else would make some suggestions. Perhaps we should split on non-alphanumeric characters (except ') instead?
from twitter-tools.
Haha, nice trick, that.
You've obviously given more thought to the tokenizer split set than I have!
The set you propose looks good to me.
However, I don't think splitting on everything but alphanum is a good idea
because of potential non-ascii/latin characters. Actually, what is going to
happen to non-ascii/latin characters?
from twitter-tools.
James, would you mind pushing a branch that fixes this, and also tweak wiki?
Thanks!
from twitter-tools.
I'm agree with @yubink, I think that alphanum based tokenization is a good idea.
Some languages use special punctuation marks, for instance « » (fr) · (el) ؟ ؛ ، (ar, ur,fa) ( ) 【 】 (cn,jp).
I have 2 questions. Are the two experssions "# trec2013" and "trec2013" considered as two different terms in the index? Are tweets containing only "#trec2013" selected for query "trec2013 guideline"?
from twitter-tools.
(For now) the tokenizer considers "#trec2013" and "trec2013" as different terms. I'm personally of the opinion that they should be treated separately, however discussion is very welcome.
I'm not sure that it's worth spending much time tuning the tokenizer to work with other languages since (to the best of my knowledge) only English tweets are considered relevant. However, if there's enough demand then I've got nothing against adding them.
@lintool I'll try and push an updated version some time tomorrow with the new delimiters.
from twitter-tools.
This task has been completed and results have been merged into the trec2013-api branch.
from twitter-tools.
Related Issues (20)
- Spam filtering
- Store created_at as long HOT 1
- Add auth mechanism HOT 1
- Java set to require too much memory in etc/run.sh HOT 2
- Extract Named Entities
- Implement service to return term counts HOT 6
- Response format HOT 2
- What fields do you need to replicate your run? HOT 22
- Extract Entities form tweet text HOT 2
- Memory usage in IndexStatuses HOT 1
- MalformedJsonException forced end to indexing HOT 4
- Status.fromJson can fail, throwing an NPE HOT 4
- "Connection timed out" on client.search
- Merge RM3 contribution back into master HOT 1
- RM3 doesn't implement duplicate removal
- trec_eval problem at compile time
- twitter collection 2011
- The test URL is access denied HOT 1
- Fetch data from Tweets2011 Collection HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from twitter-tools.