twitter-archive / twitter-text-js Goto Github PK

A JavaScript implementation of Twitter's text processing library

twitter-text-js's Introduction

Deprecation Notice!

This repository has been merged into the twitter-text mono-repo to simplify development, testing, creating issues, and pull requests. This repo is now inactive; please continue all activity in the mono-repo and move existing issues there.

twitter-text-js's People

Contributors

Stargazers

Watchers

Forkers

bcherry matasar cheesefactory leobalter gaarf gasi keitaf sayrer hesaid quickredfox maxeng cpetzold ialcazar upbeatpr firemyst siggy yusuke maccman rylnd arnaud-lb realguess mango-information-systems gijs ironymark smerchek frankk00 fionawhim idynamic secobarbital gregorynicholas tomykaira jaswope kevan ernestorx localsense crowdfavorite netconstructor doubleotoo dctrl-creativetechnology dineshkummarc vishwa rodzyn0688 railsdev dctrl bguided leppert dinesh-ramakrishnan bensochar galenwp eckoit willcode2surf anthonywinterton caraesten twobitlabs tyrchen thanos aditya-raj fat lukeasrodgers codeklerk buaasrf atulshedage bytnar ignacioola reshun powerocoos jgable villamedia andrewgreve aburan28 boleslava agustinusap mpuig mattwang75 jonathanmarvens oztc anjaly sydlawrence spredfast oemebamo jgarner ka-san thisisyoon oilywater web5design desa kylekacius jbinkleyj yiidevelop nvdnkpr winterwell rchenmit manpreetssethi drikin devongovett eileencodes martindale rentalname eugeneliang nangal

twitter-text-js's Issues

Discard mentions, hashtags and urls

Hello.

Sorry to bother you, but I need to know if given a text I can get the same text without mentions, hashtags and urls.

Thanks a lot in advance.

Urls with an IP address are not extracted

E.g.

twttr.txt.extractUrlsWithIndices('http://127.0.0.1/')
[]

autoLink adds inappropriate attributes

twttr.txt.autoLink('hello http://twitter.com', {
  , usernameClass: 'custom-user'
})

outputs:

hello <a href="http://twitter.com" usernameClass="custom-user" >http://twitter.com</a>

URLs that contain @symbols are not correctly detected

For example, the following link is detected but the indexes do not include the entire url, only up until the first @ symbol.

http://week.manoramaonline.com/cgi-bin/MMOnline.dll/portal/ep/theWeekContent.do?BV_ID=@@@&contentId=12939975&programId=1073754900

Add classes to the tags surrounding entities

We should provide semantic classes for styling purposes: e.g., twitter-hashsymbol, twitter-atsymbol, twitter-emoji, twitter-text, etc.

We should drop / deprecate customizable tags because the correct tag is span.

Equivalent rb issue: twitter-archive/twitter-text-rb#110

Not allowed to pass options parameter to isInvalidTweet

I use

getTweetLength('tweety', {/*custom url shortener options*/});

but then when I try to use isInvalidTweet I get the default options for getTweetLength which result in different length and therefore an error.

Am I missing something?

Some invalid tweets (dm commands) are reported as being valid

These are reported as being valid, but should be invalid: 'd', 'd ', 'm', 'm ', 'dm', 'dm '

There's probably other commands that I'm not aware of that should also be treated as invalid.

twttr.txt.isValidTweetText('d')

true

Tweet length counter seems to be conflicting with Twitter's "Character Counting" guideline about Unicode normalization

Since Twitter's official Character Counting guideline is saying "Tweet length is measured by the number of codepoints in the NFC normalized version of the text", "café" (U+0063 U+0061 U+0066 U+0065 U+0301) should be normalized as "café" (U+0063 U+0061 U+0066 U+00E9) and counted 4 characters. This fails by simply testing

var twitter = require('twitter-text');
console.log(twitter.getTweetLength('cafe\u0301'));

prints "5" and also the 30 times repetition of "café" (which should be counted as 120 characters) is kicked as invalid tweet.

By way of comparison, Ruby implementation works different since testing

require "twitter-text"
include Twitter::Validation
p tweet_length("cafe" + 0x0301.chr("UTF-8"))

prints "4" and the 30 times repetition passes validation.

short.me links not working

Hi,
I'm not sure that this is the right place to note this.
speedof.me did not get detected as a valid link on this tweet.

https://twitter.com/JeremiadLee/status/314246197902704641
https://twitter.com/JeremiadLee/status/314251493089890304

Add option extractUrlsWithoutProtocol

Why I can not set extractUrlsWithoutProtocol as option when I call twttr.txt.autoLink()?
It's hardcoded as {extractUrlsWithoutProtocol: false} so I need to change this in library code.

Release new package to handle tweet lengths correctly

It seems it's worthwhile releasing a new package. The most recently released package doesn't calculate tweet lengths correctly.

Setup Travis CI

Twitter has updated the length of t.co urls

Twitter is going to be extending the maximum length of t.co wrapped links from 20 to 22 characters for non-https URLs, and 21 to 23 characters for https URLs.

source: https://dev.twitter.com/blog/upcoming-tco-changes

Consider generating the regular expressions as part of the build process instead of at runtime

E.g. this:

  var nonLatinHashtagChars = [];
  // Cyrillic
  addCharsToCharClass(nonLatinHashtagChars, 0x0400, 0x04ff); // Cyrillic
  addCharsToCharClass(nonLatinHashtagChars, 0x0500, 0x0527); // Cyrillic Supplement
  addCharsToCharClass(nonLatinHashtagChars, 0x2de0, 0x2dff); // Cyrillic Extended A
  addCharsToCharClass(nonLatinHashtagChars, 0xa640, 0xa69f); // Cyrillic Extended B
  // Hebrew
  addCharsToCharClass(nonLatinHashtagChars, 0x0591, 0x05bf); // Hebrew
  addCharsToCharClass(nonLatinHashtagChars, 0x05c1, 0x05c2);
  addCharsToCharClass(nonLatinHashtagChars, 0x05c4, 0x05c5);
  addCharsToCharClass(nonLatinHashtagChars, 0x05c7, 0x05c7);
  addCharsToCharClass(nonLatinHashtagChars, 0x05d0, 0x05ea);
  addCharsToCharClass(nonLatinHashtagChars, 0x05f0, 0x05f4);
  addCharsToCharClass(nonLatinHashtagChars, 0xfb12, 0xfb28); // Hebrew Presentation Forms
  addCharsToCharClass(nonLatinHashtagChars, 0xfb2a, 0xfb36);
  addCharsToCharClass(nonLatinHashtagChars, 0xfb38, 0xfb3c);
  addCharsToCharClass(nonLatinHashtagChars, 0xfb3e, 0xfb3e);
  addCharsToCharClass(nonLatinHashtagChars, 0xfb40, 0xfb41);
  addCharsToCharClass(nonLatinHashtagChars, 0xfb43, 0xfb44);
  addCharsToCharClass(nonLatinHashtagChars, 0xfb46, 0xfb4f);
  // Arabic
  addCharsToCharClass(nonLatinHashtagChars, 0x0610, 0x061a); // Arabic
  addCharsToCharClass(nonLatinHashtagChars, 0x0620, 0x065f);
  addCharsToCharClass(nonLatinHashtagChars, 0x066e, 0x06d3);
  addCharsToCharClass(nonLatinHashtagChars, 0x06d5, 0x06dc);
  addCharsToCharClass(nonLatinHashtagChars, 0x06de, 0x06e8);
  addCharsToCharClass(nonLatinHashtagChars, 0x06ea, 0x06ef);
  addCharsToCharClass(nonLatinHashtagChars, 0x06fa, 0x06fc);
  addCharsToCharClass(nonLatinHashtagChars, 0x06ff, 0x06ff);
  addCharsToCharClass(nonLatinHashtagChars, 0x0750, 0x077f); // Arabic Supplement
  addCharsToCharClass(nonLatinHashtagChars, 0x08a0, 0x08a0); // Arabic Extended A
  addCharsToCharClass(nonLatinHashtagChars, 0x08a2, 0x08ac);
  addCharsToCharClass(nonLatinHashtagChars, 0x08e4, 0x08fe);
  addCharsToCharClass(nonLatinHashtagChars, 0xfb50, 0xfbb1); // Arabic Pres. Forms A
  addCharsToCharClass(nonLatinHashtagChars, 0xfbd3, 0xfd3d);
  addCharsToCharClass(nonLatinHashtagChars, 0xfd50, 0xfd8f);
  addCharsToCharClass(nonLatinHashtagChars, 0xfd92, 0xfdc7);
  addCharsToCharClass(nonLatinHashtagChars, 0xfdf0, 0xfdfb);
  addCharsToCharClass(nonLatinHashtagChars, 0xfe70, 0xfe74); // Arabic Pres. Forms B
  addCharsToCharClass(nonLatinHashtagChars, 0xfe76, 0xfefc);
  addCharsToCharClass(nonLatinHashtagChars, 0x200c, 0x200c); // Zero-Width Non-Joiner
  // Thai
  addCharsToCharClass(nonLatinHashtagChars, 0x0e01, 0x0e3a);
  addCharsToCharClass(nonLatinHashtagChars, 0x0e40, 0x0e4e);
  // Hangul (Korean)
  addCharsToCharClass(nonLatinHashtagChars, 0x1100, 0x11ff); // Hangul Jamo
  addCharsToCharClass(nonLatinHashtagChars, 0x3130, 0x3185); // Hangul Compatibility Jamo
  addCharsToCharClass(nonLatinHashtagChars, 0xA960, 0xA97F); // Hangul Jamo Extended-A
  addCharsToCharClass(nonLatinHashtagChars, 0xAC00, 0xD7AF); // Hangul Syllables
  addCharsToCharClass(nonLatinHashtagChars, 0xD7B0, 0xD7FF); // Hangul Jamo Extended-B
  addCharsToCharClass(nonLatinHashtagChars, 0xFFA1, 0xFFDC); // half-width Hangul
  // Japanese and Chinese
  addCharsToCharClass(nonLatinHashtagChars, 0x30A1, 0x30FA); // Katakana (full-width)
  addCharsToCharClass(nonLatinHashtagChars, 0x30FC, 0x30FE); // Katakana Chouon and iteration marks (full-width)
  addCharsToCharClass(nonLatinHashtagChars, 0xFF66, 0xFF9F); // Katakana (half-width)
  addCharsToCharClass(nonLatinHashtagChars, 0xFF70, 0xFF70); // Katakana Chouon (half-width)
  addCharsToCharClass(nonLatinHashtagChars, 0xFF10, 0xFF19); // \
  addCharsToCharClass(nonLatinHashtagChars, 0xFF21, 0xFF3A); //  - Latin (full-width)
  addCharsToCharClass(nonLatinHashtagChars, 0xFF41, 0xFF5A); // /
  addCharsToCharClass(nonLatinHashtagChars, 0x3041, 0x3096); // Hiragana
  addCharsToCharClass(nonLatinHashtagChars, 0x3099, 0x309E); // Hiragana voicing and iteration mark
  addCharsToCharClass(nonLatinHashtagChars, 0x3400, 0x4DBF); // Kanji (CJK Extension A)
  addCharsToCharClass(nonLatinHashtagChars, 0x4E00, 0x9FFF); // Kanji (Unified)
  // -- Disabled as it breaks the Regex.
  //addCharsToCharClass(nonLatinHashtagChars, 0x20000, 0x2A6DF); // Kanji (CJK Extension B)
  addCharsToCharClass(nonLatinHashtagChars, 0x2A700, 0x2B73F); // Kanji (CJK Extension C)
  addCharsToCharClass(nonLatinHashtagChars, 0x2B740, 0x2B81F); // Kanji (CJK Extension D)
  addCharsToCharClass(nonLatinHashtagChars, 0x2F800, 0x2FA1F); // Kanji (CJK supplement)
  addCharsToCharClass(nonLatinHashtagChars, 0x3003, 0x3003); // Kanji iteration mark
  addCharsToCharClass(nonLatinHashtagChars, 0x3005, 0x3005); // Kanji iteration mark
  addCharsToCharClass(nonLatinHashtagChars, 0x303B, 0x303B); // Han iteration mark

  twttr.txt.regexen.nonLatinHashtagChars = regexSupplant(nonLatinHashtagChars.join(""));

With Regenerate, this could become:

var nonLatinHashtagChars = regenerate()
  // Cyrillic
  .addRange(0x0400, 0x04FF) // Cyrillic
  .addRange(0x0500, 0x0527) // Cyrillic Supplement
  .addRange(0x2DE0, 0x2DFF) // Cyrillic Extended A
  .addRange(0xA640, 0xA69F) // Cyrillic Extended B
  // Hebrew
  .addRange(0x0591, 0x05BF) // Hebrew
  .addRange(0x05C1, 0x05C2)
  .addRange(0x05C4, 0x05C5)
  .add(0x05c7)
  .addRange(0x05D0, 0x05EA)
  .addRange(0x05F0, 0x05F4)
  .addRange(0xFB12, 0xFB28) // Hebrew Presentation Forms
  .addRange(0xFB2A, 0xFB36)
  .addRange(0xFB38, 0xFB3C)
  .addRange(0xFB3E, 0xFB3E)
  .addRange(0xFB40, 0xFB41)
  .addRange(0xFB43, 0xFB44)
  .addRange(0xFB46, 0xFB4F)
  // Arabic
  .addRange(0x0610, 0x061A) // Arabic
  .addRange(0x0620, 0x065F)
  .addRange(0x066E, 0x06D3)
  .addRange(0x06D5, 0x06DC)
  .addRange(0x06DE, 0x06E8)
  .addRange(0x06EA, 0x06EF)
  .addRange(0x06FA, 0x06FC)
  .addRange(0x06FF, 0x06FF)
  .addRange(0x0750, 0x077F) // Arabic Supplement
  .addRange(0x08A0, 0x08A0) // Arabic Extended A
  .addRange(0x08A2, 0x08AC)
  .addRange(0x08E4, 0x08FE)
  .addRange(0xFB50, 0xFBB1) // Arabic Pres. Forms A
  .addRange(0xFBD3, 0xFD3D)
  .addRange(0xFD50, 0xFD8F)
  .addRange(0xFD92, 0xFDC7)
  .addRange(0xFDF0, 0xFDFB)
  .addRange(0xFE70, 0xFE74) // Arabic Pres. Forms B
  .addRange(0xFE76, 0xFEFC)
  .addRange(0x200C, 0x200C) // Zero-Width Non-Joiner
  // Thai
  .addRange(0x0E01, 0x0E3A)
  .addRange(0x0E40, 0x0E4E)
  // Hangul (Korean)
  .addRange(0x1100, 0x11FF) // Hangul Jamo
  .addRange(0x3130, 0x3185) // Hangul Compatibility Jamo
  .addRange(0xA960, 0xA97F) // Hangul Jamo Extended-A
  .addRange(0xAC00, 0xD7AF) // Hangul Syllables
  .addRange(0xD7B0, 0xD7FF) // Hangul Jamo Extended-B
  .addRange(0xFFA1, 0xFFDC) // half-width Hangul
  // Japanese and Chinese
  .addRange(0x30A1, 0x30FA) // Katakana (full-width)
  .addRange(0x30FC, 0x30FE) // Katakana Chouon and iteration marks (full-width)
  .addRange(0xFF66, 0xFF9F) // Katakana (half-width)
  .add(0xFF70) // Katakana Chouon (half-width)
  .addRange(0xFF10, 0xFF19) // \
  .addRange(0xFF21, 0xFF3A) //  - Latin (full-width)
  .addRange(0xFF41, 0xFF5A) // /
  .addRange(0x3041, 0x3096) // Hiragana
  .addRange(0x3099, 0x309E) // Hiragana voicing and iteration mark
  .addRange(0x3400, 0x4DBF) // Kanji (CJK Extension A)
  .addRange(0x4E00, 0x9FFF) // Kanji (Unified)
  .addRange(0x20000, 0x2A6DF) // Kanji (CJK Extension B)
  .addRange(0x2A700, 0x2B73F) // Kanji (CJK Extension C)
  .addRange(0x2B740, 0x2B81F) // Kanji (CJK Extension D)
  .addRange(0x2F800, 0x2FA1F) // Kanji (CJK supplement)
  .add(0x3003) // Kanji iteration mark
  .add(0x3005) // Kanji iteration mark
  .add(0x303B); // Han iteration mark

twttr.txt.regexen.nonLatinHashtagChars = nonLatinHashtagChars.toRegExp();

But it would be even better to not do it at runtime, but as part of a build process:

nonLatinHashtagChars.toString();
// returns a string literal that can be injected into a JS file as part of a regular expression literal
// '[\\u0400-\\u0527\\u0591-\\u05BF\\u05C1-\\u05C2\\u05C4-\\u05C5\\u05C7\\u05D0-\\u05EA\\u05F0-\\u05F4\\u0610-\\u061A\\u0620-\\u065F\\u066E-\\u06D3\\u06D5-\\u06DC\\u06DE-\\u06E8\\u06EA-\\u06EF\\u06FA-\\u06FC\\u06FF\\u0750-\\u077F\\u08A0\\u08A2-\\u08AC\\u08E4-\\u08FE\\u0E01-\\u0E3A\\u0E40-\\u0E4E\\u1100-\\u11FF\\u200C\\u2DE0-\\u2DFF\\u3003\\u3005\\u303B\\u3041-\\u3096\\u3099-\\u309E\\u30A1-\\u30FA\\u30FC-\\u30FE\\u3130-\\u3185\\u3400-\\u4DBF\\u4E00-\\u9FFF\\uA640-\\uA69F\\uA960-\\uA97F\\uAC00-\\uD7FF\\uFB12-\\uFB28\\uFB2A-\\uFB36\\uFB38-\\uFB3C\\uFB3E\\uFB40-\\uFB41\\uFB43-\\uFB44\\uFB46-\\uFBB1\\uFBD3-\\uFD3D\\uFD50-\\uFD8F\\uFD92-\\uFDC7\\uFDF0-\\uFDFB\\uFE70-\\uFE74\\uFE76-\\uFEFC\\uFF10-\\uFF19\\uFF21-\\uFF3A\\uFF41-\\uFF5A\\uFF66-\\uFF9F\\uFFA1-\\uFFDC]|[\\uD840-\\uD868\\uD86A-\\uD86D][\\uDC00-\\uDFFF]|\\uD869[\\uDC00-\\uDEDF\\uDF00-\\uDFFF]|\\uD86E[\\uDC00-\\uDC1F]|\\uD87E[\\uDC00-\\uDE1F]'

This way, the source code (before building) is still very readable/maintainable, but the built code is optimized for run-time performance.

Note that using Regenerate would also solve this problem with astral symbols:

  // -- Disabled as it breaks the Regex.
  //addCharsToCharClass(nonLatinHashtagChars, 0x20000, 0x2A6DF); // Kanji (CJK Extension B)

Would you be interested in a pull request that ports all the regular expressions to Regenerate + adds a simple build script?

extractUrls doesn't handle Non-Latin characters

This seems to be an intentional decision, but extractUrls does not handle any links that may have non-latin characters. For instance, URL's that would not be properly extracted:

extractUrl's would return:

It seems the only reason that this is the case, according to the README, is that in Japanese/Korean/Chinese, sometimes links are not followed by a space. The behavior is consistent with what I see on twitter.com. To me it seems like extractUrls should be simpler and just delimit based on spaces which would allow uncommon characters, as this is a more common use case (correct me if I'm wrong on that though). And for the use case of twitter.com, since links are highlighted as you type them, Asian tweeters will know to stick a space between links and their text.

Was there some discussion on going one way or the other on this?

No way to autoLink URLs with no protocol?

If you want to use autoLink but have it link "google.com", there doesn't appear to be a way to do this even though the option is available in some methods. Is this intentional or can I submit a PR that will check passed-in options?

I'd probably change this line to look in the passed in options and if that option is set, use it instead...

Separate files for hashtags, uri etc.

Library is now pretty large. If I want to parse hashtags on the client, I don't want to load the whole thing.

There are clearly some different modules which can be logically separated but still share some regexps.

Thanks,
Oleg

RTL Check throwing error on IE7

When using in IE7 (or IE8 in compat) line 458 is erroring out:

if (hashtag[0].match(twttr.txt.regexen.rtl_chars)){

IE7 doesnt support [0] to get the first character of the string, using charAt(0) should fix.

if (hashtag.charAt(0).match(twttr.txt.regexen.rtl_chars)){

Document the options

We should document the options.

URL link with a CLASS

Is it possible to get a URL link with a class attached with it??

Not compatible with Twitter Web Intents (at least in an AMD environment).

The Twitter text library is not compatible with Twitter Web Intents as it will overwrite the twttr object on window at the bottom of the file. The environment set up is an AMD (useing require.js) module based environment with both libraries shimmed.

A quick fix I implemented to solve this is I replaced the check for the existence of window and then manually extended the window object:

  if (typeof window != 'undefined') {
    if (window.twttr) {
      var prop;
      for (prop in twttr) {
        window.twttr[prop] = twttr[prop];
      }
    }
    else
      window.twttr = twttr;
  }

Syntax error in javascript

Downloaded the twitter-text.js file and am getting a syntax error around the emoticon autolinking.

Uncaught SyntaxError: Invalid regular expression: /(8-#|8-E|+-(|\@|O|<|:~(|}:o{|:-[|>o<|X-/|[:-]-I-|////Ö\\|(|:|/)|?:*)|( | ))/: Nothing to repeat

its at line 159

&#hashtag is a formal hashtag

214   twttr.txt.regexen.hashtagBoundary = regexSupplant(/(?:^|$|[^&a-z0-9_#{latinAccentChars}#{nonLatinHashtagChars}])/);

in addition, /#hashtag is not a formal hashtag.
how about like this:

214   twttr.txt.regexen.hashtagBoundary = regexSupplant(/(?:^|$|[^/a-z0-9_#{latinAccentChars}#{nonLatinHashtagChars}])/);

Escape full-with characters

twitter-text-js code has full width characters like ＠ (U+FF20) and ＃ (U+FF03), and they seem to cause some problem in certain environment (esp, some mobile web). Those characters should be escaped.

change behavior of #hash autolinks

currently, #hash2 in the string "#hash1/#hash2" will not be autolinked because / is not a valid boundary. Allowing / as a valid hashtagBoundary is a possible fix, but it will also cause #hash to be mistakenly autolinked in "possible.url/#hash". Still, this might be a more desirable behavior.

Can I use it in PHP projects?

I'm developing a project using the Laravel PHP framework.
Can I use it in PHP projects or Ruby only?

.sx should be treated as a valid ccTLD

.sx domains have been available since mid-November 2012: http://www.register.eu/UK/news/reu-SX-domain-now-available-for-all.asp

Originally reported by a Twitter user here: http://twitter.com/KenCarpenter/status/310219988235612160

API docs

Should document the available methods and options.

Library does not detect valid CCTLD's in extractUrl method

I am trying to use this library to build a prevalidator for tweets before they are sent to the twitter API however it does not pick up URL's such as bit.ly/12345 as being valid URLs. However, the twitter API does wrap these non-http cctld URLs, producing a longer tweet that is then rejected.

I believe this bug is actually apparent on the twitter website and can be reproduced by filling the tweet box with upto 140 characters and including a bit.ly link without http. It will be impossible to send the tweet.

Some RegExp optimizations

I give full credit to closure compiler for this.
I cleaned up the output a lot, the WARNING's are all for the same References to the global RegExp object prevents optimization of regular expressions.

twitter-text.js:787: WARNING - References to the global RegExp object prevents optimization of regular expressions.
        RegExp.rightContext.match(twttr.txt.regexen.endMentionMatch)) {
        ^

twitter-text.js:817: WARNING 
      var before = RegExp.$2, url = RegExp.$3, protocol = RegExp.$4, domain = RegExp.$5, path = RegExp.$7;
                   ^                ^                     ^                   ^                 ^

twitter-text.js:860: WARNING 
          url = RegExp.lastMatch;
                ^

twitter-text.js:1268: WARNING
      return ((typeof string === "string") && string.match(regex) && RegExp["$&"] === string);
                                                                     ^

twitter-text.js:1272: WARNING
    return (!string || (string.match(regex) && RegExp["$&"] === string));
                                               ^