slevithan / xregexp Goto Github PK
View Code? Open in Web Editor NEWExtended JavaScript regular expressions
Home Page: http://xregexp.com/
License: MIT License
Extended JavaScript regular expressions
Home Page: http://xregexp.com/
License: MIT License
@mathiasbynens, @walling, this issue picks up from #25, since that's now a closed/merged pull request.
Prior to merging the opt-in astral support from the Unicode Categories Astral addon into the (default) Unicode Categories addon (which is automatically included in xregexp-all.js
, and therefore in the npm package), following are the changes I think would be beneficial:
Prerequisite:
Data for base categories like \p{L}
needs to be added, not just \p{Ll}
, \p{Lu}
, etc.
Changes:
The XRegExp.addUnicodePackage
function in Unicode Base should change from accepting an object with BMP data and an object with optional aliases to instead accept the following array:
[
{
name: 'Ll',
alias: 'Lowercase_Letter', // optional; used to support full category names
bmp: '0000-FFFF', // compressed BMP data or null
astral: '010000-10FFFF' // compressed astral data or null
},
…
]
The above data will be stored in the private unicode
object, without any preprocessing. Two new private lookup objects will be added: bmp
and astral
. These won't be populated automatically, but instead augmented on first use of each Unicode name in a regex. In other words, these are used to cache generated data.
Astral ranges with surrogate pairs will be built and cached in JavaScript code, on first use.
For scripts and blocks that exist only within astral planes, the bmp
property of the objects accepted by addUnicodePackage
should be set to null
. For addons that include astral support, the astral
property should always be included, with null
as the value for properties that have no astral code points. The astral
property shouldn't be included by addons that don't yet include astral support.
The \p{…}
(etc.) syntax token handler in Unicode Base should be updated to check XRegExp.isInstalled("astral")
in its handler
(main) function. If true
, combine data from the bmp
and astral
objects, and throw a SyntaxError
if the match scope
is "class"
. The trigger
function currently used in unicode-categories-astral.js
will no longer be necessary.
Since the BMP and astral data will be split up, these changes shouldn't inflate unicode-categories.js
too much. At least, BMP data will not be included twice.
With these changes in place, separate BMP and all-plane addons won't be needed, and users can opt in or out of astral support at any time.
in loveencounterflow@a81f8b2, i try to make it so that, when printed to the console (in NodeJS, using ( require 'utils' ).inspect
), an XRegExp
object is represented by its input pattern, not its compiled regular-JS representation.
simple reason: i have to print out a lot of values that may contain XRegExp
objects. when you combine a few advanced features, the output quickly gets incredibly long. for example, var x = new XRegExp '^\\p{L}+$'; console.log( x )
will cause an output several hundred characters long that contains characters from scripts all around the world:
{ /^[A-Za-zªµºÀ-ÖØ-öø-ˁˆ-ˑˠ-ˤˬˮͰ-ʹͶͷͺ-ͽΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ ... ... ... ... ... ... ... ... ... ...
which not only hides the intention of the pattern, it also makes the console (and the textarea i'm writing this in) grind to a halt (almost; i shortened the above quote for fear it could render this very page unusable).
my uninformed patch seems to work when you do console.log( x + '' )
, but not without adding that string literal. i think it would be much more helpful to have the input pattern displayed; as it stands,
RegExp
structurally already, so making it more readable and sensible would be a great idea IMHO.Background info:
Because the XRegExp
function returns a nonprimitive value, ES rules dictate that it can't be used as a constructor. I.e., the returned regexes inherit from RegExp.prototype
rather than XRegExp.prototype
, regardless of whether new
is used. XRegExp v2.0.0 attaches XRegExp.prototype
methods directly to regex objects when they are created or copied by XRegExp.
Going forward:
In browsers that support __proto__
(all but IE), XRegExp will set the prototype object of regexes created or copied by XRegExp to XRegExp.prototype
. The main benefit of this is performance, especially when many properties are added to XRegExp.prototype
(the Prototype Methods addon currently adds six, which isn't so bad, but users can add as many as they want). There are minor secondary benefits for instanceof
, getPrototypeOf
, isPrototypeOf
, etc.
Because XRegExp.prototype
will itself be a regex created by new RegExp()
, regexes with swapped prototype objects will continue to inherit from RegExp
, in addition to XRegExp
. In browsers that don't support __proto__
, regexes will continue to inherit from RegExp
and have XRegExp.prototype
properties assigned as own properties.
In all cases, XRegExp.isRegExp
will continue to work. instanceof
, constructor
, and Object.prototype.toString
tests against RegExp
will also continue to work just fine for all regexes, regardless of whether they are native, XRegExp-augmented, or XRegExp-created. In other words, there should be no backward-compatibility issues in any browser.
Idea: Add a new function that works similarly to String.prototype.lastIndexOf
, except that it accepts a regex to search for and returns a match array (with index
and backreference properties) like that returned by exec
.
You can sometimes get the last match by pop
-ing the array returned by String.prototype.match
when provided a regex with /g
, but that's sub-optimal for a variety of reasons:
match
often doesn't work at all since matches can overlap. The last match may be entirely different if it's forced to start after all prior matches.match
with /g
are simple strings, without any extended info (no match index or backreferences).The proposed execLast
function would essentially loop backward from the end of the string, adding one prior character on each iteration, and testing the regex against the starting position of the sliced string. Efficiency would be improved by wrapping the regex in ^(?:...)
, and by using flag /y
in browsers that support it. Alternatively, I could leave off the anchor and /y
, and perform something akin to a binary search. Either approach would, in effect, make all quantifiers nongreedy.
Feedback is very welcome, even if just to say that this would or would not be useful to you. Alternative name suggestions are also welcome.
How do you generate and update the Unicode ranges when a new version of Unicode comes along?
Would you be interested in a script that parses UnicodeData.txt
and generates the ranges for you?
Any plans on supporting negative and positive mode modifiers (?letters), such as (?i) and (?-i), in the middle of the regex?
For example: Input : (?i)te(?-i)st
Matches: test, TEst, but not teST or TEST.
http://www.regular-expressions.info/refmodifiers.html
Thanks!
I need something like regex.match("1a45", "\d", 1), the 3rd parameter is fromIndex which means match the string "1a45" from the second character 'a'. But unlike the lastIndex of JS RegExp, the character at fromIndex must be matched, otherwise it's not successful.
Is this feature supported by XRegExp? Thanks!
Hi slevithan,
first thanks for your great library. Really, really appreciating your work!
Don't know if I'm doing something wrong, but consider the following:
var url = 'page/edit/en/4f55fbbab51bda0df1000001/unnamed'
var re = XRegExp('^page/edit/(?<language>[a-z]{2})/(?<entityId>[a-z0-9]{24})/(?:.*)$');
var match = XRegExp.exec(url, re);
console.log(match);
/**
output:
0: "page/edit/en/4f55fbbab51bda0df1000001/unnamed",
1: "en",
2:"4f55fbbab51bda0df1000001",
entityId:"4f55fbbab51bda0df1000001",
index:0,
input:"page/edit/en/4f55fbbab51bda0df1000001/unnamed",
language:"en"
*/
My problem is that i have to capture the "unnamed" part of the url (?:.*). The match object doesn't hold me the value of "unnamed"...
Is XRegExp not able to mix named and unnamed captures, or am I am missing something?
Thanks!
See #28 for related info.
Can be provided via the flags
argument or provided inline. Can be combined with any other XRegExp flags. Examples:
// Via flags argument
XRegExp('^\\p{L}+$', 'Am');
// Inline
XRegExp('(?Am)^\\p{L}+$');
PHP allows the short form \pL instead of \p{L} but it isn't working in xregexp 2.
This would take the burden off @walling, who maintains https://github.com/walling/xregexp.
We could write a simple script that concatenates all files together in the right order. It could be used as a post-commit hook.
The only thing that would need to change in the XRegExp source code is the way XRegExp
is being exposed. @walling simply uses module.exports
for this, which works fine in Node.js — but with just a few more lines we could support exporting to Narwhal, RingoJS, Rhino, and AMD loaders like RequireJS as well. I do this in Punycode.js as follows: https://github.com/bestiejs/punycode.js/blob/a6e30c4e2ce7a9a569bc2c84a3435bd5612be59f/punycode.js#L493-510
Hi, we are trying to make javascript search function to handle regex. we included xregx-all in our library and tried the following.
regex = XRegExp('\\bβ', 'gi'),
str = 'The the test data has βa:ŋi in it.',
parts;
undefined
regex.test(str);
false
regex = XRegExp('\\bñ', 'gi'),
str = 'The the test data has ña:ŋi in it.',
parts;
undefined
regex.test(str);
false
regex = XRegExp('\\bç', 'gi'),
str = 'The the test data has ça:ŋi in it.',
parts;
undefined
regex.test(str);
false
regex = XRegExp('\\bg', 'gi'),
str = 'The the test data has ga:ŋi in it.',
parts;
undefined
regex.test(str);
true
regex = XRegExp('\\bあ', 'gi'),
str = 'The the test data has あa:ŋi in it.',
parts;
undefined
regex.test(str);
false
it works with plain English alphabet but other characters don't seem to be recognised.
Providing the string 'all'
to XRegExp.install
or XRegExp.uninstall
currently serves as a shortcut to add or remove all optional features. This shortcut is future hostile since new versions of XRegExp may include new optional features that current users do not mean to install or uninstall. The shortcut should therefore be removed. Users will still be able to add or remove optional features by explicitly naming them.
Back story:
Loading XRegExp v1.5.x twice in the same frame causes a descriptive error to be thrown.
XRegExp v2.0.0 does not throw the error (since the error could be frustrating in the case of browser plugins or libraries that bundle XRegExp), but it still avoids loading twice. It does this by checking whether XRegExp
is defined, and if so, it does not overwrite the variable. The script is silently skipped.
Going forward:
I think it would be better to stop guarding against running twice. v2.0.0 already made it easier to avoid the related issues since native methods are no longer overridden by default, and it's easy to rename XRegExp for user-scripts or to include an older version without conflicts, etc.
To go with this change, I'll need to ensure that everything works correctly if you load the script a second time after running XRegExp.install('natives')
. In such cases, XRegExp.uninstall('natives')
will simply revert to the versions of native methods that were present when XRegExp last loaded.
Note that the private list of added syntax/flag tokens will be tracked per XRegExp load. In other words, if you load XRegExp twice and both instances use the global name XRegExp
, you might lose previously added tokens. The tokens can be re-added by reloading the relevant addons. Separate instances of XRegExp that use modules or different global names to avoid bashing each other will be able to use independent token lists (this already works in v2.0.0).
It appears the download link is gone on xregexp.com and I cannot find a tag, so where can we download it? We need it to provide backward compatibility.
This will require wrapping the concatenated source files using an intro.js
and outro.js
file, to avoid creating the XRegExp
global variable when loaded as a RequireJS module.
For various reasons, the XRegExp.exec
and XRegExp.replace
functions make copies of their provided regexes, sometimes with the addition or removal of flags /g
and/or /y
. For improved performance, these copies should be cached on a regex's xregexp
object. The cached copies can be shared by all XRegExp functions that benefit from their use.
XRegExp.test
, XRegExp.forEach
, XRegExp.split
, and the new XRegExp.replaceEach
all rely on XRegExp.exec
or XRegExp.replace
, so they will share the performance improvements.
This should also make XRegExp.exec
fast enough to allow the private and performance-sensitive runTokens
function to take advantage of XRegExp.exec
's sticky-mode matching, rather than reinventing the sticky wheel.
The polyfill for String.prototype.replace
incorrectly throws a SyntaxError: Invalid token
for valid replacement values when natives
are installed in XRegExp version 3.0.0-pre.
Source: xregexp.js, line 1433
The following code throws a SyntaxError: Invalid token $%
error in XRegExp 3.0.0-pre:
XRegExp.install('natives');
'abc'.replace('b', '$%'); // throws "SyntaxError: Invalid token $%"
If you run the above code without installing natives
, modern browsers (Chrome and Firefox) return the correct result without throwing an error:
'abc'.replace('b', '$%'); // returns "a$%c"
When XRegExp encounters a $
character in the replacement string that is not followed by a $
, &
, ```, '
, `n`, or `nn`, XRegExp should simply return the matched substring as-is instead of throwing a `SyntaxError`.
According to ECMA-262 (PDF) § 15.5.4.11:
If
replaceValue
is [not] a function ... letnewstring
denote the result of convertingreplaceValue
to a string. The result is a string value derived from the original input string by replacing each matched substring with a string derived fromnewstring
by replacing characters innewstring
by replacement text as specified in the following table. These$
replacements are done left-to-right, and, once such a replacement is performed, the new replacement text is not subject to further replacements. For example,"$1,$2".replace(/(\$(\d))/g, "$$1-$1$2")
returns"$1-$11,$1-$22"
.A
$
innewstring
that does not match any of the forms below is left as is.
Characters | Replacement text |
---|---|
$$ |
$ |
$& |
The matched substring. |
`$`` | The portion of string that precedes the matched substring. |
$' |
The portion of string that follows the matched substring. |
$n |
The nth capture, where n is a single digit 1-9 and $n is not followed by a decimal digit. If n≤m and the nth capture is undefined, use the empty string instead. If n>m, the result is implementation-defined. |
$nn |
The nnth capture, where nn is a two-digit decimal number 01-99. If nn≤m and the nnth capture is undefined, use the empty string instead. If nn>m, the result is implementation-defined. |
Edit: Original title: Add XRegExp.matchAll
Create a new function called XRegExp.matchAll
. This will work the same as String.prototype.match
with /g
except for the following details:
null
) if no match is found./g
. In other words, it works the same for regexes with or without the global
flag, and never acts as an alias of exec
.RegExp
search values to regex objects. Instead, a TypeError
is thrown.lastIndex
, compared to the native String.prototype.match
. In other words, global
regexes always have their lastIndex
set to 0
upon completion, and non-global
regexes never have their lastIndex
modified from its original value. When using the native String.prototype.match
with /g
, IE (<= 8 ?) does not reset lastIndex
to 0
upon completion.This new function should also be mapped/aliased as XRegExp.prototype.matchAll
in the XRegExp Prototype Methods addon.
Background details:
XRegExp v2.0.0 already includes a version of String.prototype.match
with cross-browser lastIndex
fixes, but it cannot be used without first running XRegExp.install('natives')
. It does not include the other differences mentioned above. All other fixed/extended natives already have a corresponding XRegExp function that does not require overriding natives (XRegExp.exec
/test
/replace
/split
).
XRegExp doesn't need an equivalent of String.prototype.match
without /g
(i.e., match
instead of matchAll
), because that is already provided by XRegExp.exec
. More details on the rationale for adding matchAll
but not match
can be found here.
String.prototype.match
with /g
is the last place where XRegExp users need to use flag /g
or fiddle with lastIndex
. With XRegExp.matchAll
in place, XRegExp really will live up to its claim that it "frees you from worrying about pesky inconsistencies in cross-browser regex handling and the dubious lastIndex property."
Edit: Original title: Remove \p{Assigned}
The combined BMP and astral data for \p{Assigned}
is nearly 7 KB uncompressed, making it easily one of the heftiest Unicode properties supported by XRegExp. However, it adds no value since the Unicode Categories addon already supports \p{Cn}
(and its full name, \p{Unassigned}
), which is the exact inverse of \p{Assigned}
. In other words, you can match the same characters as \p{Assigned}
by using \P{Cn}
or \p{^Cn}
.
UTS #18 includes \p{Assigned}
as one of the properties required for Level 1 Unicode support. However, unlike all other Level 1 properties, the UnicodeSet application on unicode.org does not support \p{Assigned}
.
This change breaks backward compatibility, but is not expected to affect many, if any, XRegExp users. The Unicode Properties addon which includes \p{Assigned}
was only added very recently in the XRegExp 2.0.0 release, and most people are more familiar with \P{Cn}
, which will continue to work. Java, .NET, and PCRE all support \P{Cn}
but not \p{Assigned}
. (Perl and Oniguruma support both \P{Cn}
and \p{Assigned}
.)
(Note that XRegExp cannot support \p{Assigned}
via a scripted inversion of the data used by \p{Cn}
because of the complexity of the surrogate-pair-based ranges in the astral data.)
Proposed name: XRegExp.replaceSet
. _Edit:_ New name: XRegExp.replaceEach
.
Create a new function called XRegExp.replaceSet
that provides sugar for performing multiple sequential replacements. It will accept two arguments, str
{String
} and replacements
{Array
}, and return a new string with all replacements applied.
Details:
${name}
, $0
, etc.scope
as 'one'
or 'all'
via the third item in a replacement array. This follows the XRegExp.replace
function, where the optional scope
argument overrides the state of /g
.Usage example:
XRegExp.replaceSet(str, [
[XRegExp('(?<z>z)'), 'a${z}'],
[/y/gi, 'b'],
[/x/g, 'c', 'one'], // scope 'one' overrides /g
[/w/, 'd', 'all'], // scope 'all' overrides lack of /g
['v', 'e', 'all'], // scope 'all' allows replace-all for strings
[/u/g, function ($0) {
return 'f' + $0.toUpperCase();
}]
]);
Rationale:
To get the same functionality with XRegExp v2.0.0 (without any custom sugar), you'd have to write a pyramid of doom:
XRegExp.replace(
XRegExp.replace(
XRegExp.replace(
XRegExp.replace(
XRegExp.replace(
XRegExp.replace(
str, XRegExp('(?<z>z)'), 'a${z}'
), /y/gi, 'b'
), /x/g, 'c', 'one'
), /w/, 'd', 'all'
), 'v', 'e', 'all'
), /u/g, function ($0) {
return 'f' + $0.toUpperCase();
}
)
You could avoid this by extending String.prototype
with a method that calls XRegExp.replace
, but using XRegExp.replaceSet
would still be cleaner and shorter.
Implementation:
Something simple like this:
XRegExp.replaceSet = function (str, replacements) {
var i, r;
for (i = 0; i < replacements.length; ++i) {
r = replacements[i];
str = XRegExp.replace(str, r[0], r[1], r[2]);
}
return str;
};
I stripped my problem down to this short test.html:
<!DOCTYPE HTML>
<html>
<body>
<script type="text/javascript" src="xregexp-all.js"></script>
<script>
XRegExp.addToken( /é/, function () {return "[eé]"} );
alert(XRegExp.build("élé",null,"").test("élé"));// true
alert(XRegExp.build(" é",null,"").test(" é"));// true
alert(XRegExp.build(" élé",null,"").test(" élé"));// false
</script>
</body>
</html>
The third regexp tests to false on Firefox Aurora 30.0a2 while it deliveres "true" as expected on Firefox 26.
From page 29 of O’Reilly's 2nd edition of Regular Expressions Cookbook by Levithan and Goyvaerts:
\Q suppresses the meaning of all metacharacters, including the backslash, until \E. If you omit \E, all characters after the \Q until the end of the regex are treated as literals.
Example:
/\QI *love* donuts (and pizza).\E/
instead of
/I \*love\* donuts \(and pizza\)\./
This feature is available in Java, PCRE, and Perl, and would make a useful addition to XRegExp, as some client-side Javascript may get regexes that include block quotes from server-side code using one of the aforementioned regex flavors, or just contain literal text that would normally need a lot of manual escaping as in the example sentence above.
Thanks.
XRegExp should support component.
All that is needed is a component.json file.
It seems that npm now has only 2.0., are you going to publish 3.0?
Would it be possible to implement partial matching in XRegExp?
This would make real-time validation on web-forms far more user-friendly, as described in comments I posted on this page.
Java's Matcher class apparently supports it, as does this Java library - a number of other libraries for PERL and C++ have this feature, but I was unable to find an implementation in JS.
One possible implementation strategy, would be to break down the expression to it's individual component expressions, then progressively compare a larger part of the total expression plus a ^
at the end of the expression - if you find a match, as far as I can figure, that should be a partial match. I don't know how difficult it would be to parse and break up the expression into component expressions...
Does this feature seem like a good fit for this library?
This will be used with XRegExp.install
and XRegExp.uninstall
to enable full 21-bit Unicode support in XRegExp's Unicode addons (which must be loaded separately). See #25 for related information.
From what I can tell, "new XRegExp" will behave correctly for current versions of XRegExp. I was just curious if anyone is using this behavior, and if it will be protected in the future? Are there any plans to override "new XRegExp" or introduce incompatible code?
XRegExp 2.1.0-dev (pre-release) added XRegExp.matchAll
(see #16). However, before the release of v2.1.0 final, I plan to change both the name and semantics of the function. XRegExp.matchAll
will be removed. In its place, a new XRegExp.match
function will offer both match-all and match-first modes. The mode will be set via an optional third scope
argument, which works like the scope
argument of XRegExp.replace
. It will accept the values 'one'
(default) or 'all'
. Also like XRegExp.replace
, the presence or absence of flag /g
can be used to set the scope, but an explicitly specified scope
will always override /g
.
When scope
is 'one'
, XRegExp.match
will return the first match as a string, or null
if no match is found. (If you want backreference properties, etc., that's what XRegExp.exec
is for.) When scope
is 'all'
, XRegExp.match
will return an array of strings, or an empty array if no match is found.
This is essentially a more convenient re-implementation of String.prototype.match
that gives you the result types you actually want (string instead of exec-style array in match-first mode, and an empty array instead of null
when no matches are found in match-all mode), and lets you override/ignore flag /g
and lastIndex
.
I am trying to use XRegExp 3.0.0-pre with Rhino 1.6r2 (which is the version of Rhino shipping with Java 6).
Compiling the regex below (taken from http://xregexp.com/ ):
date = XRegExp('(?<year> [0-9]{4} ) -? # year \n' +
'(?<month> [0-9]{2} ) -? # month \n' +
'(?<day> [0-9]{2} ) # day ', 'x');
triggers the following error message:
"Invalid quantifier ?" at script line 517 (which is the line: "return augment(new RegExp(key.pattern, key.flags), key.captures, /*addProto*/ true);")
Inspecting key.pattern
reveals that the ?<...>
are not being stripped out:
(?<year>(?:)[0-9]{4}(?:))(?:)-?(?:)year(?:)(?<month>(?:)[0-9]{2}(?:))(?:)-?(?:)month(?:)(?<day>(?:)[0-9]{2}(?:))(?:)day
Does anyone have a workaround?
All non-minified JS in this project is incorrectly encoded and is breaking tools such as sstephenson/sprockets. Please ensure that it's all valid UTF-8.
to reflect the changes made in the readme.
It's written such to use require('xregexp')
, however, to access functions one has to still call require('xregexp').XRegExp
Currently, backreferences to a group, when multiple groups use the same name, refer to the last (rightmost) group with that name. See my named capture comparison page to see how this compares to other libraries. Notably, using multiple groups with the same name is an error in PCRE, Python, and Java.
.NET, Perl, and Oniguruma give useful semantics to multiple groups with the same name, but the behavior is different in each case, and XRegExp's current behavior is different than all of them. XRegExp's current behavior is not very useful, so I will change this to a SyntaxError
in XRegExp v2.1.0.
Generally, nonbugfix syntax changes are delayed until v3.0.0. This is being treated as a syntax bugfix, even though it is not technically a bug. The current behavior was intentional, but it was chosen without detailed information (recently provided by Jan Goyvaerts) about all the different and noncompatible ways that this is handled in other regex flavors.
This is more a question than an issue.
I have some regex that's customizable and I want to dynamically grab capture names from the configured regex. I see that all captured names are stored in captureNames array in xregexp. I assume it is should be ok to access that field right? i.e. It's not expected to change anytime soon.
I'm trying to convert this expression '\(((?:(?>[^()]+)|(?R))*)\)'
in PCRE (PHP 5.4) to XRegexp, as I'm aware it doesn't suport lookaheads and the recursive ?R
. It doesn't matter if I need some extra code to get it working, but I'm failing hard to find a substitution for this.
Unicode 6.2.0 is currently in beta and won't be released until late September or early October. However, the changes that will affect XRegExp are already well defined (see: What's new in Unicode 6.2?). Specifically, the changes are as follows:
Turkish Lira Sign (U+20BA):
U+20BA
to categories \p{S}
and \p{Sc}
.U+20BA
from categories \p{C}
and \p{Cn}
.U+20BA
to property \p{Assigned}
(no longer relevant since XRegExp defines Assigned
as the inverse of Cn
, without separate data).Arabic Wavy Hamza Below (U+065F):
\p{Inherited}
to \p{Arabic}
.IMO, it makes sense to go ahead and add these early because XRegExp 3.0.0 is almost ready for release. RegexBuddy 4 will include XRegExp as a supported regex flavor, and in future versions (v4.1?) RegexBuddy will add astral support and treat changes in the supported Unicode version as a separate regex flavor. Including Unicode upgrades in major releases of XRegExp (as with any other nonbugfix syntax changes) would therefore be ideal.
If there are any changes between the Unicode 6.2.0 beta and final release data (this seems unlikely), they can be added in an XRegExp bugfix release and will not require a new major version.
Right now I'm the owner of the NPM package. This is not very practical in the long run. I suggest the following procedure:
npm adduser
).xregexp
package (npm owner add
).xregexp
on NPM (npm publish .
in root dir).Comments are welcome.
Using XRegExp v.2.0.0 installed through npm, nodejs v0.10.22 on a recent Macbook pro. Doing the following:
var XRegExp = require("xregexp").XRegExp;
var str = "Bonjour, comment allez-vous ? Moi, ça ne va pas très bien à cause d'un gros bug dans l'exécution de mon programme";
var rule = XRegExp("\\P{L}+");
var a = Date.now();
var words = XRegExp.split(str, rule);
console.log(Date.now()-a);
gives me results above 350ms ! I saw in the docs that XRegExp is supposed to compile into native regular expressions with no/little performance hit, so I'm surprised to see such a poor performance.
Did I do anything wrong ? Is the performance better with 3.0 ?
Thanks
When you use the valueNames
option to enable the detailed match information mode, XRegExp.matchRecursive 0.2.0 is outputting the name
value in the value
property, and vice versa. Will fix and add tests immediately.
I need to loop over the named parameters after executing, seems to be pretty problematic. Why are you augmenting the array with properties? Shouldn't you provide a way to give access to clean named object?
Taken regex "(?<type>[a-z]+)/(?<id>\\d+)/(?<tab>[a-z]*)?"
I'd like to have object (after exec):
{ 'type' : 'bar', 'id' : 666, 'tab' : 'ifany' }
That would be trivial to loop over, now I just can't loop over the properties normally cause there is crap preceding the properties (and even worse, if there is update to XRegExp I might get new crap to skip over in loop).
The XRegExp.build addon is only supposed to strip a leading ^
and trailing unescaped $
from subpatterns when both are present.
This is an edge case that is not known to affect any code in use, but nevertheless, I will fix this immediately and add tests.
Using:
> XRegExp.escape("\n");
\
The output is a literal \
followed by the original whitespace character. I'm not sure what the proper behavior for escaping whitespace is and was expecting a literal \n
string.
The following example uses the symbol GClef (U+1D11E):
Unfortunately, I don't think this web form allows me to enter those symbols into the text... but under linux, holding CTRL+Shift while typing "1D11E" results in the symbol appearing in the text. I think you actually need to use the character map in windows, and something similar in mac...
XRegExp.install('astral');
XRegExp('^\\pS$').test('\uD834\uDD1E'); //--> true
XRegExp('^\\pS$').test('<G_Clef_Here>'); //--> false
Running this code with firefox 29 and xregexp 3.0pre shows true for both alerts.
The \k doesn't work as expected and as \2 in dateOK does. If we remove the
"| \s* (? August ) " -part then the bug doesn't show up anymore.
var dateOK = XRegExp.build(' \
( \
(?<day> [0-3]?\\d) \
\\s* ((?<sep> [./\\s]) ) \\s* \
(?<month> (1[012]|0?\\d)) \
| \\s* (?<fullmonth> August ) ) \
( \\s* \\2 \\s* \
(?<year> (20)?[012]\\d) \
)? ',{},'xni');
var dateBUG = XRegExp.build(' \
( \
(?<day> [0-3]?\\d) \
\\s* ((?<sep> [./\\s]) ) \\s* \
(?<month> (1[012]|0?\\d)) \
| \\s* (?<fullmonth> August ) ) \
( \\s* \\k<sep> \\s* \
(?<year> (20)?[012]\\d) \
)? ',{},'xni');
alert(XRegExp.exec("05/07 09",dateOK).year == "");
alert(XRegExp.exec("05/07 09",dateBUG).year == "09");
Hi all,
I need your help!
I am currently working on a file uploader with the features of taking it alphanumeric, Harigana, Katakana as well as . - _ for filename. The funny part is test function return true when i paste a Japanese string but when i try to upload a file with the same string as filename, it return false.
Here's my regex: XRegExp("^[\p{Hiragana}\p{Katakana}\p{L}\p{N}._-]+$")
Anyone knows what is the issue? =/
Thanks for your time!
Slight difficulty I found.
This string:
(?:“|"|")([\s\S]+?)(?:”|″|"|")\s_?:\s_?(?:“|"|")([\s\S]+?)(?:”|″|"|")
passed as a parameter to the XRegExp constructor generates the following RegExp:
/(?:“|"|")([\s\S]+?)(?:”|″|"|")s_?:s_?(?:“|"|")([\s\S]+?(?:”|″|"|")/gi
Extracted from the first string:
\s_?:\s_?
and from the second:
s_?:s_?
My current workaround as a replacement for the previous string (brackets for clarity):
[ ]?:[ ]?
Test text which should be matched:
"Lucky charm" : "22.7"
This code will cause a bug, the m.index
should be a number, but it will be overwritten by the matched group's name.
r = XRegExp('(?<index>\\w)(?<input>\\d)', 'g')
m = XRegExp.exec('a1b2', r)
console.log(m.index)
Hope the XRegExp.exec
method can return an array other than return a object that is polluted by your custom properties. Such as return an array: [match, index, input]
. Even add an underscore before the variable will be better, such as _index
, _input
.
Back story:
XRegExp 0.5.0 added the methods RegExp.prototype.apply
/call
. XRegExp 2.0.0-beta moved them to XRegExp.prototype
, added XRegExp.install('methods')
to copy them back to RegExp.prototype
, and made XRegExp(regexp)
/XRegExp.globalize
augment copied regexes with apply
and call
methods.
Going forward:
I'm considering removing the regex apply
and call
methods altogether in XRegExp 2.0.0 final. Since the built-in array collection methods (such as Array.prototype.filter
) don't use duck-typed apply
or call
, adding these methods to regexes just doesn't seem useful often enough to justify them.
Add an option to the XRegExp.addToken
options object (perhaps a boolean called reparseOutput
) that lets a token's output be reprocessed. This would allow chaining new syntax/flag tokens, and provide greater flexibility and simplicity. E.g., [:alnum:]
within character classes could return \p{L}\p{M}\p{Nd}
, and the actual code point range generation could be deferred to the Unicode Categories addon.
Example usage:
// Allow \pL (etc.) as shorthand for \p{L}
XRegExp.addToken(
/\\([pP])([CLMNPSZ])/,
function (m) {
return '\\' + m[1] + '{' + m[2] + '}';
},
{
scope: 'all',
reparseOutput: true
}
);
Hey Steve!
Just a thought that although astral characters cannot be directly supported within character classes, I think they can be simulated by the likes of:
(<high1><low1>|<high2><low2>)
Even ranges could be calculated by joining appropriate ranges of surrogates, e.g.:
(<high1a>[<low1a>-<low1a>]|[<high1b>-<high1b>][<low1b>-<low1b>]|<high1c>[<low1c>-<low1c>])
whereby the first and third (a,c) alternates might not be necessary if the entire range of surrogates on either end is requested.
Negation would I guess need to compute all non-astral, non-excluded characters/ranges and join that to the inverse of the surrogate pattern above.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.