sicroatgit / regex-engine Goto Github PK
View Code? Open in Web Editor NEWA RegEx engine that builds NFA/DFA and always returns the longest match.
License: MIT License
A RegEx engine that builds NFA/DFA and always returns the longest match.
License: MIT License
Unicode's character classes:
Alphabetic
, M
, Nd
, Pc
and Join_Control
, but not those exceeding \uFFFF
In addition to character classes, there will also be shorthand character classes. However, I'm not quite sure yet which ones there should be and which characters they should cover.
According to this website, the different RegEx engines cover different characters in the shorthand character classes:
https://www.regular-expressions.info/shorthand.html
The current listing:
\d
for [0-9]
\D
for [^\d]
\t
for the tab character\r
for carriage return (CR)\n
for linefeed (LF)\f
for form feed\s
for [ \t\r\n\f]
\S
for [^\s]
\w
for [A-Za-z0-9_]
\W
for [^\w]
\h
for [ \t]
\v
for [\r\n\f]
You could rename the Example_codes/
folder to just examples/
and keep it short — after all, are there any examples rather than code here? And if there where, it would still be better to subfolder them (examples/code/
, examples/whatever/
, etc. rather than having Example_codes/
, Example_whatever/
, etc.).
Also, the noun code (as in source code) has no plural form — it's just "code", never "codes" (unlike, e.g. "moral codes").
This feature implements ASCII mode. When activated, the predefined character classes will only match the corresponding ASCII characters. For example, (?a)\w
will then match only [a-zA-Z0-9_]
. The character encoding remains UCS-2 in this mode, i.e. (?a)\W
matches all UCS-2 characters, but not [a-zA-Z0-9_]
.
New Syntax
(?a)
— activates ASCII mode.(?-a)
— deactivates ASCII mode (default).New Parameter
AddNfa(..., "\w", #RegExMode_Ascii)
is the same as AddNfa(..., "(?a)\w")
This mode is also useful in combination with #RegExMode_NoCase
when you want to lex keywords in a code, case-insensitive, but no case-folding should be applied. Example:
(?i)set
corresponds to [Ss\u017F][Ee][Tt]
(?ia)set
corresponds to [Ss][Ee][Tt]
Unicode's character class:
White_Space
but not those exceeding \uFFFF
You could enable Discussions to avoid cluttering Issues with anything that is not a pending task on the repository (e.g. Issue #1 could be converted to a Discussion, so once you've decided which shorthands to implement you can create an Issue for each pending shorthand, and keep it brief and to the point).
Unicode's character class:
Nd
but not those exceeding \uFFFF
https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3ANd%3A%5D&abb=on&esc=on&g=&i=
Any character except the Unicode's character classes:
Alphabetic
, M
, Nd
, Pc
, Join_Control
and those exceeding \uFFFF
This feature implements full UTF-16 / Unicode support by correctly interpreting UTF-16 surrogate pairs as single Unicode code points and extending the predefined character classes to the full Unicode code points range.
New syntax
\Uhhhhhhhh
(character with hex code hhhhhhhh
)New RegEx mode
(?u)
— activates UTF-16 mode. The predefined character classes \w
, \d
etc. are then extended by the full Unicode code points range (0x01 up to 0x10FFFF).(?-u)
(default mode) — deactivates UTF-16 mode. The predefined character classes \w
, \d
etc. then correspond as before only to the Unicode code points possible with UCS-2 (0x01 up to 0xFFFF).New Parameter
AddNfa(..., "\w", #RegExMode_Unicode)
is the same as AddNfa(..., "(?u)\w")
PureBasic's string functions use UCS-2 encoding in Unicode mode according to the official documentation. But PureBasic uses the API functions of the operating systems for displaying the strings and these all (Windows, Linux and macOS) interpret the PureBasic string as UTF-16, so programs written in PureBasic can display all Unicode characters.
Currently, the RegEx engine also works with this UCS-2 encoding, so UTF-16 surrogate pairs are interpreted as two separate UCS-2 characters.
In order to write Unicode code points outside the range supported by UCS-2 (Unicode's Basic Multilingual Plane only) in the regex, the UTF-16 surrogate pairs currently have to be written separately. Besides the disadvantage that this is inconvenient to write, the case-insensitivity mode then also does not work correctly, because to work correctly it would have to be able to interpret a UTF-16 surrogate pair as a single Unicode code point.
For this feature the tool Graphviz will be used and should replace these codes:
For example, the regex engine creates the following dot code for the regex (a|b)
:
digraph nfa_state_diagram {
rankdir = LR;
size = "8,5";
node [shape = circle];
"" [shape = none, fixedsize = true, height = 0, width = 0];
"" -> 1;
1 -> 2 [label = "ε", style = dashed];
2 -> 3 [label = "61"];
3 -> 4 [label = "00"];
4 -> 5 [label = "ε", style = dashed];
5 [shape = doublecircle];
1 -> 6 [label = "ε", style = dashed];
6 -> 7 [label = "62"];
7 -> 8 [label = "00"];
8 -> 5 [label = "ε", style = dashed];
}
Graphviz then generates this NFA diagram visualization from the dot code:
It allows writing ISO_8859-1 or Unicode characters with hexadecimal numbers. This makes it easy to write characters for which there are no keyboard keys.
\u0001
to \u00FF
(same as \x01
to \xFF
)
\u0100
to \uFFFF
(hexadecimal numbers correspond here to the Unicode code points)
Any character except the Unicode's character class:
White_Space
and those exceeding \uFFFF
A RegEx ID number can then be specified for each RegEx, and in the event of a match, the RegEx ID number can be used to determine which RegEx has matched.
If there are multiple RegExes that match the same string and have been assigned different RegEx ID numbers, the RegEx ID number of the last matched RegEx is taken, i.e. the last matched RegEx added with the AddNfa()
function.
Useful for building a lexer where different RegEx ID numbers can then be used for the different token types.
[x]
, while x
can be several mixes of the following:
a-c
for the characters a
, b
and c
To negate the user-defined character class, the ^
character must follow immediately after the opening square bracket. If the ^
occurs in other positions, it is a normal character, without the special meaning.
The metacharacter .
(dot) is a dot character inside the square brackets, i.e. without the special meaning.
Nesting of character classes is not allowed.
For ranges, the range end symbol must have a larger character code than the range start symbol, i.e. [a-z]
instead of [z-a]
.
Discussed in #27
New Syntax:
(?i)
means that everything after that is case-insensitive.(?-i)
means that everything after that is case-sensitive.New Parameter
AddNfa(..., regExModes = 0)
AddNfa(..., "a", #RegExMode_NoCase)
is the same as AddNfa(..., "(?i)a")
Groups inherit the active modes of the context outside. Mode changes within a group has no effect on the context outside this group.
The implementation will use Unicode's Simple Case Folding variant (single character code point to single character code point), but in reverse: Instead of mapping all character variations to a single character (folding), a single character is mapped to all character variations (unfolding). This is necessary because the DFA must know all valid characters.
The source file from which the translation table will be created:
https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt
Any character following the escape sequence \Q
is interpreted as a normal character.
The escape sequence \E
can then be used to return to normal behavior.
Matches any character up to \uFFFF
except \r
and \n
.
Syntax | Meaning |
---|---|
x{n,m} |
at least n of x and at most m of x |
x{n,} |
at least n of x |
x{,m} |
at most m of x |
x{n} |
exactly n of x |
If n
is greater than m
an error is triggered.
Matches the carriage return character.
If someone only wants to use the precompiled DFAs in their software for matching and not wants to create new NFA/DFA at runtime, a reduced module would not unnecessarily bloat their software with the large Unicode tables and the other code.
Matches the line feed character.
Matches the horizontal tab character.
It allows writing ISO_8859-1 characters with hexadecimal numbers. This makes it easy to write characters for which there are no keyboard keys.
\x01
to \xFF
Any character except the Unicode's character class:
Nd
and those exceeding \uFFFF
https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3ANd%3A%5D&abb=on&esc=on&g=&i=
Matches the form feed character.
https://keepachangelog.com/en/1.1.0/
In the GitHub release descriptions, only links should then be included that point to the appropriate version number section in the CHANGELOG.md
file.
This makes the changelog readable offline and the project is not so dependent on GitHub services.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.