noxone / regex-generator Goto Github PK

View Code? Open in Web Editor NEW

395.0 395.0 66.0 26.61 MB

Generate regular expressions from sample texts.

Home Page: https://regex-generator.olafneumann.org

License: MIT License

Kotlin 88.57% CSS 1.80% HTML 7.91% Dockerfile 0.63% JavaScript 0.08% C 1.01%

regex-generator's People

Contributors

Stargazers

Watchers

regex-generator's Issues

Left align checkbox if "copy regex" is hidden

Describe the bug
On browsers that do not support the copy&caste function the "Copy Regex" will be removed from DOM. In this case the checkbox "Generate only patterns" is not left-aligned.

To Reproduce

Open the page in a browser that does not support copy&paste (e.g. Safari on iPhone)
Use a page width that forces the layout to use multiple column for the "Copy Regex" button (e.g. landscape mode on iPhone)
See non-left-aligned checkbox

Expected behavior
The checkbox should be left-aligned if the button is not visible.

Smartphone (please complete the following information):

Device: iPhoneX
OS: iOS 13
stock browser

very slow parsing

first of all - very nice project!!
would you consider get a files instead of reading a very long string?
I had a problem where I put some very large text (like html)

Configurable recognizers

Add a possibility to "configure" recognizers... or at least the output of a recognizer.

First idea would to be able to generate greedy and lazy patterns. Maybe there are even more options that could be added to some recognizers.

Allow late editing of capturing groups

When editing capturing groups #109, enable the user to do the following things:

Allow users to rename capturing groups ✅
Allow conversion of existing groups to capturing groups (and the other way round)
Improve capital letters in caption (in general)
Offer the possibilities for groups (or maybe in general for every part of the regex):
- optional: ? ✅
- repeatable *✅, +✅ or {3,5}
- Lazy/Greedy-Flag ✅
- flags in general (case-sensitivity, multiline, etc...)
only allow valid group names (see: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#groupname) ✅
properly change groups when editing the input
Include an optimizer to eliminate unnecessary groups
Adjust driver.js to new UI

Combine input and display (step 1 and 2)

Add the possibility to edit the sample text directly in step 2 (maybe remove step 1)

Wrong regex generated

Go to https://regex-generator.olafneumann.org/?sampleText=http%3A%5C&flags=&selection=4%7CCharacter
Remove all text from input field
See a character still left in the regex field --> wrong

error when recognizing repetitions:

Text like hgf\n\t < leads to errors.
They are currently suppressed so the users are not bothered, but they should be fixed.

Provide docker for regex-generator

It is very helpful if the developer could run the application locally inside docker.

Show popover in case of rage click

In case the user does rage clicks in the pattern selection part, the page should display a popover, what should be changed...

Maybe if 4 clicks within 2 seconds appear... show a popover.

Store info, which language is expanded

Is your feature request related to a problem? Please describe.
Currently the language snippets are closed if you open the page. The page should store information, which snippet box is open. Once the page is reloaded the corresponding boxes should be reopened again

Provide a different UI

Let's provide another type of UI that is more similar to txt2re, that displays the matches in a better way...

Add tooltip on pattern matcher hovers

Is your feature request related to a problem? Please describe.
Nope

Describe the solution you'd like
When hovering the proposed found regex's, there is the name of pattern that has been found. It would be cool to have a tooltip there to see an description of the pattern. Or maybe just the actual pattern that will be generated.

Add URL regex

Add regex for

URL encoded characters
URL schemas

maybe other url stuff

Error with webpack-cli when building

Describe the bug
I'm trying to build the project but I get the following error when I run gradle run

[webpack-cli] TypeError: cli.isMultipleCompiler is not a function
shared             |     at Command.<anonymous> (/app/node_modules/@webpack-cli/serve/lib/index.js:146:35)
shared             |     at async Promise.all (index 1)
shared             |     at async Command.<anonymous> (/app/node_modules/webpack-cli/lib/webpack-cli.js:1674:7)

Seems like this is an issue caused by a recent update from webpack. The solution described here is to upgrade webpack-cli to 4.10.0
webpack/webpack#15951

So I included this line in build.gradle under dependencies:
implementation npm("webpack-cli", "4.10.0")

However, then I get this error:

Execution failed for task ':packageJson'.
> There is already declared version of 'webpack-cli' with version '4.10.0' which does not intersects with another declared version '4.9.2'

I've tried see if there's another dependency that uses webpack-cli 4.9.2 but I can't figure it out. Any help here is much appreciated. This is a great tool and I'd love to try developing with it!

Desktop (please complete the following information):

OS: macOS Monterey
Version 12.2

Using the result in Visual Studio Code (VS-CODE)

Hi,

Great idea this - REGEX by example!

I am trying to use the website (https://regex-generator.olafneumann.org/) to help me work out some non-trivial FIND / REPLACE tasks within VS-CODE Editor.

I appreciate that this is more likely a VS-CODE question but I thought I'd ask here at it raises the possibility of the additional feature of 'REPLACE- WITH' on the site.

An example of what I am trying to do *...

Within the text ...

overline{A}

... I wish to replace this with ¬A

Thus, I am replacing the content of the braces and the the outer function 'overline' with a ¬ followed by the original contents of the braces.

I am trying to use the web page to do this but the expression that it comes up with doesn't seems to 'find' anything when I use it in the FIND/REPLACE dialogue in VS-CODE

The expression I've used is...

^overline{[a-zA-Z]}$

Stems from a desire to take a Boolean logic equation in Libre Office Math from Mathematic notation to Logical notation.

Thank you

Full Docker based build fails on unit tests

Describe the bug
The gradle build inside a docker container fails when running the unit tests.

To Reproduce
Steps to reproduce the behavior:
Simply follow the steps from the README file

Clone the repository
Navigate into the project's root folder
Run docker build . -t noxone/regexgenerator
See error:

> Task :browserTest
Cannot start FirefoxHeadless

Command '/usr/bin/firefox' requires the firefox snap to be installed.
Please install it with:

snap install firefox
FirefoxHeadless stdout:
FirefoxHeadless stderr: 
Command '/usr/bin/firefox' requires the firefox snap to be installed.
Please install it with:

snap install firefox
Cannot start FirefoxHeadless

Command '/usr/bin/firefox' requires the firefox snap to be installed.
Please install it with:

snap install firefox
FirefoxHeadless stdout:
FirefoxHeadless stderr: 
Command '/usr/bin/firefox' requires the firefox snap to be installed.
Please install it with:

snap install firefox
Cannot start FirefoxHeadless

Command '/usr/bin/firefox' requires the firefox snap to be installed.
Please install it with:

snap install firefox
FirefoxHeadless stdout:
FirefoxHeadless stderr: 
Command '/usr/bin/firefox' requires the firefox snap to be installed.
Please install it with:

snap install firefox
FirefoxHeadless failed 2 times (cannot start). Giving up.
java.lang.IllegalStateException: Errors occurred during launch of browser for testing.
- FirefoxHeadless
Please make sure that you have installed browsers.
Or change it via
browser {
    testTask {
        useKarma {
            useFirefox()
            useChrome()
            useSafari()
        }
    }
}

> Task :test FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':test'.
> Failed to execute all tests:
  :browserTest: java.lang.IllegalStateException: Errors occurred during launch of browser for testing.
  - FirefoxHeadless
  Please make sure that you have installed browsers.
  Or change it via
  browser {
      testTask {
          useKarma {
              useFirefox()
              useChrome()
              useSafari()
          }
      }
  }

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

* Get more help at https://help.gradle.org

BUILD FAILED in 1m 17s

Expected behavior
The build should run without problems and the docker build command should create a usable image.

Desktop (please complete the following information):

OS: macOS Monterey 12.5.1 on an M1 mac (so arm64 platform)

Additional context
The "normal" build works.

New Feature: Test area

Is your feature request related to a problem? Please describe.
I'm always frustrated when I have the output regex and want to test it against even more text samples.

Describe the solution you'd like
Add a new (optional) step to very the generated regex against several lines of sample text.

Describe alternatives you've considered
Copy the regex into a page like regex101. But then I would need to change the website.

Revert workflow to run on ubuntu-latest

Github runner image ubuntu-latest does not contain Firefox. They want to add it later. Once the runner image is updated, the workflow should be updated again to use the ubuntu-latest again instead of ubuntu-20.04.

regex-generator/.github/workflows/build-project.yml

Line 30 in 1b0f523

runs-on: ubuntu-20.04

Issue describing the change: actions/runner-images#6399
Pull request that needs to be merged before solving this issue: actions/runner-images#6528

Alternatively install Firefox later: https://www.omgubuntu.co.uk/2022/04/how-to-install-firefox-deb-apt-ubuntu-22-04

Add more regex recognizers

Currently the Recognizers in the application are very basic.
The application needs a bigger "library" off possible matches.

Don‘t show „copy“ if not available

Add C as language

As requested by @faustfa in #190 :

How is your programming language called?
C

How does the snippet look like that we shall generate?
The snippet should be compilable and runnable as is. The user should be able to just copy and paste the code and have a fully functional application.

#include <regex.h>

int useRegex(char* textToCheck) {
    regex_t compiledRegex;
    int reti;
    int actualReturnValue = -1;
    char messageBuffer[100];

    /* Compile regular expression */
    reti = regcomp(&compiledRegex, "^asd[0-9]+asd", REG_EXTENDED | REG_ICASE);
    if (reti) {
        fprintf(stderr, "Could not compile regex\n");
        return -2;
    }

    /* Execute compiled regular expression */
    reti = regexec(&compiledRegex, textToCheck, 0, NULL, 0);
    if (!reti) {
        puts("Match");
        actualReturnValue = 0;
    } else if (reti == REG_NOMATCH) {
        puts("No match");
        actualReturnValue = 1;
    } else {
        regerror(reti, &compiledRegex, messageBuffer, sizeof(messageBuffer));
        fprintf(stderr, "Regex match failed: %s\n", messageBuffer);
        actualReturnValue = -3;
    }

    /* Free memory allocated to the pattern buffer by regcomp() */
    regfree(&compiledRegex);
    return actualReturnValue;
}

Anything special about string literals?
Well, standard C-like string literals...

How can we specify options?
The options are part of the regcomp function call.

REG_ICASE
REG_NEWLINE
I found nothing for DOT_ALL

Need a warning?
If there is no DOT_ALL, this is a warning.

Erroneous lone quantifier brackets.

Describe the bug
Inputting [id]=[6] into rg_raw_input_text generates \[id]=\[[^\]]*], whereas it should generate \[id\]=\[[^\]]*\], per https://stackoverflow.com/a/49111429/9731176 and Visual Studio Code.

To Reproduce
Steps to reproduce the behavior:

Input [id]=[6] into rg_raw_input_text.
Click on rg_button_copy.
Duplicate into the Find and replace field of Visual Studio Code.
See error:

Screenshots:

Desktop:

PS /home/rokejulianlockhart> uname -a
Linux RQN6C6 6.2.2-1-default #1 SMP PREEMPT_DYNAMIC Thu Mar  9 06:06:13 UTC 2023 (44ca817) x86_64 x86_64 x86_64 GNU/Linux

Browser

PS /home/rokejulianlockhart> firefox -v                                   
Mozilla Firefox 110.0.1

Escape language names properly

In case the name of a language contains a special character the UI will break.
The Kotlin code generates the HTML id from the language name to create the HTML elements to display the language snippets. In case the ID is invalid the page will not work anymore.

More obvious click suggestion on matches

Use stats show, that users often click on boxes that do not allow clicking, because they want to select something. If there are several of these clicks, the page should highlight the area where to click to indicate what the user is doing "wrong"... or maybe we could change the way the page is styled so that it is more obvious where to click...

Match plain UUID4 and UUID4 being part of another string (like urls)

What's the new pattern about? What is it able to recognize?
Upper- and lowercase UUID4

How does the pattern look like?

[0-9a-fA-F]{8}\b-[0-9a-fA-F]{4}\b-[0-9a-fA-F]{4}\b-[0-9a-fA-F]{4}\b-[0-9a-fA-F]{12}

Is there anything special we need to take of when recognizing this pattern?
-

Describe the pattern
-

Technical reference
RFC4122. Does not reference the above RegEx (taken from ihateregex.com) but explains each part of an UUID.

Please add Alfanumeric pattern

What's the new pattern about? What is it able to recognize?
Alfanumeric string.

How does the pattern look like?

[A-Za-z0-9]+

Wrong indices for recognizers with searchRegex

Describe the bug
When a recognizer uses a search-regex the position of the match might be wrong in case characters in front of the main match are taken into account.

Add formatted regex for terminal usage.

Is your feature request related to a problem? Please describe.
I'm always frustrated when trying to use regular expression with the terminal command grep.

Describe the solution you'd like
A copy-and paste Regex snippet for terminal use under the Usage in programming languages section

Make detekt workflow use same config as build job

Currently the workflow "Scan with Detekt" https://github.com/noxone/regex-generator/actions/workflows/analyze-with-detekt.yml uses a slightly different configuration than the actual build job. This needs to be consolidated.

Wrong regex generated

Go to https://regex-generator.olafneumann.org/?sampleText=abx%5Bcd%5Def&flags=i&selection=3%7CSquare%20brackets
Add a character before the openen square bracket
see error in regx box. regex is generated wrong. But it can be fixed by enabling and disabling the options...

Prettier CSS for user guide

The user guide created with driver.js should be styled so it fits to the overall page style.

Automatically suggest to summarize matches

Usage analysis shows, that users quite often select several "digit" matches, instead of selecting "multiple digits". The page should recognize this behavior and then suggest to change the selection.

url overflow css bug

Describe the bug
When I enter a long URL, the page is broken

To Reproduce
Steps to reproduce the behavior:

Go to 'https://regex-generator.olafneumann.org/'
On paste a text sample, enter: "https://shopee.vn/-M%C3%A3-WAMT1005-gi%E1%BA%A3m-10k-%C4%91%C6%A1n-0k-C%E1%BB%B0C-R%E1%BA%BA-D%C3%89P-%C4%90%E1%BA%BE-TR%E1%BA%A4U-NAM-N%E1%BB%AE-2-QUAI-BIRKEN-UNISEX-DA-PU-M%C3%80U-%C4%90EN-DETA21D-i.301144779.6149149324"
Scroll down to part 2 'Which parts of the text are interesting for you?'
See the error

Expected behavior
I don't know whether this is a bug or a feature, but I think it could be improved although I know this is not of priority

Screenshots
If applicable, add screenshots to help explain your problem.
image

Desktop (please complete the following information):

OS:MacOS
Browser: Chrome

please insert also perl and PCRE

Hi can you insert also Syntax language for PERL and PCRE and PCRE2 and classical C the world would reward you , and i too thanks so much for your big work

Create limit for input

Is your feature request related to a problem? Please describe.
If the input is too long the parsing of the input and the recognition of patterns takes very long.

Describe the solution you'd like
Limit the number of characters that can be entered by the user.

Describe alternatives you've considered
Use a faster CPU.

Copy regex not working on Firefox

The "Copy regex" button does not work on Firefox, so I need to copy the regex from the textbox. If I do that, I get 2 newlines which is really bad for pasting into code.

Dynamic recognizer for parentheses

Enhance BracketedRecognizer results:

Look for recognizers within parentheses...
Use all recognizers in the brackets and suggest a combination of all the characters contained as a separate class.
It might make sense to expand the class identified in step 2 so that certain "known" classes are proposed

dot-stars in the wrong order

Describe the bug
Whenever the generated regex needs to (greedy) repeat a single character, the .* that it's supposed to generate comes out as *.

To Reproduce
Steps to reproduce the behavior:
The most straightforward way to reproduce the error is to select some single characters in step 2. It appears when selecting other spans as well.

Expected behavior
Generated regex should've had .* in place of *.

Screenshots

reenshots**

Desktop (please complete the following information):

OS: Fedora 35
Browser: Firefox
Version: 96.0.3

JavaScript-error

An error occurred with these details:

Exception: can't access property "i3c_1", t is undefined
Commit ID: ff31f7729ea786ed917d48ce22c7c05069e9c652
UserAgent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/112.0
Vendor:
Language: en-US
Platform: Win32
CPU: Windows NT 10.0; Win64; x64
TouchPoints: 0
Plugins:
- PDF Viewer
- Chrome PDF Viewer
- Chromium PDF Viewer
- Microsoft Edge PDF Viewer
- WebKit built-in PDF
StackTrace: TypeError: can't access property "i3c_1", t is undefined
@https://regex-generator.olafneumann.org/regex-generator.js?commitId=ff31f7729ea786ed917d48ce22c7c05069e9c652:1:636683
us@https://regex-generator.olafneumann.org/regex-generator.js?commitId=ff31f7729ea786ed917d48ce22c7c05069e9c652:1:557225
@https://regex-generator.olafneumann.org/regex-generator.js?commitId=ff31f7729ea786ed917d48ce22c7c05069e9c652:1:637586
@https://regex-generator.olafneumann.org/regex-generator.js?commitId=ff31f7729ea786ed917d48ce22c7c05069e9c652:1:638599
@https://regex-generator.olafneumann.org/regex-generator.js?commitId=ff31f7729ea786ed917d48ce22c7c05069e9c652:1:625862
@https://regex-generator.olafneumann.org/regex-generator.js?commitId=ff31f7729ea786ed917d48ce22c7c05069e9c652:1:648438
@https://regex-generator.olafneumann.org/regex-generator.js?commitId=ff31f7729ea786ed917d48ce22c7c05069e9c652:1:643418
@https://regex-generator.olafneumann.org/regex-generator.js?commitId=ff31f7729ea786ed917d48ce22c7c05069e9c652:1:643426
654/i/ku/<@https://regex-generator.olafneumann.org/regex-generator.js?commitId=ff31f7729ea786ed917d48ce22c7c05069e9c652:1:584320

Weird, weird character recognition issue that happened 𝘰𝘯𝘤𝘦

Describe the bug
When asked to generate a pattern to match a string plus a trailing underscore, the generator outputs a pattern that matches the string plus a hyphen.

I can't seem to reproduce this issue ever since I reloaded the page, but it consistently did as above as long as I didn't reload it.

To Reproduce
Steps to reproduce the behavior:

Go to ''.
Paste text sample TX_RESP_Q008.
To select the 'interesting parts', on the third line, click the first and second Multiple characters. Then, on the first line, click the first Character (_), which refers to the third character of the string. Then, click the second 'Character (_)', which refers to the eigth character of the string.
See that the pattern generated is either [a-zA-Z]+_[a-zA-Z]+-Q008 or [a-zA-Z]+_[a-zA-Z]+-, depending on if you select Generate only patterns' or not, which I did (these patterns were copied straight from the website using the 'Copy regex' button).

Expected behavior
The pattern should have looked like this:
[a-zA-Z]+_[a-zA-Z]+_

Screenshots
Not necessary.

Desktop (please complete the following information):

OS: Windows 10 21H2
Browser: Edge
Version: 105.0.1343.42

Additional context
It was the first URL I opened when launching Edge. I had previously used the website. I first generated the RegEx without the trailing underscore, then I switched to the program RStudio, where I used a the sub() function to replace that pattern on a string, but I realized that it kept the underscore, so I corrected the RegEx, reran the function, then came back to the website to select the second underscore, and that's when it happened. I opened the DevTools to copy the trailing underscore from the source and pasted it into Google to make sure that the Unicode was from an underscore, and not, maybe, a weird Unicode character that the generator would have recognized as a hyphen, but it was your run of the mill underscore(here's the character I copied: _)

Show regex code for different languages

Once generated the regular expression show ready-to-use source code for different languages so you can copy and paste your favourite language.

Remove substrings

As mentioned in the readme, I would like to get rid of the substring() calls. Currently there are two source files with substring in the code:

regex-generator/src/main/kotlin/org/olafneumann/regex/generator/ui/parts/P1UserInput.kt

Line 53 in 2919f09

this.inputText.substring(0, maxInputLength)
regex-generator/src/main/kotlin/org/olafneumann/regex/generator/regex/RecognizerCombiner.kt

Line 42 in 094276c

val text = inputText.substring(0, length)
regex-generator/src/main/kotlin/org/olafneumann/regex/generator/regex/RecognizerCombiner.kt

Line 54 in 094276c

val text = inputText.substring(rangesToMatches.last().range.last + 1)
regex-generator/src/main/kotlin/org/olafneumann/regex/generator/regex/RecognizerCombiner.kt

Line 103 in 094276c

val text = inputText.substring(rangeBetween)

I would like to consider alternatives to substring() to have a more expressive alternative.

Capturing groups

Describe the solution you'd like
I would like to have a possibility to define capturing groups in the generated regex.

Python sample code is wrong, including the regex.

Describe the bug
The code generated for Python, in section 4, has some mistakes and the regex pattern is also functionally different from the one displayed in section 3.
For the sample [] in section 1, with the "Square brackets" selection in section 2:
the regex in section 3 says: \[\];
the Python code in section 4 says:

import re

def useRegex(input):
    pattern = re.compile(r"\\[\\]", re.IGNORECASE)
    return pattern.match(input)

The Python code issue is that input is a keyword and should not be used as a variable name (even though the code does work this way).
The regex issue is that the extra backslash is unnecessary because the string is already a r-string (raw string).

To Reproduce
Steps to reproduce the behaviour:
1) Make an expression that requires a character to be escaped with a backslash;
2) Scroll down and view the Python code; copy it and use it in Python;
3) Notice that the r-string part \\ is interpreted by Python as \\, instead of \.

Expected behaviour
The expected behaviour is that the Python code either uses a normal instead of r-string, or uses a r-string and does not try to unnecessarily escape the backslash. In the latter option, the same regex text as in section 3 should be used.

Desktop (please complete the following information):

OS: Arch Linux
Browser: Firefox
Version: 103.0.2

python support

would you consider support some python code?
if you some help with itegrate with python I actually can help you happily

Matching Pipe Character, special meaning characters in input text to be escaped

When I paste in:
| | | 13024a.htm
to generates a regex for it, it uses the pipe as a plain character, which is an alternate selector, so it should be escaped in the regex rather.

Add VB.net

I'm forced to use this stupid language at work, so it would be nice if you could add support for it.

How is your programming language called?
VB.net

How does the snippet look like that we shall generate?

Imports System.Text.RegularExpressions

Public Module SampleModule
    Public Function useRegex(ByVal input As String) As Boolean
        Dim re = New Regex("regex", RegexOptions.IgnoreCase Or RegexOptions.Singleline Or RegexOptions.Multiline)
        Return re.IsMatch(input)
    End Function
End Module

Anything special about string literals?
VB.net uses only " for string iterals. If you want to use " in a string you have to use ""

e.g.

MsgBox("Hello "" World")

Will show a MsgBox with Hello " World.

How can we specify options?
See above

Need a warning?
No

Error is recieved when String is being identified for conversion

Describe the bug
Error is recieved when String is being identified for conversion

To Reproduce
Steps to reproduce the behavior:

Go to website to REGEX GENERATOR
Paste Sample Text : [2/10/23 11:33:09:829 CST] 00000214 SystemOut O
Create the Regex as Shown :
a) in the 1st screenshot attached below and
b) shown here: [2/10/23\s\d\d:\d\d:\d\d:\d\d\d\sCST]\s+[0-9]+\sSystemOut\s+O
On the second Section, move to the Text for SystemOut and attempt to just attempt to identify as "Multiple Characters" and then recieve the error - "Houston, We Have A Problem!"
See error info below and in Screenshot #2

Expected behavior
I expected to get the following regex in Section 4:
[2/10/23\s\d\d:\d\d:\d\d:\d\d\d\sCST]\s+[0-9]+\s[A-Za-z]+\s+O
but instead I got the error reported and in the 2nd screenshot

Screenshots