Comments (16)
Ah, nice! OK, I have some work in progress on this issue, and I'll be sure to support that case as well. That's a great feature.
from ripgrep.
Unfortunately, this is a harder thing to add, since it would require plumbing an Aho-Corasick automaton through all of the searching code so that either a Regex or an AcAutomaton could be used. Doable, but not straight-forward.
Thinking about this just a touch more, I think it might actually be easier than I let on. The search code doesn't actually know anything about regexes---all it cares about is using the relatively simple interface of grep::Grep
. If we put that interface behind a trait and then impl it with Aho-Corasick, we should be pretty much good to go.
It might be nice if the trait and Aho-Corasick implementation were in the grep
sub-crate. And maybe the Grep
type should be renamed to Regex
, that way, the trait could be called Grep
. Hmm, yes, I quite like that!
from ripgrep.
for context, here's the man page entry from grep:
-f file, --file=file
Read one or more newline separated patterns from file. Empty pattern
lines match every input line. Newlines are not considered part of a
pattern. If file is empty, nothing is matched.
🌞 note that neither ag
or ack
have this, so it would be another differentiator!
from ripgrep.
I might have some time next week to take a look at this. Would you be interested in a PR following your outline above?
from ripgrep.
@emk That'd be great! Although re-reading it, I think I'd like to keep the trait private inside of ripgrep
for now. I'd eventually like to move more of the searching code to the grep
crate, so its API is going to need to be re-worked and thought out anyway.
from ripgrep.
Happy news! My top card in Trello at work says we need support for fast -f
-style searching, so I have time to start on this on Monday. I'd love to prepare you a pull request to look at. Beyond what you wrote above, do you have any further advice on how I should tackle this to make the integration process as easy as possible?
from ripgrep.
My top card in Trello at work says we need support for fast -f-style searching
Do you have any requirements more specific than this? Specifically, the number of strings you need to search for?
I'd love to prepare you a pull request to look at. Beyond what you wrote above, do you have any further advice on how I should tackle this to make the integration process as easy as possible?
I think it's probably pretty important to support use of Aho-Corasick when the -F
flag is given, which should bypass the regex engine entirely. Specifically, Aho-Corasick should scale quite a bit better to thousands/millions of literal strings than the regex engine will. A key problem you might face with this is the fact that Aho-Corasick doesn't actually preserve leftmost-first match semantics, and I'm not quite sure how to deal with that.
In any case, I suspect that the easiest first thing you can do here is to forget about the Aho-Corasick optimization and just join all the strings together with |
and feed it to the regex engine. This may not meet your performance requirements however.
from ripgrep.
Thank for the encouragement! (And for the great tool, which I already use dozens of times a day.)
As for the implementation, let's start with the simple case as a warm-up exercise and go from there. :-)
For our purposes, I could easily imagine a couple hundred strings. As far as I know, we're unlikely need to search for thousands of strings, but @seamusabshere might know otherwise. I'm still happy to try to implement the general case, because I agree that the PR just wouldn't be up to ripgrep's high standards otherwise.
In the longer run, our use case might benefit from a ripgrep
API which allowed us to call the searching engine from another binary. But for our first in-house experiments, we'll have a shell script that uses curl
to download a list of strings from a URL, and which then passes them to rg
for a filesystem-wide search. We might also benefit from a basic implementation of #1 that can at least detect UTF-16 files automatically, even if it can't detect more exotic encodings, so that we don't miss data in files created on Windows.
from ripgrep.
As for the implementation, let's start with the simple case as a warm-up exercise and go from there. :-)
To be clear, I'd be happy with a PR that did the simplest thing first! I imagine it should work fine with hundreds of strings. I don't know what it will look like at thousands or hundreds of thousands though. (I do know that my Aho-Corasick implementation can handle a million pretty easily.)
In the longer run, our use case might benefit from a ripgrep API which allowed us to call the searching engine from another binary. But for our first in-house experiments, we'll have a shell script that uses curl to download a list of strings from a URL, and which then passes them to rg for a filesystem-wide search.
Yup. #162 is the tracking issue for this. I'm almost ready to submit a PR that takes another step in that direction.
We might also benefit from a basic implementation of #1 that can at least detect UTF-16 files automatically, even if it can't detect more exotic encodings, so that we don't miss data in files created on Windows.
I honestly expect #1 to be quite hard to implement. At a high level, you just need something that takes a Read
and transcodes the raw input to UTF-8, which in turn also implements Read
. I think that alone is enough to work with the current implementation. The problem then comes with the output though. Do you output in the same encoding as the input? If so, that sounds pretty hairy. (Possibly even intractable.)
I am kind of hoping to delay #1 until the search code gets pushed into its own library. That way we'll have an API to look at and design from, instead of what we have today, which is a bit convoluted.
from ripgrep.
examples of what i want to search for:
- specific uuids
- strings that look like addresses
/\d+ [a-z]+ (?:st|dr)/i
/^[0-9A-F]{8}-[0-9A-F]{4}-[4][0-9A-F]{3}-[89AB][0-9A-F]{3}-[0-9A-F]{12}$/i
in case that's helpful :)
from ripgrep.
@seamusabshere I think that's all part of the plan. Most of the discussion at this point is just implementation details. :-)
Note that ripgrep probably won't ever support "I want to match hundreds of thousands of regexes all at once." If you're at that point, you'll probably need something more specialized. (Hundreds of thousands of literals should be fine though, even if it's not practical in the first rev.)
from ripgrep.
Note that if this supports -f-
for reading from stdin, then it should satisfy #180 (which I'm now closing as a dupe).
from ripgrep.
The trickiest part of this so far is testing -f-
, which can't be done using the existing WorkDir::stdout
and WorkDir::output
APIs, because we need to pipe some standard input to the process. This requires using something like spawn
, as in this example. Do you have any preferences about how I should implement this?
from ripgrep.
@emk If there's a way to add a convenience method or two to WorkDir
to test it, then I think that would be fine. That way, we could add other tests that depend on stdin.
However, since there aren't actually any existing tests for stdin, I wouldn't be offended if you opted to skip testing -f-
. :-)
from ripgrep.
The first piece of this is available in #227. Thank you for getting me started in the right direction, and please feel free to suggest improvements!
from ripgrep.
Closed by #227
from ripgrep.
Related Issues (20)
- Include valid values in --help HOT 2
- Turn on PCRE2 automatically with a warning when look-around or backreferences are detected HOT 2
- consider adding an `rg:` prefix all error messages HOT 1
- issue when using double quotes to search in linux HOT 1
- feat(globset): allow custom component separator characters? HOT 3
- [ignore] Add API for adding glob patterns to ignore HOT 4
- shard regexes when there are many and do multiple passes over the haystack HOT 4
- Ripgrep significantly slower than grep HOT 3
- Ability to ignore non-existent `PATTERNFILE` HOT 2
- multiline search regex wildcard not expanding HOT 1
- Precompiled binary for powerpc64le HOT 1
- Add Vuejs as a file list to type-list HOT 1
- zero or more quantifier does not work HOT 1
- Inconsistent behavior with negation pattern in `.gitignore` HOT 8
- compiling with simd-accel feature is broken due to removal of stdsimd feature in nightly breaking the packed_simd crate HOT 1
- File named `.config` is ignored for no reason HOT 1
- Process substitution search path behavior change in 14.0.0
- [feature] Line masking (ignoring lines or part of lines in matching but displayed in output) HOT 3
- Provide Statically Compiled Binaries for (aarch64|arm64) Linux HOT 1
- Include regex syntax in man page HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ripgrep.