Giter Club home page Giter Club logo

urldedupe's Introduction

urldedupe

urldedupe is a tool to quickly pass in a list of URLs, and get back a list of deduplicated (unique) URL and query string combination. This is useful to ensure you don't have a URL list will hundreds of duplicated parameters with differing qs values. For an example run, take the following URL list passed in:

https://google.com
https://google.com/home?qs=value
https://google.com/home?qs=secondValue
https://google.com/home?qs=newValue&secondQs=anotherValue
https://google.com/home?qs=asd&secondQs=das

Passing through urldedupe will only maintain the non-duplicate URL & query string (ignoring values) combinations:

$ cat urls.txt | urldedupe
https://google.com
https://google.com/home?qs=value
https://google.com/home?qs=newValue&secondQs=anotherValue

It's also possible to deduplicate similar URLs. This is done with -s|--similar flag, to deduplicate endpoints such as API endpoints with different IDs, or assets:

$ cat urls.txt
https://site.com/api/users/123
https://site.com/api/users/222
https://site.com/api/users/412/profile
https://site.com/users/photos/photo.jpg
https://site.com/users/photos/myPhoto.jpg
https://site.com/users/photos/photo.png

Becomes:

$ cat urls.txt | urldedupe -s
https://site.com/api/users/123
https://site.com/api/users/412/profile
https://site.com/users/photos/photo.jpg

Why C++? Because it's super fast?!?! No not really, I'm working on my C++ skills and mostly just wanted to create a real-world C++ project as opposed to educational related work.

Installation

Use the binary already compiled within the repository...Or better yet to not run a random binary from myself who can be very shady, compile from source:

You'll need cmake installed and C++ 17 or higher.

Clone the repository & navigate to it:

git clone https://github.com/ameenmaali/urldedupe.git
cd urldedupe

In the urldedupe directory

cmake CMakeLists.txt

If you don't have cmake installed, do that. On Mac OS X it is:

brew install cmake

Run make:

make

The urldedupe binary should now be created in the same directory. For easy use, you can move it to your bin directory.

Usage

urldedupe takes URLs from stdin, or a file with the -u flag, of which you will most likely want in a file such as:

$ cat urls.txt
https://google.com/home/?q=2&d=asd
https://my.site/profile?param1=1&param2=2
https://my.site/profile?param3=3

Help

$ ./urldedupe -h
(-h|--help) - Usage/help info for urldedupe
(-u|--urls) - Filename containing urls (use this if you don't pipe urls via stdin)
(-V|--version) - Get current version for urldedupe
(-r|--regex-parse) - This is significantly slower than normal parsing, but may be more thorough or accurate
(-s|--similar) - Remove similar URLs (based on integers and image/font files) - i.e. /api/user/1 & /api/user/2 deduplicated

Examples

Very simple, simply pass URLs from stdin or with the -u flag:

./urldedupe -u urls.txt

After moving the urldedupe binary to your bin dir..Pass in list from stdin and save to a file:

cat urls.txt | urldedupe > deduped_urls.txt

Deduplicate similar URLs with -s|--similar flag, such as API endpoints with different IDs, or assets:

cat urls.txt | urldedupe -s

https://site.com/api/users/123
https://site.com/api/users/222
https://site.com/api/users/412/profile
https://site.com/users/photos/photo.jpg
https://site.com/users/photos/myPhoto.jpg
https://site.com/users/photos/photo.png

Becomes:

https://site.com/api/users/123
https://site.com/api/users/412/profile
https://site.com/users/photos/photo.jpg

For all the bug bounty hunters, I recommend chaining with tools such as waybackurls or gau to get back only unique URLs as those sources are prone to have many similar/duplicated URLs:

cat waybackurls | urldedupe > deduped_urls.txt

For max thoroughness (usually not necessary), you can use an RFC complaint regex for URL parsing, but it is significantly slower for large data sets:

cat urls.txt | urldedupe -r > deduped_urls_regex.txt

urldedupe's People

Contributors

ameenmaali avatar larskraemer avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.