Giter Club home page Giter Club logo

webpalm's Introduction

WebPalm

banner






Take a look

takealook-min

What is webpalm?

WebPalm is a command-line tool that enables users to traverse a website and generate a tree of all its webpages and their links. It uses a recursive approach to enter each link found on a webpage and continues to do so until all levels have been explored. In addition to generating a site map, WebPalm can extract data from the body of each page using regular expressions and save the results in a file. This feature can be useful for web scraping or extracting specific information.

⚠️ DISCLAIMER ⚠️:

this tool is intended to be used for legal purposes only, and you are responsible for your actions.

Features

  • Generate a palm tree struct of web urls
  • Dump data from body pages using regular expressions
  • Multi-threading and parallelism
  • Export the web-tree to json, xml, txt
  • Fast and easy to use
  • Colorized output and error handling

Installation

From source

git clone https://github.com/Malwarize/webpalm.git
cd webpalm
go build -o webpalm && ./webpalm

From binary

you can download the binary from Releases

wget https://github.com/Malwarize/webpalm/releases/download/v0.0.1/webpalm_x.x.x_os_arch.tar.gz
tar -xvf webpalm_x.x.x_os_arch.tar.gz
cd webpalm
./webpalm

if you have go installed

go install github.com/Malwarize/webpalm/v2@latest

Usage

webpalm -h
Flags:
  -d, --delay int                delay (ms) between each request / ex: -d 200
  -x, --exclude-code ints        status codes to exclude / ex : -x 404,500
  -h, --help                     help for webpalm
  -i, --include strings          include only domains / ex : -i google.com,facebook.com
  -l, --level int                level of palming / ex: -l2
  -o, --output string            file to export the result (f.json, f.xml, f.txt) / ex: -o result.json
  -p, --proxy string             proxy to use / ex: -p http://proxy.com:8080
      --regexes stringToString   regexes to match in each page / ex: --regexes comments="\<\!--.*?-->" (default [])
  -t, --timeout int              timeout in seconds / ex: -t 10 (default 10)
  -u, --url string               target url / ex: -u https://google.com
  -a, --user-agent string        user agent to use / ex: -a chrome, firefox, safari, ie, edge, opera, android, ios, custom
  -v, --version                  version for webpalm
  -w, --worker int               number of workers for multi-threading  / ex: -w 10

Examples

get the palm tree of a website:

webpalm -u https://google.com -l1
# or
webpalm -u https://google.com -l1 -w 3 # 3 workers (multi-threading)

get palm tree of a website and exclude some status codes:

webpalm -u https://google.com -l1 -x 404,500 

get the palm tree of a website and dump data from the body of the pages:

webpalm -u https://google.com -l1 --regexes comments="\<\!--.*?-->" -o result.json

this will dump the comments of each page in the body of the page

webpalm -u https://google.com -l1 --regexes comments="\<\!--.*?-->",emails="([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+)"

this will dump the comments and emails of each page in the body of the page

get the palm tree of a website and export it to xml,txt:

webpalm -u https://google.com -l3 -o result.xml
webpalm -u https://google.com -l2 -o result.txt

get the palm tree of a website and include only some domains:

webpalm -u https://google.com -l2 -i google.com,facebook.com

this will crawl only the urls that contains google.com or facebook.com

threading and concurrency

get the palm tree of a website using 100 workers:

webpalm -u https://google.com -l2 -w 100

Regexes Examples

Regex Pattern
emails ([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+)
comments \<\!--.*?-->
tokens [a-zA-Z0-9]{32}
password \bpassword\b.{0,10}

Don't forget escaping the regexes if needed

Tests

You can run unit tests to gain more confidence in the enhancements or changes to the code by running go test -v ./...

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. you can also contact me on discord:xorbit.

Powered By Malwarize

Join to Discord

webpalm's People

Contributors

aavision avatar coderantidote avatar elhirchek avatar jonhadfield avatar mahdiaw avatar mhmdk0 avatar xorbit01 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

webpalm's Issues

Max rows 650

Text file created maxes out after 650 rows.

I have checked with "wc -l" during operations.

Please remove the cap of rows

-bash: webpalm: command not found

From Kali Linux, I'm in the webpalm folder and the webpalm command is highlighted in green and appears executable. I'm in a Google Cloud Shell. Should I try it in a VM instead?

webpalm on URLs which are open to directory listing returns nothing

running webpalm -u https://www.nammalonline.co.il/wp-content/uploads/ -l3 -i nammalonline.co.il
returns:

┌[https://www.nammalonline.co.il/wp-content/uploads/]
│Level: 3
│Live Mode: false
│Export to: nothing
│Regexes:
│  nothing
│Crawl Only :
│  nammalonline.co.il
│Excluded Status: nothing
└
└── [https://www.nammalonline.co.il/wp-content/uploads/](0)

even though the uploads directory is open for browsing and contains multiple URLs to crawl.

Using a range quantifier breaks the --regexes option

Error:

command:

go run main.go --regexes  "emails=([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+),passwords=\bpassword\b\s*.{0,10}$"

output:

Error: invalid argument "emails=([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+),passwords=\\bpassword\\b\\s*.{0,10}$" for "--regexes" flag: 10}$ must be formatted as key=value

Description:

Upon further investigation, it seems to be related to the mismatch in functionality of the GetStringToString function at line 178. If we go to its implementation, we can see that in the parsing process, this function calls the stringToStringConv function, which treats the string as a CSV, resulting in an unexpected structure of the input, rendering ss as follows:

ss = []string{
    "emails=([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)", 
    "passwords=\bpassword\b\s*.{0", 
    "10}$"
    }

Continuing to parse each element as key=value, the last element is not valid, thus causing the error

Long runs do not respond to Ctrl-C

On macOS Ventura 13.4 (22F66), I'm trying to run:

webpalm -u https://www.nammalonline.co.il/ -l2 -i nammalonline.co.il

if I try to stop the run by pressing ^C, it does not respond

Regex errors

while running the following command as mentioned in the documentation (readme file):
webpalm -u https://google.com -l1 --regexes comments="\<\!--.*?-->",emails="([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+)"

I'm getting the following error:

Error: invalid argument "comments=\\<\\!--.*?-->,emails=([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+)" for "--regexes"

Build doesn't run

After Installing webpalm using both cloning and direct go installation and try to use the tool using " webpalm -u https://google.com -l1 --live" it just doesn't run. Even tried running on a Docker Container still didn't work.
image

runtime: failed to create new OS thread

trying to run

webpalm -u https://www.nammalonline.co.il/ -l3 -i nammalonline.co.il

After a few seconds I'm getting an error message saying

runtime: failed to create new OS thread

This could be due to the number of pages it is trying to simultaneously crawl, or due to new threads getting created faster than the existing ones are completing their work. The number of threads on top jumped from 12K to 18K in a few seconds, and then webpalm (which was eating up 1800% CPU) crashed.

I think some control on the number of simultaneous threads is required.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.