imthaghost / goclone Goto Github PK

View Code? Open in Web Editor NEW

1.2K 1.2K 255.0 125.89 MB

Website Cloner - Utilizes powerful Go routines to clone websites to your computer within seconds.

Home Page: https://goclone.io

License: MIT License

Go 100.00%

cloning crawler go golang website-cloner website-scraper

goclone's Introduction

Boo!

Just your friendly neighborhood Go dabbler who loves network programming, and contributing to open-source projects!

About Me

I'm a software engineer with 3 years of experience in a variety of domains. For the past few years, I've focused on web application performance/scalability, microservice orchestration, and platform development - tinkering with machine learning, and all things Cloud Native.

~ imthaghost

goclone's People

Contributors

Stargazers

Watchers

Forkers

nhokcrazy199 m11m seandunford xtremebeing forkkit ezorfa mayankbaluni icodus tawawhite hubitor nhienthai 0xforked siddhant-k-code fajarardiyanto strogo dz0ny atlasdatatech jiancode priest671 tibebejs ildaviz brandboxmarketplace bifurcados jenksed cgboal airacks zadspat tgulacsi isoluckynik ahmedsakrr lloydtechlead mojocrash8 5l1v3r1 mrnakumar puper mcvarer officialurmi nenoken warifp 0naga nneock vinhhapm xushiwei mohamedahmedaligpaly scriptidiot andreteixeira1998 u2takey giftmbanda bujar101 jamijalibrary benwaldner yezihack koenix lhx11187 slooppe iamshubhdevstuff chomikmarkus phuong39 ketchalegend mesaglio xfye robsondepaula arnaudallogene ljcoopz acreal1234 mmurillo4sec viobae mjzkumar jermainlaforce yvenone crazybanjo yang-wei anuskumar xilu0 gomberg5264 jeffchan69 htts3at jianjunlu walrushat abiao133 anyanfei liumingmin kidandcat sossbxy classicvalues hugorino11 https-instagram-heygoh-oficial-con peterza2019 ycjcl868 rabbithophop malicious-xv i-want-tobelieve ingramali mocheymouse tm288 flazx ankushom19 bb33bb sec-fork dheerajwps

goclone's Issues

Its shows extracting but nothing is saved

url: https://shtheme.com/demosd/foliox

then after its done , 95% of the files are not saved

css

js folder is empty

most of the images tend to get saved, but some are missing

Instagram

https://www.instagram.com

Unable to bypass cloudflare anti-bot

Thanks for the wonderful work. However, when I try to clone a website protected by cloudflare anti-bot, this software failed to bypass these protection, hence not able to retrieve the useful real website.

HTTP:/ruidosocustombuilders.com

make install instructions clearer

For those who don't use 'go', it's a tricky to google thing. I wasn't aware this was related to the go language and assumed it was some other utility program.

Then, the instructions list the out of date way to use the go get command first, and don't make it immediately clear that alternate instructions are coming afterwards.

Manual install instructions outdated.

Following the project readme's manual install instructions failed.

Go Version: 1.18.4

Command ran + output:

go get github.com/imthaghost/goclone

go: go.mod file not found in current directory or any parent directory.
        'go get' is no longer supported outside a module.
        To build and install a command, use 'go install' with a version,
        like 'go install example.com/cmd@latest'
        For more information, see https://golang.org/doc/go-get-install-deprecation
        or run 'go help get' or 'go help install'.

Alternative that also failed:

go install github.com/imthaghost/goclone@latest

Where's the website I clone where it's save at

Aaaaaaaaaaaaaaaaaaaaa HELPHELPHELP

Where it is saved

after : goclone https://configtree.co
what path goclone saved the file?

How do we do it for websites that need login ?

clone other types of files

I also want to clone files with other extensions. I'm trying to help someone troubleshoot a website they're building on another platform, and it's not easy to work on their stuff and would be easier to work with raw files for me. Their server doesn't have permissive CORS headers, though, and the site owner has no idea how to change that. They specifically have several files with .glb and .gltf extensions.

Would be nice to be able to specify extensions to include.

Does not copy zerohedge

tried this today on zerohedge.com. works alright but the website is not a carbon copy.

This looks like it should clone any website, did i do something wrong?

HTTPS://ruidosocustombuilders.com

Cloning result is only home page, no css, no links working

Edit

Usage: login Instagram

Usage:
goclone https://www.instagram.com/ [flags]

Flags:
-C, --cookie strings Pre-set these cookies
-h, --help help for goclone
-o, --open Automatically open project in deafult browser
-p, --proxy_string string Proxy connection string. Support http and socks5 https://pkg.go.dev/github.com/gocolly/colly#Collector.SetProxy
-s, --serve Serve the generated files using Echo.
-u, --user_agent string Custom User Agent

Originally posted by @Hugorino11 in #36 (comment)

403 Error

I get "403 - Forbidden | Access to this page is forbidden." when trying to open the created URL. Any ideas?

Base64 image fail when try to download

Panic error when find data:image/png;base64, hash and try to download.

Why can't I find the SRC directory in my $GOPATH, But I can find goclone in the bin folder

My newly installed go version is go1.16.9

I can only download one web page by using the command. Can't the crawler traverse all the pages of the website

Does not work with relative URLs

Css found --> /css/bootstrap.min.css
Extracting -->  /css/bootstrap.min.css
panic: Get "/css/bootstrap.min.css": unsupported protocol scheme ""

goroutine 6 [running]:
github.com/imthaghost/goclone/crawler.Extractor({0xc0000e86d8, 0x16}, {0xc000018750, 0x27})
        C:/Users/User/go/pkg/mod/github.com/imthaghost/[email protected]/crawler/extractor.go:35 +0x285
github.com/imthaghost/goclone/crawler.Collector.func1(0xc000089860)
        C:/Users/User/go/pkg/mod/github.com/imthaghost/[email protected]/crawler/collector.go:25 +0xe7
github.com/gocolly/colly.(*Collector).handleOnHTML.func1(0xc0000f4010, 0xc0003a43f0)
        C:/Users/User/go/pkg/mod/github.com/gocolly/[email protected]/colly.go:963 +0x78
github.com/PuerkitoBio/goquery.(*Selection).Each(0xc0003a43c0, 0xc000073e40)
        C:/Users/User/go/pkg/mod/github.com/!puerkito!bio/[email protected]/iteration.go:10 +0x46
github.com/gocolly/colly.(*Collector).handleOnHTML(0xc0001a0b60, 0xc0000e6000)
        C:/Users/User/go/pkg/mod/github.com/gocolly/[email protected]/colly.go:953 +0x24d
github.com/gocolly/colly.(*Collector).fetch(0xc0001a0b60, {0x0, 0x0}, {0x107bd31, 0x3}, 0x1, {0x0, 0x0}, 0x0, 0xc0002742a0, ...)
        C:/Users/User/go/pkg/mod/github.com/gocolly/[email protected]/colly.go:623 +0x68d
created by github.com/gocolly/colly.(*Collector).scrape
        C:/Users/User/go/pkg/mod/github.com/gocolly/[email protected]/colly.go:532 +0x556

Recursive clone

I realize this might be dangerous if someone haphazardly attempts WIkipedia or something, but it would be nice to have an -r flag to recursively follow all tags and clone subpages

Error while running

No command goclone found, did you mean:
Command rclone in package rclone

Cloning

Release binary for windows too please.

How to use goclone for site-search ?

How can I use goclone to extract content of internal-websites? I can put it in a struct like below, and then inject the json to elasic to for building site search functionality.

P.S: I'm new to golang. :)

/**
Plan : Import this struct in a crawling program, extract just text content no images/js, index for sitesearch.
**/
type WebPage struct {
	Id      string    `json:"id"`      // some id string or number
	Url     string    `json:"url"`     // URL of the page to index
	Title   string    `json:"title"`   // Title of the page
	Content string    `json:"content"` // Page content //TODO remove javascript, try to extract only core content
	Time    time.Time `json:"time"`    // TODO timestamp of page creation time
}

func (document *WebPage) Print() {
	// enc := json.NewEncoder(os.Stdout)
	// enc.SetIndent("", "  ")
	// document.Content = ""
	// enc.Encode(document)
	println(fmt.Sprintf("page:  {\n  title: %s, \n  url : %s, \n  content:%s, \n  time:%s \n}", document.Title, document.Url, "-redacted-", document.Time))
}

goclone: command not found

I think i did all the steps:
1> Installed GO and added the PATH environment variable on $HOME/.profile
go version go1.21.6 linux/amd64

2> Did the manual installation of goclone
go install github.com/imthaghost/goclone/cmd/goclone@latest

3> Then tried the 'goclone <example.com>' and got the "command not found" message.

What am i missing?
Thanks!

cookies

how do you pass multiple cookies?

Panic when DNS resolve issues

I am running a PiHole and thought I would try goclone against some websites, and encountered the following colly issue

$ goclone https://searchcode.com/
Extracting -->  https://searchcode.com/
Css found --> /static/css/newstyles.css
Extracting -->  https://searchcode.com/static/css/newstyles.css
Js found --> //cdn.carbonads.com/carbon.js?zoneid=1673&serve=C6AILKT&placement=searchcodecom
Extracting -->  https://cdn.carbonads.com/carbon.js?zoneid=1673&serve=C6AILKT&placement=searchcodecom
panic: Get "https://cdn.carbonads.com/carbon.js?zoneid=1673&serve=C6AILKT&placement=searchcodecom": dial tcp 0.0.0.0:443: connect: connection refused

goroutine 35 [running]:
github.com/imthaghost/goclone/pkg/crawler.Extractor({0x14000407c00, 0x55}, {0x140003a6000, 0x21})
	/Users/ghost/go/src/github.com/imthaghost/goclone/pkg/crawler/extractor.go:35 +0x24c
github.com/imthaghost/goclone/pkg/crawler.Collector.func2(0x140004aec60)
	/Users/ghost/go/src/github.com/imthaghost/goclone/pkg/crawler/collector.go:37 +0x120
github.com/gocolly/colly/v2.(*Collector).handleOnHTML.func1(0x0, 0x140004a1560)
	/Users/ghost/go/pkg/mod/github.com/gocolly/colly/[email protected]/colly.go:1074 +0x70
github.com/PuerkitoBio/goquery.(*Selection).Each(0x140004a1530, 0x14000073e30)
	/Users/ghost/go/pkg/mod/github.com/!puerkito!bio/[email protected]/iteration.go:10 +0x50
github.com/gocolly/colly/v2.(*Collector).handleOnHTML(0x140003ac000, 0x140003c06c0)
	/Users/ghost/go/pkg/mod/github.com/gocolly/colly/[email protected]/colly.go:1064 +0x288
github.com/gocolly/colly/v2.(*Collector).fetch(0x140003ac000, {0x140003a4060, 0x17}, {0x10531f364, 0x3}, 0x1, {0x0, 0x0}, 0x0, 0x1400038c210, ...)
	/Users/ghost/go/pkg/mod/github.com/gocolly/colly/[email protected]/colly.go:676 +0x7a0
created by github.com/gocolly/colly/v2.(*Collector).scrape
	/Users/ghost/go/pkg/mod/github.com/gocolly/colly/[email protected]/colly.go:574 +0x43c

This only occurs when running against websites that have blocked content, which then throws the above. While portions of the site are still cloned such an error seems like something that should be handled.

Disabling the pi-hole resolves the issue. While I understand pi-hole is not the expected path, I imagine DNS might be configured in some cases and produce something like the above.

[Feature Request] Support for blob and data

Hey there, thanks for this, looks great so far!

Got this error while downloading a site...

Img found --> blob:https://twitter.com/c86b1e90-025d-4bfa-9844-adb20bc84bf1
Extracting -->  blob:https://twitter.com/c86b1e90-025d-4bfa-9844-adb20bc84bf1
panic: Get "blob:https://twitter.com/c86b1e90-025d-4bfa-9844-adb20bc84bf1": unsupported protocol scheme "blob"

and it died. Feature request, support for "blob" and "data"

unsupported protocol scheme "data"

I added in a string replace for now but I'm not good enough at go to be worthy of a PR yet. :-)

Thanks!

Index.html doesnt show website

Hey, so instead of showing web index.html shows me code. Maybe im stupid and dont understand something, but im new to coding could u help?

goclone doesn't clone a site

I was experimenting with your 'goclone' and it seems to only clone the home page, not an entire site.
It doesn't seem to even attempt to crawl site links to clone more pages of the site.

how to uninstall?

it does not work, it can clone some websites but not all.

how do i uninstall all the installed packages or where are they downloaded to just in the main folder somewhere?

i dont know how to even download it

you say brew tap then download
what is brew tap can someone help me download it

Not on Macports

The installable is not available on macports. Brew isn't compatible on most Macs.

Permision Denied

after installing Goclone I tried to copy a website but it's showing the error below. I am using WSL in windows 10. Please is there a way to resolve this or am I doing something wrong?

2021/08/03 10:22:39 mkdir /mnt/c/Windows/system32/tesla.com: permission denied
2021/08/03 10:22:39 mkdir /mnt/c/Windows/system32/tesla.com: permission denied
2021/08/03 10:22:39 mkdir /mnt/c/Windows/system32/tesla.com: permission denied
2021/08/03 10:22:39 mkdir /mnt/c/Windows/system32/tesla.com: permission denied
2021/08/03 10:22:39 open /mnt/c/Windows/system32/tesla.com/index.html: no such file or directory
Extracting --> https://tesla.com
panic: open /mnt/c/Windows/system32/tesla.com/index.html: no such file or directory

goroutine 18 [running]:
github.com/imthaghost/goclone/crawler.HTMLExtractor(0xc0002a42a0, 0x11, 0xc0000fa180, 0x21)
/Users/ghost/go/src/github.com/imthaghost/goclone/crawler/html.go:26 +0x2d3
github.com/imthaghost/goclone/crawler.Collector.func4(0xc000198780)
/Users/ghost/go/src/github.com/imthaghost/goclone/crawler/collector.go:52 +0xc4
github.com/gocolly/colly.(*Collector).handleOnRequest(0xc0000bb860, 0xc000198780)
/Users/ghost/go/pkg/mod/github.com/gocolly/[email protected]/colly.go:924 +0x66
github.com/gocolly/colly.(*Collector).fetch(0xc0000bb860, 0xc0002a4260, 0x11, 0xa85a26, 0x3, 0x1, 0x0, 0x0, 0xc0001df7f0, 0xc0002b8000, ...)
/Users/ghost/go/pkg/mod/github.com/gocolly/[email protected]/colly.go:586 +0x176
created by github.com/gocolly/colly.(*Collector).scrape
/Users/ghost/go/pkg/mod/github.com/gocolly/[email protected]/colly.go:532 +0x3b1

it doesn't work when I clone the website with the amp template

When I clone the website with the amp website, the assets image is not downloaded using the amp tag

panic open

panic: open C:\Users\difar\Downloads\goclone-master\goclone-master\cmd/www.censor.id/imgs/image?url=https%3A%2F%2Fwww.censor.id%2Fsites%2Fdefault%2Ffiles%2F2023-07%2Fcensor_img_230731212052.png: The filename, directory name, or volume label syntax is incorrect.