Giter Club home page Giter Club logo

goclone's Introduction

Boo!

Just your friendly neighborhood Go dabbler who loves network programming, and contributing to open-source projects!

pgp views Static Badge Static Badge


About Me

I'm a software engineer with 3 years of experience in a variety of domains. For the past few years, I've focused on web application performance/scalability, microservice orchestration, and platform development - tinkering with machine learning, and all things Cloud Native.

~ imthaghost

goclone's People

Contributors

dependabot[bot] avatar imthaghost avatar joanbono-bf avatar koenix avatar lpmi-13 avatar mesaglio avatar npalumbo avatar omarsagoo avatar tempor1s avatar thanhkaiba avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

goclone's Issues

goclone doesn't clone a site

I was experimenting with your 'goclone' and it seems to only clone the home page, not an entire site.
It doesn't seem to even attempt to crawl site links to clone more pages of the site.

Does not copy zerohedge

tried this today on zerohedge.com. works alright but the website is not a carbon copy.

This looks like it should clone any website, did i do something wrong?

Unable to bypass cloudflare anti-bot

Thanks for the wonderful work. However, when I try to clone a website protected by cloudflare anti-bot, this software failed to bypass these protection, hence not able to retrieve the useful real website.

Its shows extracting but nothing is saved

url: https://shtheme.com/demosd/foliox

image
image

then after its done , 95% of the files are not saved

css
image

js folder is empty

most of the images tend to get saved, but some are missing

403 Error

I get "403 - Forbidden | Access to this page is forbidden." when trying to open the created URL. Any ideas?

make install instructions clearer

For those who don't use 'go', it's a tricky to google thing. I wasn't aware this was related to the go language and assumed it was some other utility program.

Then, the instructions list the out of date way to use the go get command first, and don't make it immediately clear that alternate instructions are coming afterwards.

Does not work with relative URLs

Css found --> /css/bootstrap.min.css
Extracting -->  /css/bootstrap.min.css
panic: Get "/css/bootstrap.min.css": unsupported protocol scheme ""

goroutine 6 [running]:
github.com/imthaghost/goclone/crawler.Extractor({0xc0000e86d8, 0x16}, {0xc000018750, 0x27})
        C:/Users/User/go/pkg/mod/github.com/imthaghost/[email protected]/crawler/extractor.go:35 +0x285
github.com/imthaghost/goclone/crawler.Collector.func1(0xc000089860)
        C:/Users/User/go/pkg/mod/github.com/imthaghost/[email protected]/crawler/collector.go:25 +0xe7
github.com/gocolly/colly.(*Collector).handleOnHTML.func1(0xc0000f4010, 0xc0003a43f0)
        C:/Users/User/go/pkg/mod/github.com/gocolly/[email protected]/colly.go:963 +0x78
github.com/PuerkitoBio/goquery.(*Selection).Each(0xc0003a43c0, 0xc000073e40)
        C:/Users/User/go/pkg/mod/github.com/!puerkito!bio/[email protected]/iteration.go:10 +0x46
github.com/gocolly/colly.(*Collector).handleOnHTML(0xc0001a0b60, 0xc0000e6000)
        C:/Users/User/go/pkg/mod/github.com/gocolly/[email protected]/colly.go:953 +0x24d
github.com/gocolly/colly.(*Collector).fetch(0xc0001a0b60, {0x0, 0x0}, {0x107bd31, 0x3}, 0x1, {0x0, 0x0}, 0x0, 0xc0002742a0, ...)
        C:/Users/User/go/pkg/mod/github.com/gocolly/[email protected]/colly.go:623 +0x68d
created by github.com/gocolly/colly.(*Collector).scrape
        C:/Users/User/go/pkg/mod/github.com/gocolly/[email protected]/colly.go:532 +0x556

goclone: command not found

I think i did all the steps:
1> Installed GO and added the PATH environment variable on $HOME/.profile
go version go1.21.6 linux/amd64

2> Did the manual installation of goclone
go install github.com/imthaghost/goclone/cmd/goclone@latest

3> Then tried the 'goclone <example.com>' and got the "command not found" message.

What am i missing?
Thanks!

Recursive clone

I realize this might be dangerous if someone haphazardly attempts WIkipedia or something, but it would be nice to have an -r flag to recursively follow all tags and clone subpages

Where it is saved

after : goclone https://configtree.co
what path goclone saved the file?

panic open

panic: open C:\Users\difar\Downloads\goclone-master\goclone-master\cmd/www.censor.id/imgs/image?url=https%3A%2F%2Fwww.censor.id%2Fsites%2Fdefault%2Ffiles%2F2023-07%2Fcensor_img_230731212052.png: The filename, directory name, or volume label syntax is incorrect.

cookies

how do you pass multiple cookies?

Index.html doesnt show website

Hey, so instead of showing web index.html shows me code. Maybe im stupid and dont understand something, but im new to coding could u help?
lol

Permision Denied

after installing Goclone I tried to copy a website but it's showing the error below. I am using WSL in windows 10. Please is there a way to resolve this or am I doing something wrong?

2021/08/03 10:22:39 mkdir /mnt/c/Windows/system32/tesla.com: permission denied
2021/08/03 10:22:39 mkdir /mnt/c/Windows/system32/tesla.com: permission denied
2021/08/03 10:22:39 mkdir /mnt/c/Windows/system32/tesla.com: permission denied
2021/08/03 10:22:39 mkdir /mnt/c/Windows/system32/tesla.com: permission denied
2021/08/03 10:22:39 open /mnt/c/Windows/system32/tesla.com/index.html: no such file or directory
Extracting --> https://tesla.com
panic: open /mnt/c/Windows/system32/tesla.com/index.html: no such file or directory

goroutine 18 [running]:
github.com/imthaghost/goclone/crawler.HTMLExtractor(0xc0002a42a0, 0x11, 0xc0000fa180, 0x21)
/Users/ghost/go/src/github.com/imthaghost/goclone/crawler/html.go:26 +0x2d3
github.com/imthaghost/goclone/crawler.Collector.func4(0xc000198780)
/Users/ghost/go/src/github.com/imthaghost/goclone/crawler/collector.go:52 +0xc4
github.com/gocolly/colly.(*Collector).handleOnRequest(0xc0000bb860, 0xc000198780)
/Users/ghost/go/pkg/mod/github.com/gocolly/[email protected]/colly.go:924 +0x66
github.com/gocolly/colly.(*Collector).fetch(0xc0000bb860, 0xc0002a4260, 0x11, 0xa85a26, 0x3, 0x1, 0x0, 0x0, 0xc0001df7f0, 0xc0002b8000, ...)
/Users/ghost/go/pkg/mod/github.com/gocolly/[email protected]/colly.go:586 +0x176
created by github.com/gocolly/colly.(*Collector).scrape
/Users/ghost/go/pkg/mod/github.com/gocolly/[email protected]/colly.go:532 +0x3b1

How to use goclone for site-search ?

How can I use goclone to extract content of internal-websites? I can put it in a struct like below, and then inject the json to elasic to for building site search functionality.

P.S: I'm new to golang. :)

/**
Plan : Import this struct in a crawling program, extract just text content no images/js, index for sitesearch.
**/
type WebPage struct {
	Id      string    `json:"id"`      // some id string or number
	Url     string    `json:"url"`     // URL of the page to index
	Title   string    `json:"title"`   // Title of the page
	Content string    `json:"content"` // Page content //TODO remove javascript, try to extract only core content
	Time    time.Time `json:"time"`    // TODO timestamp of page creation time
}

func (document *WebPage) Print() {
	// enc := json.NewEncoder(os.Stdout)
	// enc.SetIndent("", "  ")
	// document.Content = ""
	// enc.Encode(document)
	println(fmt.Sprintf("page:  {\n  title: %s, \n  url : %s, \n  content:%s, \n  time:%s \n}", document.Title, document.Url, "-redacted-", document.Time))
}

Installation Error

As going step by step through installation manual I got stuck over this issue of which I attached screenshot.

C:\Users\hp\go/src/github.com/imthaghost/goclone/cmd/goclone

Screenshot 2021-08-05 231006

Manual install instructions outdated.

Following the project readme's manual install instructions failed.

Go Version: 1.18.4

Command ran + output:

go get github.com/imthaghost/goclone

go: go.mod file not found in current directory or any parent directory.
        'go get' is no longer supported outside a module.
        To build and install a command, use 'go install' with a version,
        like 'go install example.com/cmd@latest'
        For more information, see https://golang.org/doc/go-get-install-deprecation
        or run 'go help get' or 'go help install'.

Alternative that also failed:

go install github.com/imthaghost/goclone@latest

clone other types of files

I also want to clone files with other extensions. I'm trying to help someone troubleshoot a website they're building on another platform, and it's not easy to work on their stuff and would be easier to work with raw files for me. Their server doesn't have permissive CORS headers, though, and the site owner has no idea how to change that. They specifically have several files with .glb and .gltf extensions.

Would be nice to be able to specify extensions to include.

[Feature Request] Support for blob and data

Hey there, thanks for this, looks great so far!

Got this error while downloading a site...

Img found --> blob:https://twitter.com/c86b1e90-025d-4bfa-9844-adb20bc84bf1
Extracting -->  blob:https://twitter.com/c86b1e90-025d-4bfa-9844-adb20bc84bf1
panic: Get "blob:https://twitter.com/c86b1e90-025d-4bfa-9844-adb20bc84bf1": unsupported protocol scheme "blob"

and it died. Feature request, support for "blob" and "data"

unsupported protocol scheme "data"

I added in a string replace for now but I'm not good enough at go to be worthy of a PR yet. :-)

Thanks!

how to uninstall?

it does not work, it can clone some websites but not all.

how do i uninstall all the installed packages or where are they downloaded to just in the main folder somewhere?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.