go-shiori / obelisk Goto Github PK

View Code? Open in Web Editor NEW

241.0 11.0 15.0 328 KB

Go package and CLI tool for saving web page as single HTML file

License: MIT License

Go 100.00%

go golang cli archive hacktoberfest

obelisk's People

Contributors

Stargazers

Watchers

Forkers

mdheller wabarc limamedeiros herom1 laszlo-kiss oremb imfantuan hellodword cloudwizard waybackarchiver mirrorweb vaginessa wuhan005 randydom monirzadeh

obelisk's Issues

Download media files from the page

It should be sufficient to wrap all resources in a directory. However, archiving as a single file requires the development of an approach.

Add support for specifying a path for archive resources

It is proposed the flag be marked -directory and the short flag -d.

Allow the option to archive with a headless browser

Just like archivebox, I think archivebox is very nice, but there're two issues:

slow, not a big deal;
custom automation for special pages (lazy loading for example), this issue is working on it.

And I found a great golang lib rod, how about adding a mode of using headless (or headful, it depends) chromium?

--load-cookies doesn't appear to do anything

A site I'm trying to back up has a prompt that prevents you from seeing the actual page until you accept the terms. I saved a netscape cookie file with the chrome extension 'Export cookies.txt' and tried both obelisk -c cookies.txt <url> and obelisk <url> -c cookies.txt, and well, it didn't work. Using --verbose doesn't even mention using any cookies.

do we realy need unsafe in code?

We should try this properly though, unsafe is, well... unsafe. If we see any kind of problem with it we can just remove s2b and properly cast to []byte, but I wonder why they choose to use a memory efficient approach here.

Originally posted by @fmartingr in #96 (comment)

for record #96 (comment)

Can not resolve the lazy load image [2]

Continuation of the bug from issue #2.

Most often, links to stubs are written in src. I think these lines are redundant:

obelisk/process-html.go

Lines 366 to 368 in 23c015a

 if (src != "" || srcset != "") && !strings.Contains(strings.ToLower(nodeClass), "lazy") { 

 return 

 }

But I would not bypass all the attributes in a row, but read only the values from data-src and data-original. For example, Yandex search engine supports it:

Images are downloaded using links from the src attribute of the img tag, as well as the data-src and data-original attributes (in this case, the presence of a link to the image in the src attribute is not necessary).

Optional to skip downloading resource errors?

Hi @RadhiFadlillah,

When I save a webpage using Obelisk will be interrupted if there are a few errors (e.g. some of the images are missing). I think it should continue to download the remaining resources in this case.

I've optimized it for this situation by providing the option skip-resource-url-error to skip the error when the download fails, and it will still be interrupted by default (the original solution).

If you think it is pursued, I will make a PR.

add unit-test

add unit-test for project and a workflow to run that.

Borken google.com/about

When I try to download https://google.com/about, I get a broken page (missing links, broken formatting, etc.). I'm not sure what's missing, but as a comparison point: when I download the same page with monolith, I get a 56MB file vs 53MB with obelisk.

Can not resolve the lazy load image

Source: https://mp.weixin.qq.com/s/Xo78wOeoR6RArdcuREdHUQ

The lazy image block:

<img class="rich_pages" data-ratio="1.23875" data-s="300,640" data-src="https://mmbiz.qpic.cn/mmbiz_jpg/qfC4kOufBopzvshib8KowN41pKLiahBe0EmAd8vrevPlIIhDLv16b7F3AbUJBCTLo9Tt7zlx6AyvNoEpNiaZcpJ0g/640?wx_fmt=jpeg" data-type="jpeg" data-w="800" style="">

I suggest removing the dot before image format in the regex.

obelisk/process-html.go

Lines 17 to 19 in e22fddd

 rxLazyImageSrc = regexp.MustCompile(`(?i)^\s*\S+\.(jpg|jpeg|png|webp)\S*\s*$`) 

 rxLazyImageSrcset = regexp.MustCompile(`(?i)\.(jpg|jpeg|png|webp)\s+\d`) 

 rxImgExtensions = regexp.MustCompile(`(?i)\.(jpg|jpeg|png|webp)`)

Parser plugins

As an idea, I would suggest the possibility of extending parsing functions for specific domains.

For example, on one domain, when saving, you need to remove the blurred class from images. And on another domain, pictures can generally be stored in data-json="{image:\"...\"}

Also good example is issue #37

Page is blank after archive

Hi I am trying to archive this article

https://www.foodnavigator.com/Article/2022/09/09/study-linking-deaths-to-red-meat-appears-implausible-and-lacks-transparency

obelisk {above-url}

Regardless if I use --no-js or not it still is blank

Anyone figure out how to get around this?

	if (src != "" \|\| srcset != "") && !strings.Contains(strings.ToLower(nodeClass), "lazy") {
	return
	}

	rxLazyImageSrc = regexp.MustCompile(`(?i)^\s\S+\.(jpg\|jpeg\|png\|webp)\S\s*$`)
	rxLazyImageSrcset = regexp.MustCompile(`(?i)\.(jpg\|jpeg\|png\|webp)\s+\d`)
	rxImgExtensions = regexp.MustCompile(`(?i)\.(jpg\|jpeg\|png\|webp)`)