go-shiori / obelisk Goto Github PK
View Code? Open in Web Editor NEWGo package and CLI tool for saving web page as single HTML file
License: MIT License
Go package and CLI tool for saving web page as single HTML file
License: MIT License
It should be sufficient to wrap all resources in a directory. However, archiving as a single file requires the development of an approach.
It is proposed the flag be marked -directory
and the short flag -d
.
Just like archivebox, I think archivebox is very nice, but there're two issues:
And I found a great golang lib rod, how about adding a mode of using headless (or headful, it depends) chromium?
A site I'm trying to back up has a prompt that prevents you from seeing the actual page until you accept the terms. I saved a netscape cookie file with the chrome extension 'Export cookies.txt' and tried both obelisk -c cookies.txt <url>
and obelisk <url> -c cookies.txt
, and well, it didn't work. Using --verbose doesn't even mention using any cookies.
We should try this properly though,
unsafe
is, well... unsafe. If we see any kind of problem with it we can just removes2b
and properly cast to[]byte
, but I wonder why they choose to use a memory efficient approach here.
Originally posted by @fmartingr in #96 (comment)
for record #96 (comment)
Continuation of the bug from issue #2.
Most often, links to stubs are written in src
. I think these lines are redundant:
Lines 366 to 368 in 23c015a
But I would not bypass all the attributes in a row, but read only the values from data-src
and data-original
. For example, Yandex search engine supports it:
Images are downloaded using links from the
src
attribute of theimg
tag, as well as thedata-src
anddata-original
attributes (in this case, the presence of a link to the image in thesrc
attribute is not necessary).
Hi @RadhiFadlillah,
When I save a webpage using Obelisk
will be interrupted if there are a few errors (e.g. some of the images are missing). I think it should continue to download the remaining resources in this case.
I've optimized it for this situation by providing the option skip-resource-url-error
to skip the error when the download fails, and it will still be interrupted by default (the original solution).
If you think it is pursued, I will make a PR.
add unit-test for project and a workflow to run that.
When I try to download https://google.com/about, I get a broken page (missing links, broken formatting, etc.). I'm not sure what's missing, but as a comparison point: when I download the same page with monolith, I get a 56MB file vs 53MB with obelisk.
Source: https://mp.weixin.qq.com/s/Xo78wOeoR6RArdcuREdHUQ
The lazy image block:
<img class="rich_pages" data-ratio="1.23875" data-s="300,640" data-src="https://mmbiz.qpic.cn/mmbiz_jpg/qfC4kOufBopzvshib8KowN41pKLiahBe0EmAd8vrevPlIIhDLv16b7F3AbUJBCTLo9Tt7zlx6AyvNoEpNiaZcpJ0g/640?wx_fmt=jpeg" data-type="jpeg" data-w="800" style="">
I suggest removing the dot before image format in the regex.
Lines 17 to 19 in e22fddd
As an idea, I would suggest the possibility of extending parsing functions for specific domains.
For example, on one domain, when saving, you need to remove the blurred
class from images. And on another domain, pictures can generally be stored in data-json="{image:\"...\"}
Also good example is issue #37
Hi I am trying to archive this article
obelisk {above-url}
Regardless if I use --no-js
or not it still is blank
Anyone figure out how to get around this?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.