Giter Club home page Giter Club logo

doc_downloader's Introduction

多种文档下载器

本工具适用于下载豆丁、道客巴巴、淘豆网、原创力、新浪爱问、金锄头网站的可以预览的文档。只要可以预览,就可以下载。下载下来是图片格式,然后会通过reportlab库,将图片转换成PDF。

其中,由于新浪爱问网站用的都是svg格式的文件,将其转换成图片格式需要调用第三方库。Windows下可用svg2png库,Linux下可使用rsvg库。当然,在windows上面也可以安装rsvg库,需要下载CRAN,利用CRAN安装rsvg,实现svg的转换。

本项目还提供了一个简易的在线下载网页,[点击进入]

rsvg库安装方法

Binary packages for OS-X or Windows can be installed directly from CRAN:

install.packages("rsvg")

Installation from source on Linux or OSX requires librsvg2. On Debian or Ubuntu install librsvg2-dev:

sudo apt-get install -y librsvg2-dev

On Fedora, CentOS or RHEL we need librsvg2-devel:

sudo yum install librsvg2-devel

On OS-X use rsvg from Homebrew:

brew install librsvg

svg2png安装方法(仅限Windows操作系统)

1. 安装nodejs
2. 命令提示符内输入:npm install -g svg2png
3. 命令提示符内输入:Set-ExecutionPolicy -ExecutionPolicy 

本项目使用方法

终端内输入:

pip install -r requirements.txt
python docDownloader.py

若使用报错,应先检查chromedriver版本与chrome版本是否兼容。若不兼容,则只需将项目中的chromedriver.exe替换为兼容的版本即可。附chromedriver下载地址

doc_downloader's People

Contributors

dependabot[bot] avatar rty813 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

doc_downloader's Issues

报错了

请输入网址(输入exit退出):https://www.doc88.com/p-28761857292280.html

DevTools listening on ws://127.0.0.1:54380/devtools/browser/c41bbe33-bb5c-4220-ad2a-2049ab2a857c
道客巴巴: 《TB10097-2019 铁路房屋建筑设计标准 - 道客巴巴》
Traceback (most recent call last):
  File "docDownloader.py", line 44, in <module>
  File "fire\core.py", line 141, in Fire
  File "fire\core.py", line 466, in _Fire
  File "fire\core.py", line 681, in _CallAndUpdateTrace
  File "docDownloader.py", line 16, in main
  File "doc88.py", line 43, in download
  File "selenium\webdriver\remote\webelement.py", line 80, in click
  File "selenium\webdriver\remote\webelement.py", line 633, in _execute
  File "selenium\webdriver\remote\webdriver.py", line 321, in execute
  File "selenium\webdriver\remote\errorhandler.py", line 242, in check_response
selenium.common.exceptions.ElementClickInterceptedException: Message: element click intercepted: Element <div class="surplus-btn" id="continueButton">...</div> is not clickable at point (439, 9). Other element would receive the click: <a href="javascript:;" title="缩小" id="zoomOutButton">...</a>
  (Session info: chrome=92.0.4515.159)

[28632] Failed to execute script docDownloader

金锄头用不了

    # 获取页数
    num_of_pages = driver.find_element_by_id('readshop').find_element_by_class_name(
        'mainpart').find_element_by_class_name('shop3').find_elements_by_class_name('text')[-1].get_attribute('innerHTML')

里面,shop3没了

无法下载豆丁网资源

网址 https://www.docin.com/p-2300126438.html,出现以下报错:

PS D:\docDownloader> .\docDownloader.exe
请输入网址(输入exit退出):https://www.docin.com/p-2300126438.html
Traceback (most recent call last):
  File "docDownloader.py", line 44, in <module>
  File "fire\core.py", line 141, in Fire
  File "fire\core.py", line 466, in _Fire
  File "fire\core.py", line 681, in _CallAndUpdateTrace
  File "docDownloader.py", line 29, in main
  File "douding.py", line 13, in download
ValueError: substring not found
[27544] Failed to execute script docDownloader

从doc88下载页数过多时会卡着不懂或报错

一、卡着不动
在我测试了几个文件之后,发现超过500页的都卡着了,一般会在300-500页卡着,然后loading条显示 51%|█████ | 470/746 [09:26<09:05, 1.51s/it]
就再也不update了
二、error
51%|█████ | 380/746 [09:26<09:05, 1.49s/it]
Traceback (most recent call last):
File "G:\Programwork\Python\doc_downloader\full_code\doc_downloader-master\docDownloader.py", line 44, in
fire.Fire(main)
File "C:\Users\billeyang\AppData\Local\Programs\Python\Python39\lib\site-packages\fire\core.py", line 138, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "C:\Users\billeyang\AppData\Local\Programs\Python\Python39\lib\site-packages\fire\core.py", line 463, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "C:\Users\billeyang\AppData\Local\Programs\Python\Python39\lib\site-packages\fire\core.py", line 672, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "G:\Programwork\Python\doc_downloader\full_code\doc_downloader-master\docDownloader.py", line 16, in main
doc88.download(url)
File "G:\Programwork\Python\doc_downloader\full_code\doc_downloader-master\doc88.py", line 78, in download
img_data = driver.execute_script(js_cmd)
File "C:\Users\billeyang\AppData\Local\Programs\Python\Python39\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 634, in execute_script
return self.execute(command, {
File "C:\Users\billeyang\AppData\Local\Programs\Python\Python39\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "C:\Users\billeyang\AppData\Local\Programs\Python\Python39\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
(Session info: headless chrome=99.0.4844.51)

Process finished with exit code 1

error显示target window already closed但并未对chrome界面有任何操作。

能打开chrome,不能转pdf

下面是提示信息
Traceback (most recent call last):
File "docDownloader.py", line 44, in
File "fire\core.py", line 141, in Fire
File "fire\core.py", line 466, in _Fire
File "fire\core.py", line 681, in _CallAndUpdateTrace
File "docDownloader.py", line 16, in main
File "doc88.py", line 43, in download
File "selenium\webdriver\remote\webelement.py", line 80, in click
File "selenium\webdriver\remote\webelement.py", line 633, in _execute
File "selenium\webdriver\remote\webdriver.py", line 321, in execute
File "selenium\webdriver\remote\errorhandler.py", line 242, in check_response
selenium.common.exceptions.ElementClickInterceptedException: Message: element click intercepted: Element

...
is not clickable at point (479, 8). Other element would receive the click:
...

(Session info: chrome=93.0.4577.82)

[4656] Failed to execute script docDownloader

release版下载失败.Message: unknown error: cannot find Chrome binary

D:\docDownloader>docDownloader.exe
请输入网址(输入exit退出):https://max.book118.com/html/2019/1210/6155102204002131.shtm
Traceback (most recent call last):
File "docDownloader.py", line 44, in
File "site-packages\fire\core.py", line 138, in Fire
File "site-packages\fire\core.py", line 463, in _Fire
File "site-packages\fire\core.py", line 672, in _CallAndUpdateTrace
File "docDownloader.py", line 20, in main
File "book118.py", line 22, in download
File "site-packages\selenium\webdriver\chrome\webdriver.py", line 76, in init
File "site-packages\selenium\webdriver\remote\webdriver.py", line 157, in init
File "site-packages\selenium\webdriver\remote\webdriver.py", line 252, in start_session
File "site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
File "site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
selenium.common.exceptions.WebDriverException: Message: unknown error: cannot find Chrome binary

[13500] Failed to execute script docDownloader

===================================================
已安装py,nodejs,R语言环境.

mac下目录中有空格会报错

Traceback (most recent call last):
File "/Volumes/Downloads/doc_downloader-master 3/doc88.py", line 66, in download
element = driver.find_element_by_id(canvas_id)
File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 360, in find_element_by_id
return self.find_element(by=By.ID, value=id_)
File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 976, in find_element
return self.execute(Command.FIND_ELEMENT, {
File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="outer_page_1"]"}
(Session info: headless chrome=96.0.4664.55)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Volumes/Downloads/doc_downloader-master 3/docDownloader.py", line 44, in
fire.Fire(main)
File "/usr/local/lib/python3.9/site-packages/fire/core.py", line 138, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.9/site-packages/fire/core.py", line 463, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.9/site-packages/fire/core.py", line 672, in CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/Volumes/Downloads/doc_downloader-master 3/docDownloader.py", line 16, in main
doc88.download(url)
File "/Volumes/Downloads/doc_downloader-master 3/doc88.py", line 69, in download
element = driver.find_element_by_id(canvas_id)
File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 360, in find_element_by_id
return self.find_element(by=By.ID, value=id
)
File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 976, in find_element
return self.execute(Command.FIND_ELEMENT, {
File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="outer_page_1"]"}
(Session info: headless chrome=96.0.4664.55)

运行时参数错误,是我缺少组件么?

Microsoft Windows [版本 6.1.7601]
版权所有 (c) 2009 Microsoft Corporation。保留所有权利。
E:>cd E:\chrome下载\docDownloader1.2.3\docDownloader
E:\chrome下载\docDownloader1.2.3\docDownloader>docdownloader.exe
Traceback (most recent call last):
File "C:\Program Files\Python38\Lib\site-packages\PyInstaller\hooks\rthooks\py
i_rth_multiprocessing.py", line 17, in
File "PyInstaller\loader\pyimod03_importers.py", line 531, in exec_module
File "multiprocessing_init_.py", line 16, in
File "PyInstaller\loader\pyimod03_importers.py", line 531, in exec_module
File "multiprocessing\context.py", line 6, in
File "PyInstaller\loader\pyimod03_importers.py", line 531, in exec_module
File "multiprocessing\reduction.py", line 16, in
File "PyInstaller\loader\pyimod03_importers.py", line 531, in exec_module
File "socket.py", line 49, in
ImportError: DLL load failed while importing _socket: 参数错误。
[4048] Failed to execute script pyi_rth_multiprocessing

E:\chrome下载\docDownloader1.2.3\docDownloader>

报错

所在位置 行:1 字符: 21

  • Set-ExecutionPolicy -ExecutionPolicy
  •                 ~~~~~~~~~~~~~~~~
    
    • CategoryInfo : InvalidArgument: (:) [Set-ExecutionPolicy],ParameterBindingException
    • FullyQualifiedErrorId : MissingArgument,Microsoft.PowerShell.Commands.SetExecutionPolicyCommand

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.