dixudx / tumblr-crawler Goto Github PK

View Code? Open in Web Editor NEW

1.1K 85.0 347.0 223 KB

Easily download all the photos/videos from tumblr blogs. 下载指定的 Tumblr 博客中的图片，视频

Python 100.00%

tumblr videos photos crawler python

tumblr-crawler's Introduction

tumblr-crawler

This is a Python script that you can easily download all the photos and videos from your favorite tumblr blogs.

中文版教程请移步这里

How to Discuss

Feel free to join our Slack, where you can ask questions and help answer them on Slack.
Also you can open new issue on Github

Prerequisite

For Programmers and Developers

You know how to install Python and pip. Then pip install requests xmltodict

$ git clone https://github.com/dixudx/tumblr-crawler.git
$ cd tumblr-crawler
$ pip install -r requirements.txt

For non-programmers

Installing Python: refer to this guide
Installing pip: refer to installation guide
Run pip install xmltodict six "requests>=2.10.0" "PySocks>=1.5.6" in your terminal (Windows terminal, Mac OS terminal)
Download the zip file and Unzip.

Configuration and Downloading

There are 2 ways to specify the sites you want to download, either by creating a sites.txt file or specifying in the command line parameter.

Use sites.txt

Find a text editor and open the file sites.txt, add the sites you want to download into the file, separated by comma/space/tab/CR, no .tumblr.com suffixes. For example, if you want to download vogue.tumblr.com and gucci.tumblr.com, compose the file like this:

vogue,gucci
vogue2, gucci2

And then save the file, and run python tumblr-photo-video-ripper.py in your terminal or just double click the file which will be automatically run by Python.

Use the command line parameter (only for OS experts)

If you are familiar with command lines in Windows or Unix systems, you may run the script with a parameter to specify the sites:

python tumblr-photo-video-ripper.py site1,site2

The site names should be separated with comma, no space and no .tumblr.com suffixes needed.

How the files get downloaded and stored

The photos/videos will be saved to the folders named with the tumblr blog. You will find them in the current path/directory.

This script will not re-download the photos or videos if they have already been downloaded. So it will do no harm by running this script several times. In the meanwhile, you can find back the missing photos or videos.

Use Proxies (Optional)

You may want to use proxies when downloading. Please refer to ./proxies_sample1.json and ./proxies_sample2.json. And save your own proxies to ./proxies.json in json format. You can validate the content by visiting http://jsonlint.com/.

If ./proxies.json is an empty file, no proxies will be used during downloading.

If you are using Shadowsocks with global mode, your ./proxies.json can be,

{
    "http": "socks5://127.0.0.1:1080",
    "https": "socks5://127.0.0.1:1080"
}

And now you can enjoy your downloads.

More customizations for Programmers Only

# Setting timeout
TIMEOUT = 10

# Retry times
RETRY = 5

# Medium Index Number that Starts from
START = 0

# Numbers of photos/videos per page
MEDIA_NUM = 50

# Numbers of downloading threads concurrently
THREADS = 10

You can set TIMEOUT to another value, e.g. 50, according to your network quality.

And this script will retry downloading the images or videos several times (default value is 5).

You can also only download photos or videos by commenting

def download_media(self, site):
    # only download photos
    self.download_photos(site)
    #self.download_videos(site)

def download_media(self, site):
    # only download videos
    #self.download_photos(site)
    self.download_videos(site)

tumblr-crawler's People

Contributors

Stargazers

Watchers

Forkers

tomjacksom joshuakwan cursesun llidiot hxysh zenway33 twbb breakind inzaghi2012 xlxjh ideal921 huangzulin karlicoss wings-xue hehe2048 bravolover520 srsman lingtaokong fiatluox daqing613 lyhiving cash2one newzf junjiego2 lixuejiang shawneed xuanhun qiujf bluesky4485 railwaycat jinghfut kaliludan sorakiseki24 ichobits foxerfly lostsea bycarl faint32 h1d3r lncy0123 rainsn taozywu hellodaian cced3000 lbxxgn dark1x douyacai911 thachphongphong weeks1743 iashishverma pmphxs wizardalex mijanur-munshi yxd1024 xzeu juntaran iswhat galaxysubrepos guoweijia0579 2016011ak47 yoga49 eatdao xiaohhhh fengjunping stardock aliweiya kenrecall geosmart tabniu nickfransadrien icztb linjiuyi frankiegu xuchaochao ryuuzaki113 youmuyou ajsky anson2416 yuanmomo unei66 gf76866908 qiongxiaotian factanonverba sjhzjxc gyx6644932 3rawkz huluwax baiyecha grukz guxindexin jindx0713 dbbevan cxystudio xiewei72 cchuyacc xonge zdf0221 kai2008 ojpushpull jormungendr

tumblr-crawler's Issues

an error when downloading videos

here is the error feedback:

Traceback (most recent call last):
File "tumblr-photo-video-ripper.py", line 288, in
CrawlerScheduler(sites, proxies=proxies)
File "tumblr-photo-video-ripper.py", line 149, in init
self.scheduling()
File "tumblr-photo-video-ripper.py", line 162, in scheduling
self.download_media(site)
File "tumblr-photo-video-ripper.py", line 165, in download_media
self.download_photos(site)
File "tumblr-photo-video-ripper.py", line 176, in download_photos
self._download_media(site, "photo", START)
File "tumblr-photo-video-ripper.py", line 193, in _download_media
proxies=self.proxies)
File "/Users/zhang/anaconda/lib/python2.7/site-packages/requests/api.py", line 70, in get
return request('get', url, params=params, **kwargs)
File "/Users/zhang/anaconda/lib/python2.7/site-packages/requests/api.py", line 56, in request
return session.request(method=method, url=url, **kwargs)
File "/Users/zhang/anaconda/lib/python2.7/site-packages/requests/sessions.py", line 488, in request
resp = self.send(prep, **send_kwargs)
File "/Users/zhang/anaconda/lib/python2.7/site-packages/requests/sessions.py", line 630, in send
history = [resp for resp in gen] if allow_redirects else []
File "/Users/zhang/anaconda/lib/python2.7/site-packages/requests/sessions.py", line 190, in resolve_redirects
**adapter_kwargs
File "/Users/zhang/anaconda/lib/python2.7/site-packages/requests/sessions.py", line 609, in send
r = adapter.send(request, **kwargs)
File "/Users/zhang/anaconda/lib/python2.7/site-packages/requests/adapters.py", line 473, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",))

I have no idea about what's happening.I hope you can help.
Thank you!

强烈推荐这个功能，将自己标注过喜欢的图片和视频进行下载

https://xxxxxx.tumblr.com/likes
这个样子的链接显示的内容进行下载，这绝对是一个很棒的功能呀

Max retries exceeded with url: /api/read?type=photo&num=50&start=0

is tumblr has limit the max request?

Traceback (most recent call last):
  File "tumblr-photo-video-ripper.py", line 288, in <module>
    CrawlerScheduler(sites, proxies=proxies)
  File "tumblr-photo-video-ripper.py", line 149, in __init__
    self.scheduling()
  File "tumblr-photo-video-ripper.py", line 162, in scheduling
    self.download_media(site)
  File "tumblr-photo-video-ripper.py", line 165, in download_media
    self.download_photos(site)
  File "tumblr-photo-video-ripper.py", line 176, in download_photos
    self._download_media(site, "photo", START)
  File "tumblr-photo-video-ripper.py", line 193, in _download_media
    proxies=self.proxies)
  File "/usr/lib/python2.7/site-packages/requests/api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/lib/python2.7/site-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 513, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 645, in send
    history = [resp for resp in gen] if allow_redirects else []
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 212, in resolve_redirects
    **adapter_kwargs
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 623, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 504, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='demo.tumblr.com', port=443): Max retries exceeded with url: /api/read?type=photo&num=50&start=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x6ffff8269d0>: Failed to establish a new connection: [Errno 116] Connection timed out',))

请问这个是什么错误？

用了shadowsocksR 代理

Adding timestamp to tag the start and end time of the downloading

建议增加两个时间戳，一个用于下载Post开始日期，一个用于下载Post结束日期。
比如：在2017年3月1日下载过某Tumblr博主的所有Posts（51GiB)，并且保留了少数图片/视频(1GiB)，现在要下载2017年3月1日到2017年4月1日之间的Posts，但是2017年3月1日之前的Posts不想再重复下载（毕竟磁盘空间有限）。

Add a warning if wrong link is given by user

I accidentally entered a wrong tumblr address which does not exisit. The scrips report a error that XML tag is broken. Can you add a warning if wrong address is entered?

requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",))

用的是shadowsocks,用浏览器能访问tumblr,代理是按教程填写的，但是打开后过了一分钟左右出现这个异常，请问怎么解决。。

有关代理

Hello, I am a beginner of Python that is learning crawler these days, I am really confused by the proxies issue. I tried to follow the lead of samples, but it still won't work.

it says:
requests.exceptions.InvalidSchema: Missing dependencies for SOCKS support.

Could you make the sample more specific? the proxy that I am using is called greenvpn.

Can't Download Video

I run the script and get the result like:

Downloading video_133163997.mp4 from https://player.vimeo.com/video/133163997?title=0&byline=0&portrait=0.

Finish Downloading All the videos from xxx

I have the video files but there size are abnormal ( Only 10K-20K）

I can open the url in my explorer and see the video

程序无法运行？

运行后提示：

Traceback (most recent call last):
File "tumblr-photo-video-ripper.py", line 7, in
from six.moves import queue as Queue
ModuleNotFoundError: No module named 'six'

打开就闪退，要不就是只显示你在使用代理

log在哪找啊，只创建空目录，没文件。

Video player may be loaded fail, cause the url build fail.

In function _handle_medium_url, the post["video-player"][1] might not have the field #test when the video player in a post loaded fail. Such as this post, the post["video-player"][1] just get OrderedDict([(u'@max-width', u'500')]).

Add some codes to handle this error please:

try:
    video_player = post["video-player"][1]["#text"]
except:
   return None

ERROR when downloading tumblr

I 'm using shadowsocksr with socks5 through 127.0.0.1:10323
When running the scripts,it appears to be the following :

You are using proxies. {'http': 'socks5://127.0.0.1:10323', 'https': 'socks5://127.0.0.1:10323'} Traceback (most recent call last): File "tumblr-photo-video-ripper.py", line 227, in <module> CrawlerScheduler(sites, proxies=proxies) File "tumblr-photo-video-ripper.py", line 127, in __init__ self.scheduling() File "tumblr-photo-video-ripper.py", line 141, in scheduling self.download_photos(site) File "tumblr-photo-video-ripper.py", line 153, in download_photos self._download_media(site, "photo", START) File "tumblr-photo-video-ripper.py", line 170, in _download_media data = xmltodict.parse(response.content) File "C:\Users\jdds1\AppData\Local\Programs\Python\Python36\lib\site-packages\xmltodict.py", line 330, in parse parser.Parse(xml_input, True) xml.parsers.expat.ExpatError: not well-formed (invalid token): line 28, column 41

I have no idea what happened, and I 've ensured that my proxy is stable 'cause the target website can be visited by my browser.
Hoping for replying soon

KeyError: '#text'错误

一开始都挺好，到1000多条的时候就会出KeyError: '#text'错误。
下面是显示的信息
Exception in thread Thread-5:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/root/tumblr-crawler/tumblr-photo-video-ripper.py", line 38, in run
self.download(medium_type, post, target_folder)
File "/root/tumblr-crawler/tumblr-photo-video-ripper.py", line 43, in download
medium_url = self._handle_medium_url(medium_type, post)
File "/root/tumblr-crawler/tumblr-photo-video-ripper.py", line 53, in _handle_medium_url
video_player = post["video-player"][1]["#text"]
KeyError: '#text'

not well format 错误在某些sites出现

        try:
            data = xmltodict.parse(response.content)
        except Exception ,e:
            print e
            break

在分析data这句出现的错误， e 的错误是not well format出错，是不是因为网站屏蔽的原因？

这个错误只在部分的用户名上出现，比如 tsukimitsuki
是不是日文的就会有问题？

pathon小白一枚，请问下输入命令都提示语法错误该怎么办？

在windows环境下，安装的python环境；
可输入命令一直说语法错误；

该怎么做呢？纯小白一枚，跪求指教

Cannot decode response data from URL

Hi 作者，
我在使用您的代码的时候无意中发现很多tumblr的帖子里面藏有漏洞攻击代码，藏有该代码的帖子您的代码基本都不能下载。
如 http://chinaxuyun.tumblr.com 的URL: http://chinaxuyun.tumblr.com/api/read?type=photo&num=50&start=100
故可添加import defusedexpat
并替换tumblr-photo-video-ripper.py中的第201行代码如下：

data = xmltodict.parse(response.content, expat=defusedexpat.pyexpat)

<type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'remove'

Downloading tumblr_nidkcnIqFf1tbp9pfo9_1280.jpg from http://xxxxxxxxxx/tumblr_nidkcnIqFf1tbp9pfo9_1280.jpg.

Exception in thread Thread-2 (most likely raised during interpreter shutdown):Exception in thread Thread-1 (most likely raised during interpreter shutdown):Exception in thread Thread-7 (most likely raised during interpreter shutdown):

Traceback (most recent call last):
Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
File "tumblr-photo-video-ripper.py", line 65, in run

File "tumblr-photo-video-ripper.py", line 65, in run File "tumblr-photo-video-ripper.py", line 72, in download

File "tumblr-photo-video-ripper.py", line 136, in _download File "tumblr-photo-video-ripper.py", line 72, in download

File "tumblr-photo-video-ripper.py", line 136, in _download<type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'remove'

Traceback (most recent call last):
<type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'remove'
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner

File "tumblr-photo-video-ripper.py", line 65, in run
File "tumblr-photo-video-ripper.py", line 72, in download
File "tumblr-photo-video-ripper.py", line 136, in _download
<type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'remove'
Exception in thread Thread-6 (most likely raised during interpreter shutdown):
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
File "tumblr-photo-video-ripper.py", line 65, in run
File "tumblr-photo-video-ripper.py", line 72, in download
File "tumblr-photo-video-ripper.py", line 136, in _download
<type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'remove'

No proxy setting, have some problems witg XML decode?

Downloaded video is corruption and not completed. Only has 243 bytes

Hi I checked some video files downloaded in the folder that were corruption and not completed. The size of the file only 243 bytes or less. Is this blocked or restricted by the tumblr site ?

error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

➜  tumblr-crawler git:(master) ✗ pip install -r requirements.txt
Requirement already satisfied: requests>=2.10.0 in /usr/local/lib/python3.5/dist-packages (from -r requirements.txt (line 1))
Requirement already satisfied: xmltodict in /usr/local/lib/python3.5/dist-packages (from -r requirements.txt (line 2))
Requirement already satisfied: six in /usr/lib/python3/dist-packages (from -r requirements.txt (line 3))
Requirement already satisfied: PySocks>=1.5.6 in /usr/local/lib/python3.5/dist-packages (from -r requirements.txt (line 4))
Collecting defusedexpat (from -r requirements.txt (line 5))
  Downloading http://mirrors.aliyun.com/pypi/packages/2f/cc/56e82058fa3bfbe75b8601f91e0ed2b586fb6aef3105fc0ff734371971e3/defusedexpat-0.4.zip (275kB)
    100% |████████████████████████████████| 276kB 62kB/s
Requirement already satisfied: idna<2.7,>=2.5 in /usr/local/lib/python3.5/dist-packages (from requests>=2.10.0->-r requirements.txt (line 1))
Requirement already satisfied: urllib3<1.23,>=1.21.1 in /usr/local/lib/python3.5/dist-packages (from requests>=2.10.0->-r requirements.txt (line 1))
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.5/dist-packages (from requests>=2.10.0->-r requirements.txt (line 1))
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.5/dist-packages (from requests>=2.10.0->-r requirements.txt (line 1))
Building wheels for collected packages: defusedexpat
  Running setup.py bdist_wheel for defusedexpat ... error
  Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-jgopl29n/defusedexpat/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmpl4kxy_u4pip-wheel- --python-tag cp35:
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-3.5
  copying defusedexpat.py -> build/lib.linux-x86_64-3.5
  running build_ext
  building 'pyexpat' extension
  creating build/temp.linux-x86_64-3.5
  creating build/temp.linux-x86_64-3.5/Modules35
  creating build/temp.linux-x86_64-3.5/expat
  x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DHAVE_EXPAT_CONFIG_H=1 -DUSE_PYEXPAT_CAPI -I/tmp/pip-build-jgopl29n/defusedexpat/expat -I/usr/include/python3.5m -c Modules35/pyexpat.c -o build/temp.linux-x86_64-3.5/Modules35/pyexpat.o
  x86_64-linux-gnu-gcc: error: Modules35/pyexpat.c: No such file or directory
  x86_64-linux-gnu-gcc: fatal error: no input files
  compilation terminated.
  error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

  ----------------------------------------
  Failed building wheel for defusedexpat
  Running setup.py clean for defusedexpat
Failed to build defusedexpat
Installing collected packages: defusedexpat
  Running setup.py install for defusedexpat ... error
    Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-jgopl29n/defusedexpat/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-hhaa9vah-record/install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-3.5
    copying defusedexpat.py -> build/lib.linux-x86_64-3.5
    running build_ext
    building 'pyexpat' extension
    creating build/temp.linux-x86_64-3.5
    creating build/temp.linux-x86_64-3.5/Modules35
    creating build/temp.linux-x86_64-3.5/expat
    x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DHAVE_EXPAT_CONFIG_H=1 -DUSE_PYEXPAT_CAPI -I/tmp/pip-build-jgopl29n/defusedexpat/expat -I/usr/include/python3.5m -c Modules35/pyexpat.c -o build/temp.linux-x86_64-3.5/Modules35/pyexpat.o
    x86_64-linux-gnu-gcc: error: Modules35/pyexpat.c: No such file or directory
    x86_64-linux-gnu-gcc: fatal error: no input files
    compilation terminated.
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

    ----------------------------------------
Command "/usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-jgopl29n/defusedexpat/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-hhaa9vah-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-jgopl29n/defusedexpat/
#
➜  tumblr-crawler git:(master) ✗

Downloading videos sometimes does not work

Help Me!

I used the script , but it doesn't work with the alert：

只能下载图片不能下载视频

怎么解决呢?

error question

请问tumblr的api是官方提供的码

大神你好，
请问tumblr的api是官方提供的码？如果不是，是通过什么办法取得的呢？？不吝赐教～

Failed when the site does not exist

[WinError 10060] 由于
连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。',))
在proxies.json写入

{
    "http": "127.0.0.1:14225",
    "SOCKS": "127.0.0.1:14226"
}

执行脚本时说导入错误？

错误信息如上图

Access Denied when retrieve Failed to retrieve video from xxx

file/t:EZ5TduU8a0fT8bzyRGLg2w/139894566409/tumblr_o1er2qBl811ulzyf6.

Access Denied when retrieve https://xxxx.tumblr.com/video_file/t:EZ5TduU8a0fT8bzyRGLg2w/140259088414/tumblr_nx8g10QplS1ugg6cn.

Failed to retrieve video from https://xxxx.tumblr.com/video_file/t:EZ5TduU8a0fT8bzyRGLg2w/140259088414/tumblr_nx8g10QplS1ugg6cn.

Access Denied when retrieve https://xxxx.tumblr.com/video_file/t:EZ5TduU8a0fT8bzyRGLg2w/142448116304/tumblr_o4j2fig0iM1tqr9po.

这两个错误是啥意思

Downloading likes from a tumblr

Hello,

First of all thank you very much, works great. The issue is that I failed to download likes from a tumblr blog. I tried writing the website in tumblr.com/liked/by/[tumblr name] format in sites.txt file but had no results. Is it possible to download likes from a tumblr blog using this code? If it is, can you please help?

Thank you,

spacekittylasereyes

Is the video ignored directly?

Can not download the video~ Replacing agents is useless, only to download images

[BUG] Does not download videos.

I just attempted to copy a tumblr blog, and none of the videos were downloaded. The blog in question had dozens of videos, and only a few non video posts.

Access Denied when downloading some urls. Those incompleted files with size 243B should be deleted.

经常下载视频会fail

down the photo proxy goagent

please help me
thanks

Login feature?

Some blog may not let logged-out user view (There is an option named "allow logged-out users to see this blog" in privacy settings), and the script will say 'Site *** does not exist' if it trying to access these site. If you access these sites, they will redirect xxx.tumblr.com to your dashboard and open it as a sidebar.

有没有办法实现下载自己Like的视频和图片？

个人收藏的链接永远都是https://www.tumblr.com/likes, 有没有办法下载自己收藏的那些呢？

运行一下子就报错退出了，重新运行也是如此

Traceback (most recent call last):
  File "tumblr-photo-video-ripper.py", line 291, in <module>
    CrawlerScheduler(sites, proxies=proxies)
  File "tumblr-photo-video-ripper.py", line 149, in __init__
    self.scheduling()
  File "tumblr-photo-video-ripper.py", line 162, in scheduling
    self.download_media(site)
  File "tumblr-photo-video-ripper.py", line 166, in download_media
    self.download_videos(site)
  File "tumblr-photo-video-ripper.py", line 169, in download_videos
    self._download_media(site, "video", START)
  File "tumblr-photo-video-ripper.py", line 199, in _download_media
    data = xmltodict.parse(response.content)
  File "/usr/lib/python2.7/site-packages/xmltodict.py", line 330, in parse
    parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 480, column 1496

ubuntu VPS上运行出错

错误信息：

File "/usr/lib/python2.7/dist-packages/cffi/vengine_cpy.py", line 149, in load_library
    raise ffiplatform.VerificationError(error)
cffi.ffiplatform.VerificationError: importing '/usr/lib/python2.7/dist-packages/cryptography/_Cryptography_cffi_813c10e0x7adb75f8.x86_64-linux-gnu.so': /usr/lib/python2.7/dist-packages/cryptography/_Cryptography_cffi_813c10e0x7adb75f8.x86_64-linux-gnu.so: symbol SSLv2_client_method, version OPENSSL_1.0.0 not defined in file libssl.so.1.0.0 with link time reference

请问怎么解决？谢谢

xml parsing error

I met the same problem as @Yodamt in issue #31, it looks like this:

File "tumblr-photo-video-ripper.py", line 199, in _download_media
data = xmltodict.parse(response.content)
File "C:\Python27\lib\site-packages\xmltodict.py", line 330, in parse
parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 21, column 2311

I inspected the data and 发现response.content里面的对应位置是引号后面跟了一个\b，我不确定python是不是因此就把那个引号删掉了然后导致xml解析不过。

我按照Issue #31里面给出的方法把MEDIA_NUM改成了100，然而并没什么用。

突然不要下载了如何停止?

proxies.json 文件为空，报错， python 2.11

Traceback (most recent call last):
File "tumblr-photo-video-ripper.py", line 206, in
illegal_json()
File "tumblr-photo-video-ripper.py", line 190, in illegal_json
print(u"文件proxies.json格式非法.\n"
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

不知道为何，用vpn，直接把这段全注掉就ok了

ERROR

莫名奇妙就报错了。
而且还成功过。一共成功过两次，第二次只下载了sites.txt里的第一个人的，随后就不能下载了。
错误代码：

Traceback (most recent call last):
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\packages\urllib3\contrib\pyopenssl.py", line 438, in wrap_socket
cnx.do_handshake()
File "C:\Users\Administrator\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1638, in do_handshake
self._raise_ssl_error(self._ssl, result)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1378, in _raise_ssl_error
_raise_current_error()
File "C:\Users\Administrator\Anaconda3\lib\site-packages\OpenSSL_util.py", line 54, in exception_from_error_queue
raise exception_type(errors)
OpenSSL.SSL.Error: [('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 594, in urlopen
self._prepare_proxy(conn)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 810, in prepare_proxy
conn.connect()
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\packages\urllib3\connection.py", line 326, in connect
ssl_context=context)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\packages\urllib3\util\ssl.py", line 325, in ssl_wrap_socket
return context.wrap_socket(sock, server_hostname=server_hostname)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\packages\urllib3\contrib\pyopenssl.py", line 445, in wrap_socket
raise ssl.SSLError('bad handshake: %r' % e)
ssl.SSLError: ("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\adapters.py", line 438, in send
timeout=timeout
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 630, in urlopen
raise SSLError(e)
requests.packages.urllib3.exceptions.SSLError: ("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "tumblr-photo-video-ripper.py", line 288, in
CrawlerScheduler(sites, proxies=proxies)
File "tumblr-photo-video-ripper.py", line 149, in init
self.scheduling()
File "tumblr-photo-video-ripper.py", line 162, in scheduling
self.download_media(site)
File "tumblr-photo-video-ripper.py", line 165, in download_media
self.download_photos(site)
File "tumblr-photo-video-ripper.py", line 176, in download_photos
self._download_media(site, "photo", START)
File "tumblr-photo-video-ripper.py", line 193, in _download_media
proxies=self.proxies)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\sessions.py", line 518, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\sessions.py", line 661, in send
history = [resp for resp in gen] if allow_redirects else []
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\sessions.py", line 661, in
history = [resp for resp in gen] if allow_redirects else []
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\sessions.py", line 214, in resolve_redirects
**adapter_kwargs
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\sessions.py", line 639, in send
r = adapter.send(request, **kwargs)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\adapters.py", line 512, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: ("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",)

C:\Users\Administrator\1>python tumblr-photo-video-ripper.py
Traceback (most recent call last):
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\packages\urllib3\contrib\pyopenssl.py", line 438, in wrap_socket
cnx.do_handshake()
File "C:\Users\Administrator\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1638, in do_handshake
self._raise_ssl_error(self._ssl, result)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1378, in _raise_ssl_error
_raise_current_error()
File "C:\Users\Administrator\Anaconda3\lib\site-packages\OpenSSL_util.py", line 54, in exception_from_error_queue
raise exception_type(errors)
OpenSSL.SSL.Error: [('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "tumblr-photo-video-ripper.py", line 288, in
CrawlerScheduler(sites, proxies=proxies)
File "tumblr-photo-video-ripper.py", line 149, in init
self.scheduling()
File "tumblr-photo-video-ripper.py", line 162, in scheduling
self.download_media(site)
File "tumblr-photo-video-ripper.py", line 165, in download_media
self.download_photos(site)
File "tumblr-photo-video-ripper.py", line 176, in download_photos
self._download_media(site, "photo", START)
File "tumblr-photo-video-ripper.py", line 193, in _download_media
proxies=self.proxies)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\sessions.py", line 518, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\sessions.py", line 661, in send
history = [resp for resp in gen] if allow_redirects else []
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\sessions.py", line 661, in
history = [resp for resp in gen] if allow_redirects else []
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\sessions.py", line 214, in resolve_redirects
**adapter_kwargs
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\sessions.py", line 639, in send
r = adapter.send(request, **kwargs)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\adapters.py", line 512, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: ("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",)

希望能下载图片标签

RT，在下载图片后发现海量的照片都是使用 ‘乱码’作为文件名的，这对后面的管理和检索工作造成很大的困扰，希望能增加下载Post Name 作为文件名的功能。以便后期管理和检索。谢谢

slack的链接进不去

打开是一个登录界面，也没有注册的界面。这个貌似只能是项目拥有者才可以登录吧。

运行脚本错误；shadowsocks 全局代理确定没有问题的

D:\python\tumblr-crawler-master>python tumblr-photo-video-ripper.py
You are using proxies.
{'http': 'socks5://127.0.0.1:1080', 'https': 'socks5://127.0.0.1:1080'}
Traceback (most recent call last):
File "D:\python\lib\site-packages\requests\packages\urllib3\connectionpool.py"
, line 600, in urlopen
chunked=chunked)
File "D:\python\lib\site-packages\requests\packages\urllib3\connectionpool.py"
, line 386, in _make_request
six.raise_from(e, None)
File "", line 2, in raise_from
File "D:\python\lib\site-packages\requests\packages\urllib3\connectionpool.py"
, line 382, in _make_request
httplib_response = conn.getresponse()
File "D:\python\lib\http\client.py", line 1197, in getresponse
response.begin()
File "D:\python\lib\http\client.py", line 297, in begin
version, status, reason = self._read_status()
File "D:\python\lib\http\client.py", line 266, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "D:\python\lib\site-packages\requests\adapters.py", line 423, in send
timeout=timeout
File "D:\python\lib\site-packages\requests\packages\urllib3\connectionpool.py"
, line 649, in urlopen
_stacktrace=sys.exc_info()[2])
File "D:\python\lib\site-packages\requests\packages\urllib3\util\retry.py", li
ne 347, in increment
raise six.reraise(type(error), error, _stacktrace)
File "D:\python\lib\site-packages\requests\packages\urllib3\packages\six.py",
line 685, in reraise
raise value.with_traceback(tb)
File "D:\python\lib\site-packages\requests\packages\urllib3\connectionpool.py"
, line 600, in urlopen
chunked=chunked)
File "D:\python\lib\site-packages\requests\packages\urllib3\connectionpool.py"
, line 386, in _make_request
six.raise_from(e, None)
File "", line 2, in raise_from
File "D:\python\lib\site-packages\requests\packages\urllib3\connectionpool.py"
, line 382, in _make_request
httplib_response = conn.getresponse()
File "D:\python\lib\http\client.py", line 1197, in getresponse
response.begin()
File "D:\python\lib\http\client.py", line 297, in begin
version, status, reason = self._read_status()
File "D:\python\lib\http\client.py", line 266, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
requests.packages.urllib3.exceptions.ProtocolError: ('Connection aborted.', Remo
teDisconnected('Remote end closed connection without response',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "tumblr-photo-video-ripper.py", line 288, in
CrawlerScheduler(sites, proxies=proxies)
File "tumblr-photo-video-ripper.py", line 149, in init
self.scheduling()
File "tumblr-photo-video-ripper.py", line 162, in scheduling
self.download_media(site)
File "tumblr-photo-video-ripper.py", line 165, in download_media
self.download_photos(site)
File "tumblr-photo-video-ripper.py", line 176, in download_photos
self._download_media(site, "photo", START)
File "tumblr-photo-video-ripper.py", line 193, in _download_media
proxies=self.proxies)
File "D:\python\lib\site-packages\requests\api.py", line 70, in get
return request('get', url, params=params, **kwargs)
File "D:\python\lib\site-packages\requests\api.py", line 56, in request
return session.request(method=method, url=url, **kwargs)
File "D:\python\lib\site-packages\requests\sessions.py", line 488, in request
resp = self.send(prep, **send_kwargs)
File "D:\python\lib\site-packages\requests\sessions.py", line 609, in send
r = adapter.send(request, **kwargs)
File "D:\python\lib\site-packages\requests\adapters.py", line 473, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected(
'Remote end closed connection without response',))

D:\python\tumblr-crawler-master>

python 3 是不是用不了？

Cannot decode response data

使用Shadowsock作为代理，提示Connect abort

代理写

{
"http": "sock5://127.0.0.1:1080",
"https": "sock5://127.0.0.1:1080"
}

然后无论是
启用系统代理-全局
还是
启用系统代理-PAC
还是
不启用系统代理

都无法访问（这时候Chrome能通过sock5:127.0.0.1:1080出去）

改成

{
"http": "http://127.0.0.1:1080",
"https": "http://127.0.0.1:1080"
}

然后开全局就可以了

运行出错了~

D:\Tumblr\tumblr>python tumblr-photo-video-ripper.py
You are using proxies.
{'http': 'socks5://127.0.0.1:1080', 'https': 'socks5://127.0.0.1:1080'}
Traceback (most recent call last):
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 601, in urlopen
chunked=chunked)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 387, in _make_request
six.raise_from(e, None)
File "", line 2, in raise_from
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 383, in _make_request
httplib_response = conn.getresponse()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 1331, in getresponse
response.begin()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 297, in begin
version, status, reason = self._read_status()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 266, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\adapters.py", line 440, in send
timeout=timeout
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 639, in urlopen
_stacktrace=sys.exc_info()[2])
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\util\retry.py", line 357, in increment
raise six.reraise(type(error), error, _stacktrace)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\packages\six.py", line 685, in reraise
raise value.with_traceback(tb)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 601, in urlopen
chunked=chunked)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 387, in _make_request
six.raise_from(e, None)
File "", line 2, in raise_from
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 383, in _make_request
httplib_response = conn.getresponse()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 1331, in getresponse
response.begin()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 297, in begin
version, status, reason = self._read_status()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 266, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "tumblr-photo-video-ripper.py", line 291, in
CrawlerScheduler(sites, proxies=proxies)
File "tumblr-photo-video-ripper.py", line 149, in init
self.scheduling()
File "tumblr-photo-video-ripper.py", line 162, in scheduling
self.download_media(site)
File "tumblr-photo-video-ripper.py", line 165, in download_media
self.download_photos(site)
File "tumblr-photo-video-ripper.py", line 176, in download_photos
self._download_media(site, "photo", START)
File "tumblr-photo-video-ripper.py", line 193, in _download_media
proxies=self.proxies)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\sessions.py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\sessions.py", line 618, in send
r = adapter.send(request, **kwargs)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\adapters.py", line 490, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

D:\Tumblr\tumblr>

图片都下来了视频没有

Downloading tumblr_oq86meTiUR1w2sarao5_250.jpg from https://68.media.tumblr.com/3fa3f7c2ad890e3a0a20e3aa95fb77da/tumblr_oq86meTiUR1w2sarao5_250.jpg.

Can not download content from a tumblr site that doesn't end with tumblr.com

there was a tumblr site with the address :http://laster.tumblr.com/
but it has recently changed its address to http://t.misoraskyhigh.com/
when writing "laster" in the sites.txt, it will not start download and the program doesn't show any notices of error

BTW, puting "t" in the sites.txt doesn't help

dixudx / tumblr-crawler Goto Github PK

tumblr-crawler's Introduction

tumblr-crawler

中文版教程请移步这里

How to Discuss

Prerequisite

For Programmers and Developers

For non-programmers

Configuration and Downloading

Use sites.txt

Use the command line parameter (only for OS experts)

How the files get downloaded and stored

Use Proxies (Optional)

More customizations for Programmers Only

tumblr-crawler's People

Contributors

Stargazers

Watchers

Forkers

tumblr-crawler's Issues

Recommend Projects

Recommend Topics

Recommend Org