Giter Club home page Giter Club logo

tumblr-crawler's Introduction

tumblr-crawler

This is a Python script that you can easily download all the photos and videos from your favorite tumblr blogs.

中文版教程请移步这里

How to Discuss

  • Feel free to join our Slack, where you can ask questions and help answer them on Slack.
  • Also you can open new issue on Github

Prerequisite

For Programmers and Developers

You know how to install Python and pip. Then pip install requests xmltodict

or

$ git clone https://github.com/dixudx/tumblr-crawler.git
$ cd tumblr-crawler
$ pip install -r requirements.txt

For non-programmers

Configuration and Downloading

There are 2 ways to specify the sites you want to download, either by creating a sites.txt file or specifying in the command line parameter.

Use sites.txt

Find a text editor and open the file sites.txt, add the sites you want to download into the file, separated by comma/space/tab/CR, no .tumblr.com suffixes. For example, if you want to download vogue.tumblr.com and gucci.tumblr.com, compose the file like this:

vogue,gucci
vogue2, gucci2

And then save the file, and run python tumblr-photo-video-ripper.py in your terminal or just double click the file which will be automatically run by Python.

Use the command line parameter (only for OS experts)

If you are familiar with command lines in Windows or Unix systems, you may run the script with a parameter to specify the sites:

python tumblr-photo-video-ripper.py site1,site2

The site names should be separated with comma, no space and no .tumblr.com suffixes needed.

How the files get downloaded and stored

The photos/videos will be saved to the folders named with the tumblr blog. You will find them in the current path/directory.

This script will not re-download the photos or videos if they have already been downloaded. So it will do no harm by running this script several times. In the meanwhile, you can find back the missing photos or videos.

Use Proxies (Optional)

You may want to use proxies when downloading. Please refer to ./proxies_sample1.json and ./proxies_sample2.json. And save your own proxies to ./proxies.json in json format. You can validate the content by visiting http://jsonlint.com/.

If ./proxies.json is an empty file, no proxies will be used during downloading.

If you are using Shadowsocks with global mode, your ./proxies.json can be,

{
    "http": "socks5://127.0.0.1:1080",
    "https": "socks5://127.0.0.1:1080"
}

And now you can enjoy your downloads.

More customizations for Programmers Only

# Setting timeout
TIMEOUT = 10

# Retry times
RETRY = 5

# Medium Index Number that Starts from
START = 0

# Numbers of photos/videos per page
MEDIA_NUM = 50

# Numbers of downloading threads concurrently
THREADS = 10

You can set TIMEOUT to another value, e.g. 50, according to your network quality.

And this script will retry downloading the images or videos several times (default value is 5).

You can also only download photos or videos by commenting

def download_media(self, site):
    # only download photos
    self.download_photos(site)
    #self.download_videos(site)

or

def download_media(self, site):
    # only download videos
    #self.download_photos(site)
    self.download_videos(site)

tumblr-crawler's People

Contributors

b0unt9 avatar dixudx avatar gwpl avatar haoliang-quan avatar joshuakwan avatar karlicoss avatar kyeolee89 avatar railwaycat avatar tainakadrums avatar timgates42 avatar whitelok avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tumblr-crawler's Issues

an error when downloading videos

here is the error feedback:

Traceback (most recent call last):
File "tumblr-photo-video-ripper.py", line 288, in
CrawlerScheduler(sites, proxies=proxies)
File "tumblr-photo-video-ripper.py", line 149, in init
self.scheduling()
File "tumblr-photo-video-ripper.py", line 162, in scheduling
self.download_media(site)
File "tumblr-photo-video-ripper.py", line 165, in download_media
self.download_photos(site)
File "tumblr-photo-video-ripper.py", line 176, in download_photos
self._download_media(site, "photo", START)
File "tumblr-photo-video-ripper.py", line 193, in _download_media
proxies=self.proxies)
File "/Users/zhang/anaconda/lib/python2.7/site-packages/requests/api.py", line 70, in get
return request('get', url, params=params, **kwargs)
File "/Users/zhang/anaconda/lib/python2.7/site-packages/requests/api.py", line 56, in request
return session.request(method=method, url=url, **kwargs)
File "/Users/zhang/anaconda/lib/python2.7/site-packages/requests/sessions.py", line 488, in request
resp = self.send(prep, **send_kwargs)
File "/Users/zhang/anaconda/lib/python2.7/site-packages/requests/sessions.py", line 630, in send
history = [resp for resp in gen] if allow_redirects else []
File "/Users/zhang/anaconda/lib/python2.7/site-packages/requests/sessions.py", line 190, in resolve_redirects
**adapter_kwargs
File "/Users/zhang/anaconda/lib/python2.7/site-packages/requests/sessions.py", line 609, in send
r = adapter.send(request, **kwargs)
File "/Users/zhang/anaconda/lib/python2.7/site-packages/requests/adapters.py", line 473, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",))

I have no idea about what's happening.I hope you can help.
Thank you!

Max retries exceeded with url: /api/read?type=photo&num=50&start=0

is tumblr has limit the max request?

Traceback (most recent call last):
  File "tumblr-photo-video-ripper.py", line 288, in <module>
    CrawlerScheduler(sites, proxies=proxies)
  File "tumblr-photo-video-ripper.py", line 149, in __init__
    self.scheduling()
  File "tumblr-photo-video-ripper.py", line 162, in scheduling
    self.download_media(site)
  File "tumblr-photo-video-ripper.py", line 165, in download_media
    self.download_photos(site)
  File "tumblr-photo-video-ripper.py", line 176, in download_photos
    self._download_media(site, "photo", START)
  File "tumblr-photo-video-ripper.py", line 193, in _download_media
    proxies=self.proxies)
  File "/usr/lib/python2.7/site-packages/requests/api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/lib/python2.7/site-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 513, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 645, in send
    history = [resp for resp in gen] if allow_redirects else []
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 212, in resolve_redirects
    **adapter_kwargs
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 623, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 504, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='demo.tumblr.com', port=443): Max retries exceeded with url: /api/read?type=photo&num=50&start=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x6ffff8269d0>: Failed to establish a new connection: [Errno 116] Connection timed out',))

Adding timestamp to tag the start and end time of the downloading

建议增加两个时间戳,一个用于下载Post开始日期,一个用于下载Post结束日期。
比如:在2017年3月1日下载过某Tumblr博主的所有Posts(51GiB),并且保留了少数图片/视频(1GiB),现在要下载2017年3月1日到2017年4月1日之间的Posts,但是2017年3月1日之前的Posts不想再重复下载(毕竟磁盘空间有限)。

有关代理

Hello, I am a beginner of Python that is learning crawler these days, I am really confused by the proxies issue. I tried to follow the lead of samples, but it still won't work.

it says:
requests.exceptions.InvalidSchema: Missing dependencies for SOCKS support.

Could you make the sample more specific? the proxy that I am using is called greenvpn.

程序无法运行?

运行后提示:

Traceback (most recent call last):
File "tumblr-photo-video-ripper.py", line 7, in
from six.moves import queue as Queue
ModuleNotFoundError: No module named 'six'

Video player may be loaded fail, cause the url build fail.

In function _handle_medium_url, the post["video-player"][1] might not have the field #test when the video player in a post loaded fail. Such as this post, the post["video-player"][1] just get OrderedDict([(u'@max-width', u'500')]).

Add some codes to handle this error please:

try:
    video_player = post["video-player"][1]["#text"]
except:
   return None

ERROR when downloading tumblr

I 'm using shadowsocksr with socks5 through 127.0.0.1:10323
When running the scripts,it appears to be the following :

You are using proxies. {'http': 'socks5://127.0.0.1:10323', 'https': 'socks5://127.0.0.1:10323'} Traceback (most recent call last): File "tumblr-photo-video-ripper.py", line 227, in <module> CrawlerScheduler(sites, proxies=proxies) File "tumblr-photo-video-ripper.py", line 127, in __init__ self.scheduling() File "tumblr-photo-video-ripper.py", line 141, in scheduling self.download_photos(site) File "tumblr-photo-video-ripper.py", line 153, in download_photos self._download_media(site, "photo", START) File "tumblr-photo-video-ripper.py", line 170, in _download_media data = xmltodict.parse(response.content) File "C:\Users\jdds1\AppData\Local\Programs\Python\Python36\lib\site-packages\xmltodict.py", line 330, in parse parser.Parse(xml_input, True) xml.parsers.expat.ExpatError: not well-formed (invalid token): line 28, column 41

I have no idea what happened, and I 've ensured that my proxy is stable 'cause the target website can be visited by my browser.
Hoping for replying soon

KeyError: '#text'错误

一开始都挺好,到1000多条的时候就会出KeyError: '#text'错误。
下面是显示的信息
Exception in thread Thread-5:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/root/tumblr-crawler/tumblr-photo-video-ripper.py", line 38, in run
self.download(medium_type, post, target_folder)
File "/root/tumblr-crawler/tumblr-photo-video-ripper.py", line 43, in download
medium_url = self._handle_medium_url(medium_type, post)
File "/root/tumblr-crawler/tumblr-photo-video-ripper.py", line 53, in _handle_medium_url
video_player = post["video-player"][1]["#text"]
KeyError: '#text'

not well format 错误 在某些sites出现

        try:
            data = xmltodict.parse(response.content)
        except Exception ,e:
            print e
            break

在分析data这句出现的错误, e 的错误是not well format出错,是不是因为网站屏蔽的原因?

这个错误只在部分的用户名上出现,比如 tsukimitsuki
是不是日文的就会有问题?

<type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'remove'

Traceback (most recent call last):
File "tumblr-photo-video-ripper.py", line 291, in
CrawlerScheduler(sites, proxies=proxies)
File "tumblr-photo-video-ripper.py", line 149, in init
self.scheduling()
File "tumblr-photo-video-ripper.py", line 162, in scheduling
self.download_media(site)
File "tumblr-photo-video-ripper.py", line 165, in download_media
self.download_photos(site)
File "tumblr-photo-video-ripper.py", line 176, in download_photos
self._download_media(site, "photo", START)
File "tumblr-photo-video-ripper.py", line 199, in _download_media
data = xmltodict.parse(response.content)
File "/usr/lib/python2.7/dist-packages/xmltodict.py", line 248, in parse
parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 11, column 2289
Downloading tumblr_nidkcnIqFf1tbp9pfo8_1280.jpg from http://xxxxxxxx/tumblr_nidkcnIqFf1tbp9pfo8_1280.jpg.

Downloading tumblr_nidkcnIqFf1tbp9pfo9_1280.jpg from http://xxxxxxxxxx/tumblr_nidkcnIqFf1tbp9pfo9_1280.jpg.

Exception in thread Thread-2 (most likely raised during interpreter shutdown):Exception in thread Thread-1 (most likely raised during interpreter shutdown):Exception in thread Thread-7 (most likely raised during interpreter shutdown):

Traceback (most recent call last):
Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
File "tumblr-photo-video-ripper.py", line 65, in run

File "tumblr-photo-video-ripper.py", line 65, in run File "tumblr-photo-video-ripper.py", line 72, in download

File "tumblr-photo-video-ripper.py", line 136, in _download File "tumblr-photo-video-ripper.py", line 72, in download

File "tumblr-photo-video-ripper.py", line 136, in _download<type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'remove'

Traceback (most recent call last):
<type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'remove'
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner

File "tumblr-photo-video-ripper.py", line 65, in run
File "tumblr-photo-video-ripper.py", line 72, in download
File "tumblr-photo-video-ripper.py", line 136, in _download
<type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'remove'
Exception in thread Thread-6 (most likely raised during interpreter shutdown):
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
File "tumblr-photo-video-ripper.py", line 65, in run
File "tumblr-photo-video-ripper.py", line 72, in download
File "tumblr-photo-video-ripper.py", line 136, in _download
<type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'remove'

No proxy setting, have some problems witg XML decode?

error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

➜  tumblr-crawler git:(master) ✗ pip install -r requirements.txt
Requirement already satisfied: requests>=2.10.0 in /usr/local/lib/python3.5/dist-packages (from -r requirements.txt (line 1))
Requirement already satisfied: xmltodict in /usr/local/lib/python3.5/dist-packages (from -r requirements.txt (line 2))
Requirement already satisfied: six in /usr/lib/python3/dist-packages (from -r requirements.txt (line 3))
Requirement already satisfied: PySocks>=1.5.6 in /usr/local/lib/python3.5/dist-packages (from -r requirements.txt (line 4))
Collecting defusedexpat (from -r requirements.txt (line 5))
  Downloading http://mirrors.aliyun.com/pypi/packages/2f/cc/56e82058fa3bfbe75b8601f91e0ed2b586fb6aef3105fc0ff734371971e3/defusedexpat-0.4.zip (275kB)
    100% |████████████████████████████████| 276kB 62kB/s
Requirement already satisfied: idna<2.7,>=2.5 in /usr/local/lib/python3.5/dist-packages (from requests>=2.10.0->-r requirements.txt (line 1))
Requirement already satisfied: urllib3<1.23,>=1.21.1 in /usr/local/lib/python3.5/dist-packages (from requests>=2.10.0->-r requirements.txt (line 1))
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.5/dist-packages (from requests>=2.10.0->-r requirements.txt (line 1))
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.5/dist-packages (from requests>=2.10.0->-r requirements.txt (line 1))
Building wheels for collected packages: defusedexpat
  Running setup.py bdist_wheel for defusedexpat ... error
  Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-jgopl29n/defusedexpat/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmpl4kxy_u4pip-wheel- --python-tag cp35:
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-3.5
  copying defusedexpat.py -> build/lib.linux-x86_64-3.5
  running build_ext
  building 'pyexpat' extension
  creating build/temp.linux-x86_64-3.5
  creating build/temp.linux-x86_64-3.5/Modules35
  creating build/temp.linux-x86_64-3.5/expat
  x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DHAVE_EXPAT_CONFIG_H=1 -DUSE_PYEXPAT_CAPI -I/tmp/pip-build-jgopl29n/defusedexpat/expat -I/usr/include/python3.5m -c Modules35/pyexpat.c -o build/temp.linux-x86_64-3.5/Modules35/pyexpat.o
  x86_64-linux-gnu-gcc: error: Modules35/pyexpat.c: No such file or directory
  x86_64-linux-gnu-gcc: fatal error: no input files
  compilation terminated.
  error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

  ----------------------------------------
  Failed building wheel for defusedexpat
  Running setup.py clean for defusedexpat
Failed to build defusedexpat
Installing collected packages: defusedexpat
  Running setup.py install for defusedexpat ... error
    Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-jgopl29n/defusedexpat/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-hhaa9vah-record/install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-3.5
    copying defusedexpat.py -> build/lib.linux-x86_64-3.5
    running build_ext
    building 'pyexpat' extension
    creating build/temp.linux-x86_64-3.5
    creating build/temp.linux-x86_64-3.5/Modules35
    creating build/temp.linux-x86_64-3.5/expat
    x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DHAVE_EXPAT_CONFIG_H=1 -DUSE_PYEXPAT_CAPI -I/tmp/pip-build-jgopl29n/defusedexpat/expat -I/usr/include/python3.5m -c Modules35/pyexpat.c -o build/temp.linux-x86_64-3.5/Modules35/pyexpat.o
    x86_64-linux-gnu-gcc: error: Modules35/pyexpat.c: No such file or directory
    x86_64-linux-gnu-gcc: fatal error: no input files
    compilation terminated.
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

    ----------------------------------------
Command "/usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-jgopl29n/defusedexpat/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-hhaa9vah-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-jgopl29n/defusedexpat/
#
➜  tumblr-crawler git:(master) ✗

Downloading videos sometimes does not work

Traceback (most recent call last):
File "tumblr-photo-video-ripper.py", line 288, in
CrawlerScheduler(sites, proxies=proxies)
File "tumblr-photo-video-ripper.py", line 149, in init
self.scheduling()
File "tumblr-photo-video-ripper.py", line 162, in scheduling
self.download_media(site)
File "tumblr-photo-video-ripper.py", line 166, in download_media
self.download_videos(site)
File "tumblr-photo-video-ripper.py", line 169, in download_videos
self._download_media(site, "video", START)
File "tumblr-photo-video-ripper.py", line 199, in _download_media
data = xmltodict.parse(response.content)
File "/Library/Python/2.7/site-packages/xmltodict.py", line 330, in parse
parser.Parse(xml_input, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 87: unexpected end of data

Help Me!

I used the script , but it doesn't work with the alert:

screenshot from 2017-09-07 18-54-15

Failed when the site does not exist

[WinError 10060] 由于
连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。',))
在proxies.json写入

{
    "http": "127.0.0.1:14225",
    "SOCKS": "127.0.0.1:14226"
}

Access Denied when retrieve Failed to retrieve video from xxx

file/t:EZ5TduU8a0fT8bzyRGLg2w/139894566409/tumblr_o1er2qBl811ulzyf6.

Access Denied when retrieve https://xxxx.tumblr.com/video_file/t:EZ5TduU8a0fT8bzyRGLg2w/140259088414/tumblr_nx8g10QplS1ugg6cn.

Failed to retrieve video from https://xxxx.tumblr.com/video_file/t:EZ5TduU8a0fT8bzyRGLg2w/140259088414/tumblr_nx8g10QplS1ugg6cn.

Access Denied when retrieve https://xxxx.tumblr.com/video_file/t:EZ5TduU8a0fT8bzyRGLg2w/142448116304/tumblr_o4j2fig0iM1tqr9po.

这两个错误是啥意思

Downloading likes from a tumblr

Hello,

First of all thank you very much, works great. The issue is that I failed to download likes from a tumblr blog. I tried writing the website in tumblr.com/liked/by/[tumblr name] format in sites.txt file but had no results. Is it possible to download likes from a tumblr blog using this code? If it is, can you please help?

Thank you,

spacekittylasereyes

[BUG] Does not download videos.

I just attempted to copy a tumblr blog, and none of the videos were downloaded. The blog in question had dozens of videos, and only a few non video posts.

Login feature?

Some blog may not let logged-out user view (There is an option named "allow logged-out users to see this blog" in privacy settings), and the script will say 'Site *** does not exist' if it trying to access these site. If you access these sites, they will redirect xxx.tumblr.com to your dashboard and open it as a sidebar.

运行一下子就报错退出了,重新运行也是如此

Traceback (most recent call last):
  File "tumblr-photo-video-ripper.py", line 291, in <module>
    CrawlerScheduler(sites, proxies=proxies)
  File "tumblr-photo-video-ripper.py", line 149, in __init__
    self.scheduling()
  File "tumblr-photo-video-ripper.py", line 162, in scheduling
    self.download_media(site)
  File "tumblr-photo-video-ripper.py", line 166, in download_media
    self.download_videos(site)
  File "tumblr-photo-video-ripper.py", line 169, in download_videos
    self._download_media(site, "video", START)
  File "tumblr-photo-video-ripper.py", line 199, in _download_media
    data = xmltodict.parse(response.content)
  File "/usr/lib/python2.7/site-packages/xmltodict.py", line 330, in parse
    parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 480, column 1496

ubuntu VPS上运行出错

错误信息:

File "/usr/lib/python2.7/dist-packages/cffi/vengine_cpy.py", line 149, in load_library
    raise ffiplatform.VerificationError(error)
cffi.ffiplatform.VerificationError: importing '/usr/lib/python2.7/dist-packages/cryptography/_Cryptography_cffi_813c10e0x7adb75f8.x86_64-linux-gnu.so': /usr/lib/python2.7/dist-packages/cryptography/_Cryptography_cffi_813c10e0x7adb75f8.x86_64-linux-gnu.so: symbol SSLv2_client_method, version OPENSSL_1.0.0 not defined in file libssl.so.1.0.0 with link time reference

请问怎么解决?谢谢

xml parsing error

I met the same problem as @Yodamt in issue #31, it looks like this:

File "tumblr-photo-video-ripper.py", line 199, in _download_media
data = xmltodict.parse(response.content)
File "C:\Python27\lib\site-packages\xmltodict.py", line 330, in parse
parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 21, column 2311

I inspected the data and 发现response.content里面的对应位置是引号后面跟了一个\b,我不确定python是不是因此就把那个引号删掉了然后导致xml解析不过。

我按照Issue #31里面给出的方法把MEDIA_NUM改成了100,然而并没什么用。

proxies.json 文件为空,报错, python 2.11

Traceback (most recent call last):
File "tumblr-photo-video-ripper.py", line 206, in
illegal_json()
File "tumblr-photo-video-ripper.py", line 190, in illegal_json
print(u"文件proxies.json格式非法.\n"
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

不知道为何,用vpn,直接把这段全注掉就ok了

ERROR

莫名奇妙就报错了。
而且还成功过。一共成功过两次,第二次只下载了sites.txt里的第一个人的,随后就不能下载了。
错误代码:

Traceback (most recent call last):
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\packages\urllib3\contrib\pyopenssl.py", line 438, in wrap_socket
cnx.do_handshake()
File "C:\Users\Administrator\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1638, in do_handshake
self._raise_ssl_error(self._ssl, result)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1378, in _raise_ssl_error
_raise_current_error()
File "C:\Users\Administrator\Anaconda3\lib\site-packages\OpenSSL_util.py", line 54, in exception_from_error_queue
raise exception_type(errors)
OpenSSL.SSL.Error: [('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 594, in urlopen
self._prepare_proxy(conn)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 810, in prepare_proxy
conn.connect()
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\packages\urllib3\connection.py", line 326, in connect
ssl_context=context)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\packages\urllib3\util\ssl
.py", line 325, in ssl_wrap_socket
return context.wrap_socket(sock, server_hostname=server_hostname)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\packages\urllib3\contrib\pyopenssl.py", line 445, in wrap_socket
raise ssl.SSLError('bad handshake: %r' % e)
ssl.SSLError: ("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\adapters.py", line 438, in send
timeout=timeout
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 630, in urlopen
raise SSLError(e)
requests.packages.urllib3.exceptions.SSLError: ("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "tumblr-photo-video-ripper.py", line 288, in
CrawlerScheduler(sites, proxies=proxies)
File "tumblr-photo-video-ripper.py", line 149, in init
self.scheduling()
File "tumblr-photo-video-ripper.py", line 162, in scheduling
self.download_media(site)
File "tumblr-photo-video-ripper.py", line 165, in download_media
self.download_photos(site)
File "tumblr-photo-video-ripper.py", line 176, in download_photos
self._download_media(site, "photo", START)
File "tumblr-photo-video-ripper.py", line 193, in _download_media
proxies=self.proxies)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\sessions.py", line 518, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\sessions.py", line 661, in send
history = [resp for resp in gen] if allow_redirects else []
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\sessions.py", line 661, in
history = [resp for resp in gen] if allow_redirects else []
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\sessions.py", line 214, in resolve_redirects
**adapter_kwargs
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\sessions.py", line 639, in send
r = adapter.send(request, **kwargs)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\adapters.py", line 512, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: ("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",)

C:\Users\Administrator\1>python tumblr-photo-video-ripper.py
Traceback (most recent call last):
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\packages\urllib3\contrib\pyopenssl.py", line 438, in wrap_socket
cnx.do_handshake()
File "C:\Users\Administrator\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1638, in do_handshake
self._raise_ssl_error(self._ssl, result)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1378, in _raise_ssl_error
_raise_current_error()
File "C:\Users\Administrator\Anaconda3\lib\site-packages\OpenSSL_util.py", line 54, in exception_from_error_queue
raise exception_type(errors)
OpenSSL.SSL.Error: [('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 594, in urlopen
self._prepare_proxy(conn)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 810, in prepare_proxy
conn.connect()
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\packages\urllib3\connection.py", line 326, in connect
ssl_context=context)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\packages\urllib3\util\ssl
.py", line 325, in ssl_wrap_socket
return context.wrap_socket(sock, server_hostname=server_hostname)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\packages\urllib3\contrib\pyopenssl.py", line 445, in wrap_socket
raise ssl.SSLError('bad handshake: %r' % e)
ssl.SSLError: ("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\adapters.py", line 438, in send
timeout=timeout
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 630, in urlopen
raise SSLError(e)
requests.packages.urllib3.exceptions.SSLError: ("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "tumblr-photo-video-ripper.py", line 288, in
CrawlerScheduler(sites, proxies=proxies)
File "tumblr-photo-video-ripper.py", line 149, in init
self.scheduling()
File "tumblr-photo-video-ripper.py", line 162, in scheduling
self.download_media(site)
File "tumblr-photo-video-ripper.py", line 165, in download_media
self.download_photos(site)
File "tumblr-photo-video-ripper.py", line 176, in download_photos
self._download_media(site, "photo", START)
File "tumblr-photo-video-ripper.py", line 193, in _download_media
proxies=self.proxies)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\sessions.py", line 518, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\sessions.py", line 661, in send
history = [resp for resp in gen] if allow_redirects else []
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\sessions.py", line 661, in
history = [resp for resp in gen] if allow_redirects else []
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\sessions.py", line 214, in resolve_redirects
**adapter_kwargs
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\sessions.py", line 639, in send
r = adapter.send(request, **kwargs)
File "C:\Users\Administrator\Anaconda3\lib\site-packages\requests\adapters.py", line 512, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: ("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",)

希望能下载图片标签

RT,在下载图片后发现海量的照片都是使用 ‘乱码’作为文件名的,这对后面的管理和检索工作造成很大的困扰,希望能增加 下载Post Name 作为文件名的功能。以便后期管理和检索。谢谢

slack的链接进不去

打开是一个登录界面,也没有注册的界面。这个貌似只能是项目拥有者才可以登录吧。

运行脚本错误;shadowsocks 全局代理确定没有问题的

D:\python\tumblr-crawler-master>python tumblr-photo-video-ripper.py
You are using proxies.
{'http': 'socks5://127.0.0.1:1080', 'https': 'socks5://127.0.0.1:1080'}
Traceback (most recent call last):
File "D:\python\lib\site-packages\requests\packages\urllib3\connectionpool.py"
, line 600, in urlopen
chunked=chunked)
File "D:\python\lib\site-packages\requests\packages\urllib3\connectionpool.py"
, line 386, in _make_request
six.raise_from(e, None)
File "", line 2, in raise_from
File "D:\python\lib\site-packages\requests\packages\urllib3\connectionpool.py"
, line 382, in _make_request
httplib_response = conn.getresponse()
File "D:\python\lib\http\client.py", line 1197, in getresponse
response.begin()
File "D:\python\lib\http\client.py", line 297, in begin
version, status, reason = self._read_status()
File "D:\python\lib\http\client.py", line 266, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "D:\python\lib\site-packages\requests\adapters.py", line 423, in send
timeout=timeout
File "D:\python\lib\site-packages\requests\packages\urllib3\connectionpool.py"
, line 649, in urlopen
_stacktrace=sys.exc_info()[2])
File "D:\python\lib\site-packages\requests\packages\urllib3\util\retry.py", li
ne 347, in increment
raise six.reraise(type(error), error, _stacktrace)
File "D:\python\lib\site-packages\requests\packages\urllib3\packages\six.py",
line 685, in reraise
raise value.with_traceback(tb)
File "D:\python\lib\site-packages\requests\packages\urllib3\connectionpool.py"
, line 600, in urlopen
chunked=chunked)
File "D:\python\lib\site-packages\requests\packages\urllib3\connectionpool.py"
, line 386, in _make_request
six.raise_from(e, None)
File "", line 2, in raise_from
File "D:\python\lib\site-packages\requests\packages\urllib3\connectionpool.py"
, line 382, in _make_request
httplib_response = conn.getresponse()
File "D:\python\lib\http\client.py", line 1197, in getresponse
response.begin()
File "D:\python\lib\http\client.py", line 297, in begin
version, status, reason = self._read_status()
File "D:\python\lib\http\client.py", line 266, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
requests.packages.urllib3.exceptions.ProtocolError: ('Connection aborted.', Remo
teDisconnected('Remote end closed connection without response',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "tumblr-photo-video-ripper.py", line 288, in
CrawlerScheduler(sites, proxies=proxies)
File "tumblr-photo-video-ripper.py", line 149, in init
self.scheduling()
File "tumblr-photo-video-ripper.py", line 162, in scheduling
self.download_media(site)
File "tumblr-photo-video-ripper.py", line 165, in download_media
self.download_photos(site)
File "tumblr-photo-video-ripper.py", line 176, in download_photos
self._download_media(site, "photo", START)
File "tumblr-photo-video-ripper.py", line 193, in _download_media
proxies=self.proxies)
File "D:\python\lib\site-packages\requests\api.py", line 70, in get
return request('get', url, params=params, **kwargs)
File "D:\python\lib\site-packages\requests\api.py", line 56, in request
return session.request(method=method, url=url, **kwargs)
File "D:\python\lib\site-packages\requests\sessions.py", line 488, in request
resp = self.send(prep, **send_kwargs)
File "D:\python\lib\site-packages\requests\sessions.py", line 609, in send
r = adapter.send(request, **kwargs)
File "D:\python\lib\site-packages\requests\adapters.py", line 473, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected(
'Remote end closed connection without response',))

D:\python\tumblr-crawler-master>

Cannot decode response data

ead?type=video&num=50&start=0 Cannot decode response data from URL http://u44002.tumblr.com/api/read?type=video&num=50&start=0
Cannot decode response data from URL http://u44002.tumblr.com/api/read?type=video&num=50&start=0 Cannot decode response data from URL http://u44002.tumblr.com/api/read?type=video&num=50&start=0
Cannot decode response data from URL http://u44002.tumblr.com/api/read?type=video&num=50&start=0 Cannot decode response data from URL http://u44002.tumblr.com/api/read?type=video&num=50&start=0
Cannot decode response data from URL http://u44002.tumblr.com/api/read?type=video&num=50&start=0 Cannot decode response data from URL http://u44002.tumblr.com/api/read?type=video&num=50&start=0
Cannot decode response data from URL http://u44002.tumblr.com/api/read?type=video&num=50&start=0 Cannot decode response data from URL http://u44002.tumblr.com/api/read?type=video&num=50&start=0

运行出错了~

D:\Tumblr\tumblr>python tumblr-photo-video-ripper.py
You are using proxies.
{'http': 'socks5://127.0.0.1:1080', 'https': 'socks5://127.0.0.1:1080'}
Traceback (most recent call last):
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 601, in urlopen
chunked=chunked)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 387, in _make_request
six.raise_from(e, None)
File "", line 2, in raise_from
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 383, in _make_request
httplib_response = conn.getresponse()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 1331, in getresponse
response.begin()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 297, in begin
version, status, reason = self._read_status()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 266, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\adapters.py", line 440, in send
timeout=timeout
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 639, in urlopen
_stacktrace=sys.exc_info()[2])
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\util\retry.py", line 357, in increment
raise six.reraise(type(error), error, _stacktrace)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\packages\six.py", line 685, in reraise
raise value.with_traceback(tb)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 601, in urlopen
chunked=chunked)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 387, in _make_request
six.raise_from(e, None)
File "", line 2, in raise_from
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 383, in _make_request
httplib_response = conn.getresponse()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 1331, in getresponse
response.begin()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 297, in begin
version, status, reason = self._read_status()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 266, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "tumblr-photo-video-ripper.py", line 291, in
CrawlerScheduler(sites, proxies=proxies)
File "tumblr-photo-video-ripper.py", line 149, in init
self.scheduling()
File "tumblr-photo-video-ripper.py", line 162, in scheduling
self.download_media(site)
File "tumblr-photo-video-ripper.py", line 165, in download_media
self.download_photos(site)
File "tumblr-photo-video-ripper.py", line 176, in download_photos
self._download_media(site, "photo", START)
File "tumblr-photo-video-ripper.py", line 193, in _download_media
proxies=self.proxies)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\sessions.py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\sessions.py", line 618, in send
r = adapter.send(request, **kwargs)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\adapters.py", line 490, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

D:\Tumblr\tumblr>

图片都下来了 视频没有

Downloading tumblr_oq86meTiUR1w2sarao5_250.jpg from https://68.media.tumblr.com/3fa3f7c2ad890e3a0a20e3aa95fb77da/tumblr_oq86meTiUR1w2sarao5_250.jpg.

Traceback (most recent call last):
File "tumblr-photo-video-ripper.py", line 288, in
CrawlerScheduler(sites, proxies=proxies)
File "tumblr-photo-video-ripper.py", line 149, in init
self.scheduling()
File "tumblr-photo-video-ripper.py", line 162, in scheduling
self.download_media(site)
File "tumblr-photo-video-ripper.py", line 165, in download_media
self.download_photos(site)
File "tumblr-photo-video-ripper.py", line 176, in download_photos
self._download_media(site, "photo", START)
File "tumblr-photo-video-ripper.py", line 199, in _download_media
data = xmltodict.parse(response.content)
File "/usr/local/lib/python3.5/site-packages/xmltodict.py", line 330, in parse
parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 19, column 4010
➜ tumblr-crawler git:(master) ✗

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.