fisheepx / douban-to-imdb Goto Github PK
View Code? Open in Web Editor NEW导出豆瓣电影评分到 IMDB,再将 IMDB观看记录导入 Trakt.
导出豆瓣电影评分到 IMDB,再将 IMDB观看记录导入 Trakt.
经过亿番修改后,不再报语法错误了。
但是始终报错:
ModuleNotFoundError: No module named 'bs4'
安装并指定目录安装bs4和beautifulsoup4也不起作用。
自己的PC上不进入虚拟环境但安装了Requirements,运行之后,成功显示
开始抓取所有观影数据...
但抓着抓着又报出一个错误:
Traceback (most recent call last):
File "douban_to_csv.py", line 144, in <module>
export(sys.argv[1])
File "douban_to_csv.py", line 113, in export
info.extend(get_info(url))
TypeError: 'NoneType' object is not iterable
详情请见此帖: "TypeError: 'NoneType' object is not iterable" in a python file
请问我需要改python吗,还是说我又漏了或做错了那一步?
def get_imdb_id(url):
r = requests.get(url, headers={'User-Agent': USER_AGENT})
soup = BeautifulSoup(r.text, 'lxml')
info_area = soup.find(id='info')
imdb_id = None
try:
if info_area:
# 由于豆瓣页面更改,IMDB的ID处不再有链接更改查询方法
for index in range(-1, -len(info_area.find_all('span')) + 1, -1):
imdb_id = info_area.find_all('span')[index].next_sibling.strip()
if imdb_id.startswith('tt'):
break
else:
print('不登录无法访问此电影页面:', url)
except:
print('无法获得IMDB编号的电影页面:', url)
finally:
return imdb_id if not imdb_id or imdb_id.startswith('tt') else None
难道要伪造ua和cookies才行么?
第一次执行会在处理到第十几页的时候提示:
Traceback (most recent call last):
File "/Users/sylvie/Repo/douban-to-imdb/douban_to_csv.py", line 144, in <module>
export(sys.argv[1])
File "/Users/sylvie/Repo/douban-to-imdb/douban_to_csv.py", line 113, in export
info.extend(get_info(url))
TypeError: 'NoneType' object is not iterable
之后再次执行每次都提示:
总共 1 页
开始处理第 1 页...
Traceback (most recent call last):
File "/Users/sylvie/Repo/douban-to-imdb/douban_to_csv.py", line 144, in <module>
export(sys.argv[1])
File "/Users/sylvie/Repo/douban-to-imdb/douban_to_csv.py", line 113, in export
info.extend(get_info(url))
TypeError: 'NoneType' object is not iterable
切换终端代理也会有同样的报错
导入电影评分到 IMDB时,全都提示“无法在IMDB上找到” 😂
首先感谢这么实用的脚本,完全可用。
唯一缺点是速度比较慢,在 imdb 上打分,约为每分钟一部。这样如果有有人看了比较多,比如超过 1000 部,电影,估计得打分一整天。
不算真正的 issue,只是作为讨论。
今天試發現 豆瓣限制一個 IP 短時間內一次最多抓10頁,
稍微改了一下加入 pagination = 1
參數如下
def export(user_id):
urls = url_generator(user_id)
info = []
pagination = 1
page_no = pagination
for idx, url in enumerate(urls, start=1):
if idx < pagination:
continue
if IS_OVER:#or page_no == pagination + 5
break
print(f'开始处理第 {page_no} 页...')
...
調整 pagination 值, 搭配不同 VPN server 可以全抓下來
登录自己的豆瓣账号,然后点击右上角的名字,打开个人主页,就在跳转到的URL里:https://www.douban.com/people/[这里的数字就是你的user_id]/
每次刷新页面,这里的数字都是会变化的
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.