Giter Club home page Giter Club logo

douban-to-imdb's People

Contributors

dependabot[bot] avatar fisheepx avatar librehat avatar steven1677 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

douban-to-imdb's Issues

TypeError: 'NoneType' object is not iterable

经过亿番修改后,不再报语法错误了。
但是始终报错:
ModuleNotFoundError: No module named 'bs4'
安装并指定目录安装bs4和beautifulsoup4也不起作用。

自己的PC上不进入虚拟环境但安装了Requirements,运行之后,成功显示
开始抓取所有观影数据...
但抓着抓着又报出一个错误:

Traceback (most recent call last):
  File "douban_to_csv.py", line 144, in <module>
    export(sys.argv[1])
  File "douban_to_csv.py", line 113, in export
    info.extend(get_info(url))
TypeError: 'NoneType' object is not iterable

详情请见此帖: "TypeError: 'NoneType' object is not iterable" in a python file

请问我需要改python吗,还是说我又漏了或做错了那一步?

电影页面目前没法以非登陆状态抓取

def get_imdb_id(url):
    r = requests.get(url, headers={'User-Agent': USER_AGENT})
    soup = BeautifulSoup(r.text, 'lxml')
    info_area = soup.find(id='info')
    imdb_id = None
    try:
        if info_area:
            # 由于豆瓣页面更改,IMDB的ID处不再有链接更改查询方法
            for index in range(-1, -len(info_area.find_all('span')) + 1, -1):
                imdb_id = info_area.find_all('span')[index].next_sibling.strip()
                if imdb_id.startswith('tt'):
                    break
        else:
            print('不登录无法访问此电影页面:', url)
    except:
        print('无法获得IMDB编号的电影页面:', url)
    finally:
        return imdb_id if not imdb_id or imdb_id.startswith('tt') else None

难道要伪造ua和cookies才行么?

TypeError: 'NoneType' object is not iterable

第一次执行会在处理到第十几页的时候提示:

Traceback (most recent call last):
  File "/Users/sylvie/Repo/douban-to-imdb/douban_to_csv.py", line 144, in <module>
    export(sys.argv[1])
  File "/Users/sylvie/Repo/douban-to-imdb/douban_to_csv.py", line 113, in export
    info.extend(get_info(url))
TypeError: 'NoneType' object is not iterable

之后再次执行每次都提示:

总共 1 页
开始处理第 1 页...
Traceback (most recent call last):
  File "/Users/sylvie/Repo/douban-to-imdb/douban_to_csv.py", line 144, in <module>
    export(sys.argv[1])
  File "/Users/sylvie/Repo/douban-to-imdb/douban_to_csv.py", line 113, in export
    info.extend(get_info(url))
TypeError: 'NoneType' object is not iterable

切换终端代理也会有同样的报错

可以导出,但是 imdb 打分速度慢,约为每分钟 1 部

首先感谢这么实用的脚本,完全可用。

唯一缺点是速度比较慢,在 imdb 上打分,约为每分钟一部。这样如果有有人看了比较多,比如超过 1000 部,电影,估计得打分一整天。

不算真正的 issue,只是作为讨论。

豆瓣限制一個 IP 一次至多抓10頁

今天試發現 豆瓣限制一個 IP 短時間內一次最多抓10頁,

稍微改了一下加入 pagination = 1 參數如下

def export(user_id):
    urls = url_generator(user_id)
    info = []
    pagination = 1
    page_no = pagination
    for idx, url in enumerate(urls, start=1):
        if idx < pagination:
            continue
        if IS_OVER:#or page_no == pagination + 5
            break
        print(f'开始处理第 {page_no} 页...')
...

調整 pagination 值, 搭配不同 VPN server 可以全抓下來

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.