wanzixin / sinaweibo-locationsignin-spider Goto Github PK

View Code? Open in Web Editor NEW

22.0 22.0 3.0 540 KB

以城市为单位爬取新浪微博移动端poi与poi下的微博信息

Python 100.00%

location python sina spider weibo

sinaweibo-locationsignin-spider's People

Contributors

Stargazers

Watchers

Forkers

whygist tang-dafa xiaosanmeng

sinaweibo-locationsignin-spider's Issues

AttributeError: 'NoneType' object has no attribute 'group'

Traceback (most recent call last):
File "C:/Users/XXJ/PycharmProjects/pythonProject1/poicrawler/crawler.py", line 254, in
main()
File "C:/Users/XXJ/PycharmProjects/pythonProject1/poicrawler/crawler.py", line 234, in main
spider.get_poi(ippool)
File "C:/Users/XXJ/PycharmProjects/pythonProject1/poicrawler/crawler.py", line 64, in get_poi
pois_id.append(poi_id.group())
AttributeError: 'NoneType' object has no attribute 'group'

我运行了代码，提示我上面的错误。请问大佬知道原因吗？错误指向以下部分：

        res = requests.get(cityURL+'&page='+str(page),proxies = proxy_ip,headers = headers)
        if res.status_code == 200:
            info = json.loads(res.text)

            if info['ok'] == 1:
                card_group = info['data']['cards'][0]['card_group']
                print(card_group)
                print(len(card_group))
                for i in range(0,len(card_group)):
                    poi_id = re.search(r'100101B2094[A-Z0-9]{15}',card_group[i]['scheme'])

                    pois_id.append(poi_id.group())
                    pois_name.append(card_group[i]['title_sub'])
            else:
                print('这座城市poi已经爬取完毕了。')

POI覆盖范围

目前看到代码中有关POI的获取是每个城市10页，是否有办法进行扩展呢？谢谢！

代理失效+爬虫脚本无法正确爬取

你好，脚本中使用的西刺代理已经失效了，而且在更换代理之后爬虫脚本依然无法正确工作，希望作者有空时能更新一下。

代理失效了

可否更换一下目前可用的代理？非常感谢！

超时问题

您好，请问在爬取代理的时候出现如下错误应该怎么解决呢？
----------------爬取代理使用的ip为: {'http': '223.241.119.42:47972'} --------------------
Traceback (most recent call last):
File "D:\Program Files\python36\lib\urllib\request.py", line 1318, in do_open
encode_chunked=req.has_header('Transfer-encoding'))
File "D:\Program Files\python36\lib\http\client.py", line 1239, in request
self._send_request(method, url, body, headers, encode_chunked)
File "D:\Program Files\python36\lib\http\client.py", line 1285, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "D:\Program Files\python36\lib\http\client.py", line 1234, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "D:\Program Files\python36\lib\http\client.py", line 1026, in _send_output
self.send(msg)
File "D:\Program Files\python36\lib\http\client.py", line 964, in send
self.connect()
File "D:\Program Files\python36\lib\http\client.py", line 1392, in connect
super().connect()
File "D:\Program Files\python36\lib\http\client.py", line 936, in connect
(self.host,self.port), self.timeout, self.source_address)
File "D:\Program Files\python36\lib\socket.py", line 724, in create_connection
raise err
File "D:\Program Files\python36\lib\socket.py", line 713, in create_connection
sock.connect(sa)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "D:\Program Files\python36\lib\site-packages\fake_useragent\utils.py", line 67, in get
context=context,
File "D:\Program Files\python36\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "D:\Program Files\python36\lib\urllib\request.py", line 526, in open
response = self._open(req, data)
File "D:\Program Files\python36\lib\urllib\request.py", line 544, in _open
'_open', req)
File "D:\Program Files\python36\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "D:\Program Files\python36\lib\urllib\request.py", line 1361, in https_open
context=self._context, check_hostname=self._check_hostname)
File "D:\Program Files\python36\lib\urllib\request.py", line 1320, in do_open
raise URLError(err)
urllib.error.URLError:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "E:/Pycharm/weiboqiandao/crawler.py", line 255, in
main()
File "E:/Pycharm/weiboqiandao/crawler.py", line 229, in main
ippool = build_ippool()
File "E:\Pycharm\weiboqiandao\buildip.py", line 82, in build_ippool
results = p.get_proxy(page)
File "E:\Pycharm\weiboqiandao\buildip.py", line 37, in get_proxy
res = requests.get(url, proxies=proxy_ip, headers={'User-Agent': UserAgent(use_cache_server=False).random})
File "D:\Program Files\python36\lib\site-packages\fake_useragent\fake.py", line 69, in init
self.load()
File "D:\Program Files\python36\lib\site-packages\fake_useragent\fake.py", line 78, in load
verify_ssl=self.verify_ssl,
File "D:\Program Files\python36\lib\site-packages\fake_useragent\utils.py", line 250, in load_cached
update(path, use_cache_server=use_cache_server, verify_ssl=verify_ssl)
File "D:\Program Files\python36\lib\site-packages\fake_useragent\utils.py", line 245, in update
write(path, load(use_cache_server=use_cache_server, verify_ssl=verify_ssl))
File "D:\Program Files\python36\lib\site-packages\fake_useragent\utils.py", line 178, in load
raise exc
File "D:\Program Files\python36\lib\site-packages\fake_useragent\utils.py", line 154, in load
for item in get_browsers(verify_ssl=verify_ssl):
File "D:\Program Files\python36\lib\site-packages\fake_useragent\utils.py", line 97, in get_browsers
html = get(settings.BROWSERS_STATS_PAGE, verify_ssl=verify_ssl)
File "D:\Program Files\python36\lib\site-packages\fake_useragent\utils.py", line 84, in get
raise FakeUserAgentError('Maximum amount of retries reached')
fake_useragent.errors.FakeUserAgentError: Maximum amount of retries reached

请求大佬更新一下代码

请问大佬这个代码可以维护更新一下吗？是否还可以利用这个代码爬取？我试了一下，无法爬取。先谢谢了

项目咨询

尊敬的开发者，您好！
我最近正在follow你的项目以期分析地点数据，但在运行中有以下错误：

在buildip.py中函数get_proxy里的
res = requests.get(url, proxies=proxy_ip, headers={'User-Agent': UserAgent(use_cache_server=False).random})
报错TypeError: init() got an unexpected keyword argument 'use_cache_server',将这个意外实参移除之后又继续出现错误，同样是该句报错。
res = requests.get(url, proxies=proxy_ip, headers={'User-Agent': UserAgent().random})
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None))
希望您有空帮处理或者更新下代码，十分感谢！

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.