kezhenxu94 / house-renting Goto Github PK
View Code? Open in Web Editor NEWPossibly the best practice of Scrapy 🕷 and renting a house 🏡
License: Apache License 2.0
Possibly the best practice of Scrapy 🕷 and renting a house 🏡
License: Apache License 2.0
下面是启动 时的log:
2018-06-01 06:13:15 [scrapy.core.engine] INFO: Spider opened
2018-06-01 06:13:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-06-01 06:13:15 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/scrapy/core/engine.py", line 127, in _next_request
request = next(slot.start_requests)
File "/house-renting/crawler/house_renting/base_spider.py", line 12, in start_requests
city_url = city_url_mappings[city]
KeyError: '北'
2018-06-01 06:13:15 [scrapy.core.engine] INFO: Closing spider (finished)
2018-06-01 06:13:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 6, 1, 6, 13, 15, 873859),
'log_count/ERROR': 1,
'log_count/INFO': 7,
'memusage/max': 57556992,
'memusage/startup': 57556992,
'start_time': datetime.datetime(2018, 6, 1, 6, 13, 15, 859019)}
2018-06-01 06:13:15 [scrapy.core.engine] INFO: Spider closed (finished)
感觉漏掉了http:
ValueError: Missing scheme in request url: //pic8.58cdn.com.cn/anjuke_58/34e15fc377c3154c7af5781352a53540
Looks like we are more interesting in py3(see other issues), then let's migrate it to py3
清晰简短的描述你遇到的 Bug. (A clear and concise description of what the bug is.)
docker-compose启动的容器scrapyd和crawler会立即退出,lianjia在一段时间后也会退出,lianjia应该是爬去完毕退出
docker-compose up -d
docker logs -f lianjia
docker logs -f scrapyd
docker logs -f crawler
重现步骤 (Steps to reproduce the behavior):
root@ubuntu:/mnt/house-renting# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
16fae3c93371 house-renting/crawler "scrapy crawl 58" 2 hours ago Up 2 hours 58
473fa78fc6a6 house-renting/crawler "scrapy crawl lianjia" 2 hours ago Up 3 minutes lianjia
c1336d24f029 house-renting/crawler "scrapy crawl douban" 2 hours ago Up 2 hours douban
d81f4f5c9c5e house-renting/scrapyd "/bin/bash" 2 hours ago Exited (0) 3 minutes ago scrapyd
69660e516589 vickeywu/kibana-oss:6.3.2 "/docker-entrypoint.…" 2 hours ago Up 2 hours 0.0.0.0:5601->5601/tcp kibana
d88e85587d63 house-renting/crawler "/bin/bash" 2 hours ago Exited (0) 3 minutes ago crawler
8b1e03c93a95 redis "docker-entrypoint.s…" 2 hours ago Up 2 hours 0.0.0.0:6379->6379/tcp redis
2be0615aab21 vickeywu/elasticsearch-oss:6.4.1 "/usr/local/bin/dock…" 2 hours ago Up 2 hours 0.0.0.0:9200->9200/tcp, 9300/tcp elasticsearch
lianjia日志
2019-04-08 06:19:02 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-08 06:19:02 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 7988,
'downloader/request_count': 22,
'downloader/request_method_count/GET': 22,
'downloader/response_bytes': 404392,
'downloader/response_count': 22,
'downloader/response_status_count/200': 22,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 4, 8, 6, 19, 2, 77559),
'item_dropped_count': 21,
'item_dropped_reasons_count/DropItem': 21,
'log_count/INFO': 33,
'log_count/WARNING': 21,
'memusage/max': 62763008,
'memusage/startup': 56500224,
'request_depth_max': 1,
'response_received_count': 22,
'scheduler/dequeued': 22,
'scheduler/dequeued/memory': 22,
'scheduler/enqueued': 22,
'scheduler/enqueued/memory': 22,
'start_time': datetime.datetime(2019, 4, 8, 6, 14, 47, 67989)}
2019-04-08 06:19:02 [scrapy.core.engine] INFO: Spider closed (finished)
scrapyd和crawler用docker logs -f scrapyd
和docker logs -f crawler
无法看到日志
ubuntu16.04
from scrapy.conf import settings
ModuleNotFoundError: No module named 'scrapy.conf'
基于win10 , 编译执行 scrapy crawl lianjia
操作系统(OS): win10
Python: 3.7
Scrapy:1.7.3
Redis:
Elastic search:
Kibana:
添加有利于我们排查问题的附加信息. (Add any other context about the problem here.)
你的功能建议和某一个问题相关吗? 请描述.
Is your feature request related to a problem? Please describe.
对于那些对 ES, Kibana 的使用不是很熟悉的朋友, 此项目可能不是很友好.
It is unfriendly for those who are not familiar with ES, Kibana.
描述一下你希望怎么解决这个问题
Describe the solution you'd like
如果有一个可浏览的 Web 页面, 将会让这个项目对更多人都有帮助, 因为用户相当于是在使用另一个聚合了很多资源的网站, 不必自己了解 ES 搜索, 不必会用 Kibana.
A browsable web page would be helpful for much more people, with that it is equivalent to using another website that aggregates many other data from other websites.
Describe alternatives you've considered
描述一下你觉得还可以接受的替代方案
寻找一些 ES, Kibana 的使用教程, 帮助不熟悉的用户使用.
Find some tutorials to ES, Kibana to help users who are unfamiliar with.
Additional context
额外信息
使用 Vue.js 实现一个基本的页面是作者最开始的想法.
Using Vue.js to implement this idea is the original thought of mine.
使用 site、""、* 模糊搜索等足以寻找足够的住房信息。
Bug 描述 (Describe the bug)
docker-compose up --build -d 遇到报错
Creating elasticsearch ... error
Creating redis ...
ERROR: for elasticsearch Cannot create container for service elastic: invalid volume specification: 'F:\python\house-renting\data\elastic:/Creating redis ... error
ERROR: for redis Cannot create container for service redis: invalid volume specification: 'F:\python\house-renting\data\redis:/data:rw'
ERROR: for elastic Cannot create container for service elastic: invalid volume specification: 'F:\python\house-renting\data\elastic:/usr/share/elasticsearch/data:rw'
ERROR: for redis Cannot create container for service redis: invalid volume specification: 'F:\python\house-renting\data\redis:/data:rw'
ERROR: Encountered errors while bringing up the project.
截图 (Screenshots)
桌面环境 Desktop (please complete the following information):
ERROR: error pulling image configuration: Get https://d2iks1dkcwqcbx.cloudfront.net/docker/registry/v2/blobs/sha256/38/3822ba554fe95f9ef68baa75cae97974135eb6aa8f8f37cadf11f6a59bde0139/data?Expires=1527842196&Signature=CHe3gZGVa~cy390rPxw-bobSFPU5boKRbQw4SqI6k8OgTvY~8BRuU-91Hbx8qPf~gt47ygwyXzmpidNHkh6eu6UmY0WBtABCFvQcy0cMalC9N5X7tT4LwgwsqykscMf3esBdIPTjQyk6g-c7ZEFGp1ox4OqW1dSOZHl9HG4Ke3L4D6ldtXdMPbdsZjQMlb5x3DObnQd2P4wRJnXHnjyiMaRrf~GUmg~iSuPDNXHcrRC0xbzkCmcj4cF2s4DmWFlXzouTuKZwZRJbJZ1uyo88DW417N~b~Df6jQHtG5P8qoATZ04UsU2R0yAeXlqgwsvlXp68TgUm5vy2uAt7TCN2sQ__&Key-Pair-Id=APKAIVAVKHB6SNHJAJQQ: net/http: TLS handshake timeout
操作系统(OS): curl
ubuntu 16.04
Docker:
root@zhanghao-X555LI:/usr/local/src/house-renting# docker version
Client:
Version: 1.13.1
API version: 1.26
Go version: go1.6.2
Git commit: 092cba3
Built: Thu Nov 2 20:40:23 2017
OS/Arch: linux/amd64
Server:
Version: 1.13.1
API version: 1.26 (minimum version 1.12)
Go version: go1.6.2
Git commit: 092cba3
Built: Thu Nov 2 20:40:23 2017
OS/Arch: linux/amd64
Experimental: false
这个镜像是不是得翻墙
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a2a636a67a74 elasticsearch "/docker-entrypoint.…" 4 hours ago Exited (1) 2 seconds ago house-renting_elastic_run_1
9bece9b62acb redis "docker-entrypoint.s…" 4 hours ago Up 4 hours 6379/tcp house-renting_redis_run_1
91a9dd06893b kibana "/docker-entrypoint.…" 4 hours ago Up 4 hours 5601/tcp house-renting_kibana_run_1
65d5c5167e77 house-renting_lianjia "scrapy crawl lianjia" 4 hours ago Up 4 hours house-renting_lianjia_run_1
我尝试过很多次,elasticsearch 还是会自动退出。
docker logs [elasticsearch_container_id]
输出:
[2018-05-30T10:12:50,594][INFO ][o.e.n.Node ] [] initializing ...
[2018-05-30T10:12:50,633][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried [[/usr/share/elasticsearch/data/elasticsearch]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?
at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:136) ~[elasticsearch-5.6.9.jar:5.6.9]
at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:123) ~[elasticsearch-5.6.9.jar:5.6.9]
at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:70) ~[elasticsearch-5.6.9.jar:5.6.9]
at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:134) ~[elasticsearch-5.6.9.jar:5.6.9]
at org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-5.6.9.jar:5.6.9]
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:91) ~[elasticsearch-5.6.9.jar:5.6.9]
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:84) ~[elasticsearch-5.6.9.jar:5.6.9]
Caused by: java.lang.IllegalStateException: failed to obtain node locks, tried [[/usr/share/elasticsearch/data/elasticsearch]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?
at org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:261) ~[elasticsearch-5.6.9.jar:5.6.9]
at org.elasticsearch.node.Node.<init>(Node.java:265) ~[elasticsearch-5.6.9.jar:5.6.9]
at org.elasticsearch.node.Node.<init>(Node.java:245) ~[elasticsearch-5.6.9.jar:5.6.9]
at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:233) ~[elasticsearch-5.6.9.jar:5.6.9]
at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:233) ~[elasticsearch-5.6.9.jar:5.6.9]
at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:342) ~[elasticsearch-5.6.9.jar:5.6.9]
at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:132) ~[elasticsearch-5.6.9.jar:5.6.9]
... 6 more
我的环境是 Fedora 27 x86_64,遇到的问题和解决方案如下所示。
核心错误信息:Caused by: java.nio.file.AccessDeniedException: /usr/share/elasticsearch/data/nodes
该错误是由于目录权限不够导致的,把源码根目录下的 data/elastic/
改成 777
权限就行了,简单粗暴。
参考资料:https://stackoverflow.com/q/41497520/4112667
该错误的完整信息如下:
[2018-09-06T12:53:10,296][INFO ][o.e.n.Node ] [] initializing ...
[2018-09-06T12:53:10,448][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: Failed to create node environment
at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:125) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:112) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124) ~[elasticsearch-cli-6.2.4.jar:6.2.4]
at org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-cli-6.2.4.jar:6.2.4]
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:85) ~[elasticsearch-6.2.4.jar:6.2.4]
Caused by: java.lang.IllegalStateException: Failed to create node environment
at org.elasticsearch.node.Node.<init>(Node.java:267) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.node.Node.<init>(Node.java:246) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:213) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:213) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:323) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:121) ~[elasticsearch-6.2.4.jar:6.2.4]
... 6 more
Caused by: java.nio.file.AccessDeniedException: /usr/share/elasticsearch/data/nodes
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:?]
at sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384) ~[?:?]
at java.nio.file.Files.createDirectory(Files.java:674) ~[?:1.8.0_161]
at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781) ~[?:1.8.0_161]
at java.nio.file.Files.createDirectories(Files.java:767) ~[?:1.8.0_161]
at org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:204) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.node.Node.<init>(Node.java:264) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.node.Node.<init>(Node.java:246) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:213) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:213) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:323) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:121) ~[elasticsearch-6.2.4.jar:6.2.4]
... 6 more
核心错误信息:max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
该错误是由于进程的虚拟空间上限不够大导致的。运行以下命令解决:
sysctl -w vm.max_map_count=262144
参考资料:docker-library/elasticsearch#111 (comment)
该错误完整的信息如下:
[2018-09-06T13:31:11,112][INFO ][o.e.n.Node ] [] initializing ...
[2018-09-06T13:31:11,310][INFO ][o.e.e.NodeEnvironment ] [shmG9r4] using [1] data paths, mounts [[/usr/share/elasticsearch/data (/dev/mapper/fedora-root)]], net usable_space [14.5gb], net total_space [19.5gb], types [ext4]
[2018-09-06T13:31:11,311][INFO ][o.e.e.NodeEnvironment ] [shmG9r4] heap size [494.9mb], compressed ordinary object pointers [true]
[2018-09-06T13:31:11,315][INFO ][o.e.n.Node ] node name [shmG9r4] derived from node ID [shmG9r4iTYWkz7FJUJJIoA]; set [node.name] to override
[2018-09-06T13:31:11,316][INFO ][o.e.n.Node ] version[6.2.4], pid[1], build[ccec39f/2018-04-12T20:37:28.497551Z], OS[Linux/4.17.17-100.fc27.x86_64/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_161/25.161-b14]
[2018-09-06T13:31:11,316][INFO ][o.e.n.Node ] JVM arguments [-Xms1g, -Xmx1g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.io.tmpdir=/tmp/elasticsearch.x1oF74x1, -XX:+HeapDumpOnOutOfMemoryError, -XX:+PrintGCDetails, -XX:+PrintGCDateStamps, -XX:+PrintTenuringDistribution, -XX:+PrintGCApplicationStoppedTime, -Xloggc:logs/gc.log, -XX:+UseGCLogFileRotation, -XX:NumberOfGCLogFiles=32, -XX:GCLogFileSize=64m, -Des.cgroups.hierarchy.override=/, -Xms512m, -Xmx512m, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/usr/share/elasticsearch/config]
[2018-09-06T13:31:13,480][INFO ][o.e.p.PluginsService ] [shmG9r4] loaded module [aggs-matrix-stats]
[2018-09-06T13:31:13,481][INFO ][o.e.p.PluginsService ] [shmG9r4] loaded module [analysis-common]
[2018-09-06T13:31:13,481][INFO ][o.e.p.PluginsService ] [shmG9r4] loaded module [ingest-common]
[2018-09-06T13:31:13,481][INFO ][o.e.p.PluginsService ] [shmG9r4] loaded module [lang-expression]
[2018-09-06T13:31:13,481][INFO ][o.e.p.PluginsService ] [shmG9r4] loaded module [lang-mustache]
[2018-09-06T13:31:13,481][INFO ][o.e.p.PluginsService ] [shmG9r4] loaded module [lang-painless]
[2018-09-06T13:31:13,481][INFO ][o.e.p.PluginsService ] [shmG9r4] loaded module [mapper-extras]
[2018-09-06T13:31:13,482][INFO ][o.e.p.PluginsService ] [shmG9r4] loaded module [parent-join]
[2018-09-06T13:31:13,482][INFO ][o.e.p.PluginsService ] [shmG9r4] loaded module [percolator]
[2018-09-06T13:31:13,482][INFO ][o.e.p.PluginsService ] [shmG9r4] loaded module [rank-eval]
[2018-09-06T13:31:13,482][INFO ][o.e.p.PluginsService ] [shmG9r4] loaded module [reindex]
[2018-09-06T13:31:13,482][INFO ][o.e.p.PluginsService ] [shmG9r4] loaded module [repository-url]
[2018-09-06T13:31:13,482][INFO ][o.e.p.PluginsService ] [shmG9r4] loaded module [transport-netty4]
[2018-09-06T13:31:13,482][INFO ][o.e.p.PluginsService ] [shmG9r4] loaded module [tribe]
[2018-09-06T13:31:13,483][INFO ][o.e.p.PluginsService ] [shmG9r4] loaded plugin [ingest-geoip]
[2018-09-06T13:31:13,483][INFO ][o.e.p.PluginsService ] [shmG9r4] loaded plugin [ingest-user-agent]
[2018-09-06T13:31:18,990][INFO ][o.e.d.DiscoveryModule ] [shmG9r4] using discovery type [zen]
[2018-09-06T13:31:19,756][INFO ][o.e.n.Node ] initialized
[2018-09-06T13:31:19,756][INFO ][o.e.n.Node ] [shmG9r4] starting ...
[2018-09-06T13:31:19,955][INFO ][o.e.t.TransportService ] [shmG9r4] publish_address {172.24.0.2:9300}, bound_addresses {0.0.0.0:9300}
[2018-09-06T13:31:19,974][INFO ][o.e.b.BootstrapChecks ] [shmG9r4] bound or publishing to a non-loopback address, enforcing bootstrap checks
ERROR: [1] bootstrap checks failed
[1]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
[2018-09-06T13:31:19,994][INFO ][o.e.n.Node ] [shmG9r4] stopping ...
[2018-09-06T13:31:20,041][INFO ][o.e.n.Node ] [shmG9r4] stopped
[2018-09-06T13:31:20,042][INFO ][o.e.n.Node ] [shmG9r4] closing ...
[2018-09-06T13:31:20,057][INFO ][o.e.n.Node ] [shmG9r4] closed
提示错误:
plugin:[email protected] | Unable to connect to Elasticsearch at http://elastic:9200.
kibana | {"type":"log","@timestamp":"2018-11-23T04:29:33Z","tags":["warning","elasticsearch","admin"],"pid":1,"message":"Unable to revive connection: http://elastic:9200/"}
pip install requirements.txt
安装所需要的依赖目前豆瓣只有广州天河区租房小组,希望添加城市选择的功能。
由于豆瓣上有非常多租房小组,且豆瓣没有穷举的列表或接口,而普遍认为豆瓣找房子比较靠谱(大都是租客或房东合租/直租),因此需要找一个方式来实现豆瓣的城市选择。
切换到house-renting/crawler目录下,运行scrapy crawl lianjia
出错,错误信息如下:
2018-05-31 20:03:03 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: house_renting)
2018-05-31 20:03:03 [scrapy.utils.log] INFO: Overridden settings: {'AUTOTHROTTLE_MAX_DELAY': 10, 'NEWSPIDER_MODULE': 'house_renting.spiders', 'CONCURRENT_REQUESTS_PER_DOMAIN': 1, 'AUTOTHROTTLE_TARGET_CONCURRENCY': 2.0, 'SPIDER_MODULES': ['house_renting.spiders'], 'AUTOTHROTTLE_START_DELAY': 10, 'RETRY_TIMES': 3, 'BOT_NAME': 'house_renting', 'DOWNLOAD_TIMEOUT': 30, 'COOKIES_ENABLED': False, 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1 Safari/605.1.15 ', 'TELNETCONSOLE_ENABLED': False, 'COMMANDS_MODULE': 'house_renting.commands', 'AUTOTHROTTLE_ENABLED': True, 'DOWNLOAD_DELAY': 5, 'AUTOTHROTTLE_DEBUG': True}
Traceback (most recent call last):
File "/usr/local/bin/scrapy", line 11, in
sys.exit(execute())
File "/usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py", line 149, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py", line 89, in _run_print_help
func(*a, **kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py", line 156, in _run_command
cmd.run(args, opts)
File "/home/yuchen/House/house-renting/crawler/house_renting/commands/crawl.py", line 17, in run
self.crawler_process.crawl(spider_name, **opts.spargs)
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 167, in crawl
crawler = self.create_crawler(crawler_or_spidercls)
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 195, in create_crawler
return self._create_crawler(crawler_or_spidercls)
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 200, in _create_crawler
return Crawler(spidercls, self.settings)
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 52, in init
self.extensions = ExtensionManager.from_crawler(self)
File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 34, in from_settings
mwcls = load_object(clspath)
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/misc.py", line 44, in load_object
mod = import_module(module)
File "/usr/lib/python2.7/importlib/init.py", line 37, in import_module
import(name)
File "/usr/local/lib/python2.7/dist-packages/scrapy/extensions/memusage.py", line 16, in
from scrapy.mail import MailSender
File "/usr/local/lib/python2.7/dist-packages/scrapy/mail.py", line 22, in
from twisted.internet import defer, reactor, ssl
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/ssl.py", line 230, in
from twisted.internet._sslverify import (
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/_sslverify.py", line 15, in
from OpenSSL._util import lib as pyOpenSSLlib
ImportError: No module named _util
并且我已经运行pip install -r requirements.txt
命令,请问该如何解决呢?
elasticsearch hub.docker.com 已经明确声明不再维护 elasticsearch offical 源了,所以你应该把 elasticsearch 的镜像换一个,换成 elasticsearch 官方维护的。
但是 elasticsearch 官方的源需要翻墙,国内好像不能够直接访问。
https://github.com/kezhenxu94/house-renting/blob/master/crawler/house_renting/items.py#L46-L49
In Python 3, you can have a u'string' and an r'string' but a ur'string' is a Syntax Error.
flake8 testing of https://github.com/kezhenxu94/house-renting on Python 3.6.3
$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics
./crawler/house_renting/items.py:46:46: E999 SyntaxError: invalid syntax
minutes_ago = re.compile(ur'.*?(\d+)分钟前.*').search(value)
^
1 E999 SyntaxError: invalid syntax
1
2019-01-19 11:50:43 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: house_renting)
2019-01-19 11:50:43 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.3.1, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.8 (tags/v3.6.8:3c6b436a57, Dec 24 2018, 00:16:47) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Windows-10-10.0.17134-SP0
2019-01-19 11:50:43 [scrapy.crawler] INFO: Overridden settings: {'AUTOTHROTTLE_DEBUG': True, 'AUTOTHROTTLE_ENABLED': True, 'AUTOTHROTTLE_MAX_DELAY': 10, 'AUTOTHROTTLE_START_DELAY': 10, 'AUTOTHROTTLE_TARGET_CONCURRENCY': 2.0, 'BOT_NAME': 'house_renting', 'COMMANDS_MODULE': 'house_renting.commands', 'CONCURRENT_REQUESTS_PER_DOMAIN': 1, 'COOKIES_ENABLED': False, 'DOWNLOAD_DELAY': 10, 'DOWNLOAD_TIMEOUT': 30, 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'house_renting.spiders', 'RETRY_TIMES': 3, 'SPIDER_MODULES': ['house_renting.spiders'], 'TELNETCONSOLE_ENABLED': False, 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1 Safari/605.1.15 '}
2019-01-19 11:50:44 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.throttle.AutoThrottle']
2019-01-19 11:50:44 [scrapy.middleware] INFO: Enabled downloader middlewares:
['house_renting.middlewares.HouseRentingAgentMiddleware',
'house_renting.middlewares.HouseRentingProxyMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'house_renting.middlewares.HouseRentingRetryMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-01-19 11:50:44 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-01-19 11:50:44 [scrapy.middleware] INFO: Enabled item pipelines:
['house_renting.pipelines.HouseRentingPipeline',
'house_renting.pipelines.DuplicatesPipeline',
'scrapy.pipelines.images.ImagesPipeline',
'house_renting.pipelines.ESPipeline']
2019-01-19 11:50:44 [scrapy.core.engine] INFO: Spider opened
2019-01-19 11:50:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-19 11:50:45 [scrapy.extensions.throttle] INFO: slot: hu.58.com | conc: 1 | delay:10000 ms (+0) | latency: 337 ms | size: 40350 bytes
2019-01-19 11:50:57 [scrapy.extensions.throttle] INFO: slot: hu.58.com | conc: 1 | delay:10000 ms (+0) | latency: 19 ms | size: 258 bytes
2019-01-19 11:51:09 [scrapy.extensions.throttle] INFO: slot: hu.58.com | conc: 1 | delay:10000 ms (+0) | latency: 38 ms | size: 258 bytes
2019-01-19 11:51:23 [scrapy.extensions.throttle] INFO: slot: hu.58.com | conc: 1 | delay:10000 ms (+0) | latency: 31 ms | size: 0 bytes
2019-01-19 11:51:35 [scrapy.extensions.throttle] INFO: slot: hu.58.com | conc: 1 | delay:10000 ms (+0) | latency: 15 ms | size: 258 bytes
2019-01-19 11:51:44 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2019-01-19 13:00:25 [scrapy.extensions.throttle] INFO: slot: hu.58.com | conc: 1 | delay:10000 ms (+0) | latency: 34 ms | size: 0 bytes
2019-01-19 13:00:36 [scrapy.extensions.throttle] INFO: slot: hu.58.com | conc: 1 | delay:10000 ms (+0) | latency: 114 ms | size: 14012 bytes
2019-01-19 13:00:44 [scrapy.extensions.logstats] INFO: Crawled 277 pages (at 4 pages/min), scraped 0 items (at 0 items/min)
2019-01-19 13:00:49 [scrapy.extensions.throttle] INFO: slot: hu.58.com | conc: 1 | delay:10000 ms (+0) | latency: 93 ms | size: 14976 bytes
2019-01-19 13:01:03 [scrapy.extensions.throttle] INFO: slot: hu.58.com | conc: 1 | delay:10000 ms (+0) | latency: 31 ms | size: 0 bytes
链家翻页按钮为 JS 生成,需要寻找可靠的翻页策略
LinkExtractor 已经提取出正确的翻页链接,但 Scrapy 没有 follow 这些链接.
root@dockerU:/home/house-renting# docker -v
Docker version 1.13.1, build 092cba3
root@dockerU:/home/house-renting# docker-compose up --build -d
Traceback (most recent call last):
File "/usr/bin/docker-compose", line 9, in
load_entry_point('docker-compose==1.5.2', 'console_scripts', 'docker-compose')()
File "/usr/local/lib/python2.7/dist-packages/pkg_resources.py", line 339, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/usr/local/lib/python2.7/dist-packages/pkg_resources.py", line 2457, in load_entry_point
return ep.load()
File "/usr/local/lib/python2.7/dist-packages/pkg_resources.py", line 2171, in load
['name'])
File "/usr/lib/python2.7/dist-packages/compose/cli/main.py", line 22, in
from ..project import NoSuchService
File "/usr/lib/python2.7/dist-packages/compose/project.py", line 18, in
from .service import ContainerNet
File "/usr/lib/python2.7/dist-packages/compose/service.py", line 13, in
from docker.utils import LogConfig
ImportError: cannot import name LogConfig
在a58.py和lianjia.py中只设置啦 cities = (u'北京'),但是得到的数据里面还是广州的.
而且很多时候只有douban的数据,lianjia的数据没见过, 我按照配置的流程走的, 全程也没有报错, 最后得到的数据不对.
对这套工具不是熟悉, 也不知道哪个是重新获取数据? 执行最初的
docker-compose up --build -d这个吗?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.