adriancast / scrapyd-django-template Goto Github PK

Basic setup to run ScrapyD + Django and save it in Django Models. You can be up and running in just a few minutes

Python 99.64% Dockerfile 0.36%

django hacktoberfest python scrapy spider

scrapyd-django-template's Introduction

Scrapyd-Django-Template

Basic setup to run ScrapyD + Django and save it in Django Models. You can be up and running in just a few minutes. This template includes

Basic structure of a Django project.
Basic structure for scrapy.
Configuration of scrapy in order to access Django models objects.
Basic scrapy pipeline to save crawled objets to Django models.
Basic spider definition
Basic demo from the oficial tutorial that crawls data from http://quotes.toscrape.com

Setup

1 - Install requirements

$ pip install -r requirements.txt

2 - Configure the database

$ python manage.py migrate

3 - Create superuser to login in the django admin

$ python manage.py createsuperuser

Start the project

In order to start this project you will need to have running Django and Scrapyd at the same time.

In order to run Django

$ python manage.py runserver

In order to run Scrapyd

$ cd scrapy_app
$ scrapyd

Demo

Django is running on: http://127.0.0.1:8000 Scrapyd is running on: http://0.0.0.0:6800

At this point you will be able to send job request to Scrapyd. This project is setup with a demo spider from the oficial tutorial of scrapy. To run it you must send a http request to Scrapyd with the job info

curl http://127.0.0.1:6800/schedule.json -d project=default -d spider=toscrape-css

Now go to http://127.0.0.1:8000/admin and login using the superuser you created before. The crawled data will be automatically be saved in the Django models.

This repo is inspired by an article from Ali Oğuzhan Yıldız, https://medium.com/@ali_oguzhan/how-to-use-scrapy-with-django-application-c16fabd0e62e

scrapyd-django-template's People

Contributors

Stargazers

Watchers

scrapyd-django-template's Issues

Fix docker-compose

After upgrading the Django to the last version the docker-compose is not working. I think this can be fixed by upgrading the Python version of the container.

> [scrapyd-django-template-scrapyd 4/4] RUN pip install -r requirements.txt:
#0 0.748 Collecting Django==4.1.2 (from -r requirements.txt (line 1))
#0 0.942   ERROR: Could not find a version that satisfies the requirement Django==4.1.2 (from -r requirements.txt (line 1)) (from versions: 1.1.3, 1.1.4, 1.2, 1.2.1, 1.2.2, 1.2.3, 1.2.4, 1.2.5, 1.2.6, 1.2.7, 1.3, 1.3.1, 1.3.2, 1.3.3, 1.3.4, 1.3.5, 1.3.6, 1.3.7, 1.4, 1.4.1, 1.4.2, 1.4.3, 1.4.4, 1.4.5, 1.4.6, 1.4.7, 1.4.8, 1.4.9, 1.4.10, 1.4.11, 1.4.12, 1.4.13, 1.4.14, 1.4.15, 1.4.16, 1.4.17, 1.4.18, 1.4.19, 1.4.20, 1.4.21, 1.4.22, 1.5, 1.5.1, 1.5.2, 1.5.3, 1.5.4, 1.5.5, 1.5.6, 1.5.7, 1.5.8, 1.5.9, 1.5.10, 1.5.11, 1.5.12, 1.6, 1.6.1, 1.6.2, 1.6.3, 1.6.4, 1.6.5, 1.6.6, 1.6.7, 1.6.8, 1.6.9, 1.6.10, 1.6.11, 1.7, 1.7.1, 1.7.2, 1.7.3, 1.7.4, 1.7.5, 1.7.6, 1.7.7, 1.7.8, 1.7.9, 1.7.10, 1.7.11, 1.8a1, 1.8b1, 1.8b2, 1.8rc1, 1.8, 1.8.1, 1.8.2, 1.8.3, 1.8.4, 1.8.5, 1.8.6, 1.8.7, 1.8.8, 1.8.9, 1.8.10, 1.8.11, 1.8.12, 1.8.13, 1.8.14, 1.8.15, 1.8.16, 1.8.17, 1.8.18, 1.8.19, 1.9a1, 1.9b1, 1.9rc1, 1.9rc2, 1.9, 1.9.1, 1.9.2, 1.9.3, 1.9.4, 1.9.5, 1.9.6, 1.9.7, 1.9.8, 1.9.9, 1.9.10, 1.9.11, 1.9.12, 1.9.13, 1.10a1, 1.10b1, 1.10rc1, 1.10, 1.10.1, 1.10.2, 1.10.3, 1.10.4, 1.10.5, 1.10.6, 1.10.7, 1.10.8, 1.11a1, 1.11b1, 1.11rc1, 1.11, 1.11.1, 1.11.2, 1.11.3, 1.11.4, 1.11.5, 1.11.6, 1.11.7, 1.11.8, 1.11.9, 1.11.10, 1.11.11, 1.11.12, 1.11.13, 1.11.14, 1.11.15, 1.11.16, 1.11.17, 1.11.18, 1.11.20, 1.11.21, 1.11.22, 1.11.23, 1.11.24, 1.11.25, 1.11.26, 1.11.27, 1.11.28, 1.11.29, 2.0a1, 2.0b1, 2.0rc1, 2.0, 2.0.1, 2.0.2, 2.0.3, 2.0.4, 2.0.5, 2.0.6, 2.0.7, 2.0.8, 2.0.9, 2.0.10, 2.0.12, 2.0.13, 2.1a1, 2.1b1, 2.1rc1, 2.1, 2.1.1, 2.1.2, 2.1.3, 2.1.4, 2.1.5, 2.1.7, 2.1.8, 2.1.9, 2.1.10, 2.1.11, 2.1.12, 2.1.13, 2.1.14, 2.1.15, 2.2a1, 2.2b1, 2.2rc1, 2.2, 2.2.1, 2.2.2, 2.2.3, 2.2.4, 2.2.5, 2.2.6, 2.2.7, 2.2.8, 2.2.9, 2.2.10, 2.2.11, 2.2.12, 2.2.13, 2.2.14, 2.2.15, 2.2.16, 2.2.17, 2.2.18, 2.2.19, 2.2.20, 2.2.21, 2.2.22, 2.2.23, 2.2.24, 2.2.25, 2.2.26, 2.2.27, 2.2.28, 3.0a1, 3.0b1, 3.0rc1, 3.0, 3.0.1, 3.0.2, 3.0.3, 3.0.4, 3.0.5, 3.0.6, 3.0.7, 3.0.8, 3.0.9, 3.0.10, 3.0.11, 3.0.12, 3.0.13, 3.0.14, 3.1a1, 3.1b1, 3.1rc1, 3.1, 3.1.1, 3.1.2, 3.1.3, 3.1.4, 3.1.5, 3.1.6, 3.1.7, 3.1.8, 3.1.9, 3.1.10, 3.1.11, 3.1.12, 3.1.13, 3.1.14, 3.2a1, 3.2b1, 3.2rc1, 3.2, 3.2.1, 3.2.2, 3.2.3, 3.2.4, 3.2.5, 3.2.6, 3.2.7, 3.2.8, 3.2.9, 3.2.10, 3.2.11, 3.2.12, 3.2.13, 3.2.14, 3.2.15, 3.2.16)
#0 0.943 ERROR: No matching distribution found for Django==4.1.2 (from -r requirements.txt (line 1))
#0 1.067 WARNING: You are using pip version 19.1.1, however version 22.2.2 is available.
#0 1.067 You should consider upgrading via the 'pip install --upgrade pip' command.
------
failed to solve: executor failed running [/bin/sh -c pip install -r requirements.txt]: exit code: 1

scrapyd cant find spider

i guess this repo is dead but anyways i will try to getting help.

when i send a post request to:
http://127.0.0.1:6800/schedule.json?project=scrapy_app&spider=toscrape-css

response is:

{
    "status": "error",
    "message": "spider 'toscrape-css' not found"
}

i did a research and i found people talks about "eggifying" but there is no enough information about that in documentation in this repo. how can i solve this issue?

Update .gitignore

Prevent .pyc and pycache directories

curl: (7) Failed to connect to localhost port 6800: Connection refused

i am able to run scrapy on server 127.0.0.1:6800 however when i tried to schedule the spider using

curl http://localhost:6800/schedule.json -d project=default -d spider=toscrape-css

i get this error instead curl: (7) Failed to connect to localhost port 6800: Connection refused

i tried to look around but i still cant seems to figure this out and i feel like this is something dumb or i might just be missing some steps. i am not familiar with either python nor scrapy since this is the first time i tried them so all help is much apreciated

Accessing and viewing scraped data stored in database

I ran your code according to the instructions in the readme. I can view the responses in the logs directory in scrapy_app but when I open up my sqlite3 prompt, there are no tables or databases that have been created in spite of there being an sqlite database configured. How do you access and manipulate the scraped data? Currently, I can't verify if data has been added to my database or not.

Scraped data not getting stored into a database

I'm new to Django and Scrapy. I wrote my code using your code as an example but for some reason the data that I scrape isn't getting saved in a database. From what I understand pipelines are supposed to save everything into a Django model. I want the scraped data to be saved inside a MySQL database that I use for my Django application, but for some reason it's not happening. Am I missing something?

Remove pycache directories from repository

Error when deployed to heroku

Hi Adriancast,

I had read your topic in medium. It's very helpful for me. I tried to use django rest with scrapyd to create an api to auto crawl.
My main idea is send url with token (for authenticate) to api and get data like your topic. I can run well on localhost
My repo is: https://github.com/fibonacci998/Django-intergrate-scrapy
But after i deployed it to heroku, i always received error:

HTTPConnectionPool(host='localhost', port=6800): Max retries exceeded with url: /schedule.json (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe565b42860>: Failed to establish a new connection: [Errno 111] Connection refused',))

Would you mind helping me?

Update Django to last version

bump to version 4.XX

License?

Hi, would you be willing to add a LICENSE file to the repository so this can be included in other open-source projects?

the cURL request is reproducing this error on windows

Hi,

First of all, thanks for your share and work.

I keep on getting an error message on the last curl step. Do you have any clue what can be wrong? I followed all the steps as described, but can't start the spider. Any help is appreciated.

Just to be sure, both instances are running and I can access them on port 8000 and 6800.

Feedback from scrapyd

Edit: Well, let me correct this. The spider is running and quotes are being stored in het database. It is just not possible to view the 'jobs' page in scrapyd:

Getting following error when I run the curl command:

2020-05-17T09:59:13+0000 [_GenericHTTPChannelProtocol,11,127.0.0.1] Unhandled Error
Traceback (most recent call last):
File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http.py", line 2284, in allContentReceived
req.requestReceived(command, path, version)
File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http.py", line 946, in requestReceived
self.process()
File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/server.py", line 235, in process
self.render(resrc)
File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/server.py", line 302, in render
body = resrc.render(self)
--- ---
File "/home/ubuntu/test/venv/lib/python3.6/site-packages/scrapyd/webservice.py", line 21, in render
return JsonResource.render(self, txrequest).encode('utf-8')
File "/home/ubuntu/test/venv/lib/python3.6/site-packages/scrapyd/utils.py", line 21, in render
return self.render_object(r, txrequest)
File "/home/ubuntu/test/venv/lib/python3.6/site-packages/scrapyd/utils.py", line 29, in render_object
txrequest.setHeader('Content-Length', len(r))
File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http.py", line 1314, in setHeader
self.responseHeaders.setRawHeaders(name, [value])
File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http_headers.py", line 220, in setRawHeaders
for v in self._encodeValues(values)]
File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http_headers.py", line 220, in
for v in self._encodeValues(values)]
File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http_headers.py", line 40, in _sanitizeLinearWhitespace
return b' '.join(headerComponent.splitlines())
builtins.AttributeError: 'int' object has no attribute 'splitlines'

2020-05-17T09:59:13+0000 [twisted.web.server.Request#critical]
Traceback (most recent call last):
File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http.py", line 1755, in dataReceived
finishCallback(data[contentLength:])
File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http.py", line 2171, in _finishRequestBody
self.allContentReceived()
File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http.py", line 2284, in allContentReceived
req.requestReceived(command, path, version)
File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http.py", line 946, in requestReceived
self.process()
--- ---
File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/server.py", line 235, in process
self.render(resrc)
File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/server.py", line 302, in render
body = resrc.render(self)
File "/home/ubuntu/test/venv/lib/python3.6/site-packages/scrapyd/webservice.py", line 27, in render
return self.render_object(r, txrequest).encode('utf-8')
File "/home/ubuntu/test/venv/lib/python3.6/site-packages/scrapyd/utils.py", line 29, in render_object
txrequest.setHeader('Content-Length', len(r))
File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http.py", line 1314, in setHeader
self.responseHeaders.setRawHeaders(name, [value])
File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http_headers.py", line 220, in setRawHeaders
for v in self._encodeValues(values)]
File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http_headers.py", line 220, in
for v in self._encodeValues(values)]
File "/home/ubuntu/test/venv/lib/python3.6/site-packages/twisted/web/http_headers.py", line 40, in _sanitizeLinearWhitespace
return b' '.join(headerComponent.splitlines())

Originally posted by @hanspruim in #1 (comment)