scrapinghub / portia Goto Github PK
View Code? Open in Web Editor NEWVisual scraping for Scrapy
License: BSD 3-Clause "New" or "Revised" License
Visual scraping for Scrapy
License: BSD 3-Clause "New" or "Revised" License
portiacrawl command doesnt accept all input arguments.
like .... -a DEPTH_LIMIT=10 or (depth_limit) which we have to define in settings.py.
After correct installation, when I try to run
twistd -n slyd
I get
Traceback (most recent call last):
File "/usr/local/bin/twistd", line 14, in
run()
File "/usr/local/lib/python2.7/dist-packages/twisted/scripts/twistd.py", line 27, in run
app.run(runApp, ServerOptions)
File "/usr/local/lib/python2.7/dist-packages/twisted/application/app.py", line 642, in run
runApp(config)
File "/usr/local/lib/python2.7/dist-packages/twisted/scripts/twistd.py", line 23, in runApp
_SomeApplicationRunner(config).run()
File "/usr/local/lib/python2.7/dist-packages/twisted/application/app.py", line 376, in run
self.application = self.createOrGetApplication()
File "/usr/local/lib/python2.7/dist-packages/twisted/application/app.py", line 436, in createOrGetApplication
ser = plg.makeService(self.config.subOptions)
File "/home/euphorbium/Projects/mtg/scraper/portia-master/slyd/slyd/tap.py", line 55, in makeService
root = create_root(config)
File "/home/euphorbium/Projects/mtg/scraper/portia-master/slyd/slyd/tap.py", line 27, in create_root
from slyd.crawlerspec import (CrawlerSpecManager,
File "/home/euphorbium/Projects/mtg/scraper/portia-master/slyd/slyd/crawlerspec.py", line 12, in
from jsonschema.exceptions import ValidationError
ImportError: No module named jsonschema.exceptions
I use as a start page https://athens.indymedia.org/ .There are links that lead to articles in the form of https://athens.indymedia.org/front.php3?lang=el&article_id=1523018.
When I click a link from the start page it redirects me to https://athens.indymedia.org/front.php3. Portia is able to scrape the article page if I input the article url as a startpage.
After some debugging I can see the fetch request from javascript is removing the part after the " ? ".
Hello.
I'm testing Portia and I found that there is none actual way to index a table.
By 'index a table', I think something like allowing a element with various parents (something like table > tr (which can vary) > td
based on a determinated index, or something like it).
I think that, for some cases, this is useful, like for indexing a page with tables. However, I not tested Slybot to see if it support that feature.
Can anyone help?
Thanks
The visual editor still does not seem to perfom the login step before fetching a page. Using portiacrawl works.
Is there is way to log in a website, or give some cookie to get the spider started ?
Let's say I want to scrape a website that require me to login before I can access the pages. How could i do this with Portia ?
diego@syrah:/tmp/portia/slyd (master)$ twistd -n slyd
/private/tmp/portia/slyd/slyd/bot.py:40: ScrapyDeprecationWarning: scrapy.spider.BaseSpider is deprecated, instantiate scrapy.spider.Spider instead.
spider = BaseSpider('slyd')
Traceback (most recent call last):
File "/usr/bin/twistd", line 14, in <module>
run()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/scripts/twistd.py", line 27, in run
app.run(runApp, ServerOptions)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/application/app.py", line 652, in run
runApp(config)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/scripts/twistd.py", line 23, in runApp
_SomeApplicationRunner(config).run()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/application/app.py", line 386, in run
self.application = self.createOrGetApplication()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/application/app.py", line 446, in createOrGetApplication
ser = plg.makeService(self.config.subOptions)
File "/private/tmp/portia/slyd/slyd/tap.py", line 55, in makeService
root = create_root(config)
File "/private/tmp/portia/slyd/slyd/tap.py", line 46, in create_root
projects.putChild("bot", create_bot_resource(spec_manager))
File "/private/tmp/portia/slyd/slyd/bot.py", line 34, in create_bot_resource
bot = Bot(spec_manager.settings, spec_manager)
File "/private/tmp/portia/slyd/slyd/bot.py", line 48, in __init__
crawler.configure()
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 47, in configure
self.engine = ExecutionEngine(self, self._spider_closed)
File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 63, in __init__
self.downloader = Downloader(crawler)
File "/Library/Python/2.7/site-packages/scrapy/core/downloader/__init__.py", line 73, in __init__
self.handlers = DownloadHandlers(crawler)
File "/Library/Python/2.7/site-packages/scrapy/core/downloader/handlers/__init__.py", line 18, in __init__
cls = load_object(clspath)
File "/Library/Python/2.7/site-packages/scrapy/utils/misc.py", line 40, in load_object
mod = import_module(module)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
File "/Library/Python/2.7/site-packages/scrapy/core/downloader/handlers/s3.py", line 4, in <module>
from .http import HTTPDownloadHandler
File "/Library/Python/2.7/site-packages/scrapy/core/downloader/handlers/http.py", line 5, in <module>
from .http11 import HTTP11DownloadHandler as HTTPDownloadHandler
File "/Library/Python/2.7/site-packages/scrapy/core/downloader/handlers/http11.py", line 15, in <module>
from scrapy.xlib.tx import Agent, ProxyAgent, ResponseDone, \
File "/Library/Python/2.7/site-packages/scrapy/xlib/tx/__init__.py", line 6, in <module>
from . import client, endpoints
File "/Library/Python/2.7/site-packages/scrapy/xlib/tx/client.py", line 37, in <module>
from .endpoints import TCP4ClientEndpoint, SSL4ClientEndpoint
File "/Library/Python/2.7/site-packages/scrapy/xlib/tx/endpoints.py", line 222, in <module>
interfaces.IProcessTransport, '_process')):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/zope/interface/declarations.py", line 495, in __call__
raise TypeError("Can't use implementer with classes. Use one of "
TypeError: Can't use implementer with classes. Use one of the class-declaration functions instead.
add a text box on top of the spider list to filter by name (usefull when you have a large number of spiders)
like in https://dash.scrapinghub.com/p/1131/jobs/ >Jump to > jump to spider
Hi,
First of all thanks for developing this visual scraping tool and making it open source.
We have been using scrapy crawlers with custom configuration for changing proxy and user-agent intermittently.
Creating project using portia automatically generates all required files for the crawler but I could not find some options to configure project or crawler specific settings like adding proxy details or specifying user-agents etc.
Could someone please guide me with this settings?
Thanks,
Makailol
Hello once i open to see the crawler process
to see the info of my crawler
http://localhost:9001/projects/myspidername/
i get this error:
twisted.web.resource.NoResource: <twisted.web.resource.NoResource instance at 0x7f8eeb7be290>
/usr/lib/python2.7/dist-packages/twisted/web/server.py:184 in process
183 try:
184 resrc = self.site.getResourceFor(self)
185 if resource._IEncodingResource.providedBy(resrc):
/usr/lib/python2.7/dist-packages/twisted/web/server.py:701 in getResourceFor
700 request.sitepath = copy.copy(request.prepath)
701 return resource.getChildForRequest(self.resource, request)
702
/usr/lib/python2.7/dist-packages/twisted/web/resource.py:98 in getChildForRequest
97 request.prepath.append(pathElement)
98 resource = resource.getChildWithDefault(pathElement, request)
99 return resource
/crawler/portia/slyd/slyd/projects.py:39 in getChildWithDefault
38 if next_path_element not in self.children:
39 raise NoResource("No such child resource.")
40 request.prepath.append(project_path_element)
twisted.web.resource.NoResource: <twisted.web.resource.NoResource instance at 0x7f8eeb7be290>
Right now I can store the log file when running Portia. But is there a way to store the JSON for the successful results only?
Thanks
Handle link annotations in tag attributes. Currently, slybot is not handling as expected the link annotations in tag attributes, as spider assumes the extracted data are html regions, and not url.
Also, link extraction is not working efficiently in many cases, so lets increase the test case set with more complex cases
what am i doing wrong?
➜ ~WORKON_HOME mkvirtualenv portia
New python executable in portia/bin/python
Installing setuptools, pip...done.
➜ ~WORKON_HOME workon portia
➜ ~WORKON_HOME which pip
/Users/josefson/virtualenvs/portia/bin/pip
➜ ~WORKON_HOME cd portia
➜ ~VIRTUAL_ENV git clone https://github.com/scrapinghub/portia
Cloning into 'portia'...
remote: Counting objects: 3210, done.
remote: Compressing objects: 100% (848/848), done.
remote: Total 3210 (delta 2317), reused 3204 (delta 2312)
Receiving objects: 100% (3210/3210), 1.92 MiB | 339.00 KiB/s, done.
Resolving deltas: 100% (2317/2317), done.
Checking connectivity... done.
➜ ~VIRTUAL_ENV pwd
/Users/josefson/virtualenvs/portia
➜ ~VIRTUAL_ENV cd portia/slyd
➜ slyd git:(master) pip install -r requirements.txt
Successfully installed twisted scrapy loginform lxml jsonschema scrapely slybot zope.interface w3lib queuelib pyOpenSSL cssselect six numpy cryptography cffi pycparser
Cleaning up...
➜ slyd git:(master) pwd
/Users/josefson/virtualenvs/portia/portia/slyd
➜ slyd git:(master) twistd -n slyd
Traceback (most recent call last):
File "/usr/bin/twistd", line 14, in <module>
run()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/scripts/twistd.py", line 27, in run
app.run(runApp, ServerOptions)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/application/app.py", line 652, in run
runApp(config)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/scripts/twistd.py", line 23, in runApp
_SomeApplicationRunner(config).run()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/application/app.py", line 386, in run
self.application = self.createOrGetApplication()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/application/app.py", line 446, in createOrGetApplication
ser = plg.makeService(self.config.subOptions)
File "/Users/josefson/virtualenvs/portia/portia/slyd/slyd/tap.py", line 55, in makeService
root = create_root(config)
File "/Users/josefson/virtualenvs/portia/portia/slyd/slyd/tap.py", line 25, in create_root
from scrapy import log
ImportError: No module named scrapy
➜ slyd git:(master) cd slyd
➜ slyd git:(master) twistd -n slyd
Usage: twistd [options]
Options:
--savestats save the Stats object rather than the text output of the
profiler.
-o, --no_save do not save state on shutdown
-e, --encrypted The specified tap/aos file is encrypted.
-n, --nodaemon don't daemonize, don't use default umask of 0077
--originalname Don't try to change the process name
--syslog Log to syslog, not to file
--euid Set only effective user-id rather than real user-id.
(This option has no effect unless the server is running
as root, in which case it means not to shed all
privileges after binding ports, retaining the option to
regain privileges in cases such as spawning processes.
Use with caution.)
-l, --logfile= log to a specified file, - for stdout
--logger= A fully-qualified name to a log observer factory to use
for the initial log observer. Takes precedence over
--logfile and --syslog (when available).
-p, --profile= Run in profile mode, dumping results to specified file
--profiler= Name of the profiler to use (profile, cprofile, hotshot).
[default: hotshot]
-f, --file= read the given .tap file [default: twistd.tap]
-y, --python= read an application from within a Python file (implies
-o)
-s, --source= Read an application from a .tas file (AOT format).
-d, --rundir= Change to a supplied directory before running [default:
.]
--prefix= use the given prefix when syslogging [default: twisted]
--pidfile= Name of the pidfile [default: twistd.pid]
--chroot= Chroot to a supplied directory before running
-u, --uid= The uid to run as.
-g, --gid= The gid to run as.
--umask= The (octal) file creation mask to apply.
--help-reactors Display a list of possibly available reactor names.
--version Print version information and exit.
--spew Print an insanely verbose log of everything that happens.
Useful when debugging freezes or locks in complex code.
-b, --debug Run the application in the Python Debugger (implies
nodaemon), sending SIGUSR2 will drop into debugger
-r, --reactor= Which reactor to use (see --help-reactors for a list of
possibilities)
--help Display this help and exit.
twistd reads a twisted.application.service.Application out of a file and runs
it.
Commands:
conch A Conch SSH service.
dns A domain name server.
ftp An FTP server.
inetd An inetd(8) replacement.
mail An email service
manhole An interactive remote debugger service accessible via
telnet and ssh and providing syntax coloring and basic line
editing functionality.
manhole-old An interactive remote debugger service.
news A news server.
portforward A simple port-forwarder.
procmon A process watchdog / supervisor
socks A SOCKSv4 proxy service.
telnet A simple, telnet-based remote debugging service.
web A general-purpose web server which can serve from a
filesystem or application resource.
words A modern words server
xmpp-router An XMPP Router server
/usr/bin/twistd: Unknown command: slyd
Hi
Is there a benchmark that could lead us to test the speed of the crawler. I see that i can only crawl 2 pages per second (which i have disabled delay) and the connection/ram/cpu is too fast in my server. Also the server I am scraping .
I am asking if the scrapely library makes the crawling slower than usual? I really wonder the speed tests of scapely (the learner crawler) than normal xpath using scrapy's itself.
Usually the user needs to enter a large list of start urls for a given spider. The current interface only allows to add them one by one, which is more impractical for bigger list of start url.
We need a way to allow to copy/paste a list of start urls, one per line. This does not mean to remove the one by one method, but just to add the alternative to open a big text box to copy them inside.
/Users/lishuo/Developer/portia/portia/slyd/slyd/bot.py:40: ScrapyDeprecationWarning: scrapy.spider.BaseSpider is deprecated, instantiate scrapy.spider.Spider instead.
spider = BaseSpider('slyd')
HI,
Really enjoying Portia, and am excited about what it can do for our company, but I'm having the following problem: I defined a very simple spider to crawl all the links on a test site I made. When I run the project/spider via "portialcrawl", it executes as expected, crawling and extracting from all pages on the site. However, when I execute the exact same project/spider via the HTTP API (http://localhost:9001/projects/new_project_3/bot/fetch), I only get back the first page of results. Is there a way to fix this? I know in a production scenario it's unlikely that you'd want to run an entire crawl and block while waiting for the results, but it seems like if that's what a spider is configured to do, it should do that regardless of what mechanism is executing it. Any help you could provide would be greatly appreciated.
Cheers,
Landon
I see no mailing list, so sorry about creating an GitHub issue regarding this. I can see this project becoming popular, so a mailing list is probably favourable.
Anyway, to my actual question. Is there a way to scrape multiple items and store them in an array, let's say I'm scraping a category on a eccomerce site, do I have to go down and click on every single image, every title etc and assign them to a field such asproduct1_image_url, product1_title .Is there a way I can just select all the images and create an array of them?
Right now if I assign all of them to the same field (http://imgur.com/PjAhMbA) then I just get all the text joined together when extracted (http://imgur.com/bgam0AK).
Maybe I'm missing something, or this is this the intended functionality?
First off, awesome project!
Looking at this sample page, there is a block of data as shown below:
http://tlahuac.wired.com.mx/687770/grupo-escape.html
With the current UI, I haven't found a good way to extract the multiple pieces of data in front of the elements.
The best I have come up with is to select the "p" element and apply a regex on the annotation, but that will only allow you to retrieve one value (such as the street or the telephone number)
This pattern of putting "floating" text outside of an html element seems pretty common, is there a good way of extracting them?
<p>
<span>Nombre de empresa:</span> Grupo Escape
<br/><br/>
<span>Tel:</span> 5860 1232 1233, 5845 6457 6457
<br/><br/>
<span><input class="DefBtn" type="submit" value="Contáctenos" onclick="location.href='/contact.php?cid=687770';"/></span>
<br/><br/>
<span>Street:</span> Eje 10 mz-32 Lote 3
<br/><br/>
<span>Colonia:</span> colonia Santa Catarina
<br/><br/>
<span>Código postal:</span> 13100
<br/><br/>
<span>Cuidad:</span> Tlahuac, Distrito Federal
<br/><br/> <span>Web:</span> <a href="http://www.grupoescape.com.mx">www.grupoescape.com.mx</a>
<br/><br/> </p> <h2>Mapa</h2>
<p>
History from other ticket system:
Tested "ADD" mode extraction, which consist on using the extraction from more than one template and merge, using as main one the first that validates, and then validate the following ones with the data extracted updated by the data extracted in the previous iteration.
Results are not satisfactory:
I am thinking now that the real problem is that the similarity algorithm is not the best suited for this kind of problem. A previous test using extractors for guiding similarity algorithm shows an interesting approach but difficult to maintain and not clear in unexpected results (in particular the incidence of false positives).
Instead, in a similar approach, i am thinking on switching among different matching algorithm according to extractor associated with the annotation. In particular, there should be a "keyword" extractor which instead of using the similarity algorithm, just scans a region and searches for the given keyword, also integrated in the score/prefix/suffix scheme in order to reduce the scan region and help next matching to do the same, regardless the method.
A typical test case for the issue:
http://panel.scrapinghub.com/p/380/jobs/spider/clevelandclinic.org/
Sample pages:
http://my.clevelandclinic.org/staff_directory/staff_display.aspx?DoctorID=16600
http://my.clevelandclinic.org/staff_directory/staff_display.aspx?doctorid=5387
http://my.clevelandclinic.org/staff_directory/staff_display.aspx?doctorid=9028
Try the following:
When the page loads, it does not show the popover preview of the extracted data.
I believe this began happening after this commit, on checking out the previous commit it works:
I did the "twistd -n syld" command, but when i open "http://localhost:9001/static/main.html" on my browser, but it tell me can't not connect to this.
so what's the matter?
root@do1:~/portia/portia-master/slyd# pip install -r requirements.txt
Obtaining file:///root/portia/portia-master/slybot (from -r requirements.txt (line 7))
Running setup.py egg_info for package from file:///root/portia/portia-master/slybot
/usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'tests_requires'
warnings.warn(msg)
package init file 'slybot/tests/__init__.py' not found (or not a regular file)
Downloading/unpacking twisted (from -r requirements.txt (line 1))
Downloading Twisted-13.2.0.tar.bz2 (2.7MB): 2.7MB downloaded
Running setup.py egg_info for package twisted
Downloading/unpacking scrapy (from -r requirements.txt (line 2))
Downloading Scrapy-0.22.2.tar.gz (757kB): 757kB downloaded
Running setup.py egg_info for package scrapy
no previously-included directories found matching 'docs/build'
Downloading/unpacking loginform (from -r requirements.txt (line 3))
Downloading loginform-1.0.tar.gz
Running setup.py egg_info for package loginform
Requirement already satisfied (use --upgrade to upgrade): lxml in /usr/lib/python2.7/dist-packages (from -r requirements.txt (line 4))
Downloading/unpacking jsonschema (from -r requirements.txt (line 5))
Downloading jsonschema-2.3.0.tar.gz (43kB): 43kB downloaded
Running setup.py egg_info for package jsonschema
Obtaining scrapely from git+git://github.com/scrapy/scrapely.git#egg=scrapely (from -r requirements.txt (line 6))
Cloning git://github.com/scrapy/scrapely.git to ./src/scrapely
Running setup.py egg_info for package scrapely
Downloading/unpacking zope.interface>=3.6.0 (from twisted->-r requirements.txt (line 1))
Downloading zope.interface-4.1.1.tar.gz (864kB): 864kB downloaded
Running setup.py egg_info for package zope.interface
warning: no previously-included files matching '*.dll' found anywhere in distribution
warning: no previously-included files matching '*.pyc' found anywhere in distribution
warning: no previously-included files matching '*.pyo' found anywhere in distribution
warning: no previously-included files matching '*.so' found anywhere in distribution
Downloading/unpacking w3lib>=1.2 (from scrapy->-r requirements.txt (line 2))
Downloading w3lib-1.5.tar.gz
Running setup.py egg_info for package w3lib
Downloading/unpacking queuelib (from scrapy->-r requirements.txt (line 2))
Downloading queuelib-1.1.1.tar.gz
Running setup.py egg_info for package queuelib
Downloading/unpacking pyOpenSSL (from scrapy->-r requirements.txt (line 2))
Downloading pyOpenSSL-0.14.tar.gz (128kB): 128kB downloaded
Running setup.py egg_info for package pyOpenSSL
warning: no previously-included files matching '*.pyc' found anywhere in distribution
no previously-included directories found matching 'doc/_build'
Downloading/unpacking cssselect>=0.9 (from scrapy->-r requirements.txt (line 2))
Downloading cssselect-0.9.1.tar.gz
Running setup.py egg_info for package cssselect
no previously-included directories found matching 'docs/_build'
Downloading/unpacking six>=1.5.2 (from scrapy->-r requirements.txt (line 2))
Downloading six-1.6.1.tar.gz
Running setup.py egg_info for package six
no previously-included directories found matching 'documentation/_build'
Downloading/unpacking numpy (from scrapely->-r requirements.txt (line 6))
Downloading numpy-1.8.1.tar.gz (3.8MB): 3.8MB downloaded
Running setup.py egg_info for package numpy
Running from numpy source directory.
warning: no files found matching 'tools/py3tool.py'
warning: no files found matching '*' under directory 'doc/f2py'
warning: no previously-included files matching '*.pyc' found anywhere in distribution
warning: no previously-included files matching '*.pyo' found anywhere in distribution
warning: no previously-included files matching '*.pyd' found anywhere in distribution
Requirement already satisfied (use --upgrade to upgrade): distribute in /usr/lib/python2.7/dist-packages (from zope.interface>=3.6.0->twisted->-r requirements.txt (line 1))
Downloading/unpacking cryptography>=0.2.1 (from pyOpenSSL->scrapy->-r requirements.txt (line 2))
Downloading cryptography-0.3.tar.gz (208kB): 208kB downloaded
Running setup.py egg_info for package cryptography
Traceback (most recent call last):
File "", line 16, in
File "/tmp/pip-build-root/cryptography/setup.py", line 156, in
"test": PyTest,
File "/usr/lib/python2.7/distutils/core.py", line 112, in setup
_setup_distribution = dist = klass(attrs)
File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 221, in init
self.fetch_build_eggs(attrs.pop('setup_requires'))
File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 245, in fetch_build_eggs
parse_requirements(requires), installer=self.fetch_build_egg
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 598, in resolve
raise VersionConflict(dist,req) # XXX put more info here
pkg_resources.VersionConflict: (six 1.2.0 (/usr/lib/python2.7/dist-packages), Requirement.parse('six>=1.4.1'))
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 16, in
File "/tmp/pip-build-root/cryptography/setup.py", line 156, in
"test": PyTest,
File "/usr/lib/python2.7/distutils/core.py", line 112, in setup
_setup_distribution = dist = klass(attrs)
File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 221, in init
self.fetch_build_eggs(attrs.pop('setup_requires'))
File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 245, in fetch_build_eggs
parse_requirements(requires), installer=self.fetch_build_egg
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 598, in resolve
raise VersionConflict(dist,req) # XXX put more info here
pkg_resources.VersionConflict: (six 1.2.0 (/usr/lib/python2.7/dist-packages), Requirement.parse('six>=1.4.1'))
Command python setup.py egg_info failed with error code 1 in /tmp/pip-build-root/cryptography
Storing complete log in /root/.pip/pip.log
root@do1:~/portia/portia-master/slyd# python --version
Python 2.7.4
root@do1:~/portia/portia-master/slyd# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 13.04
Release: 13.04
Codename: raring
when importing projects from AS, passwords from spiders that require login to a page arent imported.
spiders arent getting sorted by name, and would be good to have a "jump to (spider name )" option
To circumvent this issue I used url shortener
Hi, I'm running into some trouble installing on mac os x
i copied portia to my computer using
git clone https://github.com/scrapinghub/portia/
and then went to the slyd directory
and started a virtual env using
virtualenv slyd
which told me:
New python executable in slyd/bin/python
Installing setuptools, pip...done.
i then tried to install using:
pip install -r requirements.txt
it did something for a while, but then I got a clang error:
cc -fno-strict-aliasing -fno-common -dynamic -arch x86_64 -arch i386 -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -mno-fused-madd -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch x86_64 -arch i386 -pipe -I/usr/include/libxml2 -I/private/var/folders/l0/4wm5tl_57j5fv3c_d__gksk40000gn/T/pip_build/lxml/src/lxml/includes -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/lxml/lxml.etree.c -o build/temp.macosx-10.9-intel-2.7/src/lxml/lxml.etree.o -w -flat_namespace
clang: error: unknown argument: '-mno-fused-madd' [-Wunused-command-line-argument-hard-error-in-future]
clang: note: this will be a hard error (cannot be downgraded to a warning) in the future
error: command 'cc' failed with exit status 1
Cleaning up...
Command /usr/bin/python -c "import setuptools, tokenize;file='/private/var/folders/l0/4wm5tl_57j5fv3c_d__gksk40000gn/T//lxml/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /var/folders/l0/4wm5tl_57j5fv3c_d__gksk40000gn/T/pip-Yn4LXx-record/install-record.txt --single-version-externally-managed --compile failed with error code 1 in /private/var/folders/l0/4wm5tl_57j5fv3c_d__gksk40000gn/T//lxml
Storing debug log for failure in /Users/Library/Logs/pip.log
help? i have no idea how to fix this, or what's broken.
I know that this issue was raised before but I am facing the same while working with portia on windows xp.
Kindly help!
Add support for JS based sites. It would be nice to have UI support for configuring instead of having to do it manually at the Scrapy level.
Perhaps we can allow users to enable or disable sending requests via Splash.
scrapy does allow to stop and resumt the crawler.
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
so to do this, we can add arguments to our portiacrawler by :
portaicrawl projectname crawlername -a -s JOBDIR=crawlername , which doesnt work.
I know that portiacrawl is abstracted from scrapy itself. how we can pass this argument?
Dear everyone,
I'm a newbie for scrapy. I installed slyd and slybot.
When I run twistd -n slyd, it shows:
"from slybot.validation.schma import get_schema_validator
ImportError: No module named validation.schema."
I checked the directory and there are schema.py & schemas.json inside the /portia/slybot/slybot/validation.
I'm using ubuntu14.4 in a virtual box on iMac air.
Your help would be most appreciated.
The fields already declared as required in the item definition should be at least pre marked as required in the required fields template section, with no possibility to edit. Another choice is just not to show them, but I think the first alternative is better.
Say, we want to gather a list of classifieds from a classifieds' search results. All we have to pick all the listings one by one. Is there a smart way of doing this? Like crawler could understand the similar items and pick items from the list, or we have to do it one by one by defining the number of tags?
Maybe crawler might understand the items that are same, and group them in an array.
This first has to be issued at slybot side, then to UI. Idea is to avoid to use variant annotation/split for this
hello,
i started a large crawl using the command, and
portiacrawl .. .... ... & > result.txt
however i see that the crawl is ended (i am not sure if it is terminated) , and i don't know if it is really stopped or not. How i can be sure that the crawl is ended or not?
is there a way of seeing the status of the bot from shell or web interface? I tried to see the scrapy's own web UI, however i dont see the status of the crawler there (6080 port)
thanks
As an avid scrapy user I find really useful the possibility of telling scrapy to only go and look at pages of a newspaper that match today's date (or this month's). This is done for two reasons:
I therefore propose to allow the user to insert a few template variables in the allow/deny/start urls. One way could be to use the python templating such as:
allow: http://www.repubblica.it/esteri/%(year)s/%(month)s/%(day)s/ (match only today's articles)
allow: http://www.repubblica.it/esteri/%(year)s/%(month)s/\d+/ (match only this month's articles)
allow: http://www.repubblica.it/esteri/%(year)s/\d+/\d+/ (match only this year's articles)
I propose to start with some basic variables like:
I'd be glad to patch this if the feature is approved.
I have a project called "moshtix" that has a spider called "gigs". These have been setup using the localhost URL via Twistd. The spider is set up to get the data I want and the template seems to function on any of the links I'd be aiming for.
When I run
portiacrawl moshtix gigs
from my C:\Users\<name>\portia\slyd\data\projects
directory I get this error shown:
WindowsError: [Error 2] The system could not find the file specified
I've had a look through the code and played around with the __file__
variable in the Python27/Scripts directory and that changes the error to a path filled with double slashes. C:\\Users\\<name>\\portia\\slybot\\bin\\portiacrawl
I've changed the __file__
variable to what I think is the correct path to the portiacrawl file and when I run the command portiacrawl moshtix gigs
I get the help menu displayed.
What should I be looking at changing or adding to get portiacrawl working?
...and if you don't, you really, really need some tooltips. The random icons are undecipherable. Pin means pin, except when it means required. Gear means ... back? ...?
Hi guys, this is a really fascinating tool by description but i'm still struggling on getting it running on my windows8 x64
i have encountered problems like running twistd command in windows cmd, installing scrapy properly on x64 platform.
tackled some of them but still facing a lot other problems.
is it possible that the team can provide with a specific setting up guide for this tool on windows?
Even after installing scrapy, when I try to run slyd, it says "No module named scrapy"
$ pip install --upgrade scrapy
Requirement already up-to-date: scrapy in /Users/nateaune/.virtualenvs/portia/lib/python2.7/site-packages
Requirement already up-to-date: Twisted>=10.0.0 in /Users/nateaune/.virtualenvs/portia/lib/python2.7/site-packages (from scrapy)
Requirement already up-to-date: w3lib>=1.2 in /Users/nateaune/.virtualenvs/portia/lib/python2.7/site-packages (from scrapy)
Requirement already up-to-date: queuelib in /Users/nateaune/.virtualenvs/portia/lib/python2.7/site-packages (from scrapy)
Requirement already up-to-date: lxml in /Users/nateaune/.virtualenvs/portia/lib/python2.7/site-packages (from scrapy)
Requirement already up-to-date: pyOpenSSL in /Users/nateaune/.virtualenvs/portia/lib/python2.7/site-packages (from scrapy)
Requirement already up-to-date: cssselect>=0.9 in /Users/nateaune/.virtualenvs/portia/lib/python2.7/site-packages (from scrapy)
Requirement already up-to-date: six>=1.5.2 in /Users/nateaune/.virtualenvs/portia/lib/python2.7/site-packages (from scrapy)
Requirement already up-to-date: zope.interface>=3.6.0 in /Users/nateaune/.virtualenvs/portia/lib/python2.7/site-packages (from Twisted>=10.0.0->scrapy)
Requirement already up-to-date: cryptography>=0.2.1 in /Users/nateaune/.virtualenvs/portia/lib/python2.7/site-packages (from pyOpenSSL->scrapy)
Downloading/unpacking setuptools from https://pypi.python.org/packages/3.4/s/setuptools/setuptools-3.4.4-py2.py3-none-any.whl#md5=46284205a95cf3f9e132bbfe569e1b9d (from zope.interface>=3.6.0->Twisted>=10.0.0->scrapy)
Downloading setuptools-3.4.4-py2.py3-none-any.whl (545kB): 545kB downloaded
Requirement already up-to-date: cffi>=0.8 in /Users/nateaune/.virtualenvs/portia/lib/python2.7/site-packages (from cryptography>=0.2.1->pyOpenSSL->scrapy)
Requirement already up-to-date: pycparser in /Users/nateaune/.virtualenvs/portia/lib/python2.7/site-packages (from cffi>=0.8->cryptography>=0.2.1->pyOpenSSL->scrapy)
Installing collected packages: setuptools
Found existing installation: setuptools 2.2
Uninstalling setuptools:
Successfully uninstalled setuptools
Successfully installed setuptools
Cleaning up...
Now try starting up slyd...
$ twistd -n slyd
Traceback (most recent call last):
File "/usr/bin/twistd", line 14, in <module>
run()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/scripts/twistd.py", line 27, in run
app.run(runApp, ServerOptions)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/application/app.py", line 652, in run
runApp(config)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/scripts/twistd.py", line 23, in runApp
_SomeApplicationRunner(config).run()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/application/app.py", line 386, in run
self.application = self.createOrGetApplication()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/application/app.py", line 446, in createOrGetApplication
ser = plg.makeService(self.config.subOptions)
File "/Users/nateaune/Dropbox/code/portia/slyd/slyd/tap.py", line 55, in makeService
root = create_root(config)
File "/Users/nateaune/Dropbox/code/portia/slyd/slyd/tap.py", line 25, in create_root
from scrapy import log
ImportError: No module named scrapy
Hi, I've installed all dependencies using pip install -r requirements.txt
in my virtualenv.
But for some reason he can't find scrapy.
Mac OSX 10.9.2.
(venv) slyd [master] pip show scrapy
---
Name: Scrapy
Version: 0.22.2
Location: /Users/geekymartian/git_repos/portia/venv/lib/python2.7/site-packages
Requires: Twisted, w3lib, queuelib, lxml, pyOpenSSL, cssselect, six
(venv) slyd [master] twistd -n slyd
Traceback (most recent call last):
File "/usr/bin/twistd", line 14, in <module>
run()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/scripts/twistd.py", line 27, in run
app.run(runApp, ServerOptions)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/application/app.py", line 652, in run
runApp(config)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/scripts/twistd.py", line 23, in runApp
_SomeApplicationRunner(config).run()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/application/app.py", line 386, in run
self.application = self.createOrGetApplication()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/application/app.py", line 446, in createOrGetApplication
ser = plg.makeService(self.config.subOptions)
File "/Users/geekymartian/git_repos/portia/slyd/slyd/tap.py", line 55, in makeService
root = create_root(config)
File "/Users/geekymartian/git_repos/portia/slyd/slyd/tap.py", line 25, in create_root
from scrapy import log
ImportError: No module named scrapy
I am have followed the steps to install portia. I was able to succesfully create annations on a website. However, when I run the following command 'portiacrawl' I get the following error: portiacrawl command not found. I am did everything on ubuntu live 14.04. Thanks
Hi. I'm trying to gather basic description and address information for some business pages on Yahoo finance. I was able to use the Portia interface to successfully pull metadata for pages such as http://biz.yahoo.com/ic/42/42034.html. However, when i go to the main page where links to all businesses in the same domain are listed http://biz.yahoo.com/ic/774_cl_all.html all of the business links are highlighted in red. I believe this is because they are listed under another domain http://us.rd.yahoo.com/finance/industry/front/industrynav/423/*http://biz.yahoo.com/ic/423.html . I've tried writing a regex for the allowed urls but I believe the fact that they are not on the .biz domain is preventing them from being scooped up by the link extractor. Any thoughts on a fix?
This is happening a lot. I cloned the repo yesterday. I'm not sure if this always happens on 200 ['cached'] responses, but here's the spew, FWIW, after I'd already loaded the page a couple times in Portia, and outside Portia in the same Chrome instance:
2014-05-18 22:26:37-0700 [scrapy] Crawled (200) <GET http://REDACTED.com/> (referer: None) ['cached']
2014-05-18 22:26:38-0700 [scrapy] Unhandled Error
Traceback (most recent call last):
File "/Users/gordon/.pythonbrew/venvs/Python-2.7.3/portia/lib/python2.7/site-packages/twisted/internet/base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/Users/gordon/.pythonbrew/venvs/Python-2.7.3/portia/lib/python2.7/site-packages/twisted/internet/defer.py", line 382, in callback
self._startRunCallbacks(result)
File "/Users/gordon/.pythonbrew/venvs/Python-2.7.3/portia/lib/python2.7/site-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks
self._runCallbacks()
File "/Users/gordon/.pythonbrew/venvs/Python-2.7.3/portia/lib/python2.7/site-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
--- <exception caught here> ---
File "/Users/gordon/portia_test/portia/slyd/slyd/bot.py", line 113, in fetch_callback
spider = self.create_spider(request.project, params)
File "/Users/gordon/portia_test/portia/slyd/slyd/bot.py", line 143, in create_spider
**kwargs)
File "/Users/gordon/portia_test/portia/slybot/slybot/spider.py", line 72, in __init__
schema = item_schemas[itemclass_name]
exceptions.KeyError: u'default'
OS X 10.7, latest Chrome. Clearing images and files in Chrome's 'Clear Browsing Data' does not appear to help.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.