scrapinghub / portia Goto Github PK

Visual scraping for Scrapy

License: BSD 3-Clause "New" or "Revised" License

Shell 0.45% Python 40.03% Makefile 0.22% HTML 39.66% JavaScript 17.69% CSS 1.72% Batchfile 0.18% Dockerfile 0.05%

portia's Introduction

Portia

Portia is a tool that allows you to visually scrape websites without any programming knowledge required. With Portia you can annotate a web page to identify the data you wish to extract, and Portia will understand based on these annotations how to scrape data from similar pages.

Running Portia

The easiest way to run Portia is using Docker:

You can run Portia using Docker & official Portia-image by running:

docker run -v ~/portia_projects:/app/data/projects:rw -p 9001:9001 scrapinghub/portia

You can also set up a local instance with Docker-compose by cloning this repo & running from the root of the folder:

docker-compose up

For more detailed instructions, and alternatives to using Docker, see the Installation docs.

Documentation

Documentation can be found from Read the docs. Source files can be found in the docs directory.

portia's People

Contributors

Stargazers

Watchers

Forkers

adewinter baatar gregnwosu wavelets sumedh-k ebunt rgaidot amferraz bussiere eiriklv dwohlfahrt michaelevans gcmalloc juju2013 vhagerty tejbirwason chrisryancarter dragon-lex innovatifi sberkley swatkat7 iamjpg landtco taytus mhoad nilnilnil bmvakili zackliscio shobhit eugenepyvovarov jennojenno dmpavlidis shortjared antojoseph andresp99999 pnhoang olpe publicbull techiev2 mwmalinowski codercise carleeto ramanshrivastava tellezhector hihihippp abpin john-kuo bag-of-projects yanlinaung tanitall gmx libin dotpot nvdnkpr wakandan ramaseshan taoh williamfromtexas denismars rontendo-jp sawantuday yar0d nivertech gondo victorreyesh 4nakin poindexterc ismailmechbal yauzz fwaechter bakerl garftalk ultimate010 alex-zhh springbarley luomor bruce2008github jinpeng sanmisanfan easyfmxu provemyself dyxushuai vbs dahaha patrickxcam miloqq winneryong qcdcool imwbb leecade yufi113 inderjeet26 longhongjun bleachyin d4t0u libbum matic0209 lucker6666 sailor96 felipegtx

portia's Issues

How to use crawlera or any other proxy with this?

Hi,

First of all thanks for developing this visual scraping tool and making it open source.

We have been using scrapy crawlers with custom configuration for changing proxy and user-agent intermittently.

Creating project using portia automatically generates all required files for the crawler but I could not find some options to configure project or crawler specific settings like adding proxy details or specifying user-agents etc.

Could someone please guide me with this settings?

Thanks,
Makailol

problem in input arguments

portiacrawl command doesnt accept all input arguments.

like .... -a DEPTH_LIMIT=10 or (depth_limit) which we have to define in settings.py.

Portia browser does not render page

http://www.sears.com/stooshy-junior-s-lurex-blazer/p-002VA52502712P?prdNo=33

Please reopen bug #11

The visual editor still does not seem to perfom the login step before fetching a page. Using portiacrawl works.

[UI] Allow to generate generic form/start requests. Slybot already supports them.

date variables in starting urls/allowed/denied patterns

As an avid scrapy user I find really useful the possibility of telling scrapy to only go and look at pages of a newspaper that match today's date (or this month's). This is done for two reasons:

Newspaper don't like people to crawl their entire index
Indexing a whole newspaper website takes ages

I therefore propose to allow the user to insert a few template variables in the allow/deny/start urls. One way could be to use the python templating such as:

allow: http://www.repubblica.it/esteri/%(year)s/%(month)s/%(day)s/ (match only today's articles)
allow: http://www.repubblica.it/esteri/%(year)s/%(month)s/\d+/ (match only this month's articles)
allow: http://www.repubblica.it/esteri/%(year)s/\d+/\d+/ (match only this year's articles)

I propose to start with some basic variables like:

day, month, year

I'd be glad to patch this if the feature is approved.

ImportError: No module named jsonschema.exceptions

After correct installation, when I try to run

twistd -n slyd

I get

Traceback (most recent call last):
File "/usr/local/bin/twistd", line 14, in
run()
File "/usr/local/lib/python2.7/dist-packages/twisted/scripts/twistd.py", line 27, in run
app.run(runApp, ServerOptions)
File "/usr/local/lib/python2.7/dist-packages/twisted/application/app.py", line 642, in run
runApp(config)
File "/usr/local/lib/python2.7/dist-packages/twisted/scripts/twistd.py", line 23, in runApp
_SomeApplicationRunner(config).run()
File "/usr/local/lib/python2.7/dist-packages/twisted/application/app.py", line 376, in run
self.application = self.createOrGetApplication()
File "/usr/local/lib/python2.7/dist-packages/twisted/application/app.py", line 436, in createOrGetApplication
ser = plg.makeService(self.config.subOptions)
File "/home/euphorbium/Projects/mtg/scraper/portia-master/slyd/slyd/tap.py", line 55, in makeService
root = create_root(config)
File "/home/euphorbium/Projects/mtg/scraper/portia-master/slyd/slyd/tap.py", line 27, in create_root
from slyd.crawlerspec import (CrawlerSpecManager,
File "/home/euphorbium/Projects/mtg/scraper/portia-master/slyd/slyd/crawlerspec.py", line 12, in
from jsonschema.exceptions import ValidationError
ImportError: No module named jsonschema.exceptions

ImportError: No module named validation.schema

Dear everyone,
I'm a newbie for scrapy. I installed slyd and slybot.
When I run twistd -n slyd, it shows:
"from slybot.validation.schma import get_schema_validator
ImportError: No module named validation.schema."
I checked the directory and there are schema.py & schemas.json inside the /portia/slybot/slybot/validation.
I'm using ubuntu14.4 in a virtual box on iMac air.
Your help would be most appreciated.

Add a text box to filter by spider name

add a text box on top of the spider list to filter by name (usefull when you have a large number of spiders)

like in https://dash.scrapinghub.com/p/1131/jobs/ >Jump to > jump to spider

Support to index tables?

Hello.

I'm testing Portia and I found that there is none actual way to index a table.

By 'index a table', I think something like allowing a element with various parents (something like table > tr (which can vary) > td based on a determinated index, or something like it).

I think that, for some cases, this is useful, like for indexing a page with tables. However, I not tested Slybot to see if it support that feature.

Can anyone help?

Thanks

how to stop/resumt crawler

scrapy does allow to stop and resumt the crawler.

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

so to do this, we can add arguments to our portiacrawler by :
portaicrawl projectname crawlername -a -s JOBDIR=crawlername , which doesnt work.
I know that portiacrawl is abstracted from scrapy itself. how we can pass this argument?

feature request

Say, we want to gather a list of classifieds from a classifieds' search results. All we have to pick all the listings one by one. Is there a smart way of doing this? Like crawler could understand the similar items and pick items from the list, or we have to do it one by one by defining the number of tags?

Maybe crawler might understand the items that are same, and group them in an array.

running portia on macosx

diego@syrah:/tmp/portia/slyd (master)$ twistd -n slyd
/private/tmp/portia/slyd/slyd/bot.py:40: ScrapyDeprecationWarning: scrapy.spider.BaseSpider is deprecated, instantiate scrapy.spider.Spider instead.
  spider = BaseSpider('slyd')
Traceback (most recent call last):
  File "/usr/bin/twistd", line 14, in <module>
    run()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/scripts/twistd.py", line 27, in run
    app.run(runApp, ServerOptions)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/application/app.py", line 652, in run
    runApp(config)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/scripts/twistd.py", line 23, in runApp
    _SomeApplicationRunner(config).run()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/application/app.py", line 386, in run
    self.application = self.createOrGetApplication()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/application/app.py", line 446, in createOrGetApplication
    ser = plg.makeService(self.config.subOptions)
  File "/private/tmp/portia/slyd/slyd/tap.py", line 55, in makeService
    root = create_root(config)
  File "/private/tmp/portia/slyd/slyd/tap.py", line 46, in create_root
    projects.putChild("bot", create_bot_resource(spec_manager))
  File "/private/tmp/portia/slyd/slyd/bot.py", line 34, in create_bot_resource
    bot = Bot(spec_manager.settings, spec_manager)
  File "/private/tmp/portia/slyd/slyd/bot.py", line 48, in __init__
    crawler.configure()
  File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 47, in configure
    self.engine = ExecutionEngine(self, self._spider_closed)
  File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 63, in __init__
    self.downloader = Downloader(crawler)
  File "/Library/Python/2.7/site-packages/scrapy/core/downloader/__init__.py", line 73, in __init__
    self.handlers = DownloadHandlers(crawler)
  File "/Library/Python/2.7/site-packages/scrapy/core/downloader/handlers/__init__.py", line 18, in __init__
    cls = load_object(clspath)
  File "/Library/Python/2.7/site-packages/scrapy/utils/misc.py", line 40, in load_object
    mod = import_module(module)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/Library/Python/2.7/site-packages/scrapy/core/downloader/handlers/s3.py", line 4, in <module>
    from .http import HTTPDownloadHandler
  File "/Library/Python/2.7/site-packages/scrapy/core/downloader/handlers/http.py", line 5, in <module>
    from .http11 import HTTP11DownloadHandler as HTTPDownloadHandler
  File "/Library/Python/2.7/site-packages/scrapy/core/downloader/handlers/http11.py", line 15, in <module>
    from scrapy.xlib.tx import Agent, ProxyAgent, ResponseDone, \
  File "/Library/Python/2.7/site-packages/scrapy/xlib/tx/__init__.py", line 6, in <module>
    from . import client, endpoints
  File "/Library/Python/2.7/site-packages/scrapy/xlib/tx/client.py", line 37, in <module>
    from .endpoints import TCP4ClientEndpoint, SSL4ClientEndpoint
  File "/Library/Python/2.7/site-packages/scrapy/xlib/tx/endpoints.py", line 222, in <module>
    interfaces.IProcessTransport, '_process')):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/zope/interface/declarations.py", line 495, in __call__
    raise TypeError("Can't use implementer with classes.  Use one of "
TypeError: Can't use implementer with classes.  Use one of the class-declaration functions instead.

ImportError: No module named validation.schema

I know that this issue was raised before but I am facing the same while working with portia on windows xp.
Kindly help!

bug in web ui of slyd

Hello once i open to see the crawler process
to see the info of my crawler
http://localhost:9001/projects/myspidername/

i get this error:

twisted.web.resource.NoResource: <twisted.web.resource.NoResource instance at 0x7f8eeb7be290>
/usr/lib/python2.7/dist-packages/twisted/web/server.py:184 in process
183 try:
184 resrc = self.site.getResourceFor(self)
185 if resource._IEncodingResource.providedBy(resrc):
/usr/lib/python2.7/dist-packages/twisted/web/server.py:701 in getResourceFor
700 request.sitepath = copy.copy(request.prepath)
701 return resource.getChildForRequest(self.resource, request)
702
/usr/lib/python2.7/dist-packages/twisted/web/resource.py:98 in getChildForRequest
97 request.prepath.append(pathElement)
98 resource = resource.getChildWithDefault(pathElement, request)
99 return resource
/crawler/portia/slyd/slyd/projects.py:39 in getChildWithDefault
38 if next_path_element not in self.children:
39 raise NoResource("No such child resource.")
40 request.prepath.append(project_path_element)
twisted.web.resource.NoResource: <twisted.web.resource.NoResource instance at 0x7f8eeb7be290>

Getting "floating" text outside of html elements

First off, awesome project!

Looking at this sample page, there is a block of data as shown below:
http://tlahuac.wired.com.mx/687770/grupo-escape.html

With the current UI, I haven't found a good way to extract the multiple pieces of data in front of the elements.

The best I have come up with is to select the "p" element and apply a regex on the annotation, but that will only allow you to retrieve one value (such as the street or the telephone number)

This pattern of putting "floating" text outside of an html element seems pretty common, is there a good way of extracting them?

<p>
  <span>Nombre de empresa:</span> Grupo Escape
  <br/><br/>
  <span>Tel:</span> 5860 1232 1233, 5845 6457 6457
  <br/><br/>
  <span><input class="DefBtn" type="submit" value="Contáctenos" onclick="location.href='/contact.php?cid=687770';"/></span>
  <br/><br/>
  <span>Street:</span> Eje 10 mz-32 Lote 3
  <br/><br/>
  <span>Colonia:</span> colonia Santa Catarina
  <br/><br/>
  <span>Código postal:</span> 13100
  <br/><br/>
  <span>Cuidad:</span> Tlahuac, Distrito Federal
  <br/><br/>            <span>Web:</span> <a href="http://www.grupoescape.com.mx">www.grupoescape.com.mx</a>
  <br/><br/>            </p>    <h2>Mapa</h2>
<p>

Installing on Mac OS X

Hi, I'm running into some trouble installing on mac os x

i copied portia to my computer using

git clone https://github.com/scrapinghub/portia/

and then went to the slyd directory

and started a virtual env using

virtualenv slyd

which told me:

New python executable in slyd/bin/python
Installing setuptools, pip...done.

i then tried to install using:
pip install -r requirements.txt

it did something for a while, but then I got a clang error:

cc -fno-strict-aliasing -fno-common -dynamic -arch x86_64 -arch i386 -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -mno-fused-madd -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch x86_64 -arch i386 -pipe -I/usr/include/libxml2 -I/private/var/folders/l0/4wm5tl_57j5fv3c_d__gksk40000gn/T/pip_build/lxml/src/lxml/includes -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/lxml/lxml.etree.c -o build/temp.macosx-10.9-intel-2.7/src/lxml/lxml.etree.o -w -flat_namespace

clang: error: unknown argument: '-mno-fused-madd' [-Wunused-command-line-argument-hard-error-in-future]

clang: note: this will be a hard error (cannot be downgraded to a warning) in the future

error: command 'cc' failed with exit status 1

Cleaning up...
Command /usr/bin/python -c "import setuptools, tokenize;file='/private/var/folders/l0/4wm5tl_57j5fv3c_d__gksk40000gn/T//lxml/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /var/folders/l0/4wm5tl_57j5fv3c_d__gksk40000gn/T/pip-Yn4LXx-record/install-record.txt --single-version-externally-managed --compile failed with error code 1 in /private/var/folders/l0/4wm5tl_57j5fv3c_d__gksk40000gn/T//lxml
Storing debug log for failure in /Users/Library/Logs/pip.log

help? i have no idea how to fix this, or what's broken.

unexpected internal error: u'default'

This is happening a lot. I cloned the repo yesterday. I'm not sure if this always happens on 200 ['cached'] responses, but here's the spew, FWIW, after I'd already loaded the page a couple times in Portia, and outside Portia in the same Chrome instance:

2014-05-18 22:26:37-0700 [scrapy] Crawled (200) <GET http://REDACTED.com/> (referer: None) ['cached']
2014-05-18 22:26:38-0700 [scrapy] Unhandled Error
    Traceback (most recent call last):
      File "/Users/gordon/.pythonbrew/venvs/Python-2.7.3/portia/lib/python2.7/site-packages/twisted/internet/base.py", line 824, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/Users/gordon/.pythonbrew/venvs/Python-2.7.3/portia/lib/python2.7/site-packages/twisted/internet/defer.py", line 382, in callback
        self._startRunCallbacks(result)
      File "/Users/gordon/.pythonbrew/venvs/Python-2.7.3/portia/lib/python2.7/site-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks
        self._runCallbacks()
      File "/Users/gordon/.pythonbrew/venvs/Python-2.7.3/portia/lib/python2.7/site-packages/twisted/internet/defer.py", line 577, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
    --- <exception caught here> ---
      File "/Users/gordon/portia_test/portia/slyd/slyd/bot.py", line 113, in fetch_callback
        spider = self.create_spider(request.project, params)
      File "/Users/gordon/portia_test/portia/slyd/slyd/bot.py", line 143, in create_spider
        **kwargs)
      File "/Users/gordon/portia_test/portia/slybot/slybot/spider.py", line 72, in __init__
        schema = item_schemas[itemclass_name]
    exceptions.KeyError: u'default'

OS X 10.7, latest Chrome. Clearing images and files in Chrome's 'Clear Browsing Data' does not appear to help.

Portiacrawl error + unexpected behaviour

I have a project called "moshtix" that has a spider called "gigs". These have been setup using the localhost URL via Twistd. The spider is set up to get the data I want and the template seems to function on any of the links I'd be aiming for.

When I run

portiacrawl moshtix gigs

from my C:\Users\<name>\portia\slyd\data\projects directory I get this error shown:

WindowsError: [Error 2] The system could not find the file specified

I've had a look through the code and played around with the __file__ variable in the Python27/Scripts directory and that changes the error to a path filled with double slashes. C:\\Users\\<name>\\portia\\slybot\\bin\\portiacrawl

I've changed the __file__ variable to what I think is the correct path to the portiacrawl file and when I run the command portiacrawl moshtix gigs I get the help menu displayed.

What should I be looking at changing or adding to get portiacrawl working?

setting up on windows

Hi guys, this is a really fascinating tool by description but i'm still struggling on getting it running on my windows8 x64
i have encountered problems like running twistd command in windows cmd, installing scrapy properly on x64 platform.
tackled some of them but still facing a lot other problems.

is it possible that the team can provide with a specific setting up guide for this tool on windows?

[slybot] Template consistency validations

Validate that all required fields are present in a template

Pip install failed

root@do1:~/portia/portia-master/slyd# pip install -r requirements.txt
Obtaining file:///root/portia/portia-master/slybot (from -r requirements.txt (line 7))
Running setup.py egg_info for package from file:///root/portia/portia-master/slybot
/usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'tests_requires'
warnings.warn(msg)

package init file 'slybot/tests/__init__.py' not found (or not a regular file)

Downloading/unpacking twisted (from -r requirements.txt (line 1))
Downloading Twisted-13.2.0.tar.bz2 (2.7MB): 2.7MB downloaded
Running setup.py egg_info for package twisted

Downloading/unpacking scrapy (from -r requirements.txt (line 2))
Downloading Scrapy-0.22.2.tar.gz (757kB): 757kB downloaded
Running setup.py egg_info for package scrapy

no previously-included directories found matching 'docs/build'

Downloading/unpacking loginform (from -r requirements.txt (line 3))
Downloading loginform-1.0.tar.gz
Running setup.py egg_info for package loginform

Requirement already satisfied (use --upgrade to upgrade): lxml in /usr/lib/python2.7/dist-packages (from -r requirements.txt (line 4))
Downloading/unpacking jsonschema (from -r requirements.txt (line 5))
Downloading jsonschema-2.3.0.tar.gz (43kB): 43kB downloaded
Running setup.py egg_info for package jsonschema

Obtaining scrapely from git+git://github.com/scrapy/scrapely.git#egg=scrapely (from -r requirements.txt (line 6))
Cloning git://github.com/scrapy/scrapely.git to ./src/scrapely
Running setup.py egg_info for package scrapely

Downloading/unpacking zope.interface>=3.6.0 (from twisted->-r requirements.txt (line 1))
Downloading zope.interface-4.1.1.tar.gz (864kB): 864kB downloaded
Running setup.py egg_info for package zope.interface

warning: no previously-included files matching '*.dll' found anywhere in distribution
warning: no previously-included files matching '*.pyc' found anywhere in distribution
warning: no previously-included files matching '*.pyo' found anywhere in distribution
warning: no previously-included files matching '*.so' found anywhere in distribution

Downloading/unpacking w3lib>=1.2 (from scrapy->-r requirements.txt (line 2))
Downloading w3lib-1.5.tar.gz
Running setup.py egg_info for package w3lib

Downloading/unpacking queuelib (from scrapy->-r requirements.txt (line 2))
Downloading queuelib-1.1.1.tar.gz
Running setup.py egg_info for package queuelib

Downloading/unpacking pyOpenSSL (from scrapy->-r requirements.txt (line 2))
Downloading pyOpenSSL-0.14.tar.gz (128kB): 128kB downloaded
Running setup.py egg_info for package pyOpenSSL

warning: no previously-included files matching '*.pyc' found anywhere in distribution
no previously-included directories found matching 'doc/_build'

Downloading/unpacking cssselect>=0.9 (from scrapy->-r requirements.txt (line 2))
Downloading cssselect-0.9.1.tar.gz
Running setup.py egg_info for package cssselect

no previously-included directories found matching 'docs/_build'

Downloading/unpacking six>=1.5.2 (from scrapy->-r requirements.txt (line 2))
Downloading six-1.6.1.tar.gz
Running setup.py egg_info for package six

no previously-included directories found matching 'documentation/_build'

Downloading/unpacking numpy (from scrapely->-r requirements.txt (line 6))
Downloading numpy-1.8.1.tar.gz (3.8MB): 3.8MB downloaded
Running setup.py egg_info for package numpy
Running from numpy source directory.

warning: no files found matching 'tools/py3tool.py'
warning: no files found matching '*' under directory 'doc/f2py'
warning: no previously-included files matching '*.pyc' found anywhere in distribution
warning: no previously-included files matching '*.pyo' found anywhere in distribution
warning: no previously-included files matching '*.pyd' found anywhere in distribution

Requirement already satisfied (use --upgrade to upgrade): distribute in /usr/lib/python2.7/dist-packages (from zope.interface>=3.6.0->twisted->-r requirements.txt (line 1))
Downloading/unpacking cryptography>=0.2.1 (from pyOpenSSL->scrapy->-r requirements.txt (line 2))
Downloading cryptography-0.3.tar.gz (208kB): 208kB downloaded
Running setup.py egg_info for package cryptography
Traceback (most recent call last):
File "", line 16, in
File "/tmp/pip-build-root/cryptography/setup.py", line 156, in
"test": PyTest,
File "/usr/lib/python2.7/distutils/core.py", line 112, in setup
_setup_distribution = dist = klass(attrs)
File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 221, in init
self.fetch_build_eggs(attrs.pop('setup_requires'))
File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 245, in fetch_build_eggs
parse_requirements(requires), installer=self.fetch_build_egg
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 598, in resolve
raise VersionConflict(dist,req) # XXX put more info here
pkg_resources.VersionConflict: (six 1.2.0 (/usr/lib/python2.7/dist-packages), Requirement.parse('six>=1.4.1'))
Complete output from command python setup.py egg_info:
Traceback (most recent call last):

File "", line 16, in

File "/tmp/pip-build-root/cryptography/setup.py", line 156, in

"test": PyTest,

File "/usr/lib/python2.7/distutils/core.py", line 112, in setup

_setup_distribution = dist = klass(attrs)

File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 221, in init

self.fetch_build_eggs(attrs.pop('setup_requires'))

File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 245, in fetch_build_eggs

parse_requirements(requires), installer=self.fetch_build_egg

File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 598, in resolve

raise VersionConflict(dist,req) # XXX put more info here

pkg_resources.VersionConflict: (six 1.2.0 (/usr/lib/python2.7/dist-packages), Requirement.parse('six>=1.4.1'))

Command python setup.py egg_info failed with error code 1 in /tmp/pip-build-root/cryptography
Storing complete log in /root/.pip/pip.log
root@do1:~/portia/portia-master/slyd# python --version
Python 2.7.4

root@do1:~/portia/portia-master/slyd# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 13.04
Release: 13.04
Codename: raring

No module named scrapy

Hi, I've installed all dependencies using pip install -r requirements.txt in my virtualenv.
But for some reason he can't find scrapy.

Mac OSX 10.9.2.

(venv) slyd [master] pip show scrapy

---
Name: Scrapy
Version: 0.22.2
Location: /Users/geekymartian/git_repos/portia/venv/lib/python2.7/site-packages
Requires: Twisted, w3lib, queuelib, lxml, pyOpenSSL, cssselect, six
(venv) slyd [master]  twistd -n slyd 
Traceback (most recent call last):
  File "/usr/bin/twistd", line 14, in <module>
    run()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/scripts/twistd.py", line 27, in run
    app.run(runApp, ServerOptions)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/application/app.py", line 652, in run
    runApp(config)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/scripts/twistd.py", line 23, in runApp
    _SomeApplicationRunner(config).run()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/application/app.py", line 386, in run
    self.application = self.createOrGetApplication()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/application/app.py", line 446, in createOrGetApplication
    ser = plg.makeService(self.config.subOptions)
  File "/Users/geekymartian/git_repos/portia/slyd/slyd/tap.py", line 55, in makeService
    root = create_root(config)
  File "/Users/geekymartian/git_repos/portia/slyd/slyd/tap.py", line 25, in create_root
    from scrapy import log
ImportError: No module named scrapy

Required field in items and template

The fields already declared as required in the item definition should be at least pre marked as required in the required fields template section, with no possibility to edit. Another choice is just not to show them, but I think the first alternative is better.

[slybot] RSS link extractor does not handle correctly relative path

http://support.scrapinghub.com/topic/291452-error-report/

Doesn't fetch php pages with arguments

I use as a start page https://athens.indymedia.org/ .There are links that lead to articles in the form of https://athens.indymedia.org/front.php3?lang=el&article_id=1523018.
When I click a link from the start page it redirects me to https://athens.indymedia.org/front.php3. Portia is able to scrape the article page if I input the article url as a startpage.
After some debugging I can see the fetch request from javascript is removing the part after the " ? ".

[slybot] number extractor does not extract negative numbers correctly

Logging in to website

Is there is way to log in a website, or give some cookie to get the spider started ?

Let's say I want to scrape a website that require me to login before I can access the pages. How could i do this with Portia ?

Scraping array of items

I see no mailing list, so sorry about creating an GitHub issue regarding this. I can see this project becoming popular, so a mailing list is probably favourable.

Anyway, to my actual question. Is there a way to scrape multiple items and store them in an array, let's say I'm scraping a category on a eccomerce site, do I have to go down and click on every single image, every title etc and assign them to a field such asproduct1_image_url, product1_title .Is there a way I can just select all the images and create an array of them?

Right now if I assign all of them to the same field (http://imgur.com/PjAhMbA) then I just get all the text joined together when extracted (http://imgur.com/bgam0AK).

Maybe I'm missing something, or this is this the intended functionality?

Support annotation of multiple items per page

This first has to be issued at slybot side, then to UI. Idea is to avoid to use variant annotation/split for this

portiacrawl command not found

I am have followed the steps to install portia. I was able to succesfully create annations on a website. However, when I run the following command 'portiacrawl' I get the following error: portiacrawl command not found. I am did everything on ubuntu live 14.04. Thanks

big text input for start urls

Usually the user needs to enter a large list of start urls for a given spider. The current interface only allows to add them one by one, which is more impractical for bigger list of start url.

We need a way to allow to copy/paste a list of start urls, one per line. This does not mean to remove the one by one method, but just to add the alternative to open a big text box to copy them inside.

If you have tooltips, they don't work in Chrome

...and if you don't, you really, really need some tooltips. The random icons are undecipherable. Pin means pin, except when it means required. Gear means ... back? ...?

Support for JS

Add support for JS based sites. It would be nice to have UI support for configuring instead of having to do it manually at the Scrapy level.

Perhaps we can allow users to enable or disable sending requests via Splash.

Importing projects from AS

when importing projects from AS, passwords from spiders that require login to a page arent imported.

spiders arent getting sorted by name, and would be good to have a "jump to (spider name )" option

bug/feature request in opening the status of the crawler from web or somewhere?

hello,

i started a large crawl using the command, and
portiacrawl .. .... ... & > result.txt

however i see that the crawl is ended (i am not sure if it is terminated) , and i don't know if it is really stopped or not. How i can be sure that the crawl is ended or not?

is there a way of seeing the status of the bot from shell or web interface? I tried to see the scrapy's own web UI, however i dont see the status of the crawler there (6080 port)

thanks

Can't Run Full Crawl via REST API

HI,

Really enjoying Portia, and am excited about what it can do for our company, but I'm having the following problem: I defined a very simple spider to crawl all the links on a test site I made. When I run the project/spider via "portialcrawl", it executes as expected, crawling and extracting from all pages on the site. However, when I execute the exact same project/spider via the HTTP API (http://localhost:9001/projects/new_project_3/bot/fetch), I only get back the first page of results. Is there a way to fix this? I know in a production scenario it's unlikely that you'd want to run an entire crawl and block while waiting for the results, but it seems like if that's what a spider is configured to do, it should do that regardless of what mechanism is executing it. Any help you could provide would be greatly appreciated.

Cheers,
Landon

Investigations on alternatives to improve extractions with variable field items

History from other ticket system:

Tested "ADD" mode extraction, which consist on using the extraction from more than one template and merge, using as main one the first that validates, and then validate the following ones with the data extracted updated by the data extracted in the previous iteration.

Results are not satisfactory:

does not resolve the problem of field mismatch, very common in this kind of problem. Needs so the help of regex extractors
it is much more difficult to debug, as now there are multiple templates interacting, data extracted from different templates, extra required are not transparent, etc.

I am thinking now that the real problem is that the similarity algorithm is not the best suited for this kind of problem. A previous test using extractors for guiding similarity algorithm shows an interesting approach but difficult to maintain and not clear in unexpected results (in particular the incidence of false positives).

Instead, in a similar approach, i am thinking on switching among different matching algorithm according to extractor associated with the annotation. In particular, there should be a "keyword" extractor which instead of using the similarity algorithm, just scans a region and searches for the given keyword, also integrated in the score/prefix/suffix scheme in order to reduce the scan region and help next matching to do the same, regardless the method.

A typical test case for the issue:

http://panel.scrapinghub.com/p/380/jobs/spider/clevelandclinic.org/

Sample pages:

http://my.clevelandclinic.org/staff_directory/staff_display.aspx?DoctorID=16600
http://my.clevelandclinic.org/staff_directory/staff_display.aspx?doctorid=5387
http://my.clevelandclinic.org/staff_directory/staff_display.aspx?doctorid=9028

Preview window for extracted data not working

Try the following:

Load a previously written template
Click on "Continue Browsing" on the top navbar

When the page loads, it does not show the popover preview of the extracted data.

I believe this began happening after this commit, on checking out the previous commit it works:

2a235ca

benchmarks?

Is there a benchmark that could lead us to test the speed of the crawler. I see that i can only crawl 2 pages per second (which i have disabled delay) and the connection/ram/cpu is too fast in my server. Also the server I am scraping .

I am asking if the scrapely library makes the crawling slower than usual? I really wonder the speed tests of scapely (the learner crawler) than normal xpath using scrapy's itself.

ImportError: No module named scrapy

Even after installing scrapy, when I try to run slyd, it says "No module named scrapy"

$ pip install --upgrade scrapy
Requirement already up-to-date: scrapy in /Users/nateaune/.virtualenvs/portia/lib/python2.7/site-packages
Requirement already up-to-date: Twisted>=10.0.0 in /Users/nateaune/.virtualenvs/portia/lib/python2.7/site-packages (from scrapy)
Requirement already up-to-date: w3lib>=1.2 in /Users/nateaune/.virtualenvs/portia/lib/python2.7/site-packages (from scrapy)
Requirement already up-to-date: queuelib in /Users/nateaune/.virtualenvs/portia/lib/python2.7/site-packages (from scrapy)
Requirement already up-to-date: lxml in /Users/nateaune/.virtualenvs/portia/lib/python2.7/site-packages (from scrapy)
Requirement already up-to-date: pyOpenSSL in /Users/nateaune/.virtualenvs/portia/lib/python2.7/site-packages (from scrapy)
Requirement already up-to-date: cssselect>=0.9 in /Users/nateaune/.virtualenvs/portia/lib/python2.7/site-packages (from scrapy)
Requirement already up-to-date: six>=1.5.2 in /Users/nateaune/.virtualenvs/portia/lib/python2.7/site-packages (from scrapy)
Requirement already up-to-date: zope.interface>=3.6.0 in /Users/nateaune/.virtualenvs/portia/lib/python2.7/site-packages (from Twisted>=10.0.0->scrapy)
Requirement already up-to-date: cryptography>=0.2.1 in /Users/nateaune/.virtualenvs/portia/lib/python2.7/site-packages (from pyOpenSSL->scrapy)
Downloading/unpacking setuptools from https://pypi.python.org/packages/3.4/s/setuptools/setuptools-3.4.4-py2.py3-none-any.whl#md5=46284205a95cf3f9e132bbfe569e1b9d (from zope.interface>=3.6.0->Twisted>=10.0.0->scrapy)
  Downloading setuptools-3.4.4-py2.py3-none-any.whl (545kB): 545kB downloaded
Requirement already up-to-date: cffi>=0.8 in /Users/nateaune/.virtualenvs/portia/lib/python2.7/site-packages (from cryptography>=0.2.1->pyOpenSSL->scrapy)
Requirement already up-to-date: pycparser in /Users/nateaune/.virtualenvs/portia/lib/python2.7/site-packages (from cffi>=0.8->cryptography>=0.2.1->pyOpenSSL->scrapy)
Installing collected packages: setuptools
  Found existing installation: setuptools 2.2
    Uninstalling setuptools:
      Successfully uninstalled setuptools
Successfully installed setuptools
Cleaning up...

Now try starting up slyd...

$ twistd -n slyd              
Traceback (most recent call last):
  File "/usr/bin/twistd", line 14, in <module>
    run()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/scripts/twistd.py", line 27, in run
    app.run(runApp, ServerOptions)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/application/app.py", line 652, in run
    runApp(config)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/scripts/twistd.py", line 23, in runApp
    _SomeApplicationRunner(config).run()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/application/app.py", line 386, in run
    self.application = self.createOrGetApplication()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/application/app.py", line 446, in createOrGetApplication
    ser = plg.makeService(self.config.subOptions)
  File "/Users/nateaune/Dropbox/code/portia/slyd/slyd/tap.py", line 55, in makeService
    root = create_root(config)
  File "/Users/nateaune/Dropbox/code/portia/slyd/slyd/tap.py", line 25, in create_root
    from scrapy import log
ImportError: No module named scrapy

Allow to copy spiders between projects

URL allowed domain issue blocking onsite links

Hi. I'm trying to gather basic description and address information for some business pages on Yahoo finance. I was able to use the Portia interface to successfully pull metadata for pages such as http://biz.yahoo.com/ic/42/42034.html. However, when i go to the main page where links to all businesses in the same domain are listed http://biz.yahoo.com/ic/774_cl_all.html all of the business links are highlighted in red. I believe this is because they are listed under another domain http://us.rd.yahoo.com/finance/industry/front/industrynav/423/*http://biz.yahoo.com/ic/423.html . I've tried writing a regex for the allowed urls but I believe the fact that they are not on the .biz domain is preventing them from being scooped up by the link extractor. Any thoughts on a fix?

Slybot supports many kind of link extractors definitions. But still UI does not allow to edit/add them

Store successful JSON results only

Right now I can store the log file when running Portia. But is there a way to store the JSON for the successful results only?

Thanks

portia does not function correctly when website name contains '-' character

on http://localhost:9001/static/main.html#/projects enter site url which contains '-' (e.g. http://dat-data.com/)
observe that it doesn't try to retrieve website page

To circumvent this issue I used url shortener

command"twistd -n syld" can't open the site:http://localhost:9001/static/main.html

I did the "twistd -n syld" command, but when i open "http://localhost:9001/static/main.html" on my browser, but it tell me can't not connect to this.
so what's the matter?

Raise an DeprecationWarning and exited

/Users/lishuo/Developer/portia/portia/slyd/slyd/bot.py:40: ScrapyDeprecationWarning: scrapy.spider.BaseSpider is deprecated, instantiate scrapy.spider.Spider instead.
spider = BaseSpider('slyd')

twistd -n slyd error

what am i doing wrong?

➜  ~WORKON_HOME  mkvirtualenv portia
New python executable in portia/bin/python
Installing setuptools, pip...done.
➜  ~WORKON_HOME  workon portia
➜  ~WORKON_HOME  which pip
/Users/josefson/virtualenvs/portia/bin/pip
➜  ~WORKON_HOME  cd portia
➜  ~VIRTUAL_ENV  git clone https://github.com/scrapinghub/portia
Cloning into 'portia'...
remote: Counting objects: 3210, done.
remote: Compressing objects: 100% (848/848), done.
remote: Total 3210 (delta 2317), reused 3204 (delta 2312)
Receiving objects: 100% (3210/3210), 1.92 MiB | 339.00 KiB/s, done.
Resolving deltas: 100% (2317/2317), done.
Checking connectivity... done.
➜  ~VIRTUAL_ENV  pwd
/Users/josefson/virtualenvs/portia
➜  ~VIRTUAL_ENV  cd portia/slyd
➜  slyd git:(master) pip install -r requirements.txt
Successfully installed twisted scrapy loginform lxml jsonschema scrapely slybot zope.interface w3lib queuelib pyOpenSSL cssselect six numpy cryptography cffi pycparser
Cleaning up...
➜  slyd git:(master) pwd
/Users/josefson/virtualenvs/portia/portia/slyd
➜  slyd git:(master) twistd -n slyd
Traceback (most recent call last):
  File "/usr/bin/twistd", line 14, in <module>
    run()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/scripts/twistd.py", line 27, in run
    app.run(runApp, ServerOptions)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/application/app.py", line 652, in run
    runApp(config)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/scripts/twistd.py", line 23, in runApp
    _SomeApplicationRunner(config).run()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/application/app.py", line 386, in run
    self.application = self.createOrGetApplication()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/application/app.py", line 446, in createOrGetApplication
    ser = plg.makeService(self.config.subOptions)
  File "/Users/josefson/virtualenvs/portia/portia/slyd/slyd/tap.py", line 55, in makeService
    root = create_root(config)
  File "/Users/josefson/virtualenvs/portia/portia/slyd/slyd/tap.py", line 25, in create_root
    from scrapy import log
ImportError: No module named scrapy
➜  slyd git:(master) cd slyd
➜  slyd git:(master) twistd -n slyd
Usage: twistd [options]
Options:
      --savestats      save the Stats object rather than the text output of the
                       profiler.
  -o, --no_save        do not save state on shutdown
  -e, --encrypted      The specified tap/aos file is encrypted.
  -n, --nodaemon       don't daemonize, don't use default umask of 0077
      --originalname   Don't try to change the process name
      --syslog         Log to syslog, not to file
      --euid           Set only effective user-id rather than real user-id.
                       (This option has no effect unless the server is running
                       as root, in which case it means not to shed all
                       privileges after binding ports, retaining the option to
                       regain privileges in cases such as spawning processes.
                       Use with caution.)
  -l, --logfile=       log to a specified file, - for stdout
      --logger=        A fully-qualified name to a log observer factory to use
                       for the initial log observer. Takes precedence over
                       --logfile and --syslog (when available).
  -p, --profile=       Run in profile mode, dumping results to specified file
      --profiler=      Name of the profiler to use (profile, cprofile, hotshot).
                       [default: hotshot]
  -f, --file=          read the given .tap file [default: twistd.tap]
  -y, --python=        read an application from within a Python file (implies
                       -o)
  -s, --source=        Read an application from a .tas file (AOT format).
  -d, --rundir=        Change to a supplied directory before running [default:
                       .]
      --prefix=        use the given prefix when syslogging [default: twisted]
      --pidfile=       Name of the pidfile [default: twistd.pid]
      --chroot=        Chroot to a supplied directory before running
  -u, --uid=           The uid to run as.
  -g, --gid=           The gid to run as.
      --umask=         The (octal) file creation mask to apply.
      --help-reactors  Display a list of possibly available reactor names.
      --version        Print version information and exit.
      --spew           Print an insanely verbose log of everything that happens.
                       Useful when debugging freezes or locks in complex code.
  -b, --debug          Run the application in the Python Debugger (implies
                       nodaemon), sending SIGUSR2 will drop into debugger
  -r, --reactor=       Which reactor to use (see --help-reactors for a list of
                       possibilities)
      --help           Display this help and exit.

twistd reads a twisted.application.service.Application out of a file and runs
it.
Commands:
    conch            A Conch SSH service.
    dns              A domain name server.
    ftp              An FTP server.
    inetd            An inetd(8) replacement.
    mail             An email service
    manhole          An interactive remote debugger service accessible via
                     telnet and ssh and providing syntax coloring and basic line
                     editing functionality.
    manhole-old      An interactive remote debugger service.
    news             A news server.
    portforward      A simple port-forwarder.
    procmon          A process watchdog / supervisor
    socks            A SOCKSv4 proxy service.
    telnet           A simple, telnet-based remote debugging service.
    web              A general-purpose web server which can serve from a
                     filesystem or application resource.
    words            A modern words server
    xmpp-router      An XMPP Router server

/usr/bin/twistd: Unknown command: slyd

[slybot] Improve link extraction annotations system

Handle link annotations in tag attributes. Currently, slybot is not handling as expected the link annotations in tag attributes, as spider assumes the extracted data are html regions, and not url.

Also, link extraction is not working efficiently in many cases, so lets increase the test case set with more complex cases