scrapy / scrapely Goto Github PK

A pure-python HTML screen-scraping library

Python 48.06% HTML 51.94%

scrapely's Introduction

Scrapely

Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.

Overview

Scrapinghub wrote a nice blog post explaining how scrapely works and how it's used in Portia.

Installation

Scrapely works in Python 2.7 or 3.3+. It requires numpy and w3lib Python packages.

To install scrapely on any platform use:

pip install scrapely

If you're using Ubuntu (9.10 or above), you can install scrapely from the Scrapy Ubuntu repos. Just add the Ubuntu repos as described here: http://doc.scrapy.org/en/latest/topics/ubuntu.html

And then install scrapely with:

aptitude install python-scrapely

Usage (API)

Scrapely has a powerful API, including a template format that can be edited externally, that you can use to build very capable scrapers.

What follows is a quick example of the simplest possible usage, that you can run in a Python shell.

Start by importing and instantiating the Scraper class:

>>> from scrapely import Scraper
>>> s = Scraper()

Then, proceed to train the scraper by adding some page and the data you expect to scrape from there (note that all keys and values in the data you pass must be strings):

>>> url1 = 'http://pypi.python.org/pypi/w3lib/1.1'
>>> data = {'name': 'w3lib 1.1', 'author': 'Scrapy project', 'description': 'Library of web-related functions'}
>>> s.train(url1, data)

Finally, tell the scraper to scrape any other similar page and it will return the results:

>>> url2 = 'http://pypi.python.org/pypi/Django/1.3'
>>> s.scrape(url2)
[{u'author': [u'Django Software Foundation &lt;foundation at djangoproject com&gt;'],
  u'description': [u'A high-level Python Web framework that encourages rapid development and clean, pragmatic design.'],
  u'name': [u'Django 1.3']}]

That's it! No xpaths, regular expressions, or hacky python code.

Usage (command line tool)

There is also a simple script to create and manage Scrapely scrapers.

It supports a command-line interface, and an interactive prompt. All commands supported on interactive prompt are also supported in the command-line interface.

To enter the interactive prompt type the following without arguments:

python -m scrapely.tool myscraper.json

Example:

$ python -m scrapely.tool myscraper.json
scrapely> help

Documented commands (type help <topic>):
========================================
a  al  s  ta  td  tl

scrapely>

To create a scraper and add a template:

scrapely> ta http://pypi.python.org/pypi/w3lib/1.1
[0] http://pypi.python.org/pypi/w3lib/1.1

This is equivalent as typing the following in one command:

python -m scrapely.tool myscraper.json ta http://pypi.python.org/pypi/w3lib/1.1

To list available templates from a scraper:

scrapely> tl
[0] http://pypi.python.org/pypi/w3lib/1.1

To add a new annotation, you usually test the selection criteria first:

scrapely> t 0 w3lib 1.1
[0] u'<h1>w3lib 1.1</h1>'
[1] u'<title>Python Package Index : w3lib 1.1</title>'

You can also quote the text, if you need to specify an arbitrary number of spaces, for example:

scrapely> t 0 "w3lib 1.1"

You can refine by position. To take the one in position [0]:

scrapely> a 0 w3lib 1.1 -n 0
[0] u'<h1>w3lib 1.1</h1>'

To annotate some fields on the template:

scrapely> a 0 w3lib 1.1 -n 0 -f name
[new] (name) u'<h1>w3lib 1.1</h1>'
scrapely> a 0 Scrapy project -n 0 -f author
[new] u'<span>Scrapy project</span>'

To list annotations on a template:

scrapely> al 0
[0-0] (name) u'<h1>w3lib 1.1</h1>'
[0-1] (author) u'<span>Scrapy project</span>'

To scrape another similar page with the already added templates:

scrapely> s http://pypi.python.org/pypi/Django/1.3
[{u'author': [u'Django Software Foundation'], u'name': [u'Django 1.3']}]

Tests

tox is the preferred way to run tests. Just run: tox from the root directory.

Support

Mailing list: https://groups.google.com/forum/#!forum/scrapely
IRC: scrapy@freenode

Scrapely is created and maintained by the Scrapy group, so you can get help through the usual support channels described in the Scrapy community page.

Architecture

Unlike most scraping libraries, Scrapely doesn't work with DOM trees or xpaths so it doesn't depend on libraries such as lxml or libxml2. Instead, it uses an internal pure-python parser, which can accept poorly formed HTML. The HTML is converted into an array of token ids, which is used for matching the items to be extracted.

Scrapely extraction is based upon the Instance Based Learning algorithm [1] and the matched items are combined into complex objects (it supports nested and repeated objects), using a tree of parsers, inspired by A Hierarchical Approach to Wrapper Induction [2].

[1]	Yanhong Zhai , Bing Liu, Extracting Web Data Using Instance-Based Learning, World Wide Web, v.10 n.2, p.113-132, June 2007

[2]	Ion Muslea , Steve Minton , Craig Knoblock, A hierarchical approach to wrapper induction, Proceedings of the third annual conference on Autonomous Agents, p.190-197, April 1999, Seattle, Washington, United States

Known Issues

The training implementation is currently very simple and is only provided for references purposes, to make it easier to test Scrapely and play with it. On the other hand, the extraction code is reliable and production-ready. So, if you want to use Scrapely in production, you should use train() with caution and make sure it annotates the area of the page you intended.

Alternatively, you can use the Scrapely command line tool to annotate pages, which provides more manual control for higher accuracy.

How does Scrapely relate to Scrapy?

Despite the similarity in their names, Scrapely and Scrapy are quite different things. The only similarity they share is that they both depend on w3lib, and they are both maintained by the same group of developers (which is why both are hosted on the same Github account).

Scrapy is an application framework for building web crawlers, while Scrapely is a library for extracting structured data from HTML pages. If anything, Scrapely is more similar to BeautifulSoup or lxml than Scrapy.

Scrapely doesn't depend on Scrapy nor the other way around. In fact, it is quite common to use Scrapy without Scrapely, and viceversa.

If you are looking for a complete crawler-scraper solution, there is (at least) one project called Slybot that integrates both, but you can definitely use Scrapely on other web crawlers since it's just a library.

Scrapy has a builtin extraction mechanism called selectors which (unlike Scrapely) is based on XPaths.

License

Scrapely library is licensed under the BSD license.

scrapely's People

Contributors

Stargazers

Watchers

Forkers

dangra pjob netconstructor esimionato cloudappsetup boite daqing15 madberry alepharchives cheekybastard wmelton diarmuidw chiehwen tankiit rakhoush joskid jheusser zpassenger scraping-xx big-data bigdata-tools mapping dreampuf listings-xx 4iji vovkd nvdnkpr lmorillas adorsk saidimu olpe masdude streambo chishaku johnstuartrutledge jurecuhalev imclab bghyun rgeorgioff web5design ivanchak guoyunsky laapsaap alexriina d1on houbl leeight joeromero xyb dustinthughes genba alice-private-life juraseg kmike mlespiau lvheyang wolph vyrus azizur77 eric011 mirona shangma tpeng liangkai haywhisksoftware georgefs aaronmartin0303 agstudy yupengyan nsdown derekrazo bitliner pborky erinkhoo eliasdorneles bookeriii theringer gbts ashishnitinpatil cyberplant cgc1983 priestd09 touhonoob marchchadwick 1060460048 damianzhou surfyst clizarralde ruairif lionel0822 wha000tif jaykizhou shannonyu miminus zhoubaozhou xunyou elad-l eval-usertoken thinker007 codeops

scrapely's Issues

Unable to pull in https

I'm trying to follow the into documentation. I changed the training url to be an https one and get the following.

Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.6/site-packages/scrapely/init.py", line 48, in train
page = url_to_page(url, encoding)
File "/usr/local/lib/python3.6/site-packages/scrapely/htmlpage.py", line 183, in url_to_page
fh = urlopen(url)
File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 570, in error
return self._call_chain(*args)
File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 503: Service Unavailable

I'm using Python 3.6.5.

URL was:

https://www.amazon.com/Xbox-One-X-1TB-Console/dp/B074WPGYRF/ref=sr_1_3?s=videogames&ie=UTF8&qid=1524486645&sr=1-3&keywords=xbox%2Bone%2Bx&th=1

Thanks!

Html page containing more than one single entity. How to annotate?

Let's imagine that an html page contains more than one single entity to extract.

Does Scrapely have a direct support for it?

I'm actually handling this situation manually, I will add it to scrapely in case I don't find any support of it and as soon as I understand the project more in details

benchmarks?

Instead of using xpath with very own scrapy, does scrapely deliver better performance with its own learning system?

Duplicate Values (but valid) in the same html

Hi - I just noticed about this library for unsupervised learning mechanism, I guess this is based on Wrapper Induction. This is a wonderful implementation. I have been playing around for few days.

What I am trying to understand that if a value, e.g. a flight ticket have two journeys, in the same itinerary (just as a practical example)

Outbound Journey:
Depart : Newyork
Arrive: Dallas

Forward Journey:
Depart: Dallas
Arrive: California

If I I train for "Dallas" value which can be both Arrive & Depart, scrapely is mentioning fragment being already annotated.

data {'Depart_1' : 'Newyork', 'Arrive_1' : 'Dallas,'Depart_2' : 'Dallas', 'Arrive_2' : 'California}

This is an example for places, can be more like time, date, etc. Also, for return journeys in the same itinerary.

How can we achieve this using scrapely?

Anshuk

Is really Python 3 supported?

I have problems with running scrapely with Python 3.
Scrapely depends on slybot, which depends on scrapy, which depend on Twisted, which don't support yet Python 3.

Please remove info about supporting Python 3 or give instructions how it can be possible.

Extract from javascript?

Would it be possible to pull values out of javascript off a page? For example, I'm looking to pull some content that is contained within a string, such as "Here is some string with my value 1998". I want to annotate the 1998, however a lot of times I get a bunch of html with it too. But in the JS, there is a variable that holds what I need:

<script>
data['item'] = {
  "year": "1998"
};
</script>

Would this be possible?

Thx

Random failing doctests

I have run the testsuite multiple times and some times it fails (and sometime it doesnt) due to:

======================================================================
FAIL: Doctest: scrapely.extraction.regionextract.RecordExtractor
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib64/python2.7/doctest.py", line 2226, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for scrapely.extraction.regionextract.RecordExtractor
  File "/home/daniel/src/scrapely/scrapely/extraction/regionextract.py", line 291, in RecordExtractor

----------------------------------------------------------------------
File "/home/daniel/src/scrapely/scrapely/extraction/regionextract.py", line 306, in scrapely.extraction.regionextract.RecordExtractor
Failed example:
    ex.extract(page)
Expected:
    [{u'description': [u'description'], u'name': [u'name']}]
Got:
    [{u'name': [u'name'], u'description': [u'description']}]


----------------------------------------------------------------------
Ran 99 tests in 0.578s

FAILED (failures=1)
ERROR: InvocationError: '/home/daniel/src/scrapely/.tox/py27/bin/nosetests scrapely tests'
_______________________________________________________________________________________________ summary _______________________________________________________________________________________________
ERROR:   py27: commands failed

How to scrape within Python using generated JSON from command line?

After doing:

python -m scrapely.tool myscraper.json
scrapely> ta http://pypi.python.org/pypi/w3lib/1.1
scrapely> a 0 w3lib 1.1 -n 0 -f name

How would I then use the myscraper.json from within Python for scraping?

I tried:

with open('myscraper.json') as f:
     s.fromfile(f)
     m = s.scrape('http://pypi.python.org/pypi/Django/1.3')

But it returns nothing.

Drop Python 2.6 support

What about dropping Python 2.6 support after a new scrapely version is released, i.e. making scrapely 0.13.0 the last version which supports Python 2.6?

scrapely.template.FragmentNotFound: Fragment not found annotating 'price' using: <function func at 0x...>

How does it work? What it get to train?

For example: span class="cldt-price sc-font-xl sc-font-bold" data-item-name="price">€ 7.600,-</span

In this case, I did: data = {"price" : "€ 7.600,-"}
But I receive that error.

What I have must to do to get the price?

Wrong tag getting annotated

i have the following situation. can you guys look into this issue once.

    <span class='a-color-price'>  <-------- desired tag to be annotated as this contains the data
           <span class="currencyINR">   <----- actual tag being annotated and hence outputting wrongly as &nbsp;&nbsp;
            	&nbsp;&nbsp;
           </span>
        237.00  <-------- desired data passed for training
    </span>

i tried looking at the code to see if can understand whats going on, but it was slightly hard to find out.

Multiple matches?

I am very interested in using scrapely in a project and started playing with it. Is it possible to find multiple matches on a page? It seems to only find one.

What you mean with "The training implementation is currently very simple and is only provided for references purposes, to make it easier to test Scrapely and play with it. "

May you specify more in details the meaning of the sentence in the README.md

The training implementation is currently very simple and is only provided for references purposes, to make it easier to test Scrapely and play with it. ...you should use train() with caution and make sure it annotates the area of the page you intended

Which are the problems that may come out from using the trainining implementation?

A mismatch of encoding between the data provided as input and the encoding of the html pages?
Others?

If you can make a list of all the known problems I may help with the development of one of them

Slow Extraction Times

It's currently taking me around 2s to run the extraction on a single page.

Following is the output of the lineprofiler:
'''
Line #, Hits, Time, Per Hit, % Time, Line Contents

53                                           def extract(url, page, scraper):
54                                               """Returns a dictionary containing the extraction output
55                                               """
56        10         2923    292.3      0.1      page = unicode(page, errors = 'ignore')
57        10       704147  70414.7     17.8      html_page = HtmlPage(url, body=page, encoding = 'utf-8')
58                                           
59        10      2604545 260454.5     65.9      ex = InstanceBasedLearningExtractor(scraper.templates)
60        10       640413  64041.3     16.2      records = ex.extract(html_page)[0]
61        10          141     14.1      0.0      return records[0]

'''

Am I doing something wrong ? The extraction code is similar to that found in tool.py and init.py
But, I get faster extraction times when I run scrapely from the command line than using the code above.

Please advice.

problem with bad encoding and BOM?

I'm trying to annotate this page http://departamentos.inmobusqueda.com.ar/alquileres/buenos-aires/san-justo/74082/ that seems to have a UTF8 BOM encoded to iso8859-1.

I cannot find the way to annotate the page, scrapely won't be able to annotate (at least from console.

I was able to find a workaround downloading the file, "iconv"erted from iso8859-1 to utf8, and putting it into a server on localhost (then it has a rare converted bom, but it could be annotated with scrapely)

ValueError: Buffer dtype mismatch, expected 'int64_t' but got 'long'

Hi,
I am having the following problem. Not sure if i am following the right steps.
This is the repro.
Regards,

--------------------------------------
root
--------------------------------------
root@tex:/home/scraper# python --version
Python 3.4.3+
root@tex:/home/scraper# virtualenv venv_scrapely
Using base prefix '/usr'
New python executable in /home/scraper/venv_scrapely/bin/python3
Also creating executable in /home/scraper/venv_scrapely/bin/python
Installing setuptools, pip, wheel...done.
root@tex:/home/scraper# ls -lrt
total 4
drwxr-xr-x 5 root root 4096 Feb  6 18:23 venv_scrapely
root@tex:/home/scraper# source ./venv_scrapely/bin/activate
(venv_scrapely) root@tex:/home/scraper# pip install scrapely
Collecting scrapely
Collecting w3lib (from scrapely)
  Using cached w3lib-1.16.0-py2.py3-none-any.whl
Collecting numpy (from scrapely)
  Using cached numpy-1.12.0-cp34-cp34m-manylinux1_i686.whl
Requirement already satisfied: six in ./venv_scrapely/lib/python3.4/site-packages (from scrapely)
Installing collected packages: w3lib, numpy, scrapely
Successfully installed numpy-1.12.0 scrapely-0.13.3 w3lib-1.16.0
(venv_scrapely) root@tex:/home/scraper#
(venv_scrapely) root@tex:/home/scraper# pip list
(1.4.0)
numpy (1.12.0)
packaging (16.8)
pip (9.0.1)
pyparsing (2.1.10)
scrapely (0.13.3)
setuptools (34.1.1)
six (1.10.0)
w3lib (1.16.0)
wheel (0.29.0)
------------------------
with user scraper
------------------------
scraper@tex:$ source ./venv_scrapely/bin/activate
(venv_scrapely) scraper@tex:~$ python --version
Python 3.4.3+
(venv_scrapely) scraper@tex:~$ python
Python 3.4.3+ (default, Oct 14 2015, 16:03:50)
[GCC 5.2.1 20151010] on linux
Type "help", "copyright", "credits" or "license" for more information.
*** from scrapely import Scraper
*** s=Scraper()
*** url1='https://github.com/ripple/rippled'
*** data={'name':'ripple/rippled','commits':'11,292','releases':'66','contributors':'56'}
*** s.train(url1,data)
*** url2='https://github.com/scrapy/scrapely/'
*** s.scrape(url2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/__init__.py", line 53, in scrape
    return self.scrape_page(page)
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/__init__.py", line 59, in scrape_page
    return self._ex.extract(page)[0]
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/extraction/__init__.py", line 119, in extract
    extracted = extraction_tree.extract(extraction_page)
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/extraction/regionextract.py", line 575, in extract
    items.extend(extractor.extract(page, start_index, end_index, self.template.ignored_regions))
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/extraction/regionextract.py", line 351, in extract
    _, _, attributes = self._doextract(page, extractors, start_index, end_index, **kwargs)
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/extraction/regionextract.py", line 396, in _doextract
    labelled, start_index, end_index_exclusive, self.best_match, **kwargs)
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/extraction/similarity.py", line 148, in similar_region
    data_length - range_end, data_length - range_start)
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/extraction/similarity.py", line 85, in longest_unique_subsequence
    matches = naive_match_length(to_search, subsequence, range_start, range_end)
  File "scrapely/extraction/_similarity.pyx", line 155, in scrapely.extraction._similarity.naive_match_length (scrapely/extraction/_similarity.c:3845)
  File "scrapely/extraction/_similarity.pyx", line 158, in scrapely.extraction._similarity.naive_match_length (scrapely/extraction/_similarity.c:3648)
  File "scrapely/extraction/_similarity.pyx", line 87, in scrapely.extraction._similarity.np_naive_match_length (scrapely/extraction/_similarity.c:2802)
ValueError: Buffer dtype mismatch, expected 'int64_t' but got 'long'
```bash

Output processor for @href and @src (Image Field) : to remove whitespace characters if present

Please read details here on portia project : Portia issue #378

safehtml omit some important (all) attributes of tags

Let's consider that someone (like me) want to keep an img tag so the src attribute of this tag would be important for him/her. But safehtml() function omit all the attributes of the relevant tag.
I think it would better to keep attributes of allowed_tags or add another param named allowed_attributes to specify which attributes to keep.

how to use to_file method

I test scrapely with your example...but I don't know how to store templates to file (or database)...
I tried

from scrapely import Scraper
s = Scraper()
url1 = 'http://pypi.python.org/pypi/w3lib'
data = {'name': 'w3lib 1.0', 'author': 'Scrapy project', 'description': 'Library of web-related functions'}
s.train(url1, data)

s.tofile('testemplatefile')
Traceback (most recent call last):
File "", line 1, in
File "scrapely/init.py", line 28, in tofile
json.dump({'templates': tpls}, file)
File "/usr/lib/python2.7/json/init.py", line 182, in dump
fp.write(chunk)
AttributeError: 'str' object has no attribute 'write'

so I test

s = Scraper('abc.json')
url1 = 'http://pypi.python.org/pypi/w3lib'
data = {'name': 'w3lib 1.0', 'author': 'Scrapy project', 'description': 'Library of web-related functions'}
s.train(url1, data)
Traceback (most recent call last):
File "", line 1, in
File "scrapely/init.py", line 41, in train
self.templates.append(tm.get_template())
AttributeError: 'str' object has no attribute 'append'
s.tofile(url1)
Traceback (most recent call last):
File "", line 1, in
File "scrapely/init.py", line 27, in tofile
tpls = [page_to_dict(x) for x in self.templates]
File "scrapely/htmlpage.py", line 32, in page_to_dict
'url': page.url,

what should I do to store template to file (or database) then use it again?? maybe redis is my database choice...

ZeroDivisionError when training with zero-length data

(Minor bug.)
I installed scrapely from pip this morning.

This is a wacky edge case, but I think you could raise a more constructive error.

(Who wants to extract a zero-length string from a document? It's a bit like a magician pulling some atmosphere out of a hat: it's always going to be there...)

Check it out:

In [97]: from scrapely import Scraper

In [98]: s = Scraper()

In [99]: s.train('http://www.google.com', {'image': u''})
- - - - - - - - - - - - - - - - -
ZeroDivisionError                         Traceback (most recent call last)
/home/username/myfolder/<ipython-input-99-233d0ac90e7f> in <module>()
----> 1 s.train('http://www.google.com', {'image': u''})

/usr/local/lib/python2.7/dist-packages/scrapely/__init__.pyc in train(self, url, data, encoding)
     44     def train(self, url, data, encoding=None):
     45         page = url_to_page(url, encoding)
---> 46         self.train_from_htmlpage(page, data)
     47 
     48     def scrape(self, url, encoding=None):

/usr/local/lib/python2.7/dist-packages/scrapely/__init__.pyc in train_from_htmlpage(self, htmlpage, data)
     39                 if isinstance(value, str):
     40                     value = value.decode(htmlpage.encoding or 'utf-8')
---> 41                 tm.annotate(field, best_match(value))
     42         self.add_template(tm.get_template())
     43 

/usr/local/lib/python2.7/dist-packages/scrapely/template.pyc in annotate(self, field, score_func, best_match)
     31 
     32         """
---> 33         indexes = self.select(score_func)
     34         if not indexes:
     35             raise FragmentNotFound("Fragment not found annotating %r using: %s" % 

/usr/local/lib/python2.7/dist-packages/scrapely/template.pyc in select(self, score_func)
     46         matches = []
     47         for i, fragment in enumerate(htmlpage.parsed_body):
---> 48             score = score_func(fragment, htmlpage)
     49             if score:
     50                 matches.append((score, i))

/usr/local/lib/python2.7/dist-packages/scrapely/template.pyc in func(fragment, page)
     95         fdata = page.fragment_data(fragment).strip()
     96         if text in fdata:
---> 97             return float(len(text)) / len(fdata) - (1e-6 * fragment.start)
     98         else:
     99             return 0.0

ZeroDivisionError: float division by zero

possible to pass scrapy response object to scrapely?

Instead of url, is it possible to do

def parse(self,response):
s.train(response.body,encoding='iso-885901')

instead of making scraply fetch things manually from url or local file.

error [SSL: CERTIFICATE_VERIFY_FAILED] on travel sites

Im just starting with this tool and im trying to scrape travel prices but i got error [SSL: CERTIFICATE_VERIFY_FAILED].

from scrapely import Scraper

s = Scraper()
url1 = 'XXXXX' # URL of site
data = {'price': '16.929'}
s.train(url1, data)

url2 = 'XXXXX' # URL of same site but different search params, same destination and origin just one month later
print(s.scrape(url2))

Full console log:

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 1318, in do_open
encode_chunked=req.has_header('Transfer-encoding'))
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1239, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1285, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1234, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1026, in _send_output
self.send(msg)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 964, in send
self.connect()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1400, in connect
server_hostname=server_hostname)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/ssl.py", line 401, in wrap_socket
_context=self, _session=session)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/ssl.py", line 808, in init
self.do_handshake()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/ssl.py", line 1061, in do_handshake
self._sslobj.do_handshake()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/ssl.py", line 683, in do_handshake
self._sslobj.do_handshake()
ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:749)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "spider.py", line 6, in
s.train(url1, data)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapely/init.py", line 48, in train
page = url_to_page(url, encoding)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapely/htmlpage.py", line 183, in url_to_page
fh = urlopen(url)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 526, in open
response = self._open(req, data)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 544, in _open
'_open', req)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 1361, in https_open
context=self._context, check_hostname=self._check_hostname)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 1320, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:749)>

Any idea what could be the problem?

I want to scrap all the contact list of different food industry website in a specific city?

I can data scrap a specific website but I am wondering how to scrap different web sites simultaneously. In this case, scrapely can be helpful?

Regards,
Shiva

Question: automate training

Hi,

I was wondering if it would be possible to do automating training using something like boilerpipe or goose to do title, content and date discovery?

I know by default those libraries don't supply the extracted nodes and just the values.... Figured it was better to ask before diving in to see if somebody has already done this.

Can I train the scraper on multiple pages so given a certain page it chooses automatically the template?

Can I train the same scraper on multiple pages so, given a certain page, it chooses automatically the template?

Specifying integer values in the data dict

Amazing work! This is really useful.

I ran into a minor issue with the way you provide data. The documentation does not say you can't provide integer values, so I ended up providing this data:

In [1]: from scrapely import Scraper

In [2]: s = Scraper()

In [3]: data = {'name': 'scrapy/scrapely', 'url': 'https://github.com/scrapy/scrapely', 'description': 'A pure-python HTML screen-scraping library', 'watchers': 42, 'forks': 9}

In [4]: url = "https://github.com/scrapy/scrapely"

and ran into this exception:

In [5]: s.train(url, data)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

...

/home/ubuntu/scrapely/scrapely/template.py in func(fragment, page)
     93     def func(fragment, page):
     94         fdata = page.fragment_data(fragment).strip()
---> 95         if text in fdata:
     96             return float(len(text)) / len(fdata) - (1e-6 * fragment.start)
     97         else:

TypeError: 'in <string>' requires string as left operand

It took me a while to realize what the issue was, it was with the integer values in the data variable.

So, you can either make it all unicode string:

if unicode(text) in fdata:
    return float(len(unicode(text))) / len(fdata) - (1e-6 * fragment.start)

or specify in the documentation that values should all be strings.

Python 3 support

Just a place for prospective porters to start with their ideas.
Would really love this to be ported but I don't have the time to do it myself.

Unable to extract some fields ?

Is it possible to extract the address: "15, Vishal Est, Opp Bharat Party Plt,amraiwadi, Ahmedabad-380026, Ahmedabad, 380026"
from http://indisearch.com/s-v-machine-tools,-rajkot/74709

Because, the address does not come under an HTML element ?

I am aware of editing the template.json file. So, if this can be done please let me know.

Thanks.

How to use use html data instead of direct URLs

Older issue mentions 'train_from_htmlpage' method but its not working anymore? What I try to do is provide preprocessed html data (utf8 conversion done to make scrapely work) for scrapely.

Use in production

I got very curious about this project. Today I use scrapy a lot, with beutifulsoup, and this make me think that could be used too.

Anybody using this in production?
Any gotchas?

Installing via pip on Python 3.7 fails

➜  beacon-scrapy git:(master) ✗ pip3 install scrapely
Collecting scrapely
  Downloading https://files.pythonhosted.org/packages/5e/8b/dcf53699a4645f39e200956e712180300ec52d2a16a28a51c98e96e76548/scrapely-0.13.4.tar.gz (134kB)
    100% |████████████████████████████████| 143kB 5.1MB/s
Requirement already satisfied: numpy in /usr/local/lib/python3.7/site-packages (from scrapely) (1.15.0)
Requirement already satisfied: w3lib in /usr/local/lib/python3.7/site-packages (from scrapely) (1.19.0)
Requirement already satisfied: six in /usr/local/lib/python3.7/site-packages (from scrapely) (1.11.0)
Building wheels for collected packages: scrapely
  Running setup.py bdist_wheel for scrapely ... error
  Complete output from command /usr/local/opt/python/bin/python3.7 -u -c "import setuptools, tokenize;__file__='/private/var/folders/7c/dm671s4x4v5bm8_6tprr861r0000gn/T/pip-install-p7z3xbo1/scrapely/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /private/var/folders/7c/dm671s4x4v5bm8_6tprr861r0000gn/T/pip-wheel-7rl3xgbc --python-tag cp37:
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.macosx-10.13-x86_64-3.7
  creating build/lib.macosx-10.13-x86_64-3.7/scrapely
  copying scrapely/descriptor.py -> build/lib.macosx-10.13-x86_64-3.7/scrapely
  copying scrapely/version.py -> build/lib.macosx-10.13-x86_64-3.7/scrapely
  copying scrapely/extractors.py -> build/lib.macosx-10.13-x86_64-3.7/scrapely
  copying scrapely/__init__.py -> build/lib.macosx-10.13-x86_64-3.7/scrapely
  copying scrapely/template.py -> build/lib.macosx-10.13-x86_64-3.7/scrapely
  copying scrapely/htmlpage.py -> build/lib.macosx-10.13-x86_64-3.7/scrapely
  copying scrapely/tool.py -> build/lib.macosx-10.13-x86_64-3.7/scrapely
  creating build/lib.macosx-10.13-x86_64-3.7/scrapely/extraction
  copying scrapely/extraction/pageobjects.py -> build/lib.macosx-10.13-x86_64-3.7/scrapely/extraction
  copying scrapely/extraction/similarity.py -> build/lib.macosx-10.13-x86_64-3.7/scrapely/extraction
  copying scrapely/extraction/__init__.py -> build/lib.macosx-10.13-x86_64-3.7/scrapely/extraction
  copying scrapely/extraction/regionextract.py -> build/lib.macosx-10.13-x86_64-3.7/scrapely/extraction
  copying scrapely/extraction/pageparsing.py -> build/lib.macosx-10.13-x86_64-3.7/scrapely/extraction
  running egg_info
  writing scrapely.egg-info/PKG-INFO
  writing dependency_links to scrapely.egg-info/dependency_links.txt
  writing requirements to scrapely.egg-info/requires.txt
  writing top-level names to scrapely.egg-info/top_level.txt
  reading manifest file 'scrapely.egg-info/SOURCES.txt'
  reading manifest template 'MANIFEST.in'
  writing manifest file 'scrapely.egg-info/SOURCES.txt'
  copying scrapely/_htmlpage.c -> build/lib.macosx-10.13-x86_64-3.7/scrapely
  copying scrapely/_htmlpage.pyx -> build/lib.macosx-10.13-x86_64-3.7/scrapely
  copying scrapely/extraction/_similarity.c -> build/lib.macosx-10.13-x86_64-3.7/scrapely/extraction
  copying scrapely/extraction/_similarity.pyx -> build/lib.macosx-10.13-x86_64-3.7/scrapely/extraction
  running build_ext
  building 'scrapely._htmlpage' extension
  creating build/temp.macosx-10.13-x86_64-3.7
  creating build/temp.macosx-10.13-x86_64-3.7/scrapely
  clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -I/usr/local/lib/python3.7/site-packages/numpy/core/include -I/usr/local/include -I/usr/local/opt/openssl/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/include/python3.7m -c scrapely/_htmlpage.c -o build/temp.macosx-10.13-x86_64-3.7/scrapely/_htmlpage.o
  scrapely/_htmlpage.c:7367:65: error: too many arguments to function call, expected 3, have 4
      return (*((__Pyx_PyCFunctionFast)meth)) (self, args, nargs, NULL);
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                     ^~~~
  /Library/Developer/CommandLineTools/usr/lib/clang/9.1.0/include/stddef.h:105:16: note: expanded from macro 'NULL'
  #  define NULL ((void*)0)
                 ^~~~~~~~~~
  1 error generated.
  error: command 'clang' failed with exit status 1

README Usage (command line tool) correction

In the Usage (command line tool) section, after the text "To add a new annotation, you usually test the selection criteria first:"

The command says:

scrapely> a 0 w3lib 1.1

which should be corrected to:

scrapely> t 0 w3lib 1.1

Also in scrapy command line:

help t

prints

ts <template> <text> - test selection text

when it should print:

t <template> <text> - test selection text

Just adding these for the benefit of other who got confused like I did.

How to extract a list of items

How I can extract a list of items
example.html

Beer red 1.50
Coffee black 2.0
Corn yellow 3.65

I know i can do it

data = {
  'name': 'Beer',
  'color': 'red'
  'price': '1.50'
}

s = scrapely.Scraper()
s.train('http//example.com', data)
...

to train example.html but how I can extract the rest of the data. I mean I need extract a list of items from that page

Import Error: Cannot import name 'Scraper'

I'm trying to build something with the Scrapely library. After a bit of fixing I finally got all install issues out of the way.
Running the sample code:

from scrapely import Scraper
s = Scraper()
url1 = 'http://pypi.python.org/pypi/w3lib/1.1'
data = {'name': 'w3lib 1.1', 'author': 'Scrapy project', 'description': 'Library of web-related functions'}
s.train(url1, data)

I get the error:

Import Error: Cannot import name 'Scraper'

How would I fix this?

Incorrect cleaning of <img> tag

Hi guys, I was looking for a html cleaner and found it inside the Scrapely lib. After some trials, I found a bug that I believe is critical.

It is expected that the img tag appear in the self-closing way (<img src='github.png' />) but it might appear in this way: <img src='stackoverflow.png'>. In this case, the safehtml cleans the text incorrectly. For example, see the test in the terminal:

>>> from scrapely.extractors import safehtml, htmlregion
>>> t = lambda s: safehtml(htmlregion(s))
>>> t('my <img href="http://fake.url"> img is <b>cool</b>')
'my'

IMHO, the output was expected to be my img is <strong>cool</strong>. The same behavior is witnessed with the tag <input>.

Best regards,

Please, release a version with a better python3 support

I saw I lot of commits, like this 58f0886, that solve my problem:

test/pipeline/test_item_validator.py:3: in <module>
    from newscrawler.pipelines import ItemValidatorPipeline
newscrawler/pipelines.py:12: in <module>
    from scrapely.extractors import safehtml, htmlregion, _TAGS_TO_REPLACE
.eggs/scrapely-0.12.0-py3.5.egg/scrapely/__init__.py:4: in <module>
    from scrapely.htmlpage import HtmlPage, page_to_dict, url_to_page
.eggs/scrapely-0.12.0-py3.5.egg/scrapely/htmlpage.py:8: in <module>
    import re, hashlib, urllib2
E   ImportError: No module named 'urllib2'

Is possible to release a new scrapely version?

How can I help?

Thanks

Interest in other wrapper induction techniques?

Hi all,

I'm sorry if this is not the right place for this discussion. If there is a more appropriate forum, I'd be happy to move over there.

I've been digging into the wrapper induction literature, and have really appreciated the work that y'all have done with this library and pydepta and mdr.

I'd like to build a library using the ideas from the Trinity paper or @AdiOmari's SYNTHIA approach.

It does not seem like your wrapper induction libraries are currently a very active area of interest, but I wanted to know if these would be of interest to y'all (or other methods)?

l

add a tag for 0.10 release

I'm not sure I can add it without breaking auto-release setup - never done this before. //cc @dangra?

Tag should be added for this commit: 62a46da (I've checked an archive uploaded to pypi for 0.10 versions and compared it with the source code).

remove most Scrapy mentions from the README

I think we should remove Scrapy mentions from the readme. It is weird that instead of install instructions or package description we start README with a chapter about Scrapy, essentially describing that scrapely is not related to Scrapy.

Support for passing HTML, not just URLs

http://groups.google.com/group/scraperwiki/browse_thread/thread/d750d093ca5220bf
... was posted, wanting to use Mechanize to download HTML [since the data was behind a login] and Scrapely to parse it.

As far as I can see, Scrapely doesn't support that.

I've made https://scraperwiki.com/scrapers/scrapely-hack/ to try to work around that.

The core change is in Scraper._get_page where:

if html:
    body=html.decode(encoding)
else:

is added before

    body = urllib.urlopen(url).read().decode(encoding)

, an optional 'html' parameter is added to Scraper.scrape, .train and _get_page [and passed to _get_page], and the 'url' parameter is made optional.

Installing pip on Python 3.7 still fails

When installing with python 3.7 it still fails.

Collecting scrapely
  Using cached https://files.pythonhosted.org/packages/5e/8b/dcf53699a4645f39e200956e712180300ec52d2a16a28a51c98e96e76548/scrapely-0.13.4.tar.gz
Requirement already satisfied: numpy in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from scrapely) (1.15.2)
Requirement already satisfied: w3lib in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from scrapely) (1.19.0)
Requirement already satisfied: six in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from scrapely) (1.11.0)
Installing collected packages: scrapely
  Running setup.py install for scrapely ... error
    Complete output from command /Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7 -u -c "import setuptools, tokenize;__file__='/private/var/folders/p5/w24gg45x3mngmm2nk1v8v18h0000gn/T/pip-install-chwlaolb/scrapely/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/p5/w24gg45x3mngmm2nk1v8v18h0000gn/T/pip-record-l4aa8igy/install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    creating build
    creating build/lib.macosx-10.9-x86_64-3.7
    creating build/lib.macosx-10.9-x86_64-3.7/scrapely
    copying scrapely/descriptor.py -> build/lib.macosx-10.9-x86_64-3.7/scrapely
    copying scrapely/version.py -> build/lib.macosx-10.9-x86_64-3.7/scrapely
    copying scrapely/extractors.py -> build/lib.macosx-10.9-x86_64-3.7/scrapely
    copying scrapely/__init__.py -> build/lib.macosx-10.9-x86_64-3.7/scrapely
    copying scrapely/template.py -> build/lib.macosx-10.9-x86_64-3.7/scrapely
    copying scrapely/htmlpage.py -> build/lib.macosx-10.9-x86_64-3.7/scrapely
    copying scrapely/tool.py -> build/lib.macosx-10.9-x86_64-3.7/scrapely
    creating build/lib.macosx-10.9-x86_64-3.7/scrapely/extraction
    copying scrapely/extraction/pageobjects.py -> build/lib.macosx-10.9-x86_64-3.7/scrapely/extraction
    copying scrapely/extraction/similarity.py -> build/lib.macosx-10.9-x86_64-3.7/scrapely/extraction
    copying scrapely/extraction/__init__.py -> build/lib.macosx-10.9-x86_64-3.7/scrapely/extraction
    copying scrapely/extraction/regionextract.py -> build/lib.macosx-10.9-x86_64-3.7/scrapely/extraction
    copying scrapely/extraction/pageparsing.py -> build/lib.macosx-10.9-x86_64-3.7/scrapely/extraction
    running egg_info
    writing scrapely.egg-info/PKG-INFO
    writing dependency_links to scrapely.egg-info/dependency_links.txt
    writing requirements to scrapely.egg-info/requires.txt
    writing top-level names to scrapely.egg-info/top_level.txt
    reading manifest file 'scrapely.egg-info/SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    writing manifest file 'scrapely.egg-info/SOURCES.txt'
    copying scrapely/_htmlpage.c -> build/lib.macosx-10.9-x86_64-3.7/scrapely
    copying scrapely/_htmlpage.pyx -> build/lib.macosx-10.9-x86_64-3.7/scrapely
    copying scrapely/extraction/_similarity.c -> build/lib.macosx-10.9-x86_64-3.7/scrapely/extraction
    copying scrapely/extraction/_similarity.pyx -> build/lib.macosx-10.9-x86_64-3.7/scrapely/extraction
    running build_ext
    building 'scrapely._htmlpage' extension
    creating build/temp.macosx-10.9-x86_64-3.7
    creating build/temp.macosx-10.9-x86_64-3.7/scrapely
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch x86_64 -g -I/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/numpy/core/include -I/Library/Frameworks/Python.framework/Versions/3.7/include/python3.7m -c scrapely/_htmlpage.c -o build/temp.macosx-10.9-x86_64-3.7/scrapely/_htmlpage.o
    scrapely/_htmlpage.c:7367:65: error: too many arguments to function call, expected 3, have 4
        return (*((__Pyx_PyCFunctionFast)meth)) (self, args, nargs, NULL);
               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                     ^~~~
    /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/10.0.0/include/stddef.h:105:16: note: expanded from macro 'NULL'
    #  define NULL ((void*)0)
                   ^~~~~~~~~~
    1 error generated.
    error: command 'gcc' failed with exit status 1

Correct example at README.rst

w3lib has changed version, so the example should be:

data = {'name': 'w3lib 1.1', 'author': 'Scrapy project', 'description': 'Library of web-related functions'}

If not, it raises a FragmentNotFound exception.

iso-8859-1

Trying to scrape pages with a content-encoding of iso-8859-1 throws a unicode error:
>>> url1 = 'http://www[DOT]getmobile[DOT]de/handy/NO68128,Nokia-C3-01-Touch-and-Type.html' #url changed to prevent backlinking
>>> data = {'name': 'Nokia C3-01 Touch and Type', 'price': '129,00'}
>>> s.train(url1,data)
Traceback (most recent call last):
File "", line 1, in
File "build/bdist.macosx-10.6-universal/egg/scrapely/init.py", line 32, in train
File "build/bdist.macosx-10.6-universal/egg/scrapely/init.py", line 50, in _get_page
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/encodings/utf_8.py", line 16, in decode
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1512-1514: invalid data

Obtaining sectioned article text

Hello,

I have a project that I was looking to use Scrapely for. From what I've read and found out this sounds like it's something that I would like to use. I have run into a problem with it though. when I pass a url that contains sectioned article text (which appears to be almost all of my urls) I only receive the first section of the text.

Here's a site that I tried: http://www.autostraddle.com/12-black-friday-deals-you-can-get-without-having-to-put-pants-on-266850/

and here's what I used to train scrapely:

{'title':'15 Things You Learn When You Move In With Your Girlfriend', 'author': 'by Kate', 'postdate':'November 10, 2014 at 9:00am PST', 'count':'82', 'content':'There comes a point in every relationship when it makes sense for you to think about cohabitation.'}

if I then have scrapely scrape that same url it only gives me that first paragraph.

So my question is, how would I get scrapely to obtain all of the articles main text (basically the text between the social media icons).

Any help would be greatly appreciated!

Thanks

Provide method for parsing HTML that has already been downloaded by external libraries.

While this library seems very appealing, the fact that Scraper.scrape makes blocking IO calls is a problem for those of us who would like to use it with an asynchronous framework such as Twisted.

It would be nice to have a Scraper function that takes a string of HTML and parses it, thus allowing the user to avoid blocking calls.

tool.parse_criteria normalizes whitespace

Unfortunately, this breaks on templates with critera that include multiple whitespaces.

This can be seen on this page: http://lookbook.nu/msha //h1[@class="left rightspaced inline"]/a/text() with the following scrapely session:

scrapely> ta http://lookbook.nu/msha
[1] http://lookbook.nu/msha
scrapely> t 0 Melisa  I
scrapely>

safehtml should ensure tabular content safety

safehtml should ensure that tabular content is safe to display enforcing <table> tags where needed, take as an example:

>>> print safehtml(htmlregion(u'<span>pre text</span><tr><td>hello world</td></tr>'))
u'pre text<tr><td>hello world</td></tr>'

That output will break any table layout where the content is rendered.

Is this still an active project?

No new commits since February? Is there a more updated fork ?

Does the order of annotations matter - Weird output

I've been playing with scrapely, and this script generates some weird output:

annotate url1
try scrapping url1, got the expected output
annotate url2
try scrapping url2, got nothing from scrapping url2.

I thought it could be train since it is not supposed to be reliable, but when exported the annotated data the annotations seems alright.

Then I inverted the order:

annotate url2
try scrapping url2, got the expected output
annotate url1
try scrapping url1, got something different from the annotation( a subset of what was annotated)

Is this a expected behaviour ?