scrapinghub / spidermon Goto Github PK

View Code? Open in Web Editor NEW

511.0 76.0 92.0 1.39 MB

Scrapy Extension for monitoring spiders execution.

Home Page: https://spidermon.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

Python 93.95% CSS 1.63% Shell 0.02% Jinja 4.41%

scrapinghub scraping monitoring spiders crawling testing monitoring-tool hacktoberfest

spidermon's Introduction

Spidermon

Overview

Spidermon is an extension for Scrapy spiders. The package provides useful tools for data validation, stats monitoring, and notification messages. This way you leave the monitoring task to Spidermon and just check the reports/notifications.

Requirements

Python Version: 3.8, 3.9, 3.10 or 3.11

Install

The quick way:

pip install spidermon

For more details see the install section in the documentation: https://spidermon.readthedocs.io/en/latest/installation.html

Documentation

Documentation is available online at https://spidermon.readthedocs.io/ and in the docs directory.

spidermon's People

Contributors

Stargazers

Watchers

Forkers

jesuslosada mathandpencil pombredanne leovictorsr matiskay andrey-112 vipulgupta2048 gcmurillo gallaecio lopkik ejulio zhuguangqiang zorrock toritrae alexwestco tilyp wrar itsx thamenato amironoff anujaagarwal kingking888 wooodhead sulthonzh backwardn taurusjun andyking1688 jborges42 kishan3 imbillu heylouiz fauzank339 felipefin azmi155 tfsky hongshunyang victor-torres masterscott aixioma mirceachira fakegit andersonberg furongpeng anapaulagomes sammeeey paulfairless alexbelij zanachka parchanco karenyavine vertusd hugovk hollow667 pengjinfu charismaticzone phrfpeixoto casual-silva royvb-git marcosmadr ns3098 rennerocha dash-remi servicefoundation bitmakerla tcurvelo lonly7star gofirestar anujsngh ankitjavalkar bazaaaaaaaar vishalbelsare python-repository-hub fairhopeweb sts0mrg0 ogiaquino further-reading senliontec wanderer163 vmruiz mrwbarg raphapassini kennyaires starsoft35 renatodvc felipdc rvandam niltongmjunior dearborn-open-ai rochamatcomp shafiq-muhammad ning-0217

spidermon's Issues

Unable to add model-level validation with schematics

schematics allows you to Type-level validation (when you want to validate only the value of a specific field) or Model-level validation (if you want to validate one field against the content of other field).

Considering a model and a validation method like the following:

class MyModel(Model):
    sale_price = DecimalType(required=True)
    list_price = DecimalType(required=True)

    def validate_list_price(self, data, value):
        if data['sale_price'] > data['list_price']:
            raise ValidationError(
                'List price must be greater or equal to sale price.')
        return value

When I execute the spider, I see this error for each item that fails the validation:

Traceback (most recent call last):
  File "/tmp/crawler/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/tmp/crawler/spidermon/contrib/scrapy/pipelines.py", line 110, in process_item
    ok, errors = validator.validate(data)
  File "/tmp/crawler/spidermon/contrib/validation/validator.py", line 22, in validate
    return not self.has_errors, self.errors
  File "/tmp/crawler/spidermon/contrib/validation/validator.py", line 40, in errors
    for field_name, messages in self._errors.items()])
  File "/tmp/crawler/spidermon/contrib/validation/validator.py", line 40, in <listcomp>
    for field_name, messages in self._errors.items()])
  File "/tmp/crawler/spidermon/contrib/validation/translator.py", line 12, in translate_messages
    return [self.translate_message(m) for m in messages]
  File "/tmp/crawler/spidermon/contrib/validation/translator.py", line 12, in <listcomp>
    return [self.translate_message(m) for m in messages]
  File "/tmp/crawler/spidermon/contrib/validation/translator.py", line 16, in translate_message
    if pattern.search(message):
TypeError: expected string or bytes-like object

How to set a local/custom template in Email actions

Hi,

I would like to know if there is way to pass a local jinja template to the actions. I'm setting a local path and absolute path in SPIDERMON_BODY_HTML_TEMPLATE but I'm getting a TemplateNotFound error.

I tried SPIDERMON_BODY_HTML_TEMPLATE = 'reports/email/monitors/result.jinja' and it worked but I would like to add my own template.

It looks like jinja2 loader only looks at /spidermon/contrib/actions/reports/templates/ folder inside spidermon package.

Add default template to email actions

It would be great to use reports/email/monitors/result.jinja as the default template for SPIDERMON_BODY_HTML_TEMPLATE.

Add support for validation of different item types generated by the same spider

Currently, validation can only be defined via settings, being applied to all items generated by a given spider. I'm talking specifically about the SPIDERMON_VALIDATION_SCHEMAS and SPIDERMON_VALIDATION_MODELS settings.

This leads to many validation errorr for spiders that generate more than one item type.

We need a way to allow the validation to be specified for each item type.

Options that I can think of right now:

Add a new setting, that allows us to map the validation schema by item type. Something like:

SPIDERMON_VALIDATION_SCHEMAS_BY_TYPE = {
    'ItemType1': 'schema1.json', 
    'ItemType2': 'schema2.json'
}

Map the schema in the items themselves. Something like:

class ItemType1(Item):
    _schema = 'schema1.json'

Thoughts? Alternatives? Improvements? Have I missed something?

Option to make validation errors ERRORs in job

At the moment validation errors can be added to the items in the _validation key, but there doesn't seem to be any way to make it more obvious that some errors are incorrect.

Adding new JSON schema format validators

We're exporting to SQL and having datetime objects in the data is the easiest way to send SQL datetimes, but these are rejected by jsonschema. There's no way to set a validator class or pass extra arguments to the current validator class to override the datetime validator AFAICT.

Clear git history

Should we clear the history of this repository? How?

Document use of custom templates

Related to #126 , we need a better documentation to explain how to define and use custom templates for e-mails, reports, etc.

This action contains part of the code that handles it.

Include examples in the tutorial would be nice as well.

Publish Spidermon to PyPi

We need to get the spidermon ready to be installed via https://pypi.org/

missing entries on setup.py install_requires

Hey! Is there a reason behind not having all dependencies listed under install_requires?

The package is currently unusable outside SC for local development/testing...

Make basic monitors part of the spidermon.

We have to add basic monitors for almost all the projects and the goal is to make such monitors part of the tool/spidermon itself and upon enabling them, the tool should send notifications automatically when integrated with a project.

Basic monitors inside the spidermon.
One should be able to add customized monitors in a project in addition to these basic monitors.
On should be able to disable these monitors and add their own monitors.

Release documentation in readthedocs

Action to restart spider in Scrapy Cloud

Discussed in chat a bit - the idea is that if a job meets some conditions (monitor detects certain website responses, the job stalls, etc) this action could restart the job.

Ideas for how to count restarts to prevent infinite restarting included:

job metadata
spider parameter
tag the job + restarts with a uuid or something and look up the previous job(s) with the tag to get the restart count

Documentation on how to get started using spidermon

We need some documentation for spidermon, it would be nice to have an explanation of the generic goals, a tutorial and some documentation on how to extend it.

As low-hanging fruits, perhaps we could start with:

a tutorial for using it with Scrapy
autodocs generated and published somewhere

I've been using it for an use-case it wasn't it designed for (monitoring script jobs -- not spiders), and I've personally never used it for the original intention (monitor a spider), so I'm not that helpful for the standard use case.

Extend support for a few modules in python expressions

User case to cover:
Test that job was running more or less than certain amount of time. So in assert expression in monitor's test user tries:
($stats['finish_time'] - $stats['start_time']) < datetime.datetime.timedelta(0, 7)

and gets:
NameError: name 'datetime' is not defined

Stat's finish_time and start_time are datetime objects, so it's kind of supposed case to be able to work with them.

We may extend interpreter's context and pre-import modules like datetime (and maybe few others as well).

Are optional dependencies really optional?

This is probably related to #88.

It seems to me that if you want to use spidermon with Scrapy you need to use spidermon.contrib.scrapy.extensions.Spidermon. That requires at least jsonschema (imported via spidermon.python.factory) from the optional validation feature. On the other hand, the contents of the optional monitoring feature look important for any spidermon usage. Are there real use cases where one or both optional features can be skipped?

Disclaimer: I'm not familiar with optional installation features that much, as most modules I've used don't use them.

Use logging library in spidermon

Some actions uses print for logging:

Python built-in logging library should be used instead (like
here).

Python 3 support

Looks like for now the library doesn't support Python 3 (imports, syntax).
Would be great to see it working with both Python 2/3.

Update the "Readme.md" to reflect new links and installation instructions

We should have a link to the complete documentation that we already have and it's publish in readthedocs
and also put some installation notes in the readme.

We have to define a license in order to open source the project

MIT or something like that. We could use Scrapy's LICENSE file as a reference.

Integrate spidermon with Sentry

The goal is to integrate spidermon in such a way that the notifications could be sent to the sentry dashboard which could be viewed, linked and marked with the tickets in order to reduce the duplicate efforts. This will be in addition to integration with slack and email.

Enhance slack notifications

Currently the notifications we receive in Slack are quite simple, without much information.
For example:

*somesite spider finished with errors!* / view job in Scrapy Cloud _(errors=7)_
•  _Job validation/validation errors_

We could add more information there, something like:

*somesite spider finished with errors!* / view job in Scrapy Cloud _(errors=7)_
•  _Job validation/validation errors_
* invalid_string: 3
* missing_required_field/field_name: 4

Include documents about monitor mixins

We have different mixins in spidermon/contrib/monitors/mixins directory, but no documentation.

jsonschema>=3.0.0 compatibility

It looks like the jsonschema 3.0.0 released a few days ago broke something in Spidermon's contrib/validation/jsonschema/formats.py:

Unhandled error in Deferred:
2019-02-28 02:27:27 [twisted] CRITICAL: Unhandled error in Deferred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 172, in crawl
    return self._crawl(crawler, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 176, in _crawl
    d = crawler.crawl(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator
    return _cancellableInlineCallbacks(gen)
  File "/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
    _inlineCallbacks(None, g, status)
--- <exception caught here> ---
  File "/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 80, in crawl
    self.engine = self._create_engine()
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 105, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/usr/local/lib/python3.7/site-packages/scrapy/core/engine.py", line 70, in __init__
    self.scraper = Scraper(crawler)
  File "/usr/local/lib/python3.7/site-packages/scrapy/core/scraper.py", line 71, in __init__
    self.itemproc = itemproc_cls.from_crawler(crawler)
  File "/usr/local/lib/python3.7/site-packages/scrapy/middleware.py", line 53, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/usr/local/lib/python3.7/site-packages/scrapy/middleware.py", line 34, in from_settings
    mwcls = load_object(clspath)
  File "/usr/local/lib/python3.7/site-packages/scrapy/utils/misc.py", line 44, in load_object
    mod = import_module(module)
  File "/usr/local/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import

  File "<frozen importlib._bootstrap>", line 983, in _find_and_load

  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked

  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked

  File "<frozen importlib._bootstrap_external>", line 728, in exec_module

  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed

  File "/usr/local/lib/python3.7/site-packages/spidermon/contrib/scrapy/pipelines.py", line 13, in <module>
    from spidermon.contrib.validation import SchematicsValidator, JSONSchemaValidator
  File "/usr/local/lib/python3.7/site-packages/spidermon/contrib/validation/__init__.py", line 2, in <module>
    from .jsonschema.validator import JSONSchemaValidator
  File "/usr/local/lib/python3.7/site-packages/spidermon/contrib/validation/jsonschema/validator.py", line 9, in <module>
    from .formats import format_checker
  File "/usr/local/lib/python3.7/site-packages/spidermon/contrib/validation/jsonschema/formats.py", line 32, in <module>
    format_checker = FormatChecker(_draft_checkers["draft4"] + list(iterkeys(_spidermon_checkers)))
builtins.TypeError: unsupported operand type(s) for +: 'FormatChecker' and 'list'

2019-02-28 02:27:27 [twisted] CRITICAL:                                                                                                                                 
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 80, in crawl
    self.engine = self._create_engine()
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 105, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/usr/local/lib/python3.7/site-packages/scrapy/core/engine.py", line 70, in __init__
    self.scraper = Scraper(crawler)
  File "/usr/local/lib/python3.7/site-packages/scrapy/core/scraper.py", line 71, in __init__
    self.itemproc = itemproc_cls.from_crawler(crawler)
  File "/usr/local/lib/python3.7/site-packages/scrapy/middleware.py", line 53, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/usr/local/lib/python3.7/site-packages/scrapy/middleware.py", line 34, in from_settings
    mwcls = load_object(clspath)
  File "/usr/local/lib/python3.7/site-packages/scrapy/utils/misc.py", line 44, in load_object
    mod = import_module(module)
  File "/usr/local/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/usr/local/lib/python3.7/site-packages/spidermon/contrib/scrapy/pipelines.py", line 13, in <module>
    from spidermon.contrib.validation import SchematicsValidator, JSONSchemaValidator
  File "/usr/local/lib/python3.7/site-packages/spidermon/contrib/validation/__init__.py", line 2, in <module>
    from .jsonschema.validator import JSONSchemaValidator
  File "/usr/local/lib/python3.7/site-packages/spidermon/contrib/validation/jsonschema/validator.py", line 9, in <module>
    from .formats import format_checker
  File "/usr/local/lib/python3.7/site-packages/spidermon/contrib/validation/jsonschema/formats.py", line 32, in <module>
    format_checker = FormatChecker(_draft_checkers["draft4"] + list(iterkeys(_spidermon_checkers)))
TypeError: unsupported operand type(s) for +: 'FormatChecker' and 'list'

Downgrading to jsonschema 2.6.0 prevents this error.

Get the JSON Schema file from a URL

Maybe it's interesting to make it possible to download the JSON Schema file from a remote URL.

Pros

In cases when people have a dedicated repository for JSON Schemas, this could avoid having to keep two versions of the same file in two different repositories.
Some APIs could provide JSON Schema references, their endpoints could be used to live update the Spider requirements.

Cons

Sometimes the JSON Schema used in Spidermon is different than the JSON Schema used in QA.
The same problem could be solved using a Pipeline to update the Spider repository when the QA repository receives new commits on the JSON Schema file.
Single responsibility and UNIX philosophy: maybe this feature is beyond the scope of this project.

(Thanks to @ejulio for discussing pros and cons)

Create a Release Notes for version 1.9

Create a release notes, for this we need to keep track about what was merged into master, suggestions here?

Option to stop spider on errors

I think it would be nice to have validation errors treated like other spider errors - increment the error counter, optionally stop the spider with an error condition if over an error count.

I'd primarily like to stop the spider so I made that the title, but having both would be nice.

Create documentation of monitor

Document Monitor and MonitorSuite classes

report context defined in settings.py does not have any effect on email reports

SPIDERMON_REPORT_CONTEXT defined in settings.py file does not have any effect on email reports, we need to subclass SendSESEmail and add 'context' attribute explicitly.

SPIDERMON_REPORT_CONTEXT = { 'report_title': 'ABC Spiders Report' , 'show_job_button': True }

Include support for Cerberus as optional item validation library

Actually Spidermon allows the user to choose between jsonschema or schematics as item validation library. We would like to include cerberus as a new choice to the user.

Documentation refers to `self.stats`, should be `self.data.stats`

Copying examples from the documentation raises this error:

getattr(self.stats, 'item_scraped_count', 0)
AttributeError: 'GeneralMonitor' object has no attribute 'stats'

I believe these should be self.data.stats as per other documentation.

Follow python end of life https://devguide.python.org/

I propose to follow Python dev guidelines. I thought it was idea from the start that we actually won't support Python2 officialy in this library.

Here you can think of all benefits which you can get by removing Python2 support altogether

But in the end it's all the same - guidelines there are for certain reasons, and something old eventually will stop working of will block this library development, sooner or later.

So I propose to actually:

deprecate python2
python 3.4.

Alive projects where Python2 support is needed can use an archived version or switch to 3.

Infers validator class based on given JSON Schema

Actually it is using draft-4 version:
https://github.com/scrapinghub/spidermon/blob/master/spidermon/contrib/validation/jsonschema/validator.py#L27

Spidermon email not sent when the log count is goes beyond 50+ thousand(estimated)

I see in some jobs, the email does not trigger and spider does not close normally. In normal jobs, the stats are logged at the end but in this case, they're not. The last lines in logs show spidermon stuck on log request.

https://app.scrapinghub.com/p/193898/112/32/log?latest

Document sentry integration

The purpose of this task is to document the integration between sentry & spidermon. The following things should be a part of documentation:

How to turn on the integration.
Possible settings which could be used.
Screenshots ???

Add continuous integration to spidermon

As the title says.

make this repo publicly available

I have noticed that this repository is accessible only for scrapinghub staff. Is that intentional? I wonder if we can make it public, or even publish on pypi. Some scrapinghub customers which write their own projects may need to make use of it.

Whole code review to remove private information

Need to check in the whole codebase if we don't have mentions of our customer, API keys or code that are only of interest of SH and should not be released publicly.

Create documentation of actions

Document:

E-mail action
Slack Action
Job Action
Report Action
Custom Actions

Improve message of validation error in JSON Schema

When we have a schema with additionalProperties set to False, the error message provided by JSONSchemaValidator is not useful as it doesn't show which are the unexpected fields found.

In [1]: from spidermon.contrib.validation import JSONSchemaValidator 
   ...:  
   ...: schema = { 
   ...:     "$schema": "http://json-schema.org/draft-07/schema#", 
   ...:     "additionalProperties": False, 
   ...:     "type": "object", 
   ...:     "properties": { 
   ...:         "name": {"type": "string"} 
   ...:     } 
   ...: } 
   ...:  
   ...: validator = JSONSchemaValidator(schema) 
   ...: validator.validate({"name": "Test Name"})                                                                                                                             
Out[1]: (True, {})

In [2]: validator.validate({"name": "Test Name", "email": "Unexpected email"}) 
   ...:                                                                                                                                                                       
Out[2]: (False, {'': ['Unexpected field']})

It would be better if we include the unexpected fields:

In [2]: validator.validate({"name": "Test Name", "email": "Unexpected email"}) 
   ...:                                                                                                                                                                       
Out[2]: (False, {'': ['Unexpected fields: "email"']})

pip install -r requirements.txt does not install all dependencies required for json schema validation

The JSON Schema validation is a pretty common use case for Spidermon, but we need to use a custom pip install command to install required dependencies. Sometimes it leads to missing modules during execution time and it could be hard for first-time users.

My suggestion is to unify the production requirements in a single file.

Document periodic monitors

Document and create examples (in examples/tutorial spider) of the use of periodic monitors:
https://spidermon.readthedocs.io/en/latest/settings.html#spidermon-periodic-monitors

Create documentation of item validators

schematics and JSON schema validators

Black to handle pep8 and all code formating

I have noticed here and there formatting issues in prs, I suggest looking at https://github.com/ambv/black.

The setup is simple - add few lines of settings to tox.ini, autoformat the code with black ./, and add a line to CI to check everything is ok.

The main benefit - it allows to focus on code, it brings consistency, integrations with IDEs are good.
The main cons for me - it's a pre-release package, and why it's very stable -pre is a special case.

I am ready to help.

Add SPIDERMON_ENGINE_STOP_MONITORS

I want to monitor if the file was uploaded to S3 or not. I've subclassed the FeedExporter and add a flag in stats when the file is uploaded. But the issue the SPIDERMON_SPIDER_CLOSE_MONITOR runs all the monitors even before the file is uploaded, this is because FeedExporter and spidrmon SPIDERMON_SPIDER_CLOSE_MONITOR both runs on the spider_close signal.

To monitor this we need to run some monitors on the engine_stopped signal.

Create a base monitor to compare with previous crawls

Often, we compare the current crawling result with a previous one.
The idea here is to create some kind of base monitor that would make it a bit easier.
Basically, we need:

Some sort of storage provider (we can use python-scrapinghub by default, but we need to make extensible for any kind of storage)
A base monitor that would load last crawl data, an entry point to compare with the current one, storage the current crawl data.

Anything else you think is worth adding here?

What is the difference between render_text_template and render_template?

Hi,

I would like to know what is the intention of render_text_template and render_template methods in /contrib/actions/templates.py.

spidermon/spidermon/contrib/actions/templates.py

Lines 37 to 43 in 47b052b

 def render_text_template(self, template): 

 template = Template(template) 

 return template.render(self.get_template_context()) 

 def render_template(self, template): 

 template = self.get_template(template) 

 return template.render(self.get_template_context())

both are the same.

Create a CLI for setup a very basic spidermon monitors and notifications

Some simple prompt questions that help new comers to setup a basic spidermon installation in a easy and quick way.

$ spidermon setup
$ To who should I send the notification email: [email protected]
$ Do you want to enable Slack Notifications [Y/n]: y
$ Please provide an Slack API key to send notifications: ab213234_cHJscaJRFVN
Thanks for enabling the amazing Spidermon! You're good to go.

Merge development branch

Is the code in master branch being used somewhere?

If not, I say let's merge the development branch already, to avoid multiple versions being used all around.

Create tests for PythonExpressionsMonitor

This feature is not tested:
https://github.com/scrapinghub/spidermon/blob/master/spidermon/python/factory.py#L12

	def render_text_template(self, template):
	template = Template(template)
	return template.render(self.get_template_context())

	def render_template(self, template):
	template = self.get_template(template)
	return template.render(self.get_template_context())