Giter Club home page Giter Club logo

spidermon's Introduction

Spidermon

CI Status

Coverage report

pypi version

licence

python versions

Code style: black

Overview

Spidermon is an extension for Scrapy spiders. The package provides useful tools for data validation, stats monitoring, and notification messages. This way you leave the monitoring task to Spidermon and just check the reports/notifications.

Requirements

  • Python Version: 3.8, 3.9, 3.10 or 3.11

Install

The quick way:

pip install spidermon

For more details see the install section in the documentation: https://spidermon.readthedocs.io/en/latest/installation.html

Documentation

Documentation is available online at https://spidermon.readthedocs.io/ and in the docs directory.

spidermon's People

Contributors

andersonberg avatar arturgaspar avatar baitxo avatar eliasdorneles avatar further-reading avatar gallaecio avatar gatufo avatar hassanqamar07 avatar hugovk avatar immerrr avatar jesuslosada avatar josericardo avatar manycoding avatar mrwbarg avatar muzaffaryousaf avatar raphapassini avatar renatodvc avatar rennerocha avatar rochamatcomp avatar serhii73 avatar shafiq-muhammad avatar stummjr avatar sulthonzh avatar tcurvelo avatar torymur avatar victor-torres avatar vipulgupta2048 avatar vmruiz avatar vshlapakov avatar wrar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spidermon's Issues

Unable to add model-level validation with schematics

schematics allows you to Type-level validation (when you want to validate only the value of a specific field) or Model-level validation (if you want to validate one field against the content of other field).

Considering a model and a validation method like the following:

class MyModel(Model):
    sale_price = DecimalType(required=True)
    list_price = DecimalType(required=True)

    def validate_list_price(self, data, value):
        if data['sale_price'] > data['list_price']:
            raise ValidationError(
                'List price must be greater or equal to sale price.')
        return value

When I execute the spider, I see this error for each item that fails the validation:

Traceback (most recent call last):
  File "/tmp/crawler/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/tmp/crawler/spidermon/contrib/scrapy/pipelines.py", line 110, in process_item
    ok, errors = validator.validate(data)
  File "/tmp/crawler/spidermon/contrib/validation/validator.py", line 22, in validate
    return not self.has_errors, self.errors
  File "/tmp/crawler/spidermon/contrib/validation/validator.py", line 40, in errors
    for field_name, messages in self._errors.items()])
  File "/tmp/crawler/spidermon/contrib/validation/validator.py", line 40, in <listcomp>
    for field_name, messages in self._errors.items()])
  File "/tmp/crawler/spidermon/contrib/validation/translator.py", line 12, in translate_messages
    return [self.translate_message(m) for m in messages]
  File "/tmp/crawler/spidermon/contrib/validation/translator.py", line 12, in <listcomp>
    return [self.translate_message(m) for m in messages]
  File "/tmp/crawler/spidermon/contrib/validation/translator.py", line 16, in translate_message
    if pattern.search(message):
TypeError: expected string or bytes-like object

How to set a local/custom template in Email actions

Hi,

I would like to know if there is way to pass a local jinja template to the actions. I'm setting a local path and absolute path in SPIDERMON_BODY_HTML_TEMPLATE but I'm getting a TemplateNotFound error.

I tried SPIDERMON_BODY_HTML_TEMPLATE = 'reports/email/monitors/result.jinja' and it worked but I would like to add my own template.

It looks like jinja2 loader only looks at /spidermon/contrib/actions/reports/templates/ folder inside spidermon package.

Add support for validation of different item types generated by the same spider

Currently, validation can only be defined via settings, being applied to all items generated by a given spider. I'm talking specifically about the SPIDERMON_VALIDATION_SCHEMAS and SPIDERMON_VALIDATION_MODELS settings.

This leads to many validation errorr for spiders that generate more than one item type.

We need a way to allow the validation to be specified for each item type.

Options that I can think of right now:

  • Add a new setting, that allows us to map the validation schema by item type. Something like:
SPIDERMON_VALIDATION_SCHEMAS_BY_TYPE = {
    'ItemType1': 'schema1.json', 
    'ItemType2': 'schema2.json'
}
  • Map the schema in the items themselves. Something like:
class ItemType1(Item):
    _schema = 'schema1.json'

Thoughts? Alternatives? Improvements? Have I missed something?

Adding new JSON schema format validators

We're exporting to SQL and having datetime objects in the data is the easiest way to send SQL datetimes, but these are rejected by jsonschema. There's no way to set a validator class or pass extra arguments to the current validator class to override the datetime validator AFAICT.

Document use of custom templates

Related to #126 , we need a better documentation to explain how to define and use custom templates for e-mails, reports, etc.

This action contains part of the code that handles it.

Include examples in the tutorial would be nice as well.

Make basic monitors part of the spidermon.

We have to add basic monitors for almost all the projects and the goal is to make such monitors part of the tool/spidermon itself and upon enabling them, the tool should send notifications automatically when integrated with a project.

  • Basic monitors inside the spidermon.
  • One should be able to add customized monitors in a project in addition to these basic monitors.
  • On should be able to disable these monitors and add their own monitors.

Action to restart spider in Scrapy Cloud

Discussed in chat a bit - the idea is that if a job meets some conditions (monitor detects certain website responses, the job stalls, etc) this action could restart the job.

Ideas for how to count restarts to prevent infinite restarting included:

  • job metadata
  • spider parameter
  • tag the job + restarts with a uuid or something and look up the previous job(s) with the tag to get the restart count

Documentation on how to get started using spidermon

We need some documentation for spidermon, it would be nice to have an explanation of the generic goals, a tutorial and some documentation on how to extend it.

As low-hanging fruits, perhaps we could start with:

  • a tutorial for using it with Scrapy
  • autodocs generated and published somewhere

I've been using it for an use-case it wasn't it designed for (monitoring script jobs -- not spiders), and I've personally never used it for the original intention (monitor a spider), so I'm not that helpful for the standard use case.

Extend support for a few modules in python expressions

User case to cover:
Test that job was running more or less than certain amount of time. So in assert expression in monitor's test user tries:
($stats['finish_time'] - $stats['start_time']) < datetime.datetime.timedelta(0, 7)

and gets:
NameError: name 'datetime' is not defined

Stat's finish_time and start_time are datetime objects, so it's kind of supposed case to be able to work with them.

We may extend interpreter's context and pre-import modules like datetime (and maybe few others as well).

Are optional dependencies really optional?

This is probably related to #88.

It seems to me that if you want to use spidermon with Scrapy you need to use spidermon.contrib.scrapy.extensions.Spidermon. That requires at least jsonschema (imported via spidermon.python.factory) from the optional validation feature. On the other hand, the contents of the optional monitoring feature look important for any spidermon usage. Are there real use cases where one or both optional features can be skipped?

Disclaimer: I'm not familiar with optional installation features that much, as most modules I've used don't use them.

Python 3 support

Looks like for now the library doesn't support Python 3 (imports, syntax).
Would be great to see it working with both Python 2/3.

Integrate spidermon with Sentry

The goal is to integrate spidermon in such a way that the notifications could be sent to the sentry dashboard which could be viewed, linked and marked with the tickets in order to reduce the duplicate efforts. This will be in addition to integration with slack and email.

Enhance slack notifications

Currently the notifications we receive in Slack are quite simple, without much information.
For example:

*somesite spider finished with errors!* / view job in Scrapy Cloud _(errors=7)_
•  _Job validation/validation errors_

We could add more information there, something like:

*somesite spider finished with errors!* / view job in Scrapy Cloud _(errors=7)_
•  _Job validation/validation errors_
* invalid_string: 3
* missing_required_field/field_name: 4

jsonschema>=3.0.0 compatibility

It looks like the jsonschema 3.0.0 released a few days ago broke something in Spidermon's contrib/validation/jsonschema/formats.py:

Unhandled error in Deferred:
2019-02-28 02:27:27 [twisted] CRITICAL: Unhandled error in Deferred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 172, in crawl
    return self._crawl(crawler, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 176, in _crawl
    d = crawler.crawl(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator
    return _cancellableInlineCallbacks(gen)
  File "/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
    _inlineCallbacks(None, g, status)
--- <exception caught here> ---
  File "/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 80, in crawl
    self.engine = self._create_engine()
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 105, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/usr/local/lib/python3.7/site-packages/scrapy/core/engine.py", line 70, in __init__
    self.scraper = Scraper(crawler)
  File "/usr/local/lib/python3.7/site-packages/scrapy/core/scraper.py", line 71, in __init__
    self.itemproc = itemproc_cls.from_crawler(crawler)
  File "/usr/local/lib/python3.7/site-packages/scrapy/middleware.py", line 53, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/usr/local/lib/python3.7/site-packages/scrapy/middleware.py", line 34, in from_settings
    mwcls = load_object(clspath)
  File "/usr/local/lib/python3.7/site-packages/scrapy/utils/misc.py", line 44, in load_object
    mod = import_module(module)
  File "/usr/local/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import

  File "<frozen importlib._bootstrap>", line 983, in _find_and_load

  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked

  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked

  File "<frozen importlib._bootstrap_external>", line 728, in exec_module

  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed

  File "/usr/local/lib/python3.7/site-packages/spidermon/contrib/scrapy/pipelines.py", line 13, in <module>
    from spidermon.contrib.validation import SchematicsValidator, JSONSchemaValidator
  File "/usr/local/lib/python3.7/site-packages/spidermon/contrib/validation/__init__.py", line 2, in <module>
    from .jsonschema.validator import JSONSchemaValidator
  File "/usr/local/lib/python3.7/site-packages/spidermon/contrib/validation/jsonschema/validator.py", line 9, in <module>
    from .formats import format_checker
  File "/usr/local/lib/python3.7/site-packages/spidermon/contrib/validation/jsonschema/formats.py", line 32, in <module>
    format_checker = FormatChecker(_draft_checkers["draft4"] + list(iterkeys(_spidermon_checkers)))
builtins.TypeError: unsupported operand type(s) for +: 'FormatChecker' and 'list'

2019-02-28 02:27:27 [twisted] CRITICAL:                                                                                                                                 
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 80, in crawl
    self.engine = self._create_engine()
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 105, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/usr/local/lib/python3.7/site-packages/scrapy/core/engine.py", line 70, in __init__
    self.scraper = Scraper(crawler)
  File "/usr/local/lib/python3.7/site-packages/scrapy/core/scraper.py", line 71, in __init__
    self.itemproc = itemproc_cls.from_crawler(crawler)
  File "/usr/local/lib/python3.7/site-packages/scrapy/middleware.py", line 53, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/usr/local/lib/python3.7/site-packages/scrapy/middleware.py", line 34, in from_settings
    mwcls = load_object(clspath)
  File "/usr/local/lib/python3.7/site-packages/scrapy/utils/misc.py", line 44, in load_object
    mod = import_module(module)
  File "/usr/local/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/usr/local/lib/python3.7/site-packages/spidermon/contrib/scrapy/pipelines.py", line 13, in <module>
    from spidermon.contrib.validation import SchematicsValidator, JSONSchemaValidator
  File "/usr/local/lib/python3.7/site-packages/spidermon/contrib/validation/__init__.py", line 2, in <module>
    from .jsonschema.validator import JSONSchemaValidator
  File "/usr/local/lib/python3.7/site-packages/spidermon/contrib/validation/jsonschema/validator.py", line 9, in <module>
    from .formats import format_checker
  File "/usr/local/lib/python3.7/site-packages/spidermon/contrib/validation/jsonschema/formats.py", line 32, in <module>
    format_checker = FormatChecker(_draft_checkers["draft4"] + list(iterkeys(_spidermon_checkers)))
TypeError: unsupported operand type(s) for +: 'FormatChecker' and 'list'

Downgrading to jsonschema 2.6.0 prevents this error.

Get the JSON Schema file from a URL

Maybe it's interesting to make it possible to download the JSON Schema file from a remote URL.

Pros

  • In cases when people have a dedicated repository for JSON Schemas, this could avoid having to keep two versions of the same file in two different repositories.
  • Some APIs could provide JSON Schema references, their endpoints could be used to live update the Spider requirements.

Cons

  • Sometimes the JSON Schema used in Spidermon is different than the JSON Schema used in QA.
  • The same problem could be solved using a Pipeline to update the Spider repository when the QA repository receives new commits on the JSON Schema file.
  • Single responsibility and UNIX philosophy: maybe this feature is beyond the scope of this project.

(Thanks to @ejulio for discussing pros and cons)

Option to stop spider on errors

I think it would be nice to have validation errors treated like other spider errors - increment the error counter, optionally stop the spider with an error condition if over an error count.

I'd primarily like to stop the spider so I made that the title, but having both would be nice.

Follow python end of life https://devguide.python.org/

I propose to follow Python dev guidelines. I thought it was idea from the start that we actually won't support Python2 officialy in this library.

Here you can think of all benefits which you can get by removing Python2 support altogether

But in the end it's all the same - guidelines there are for certain reasons, and something old eventually will stop working of will block this library development, sooner or later.

So I propose to actually:

  1. deprecate python2
  2. python 3.4.

Alive projects where Python2 support is needed can use an archived version or switch to 3.

Document sentry integration

The purpose of this task is to document the integration between sentry & spidermon. The following things should be a part of documentation:

  1. How to turn on the integration.
  2. Possible settings which could be used.
  3. Screenshots ???

make this repo publicly available

I have noticed that this repository is accessible only for scrapinghub staff. Is that intentional? I wonder if we can make it public, or even publish on pypi. Some scrapinghub customers which write their own projects may need to make use of it.

Improve message of validation error in JSON Schema

When we have a schema with additionalProperties set to False, the error message provided by JSONSchemaValidator is not useful as it doesn't show which are the unexpected fields found.

In [1]: from spidermon.contrib.validation import JSONSchemaValidator 
   ...:  
   ...: schema = { 
   ...:     "$schema": "http://json-schema.org/draft-07/schema#", 
   ...:     "additionalProperties": False, 
   ...:     "type": "object", 
   ...:     "properties": { 
   ...:         "name": {"type": "string"} 
   ...:     } 
   ...: } 
   ...:  
   ...: validator = JSONSchemaValidator(schema) 
   ...: validator.validate({"name": "Test Name"})                                                                                                                             
Out[1]: (True, {})

In [2]: validator.validate({"name": "Test Name", "email": "Unexpected email"}) 
   ...:                                                                                                                                                                       
Out[2]: (False, {'': ['Unexpected field']})

It would be better if we include the unexpected fields:

In [2]: validator.validate({"name": "Test Name", "email": "Unexpected email"}) 
   ...:                                                                                                                                                                       
Out[2]: (False, {'': ['Unexpected fields: "email"']})

Black to handle pep8 and all code formating

I have noticed here and there formatting issues in prs, I suggest looking at https://github.com/ambv/black.

The setup is simple - add few lines of settings to tox.ini, autoformat the code with black ./, and add a line to CI to check everything is ok.

The main benefit - it allows to focus on code, it brings consistency, integrations with IDEs are good.
The main cons for me - it's a pre-release package, and why it's very stable -pre is a special case.

I am ready to help.

Add SPIDERMON_ENGINE_STOP_MONITORS

I want to monitor if the file was uploaded to S3 or not. I've subclassed the FeedExporter and add a flag in stats when the file is uploaded. But the issue the SPIDERMON_SPIDER_CLOSE_MONITOR runs all the monitors even before the file is uploaded, this is because FeedExporter and spidrmon SPIDERMON_SPIDER_CLOSE_MONITOR both runs on the spider_close signal.

To monitor this we need to run some monitors on the engine_stopped signal.

Create a base monitor to compare with previous crawls

Often, we compare the current crawling result with a previous one.
The idea here is to create some kind of base monitor that would make it a bit easier.
Basically, we need:

  • Some sort of storage provider (we can use python-scrapinghub by default, but we need to make extensible for any kind of storage)
  • A base monitor that would load last crawl data, an entry point to compare with the current one, storage the current crawl data.

Anything else you think is worth adding here?

What is the difference between render_text_template and render_template?

Hi,

I would like to know what is the intention of render_text_template and render_template methods in /contrib/actions/templates.py.

def render_text_template(self, template):
template = Template(template)
return template.render(self.get_template_context())
def render_template(self, template):
template = self.get_template(template)
return template.render(self.get_template_context())

both are the same.

Create a CLI for setup a very basic spidermon monitors and notifications

Some simple prompt questions that help new comers to setup a basic spidermon installation in a easy and quick way.

$ spidermon setup
$ To who should I send the notification email: [email protected]
$ Do you want to enable Slack Notifications [Y/n]: y
$ Please provide an Slack API key to send notifications: ab213234_cHJscaJRFVN
Thanks for enabling the amazing Spidermon! You're good to go.

Merge development branch

Is the code in master branch being used somewhere?

If not, I say let's merge the development branch already, to avoid multiple versions being used all around.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.