Giter Club home page Giter Club logo

Comments (16)

rshewitt avatar rshewitt commented on July 20, 2024 1

@jbrown-xentity and I spoke about the load function and we think we can nix it in the harvesting logic repo. in other words, a dag won't be calling something like harvester.load(*args). instead, the load function will reside in the dag itself. it could look something like this @btylerburton

def load( harvest_datasets, operation ):
  ckan = harvester.create_ckan_entrypoint( ckan_url, api_key ) 
  operations = { 
    "delete": harvester.purge_ckan_package,
    "create": harvester.create_ckan_package,
    "update": harvester.update_ckan_package
  } 
  for dataset in harvest_datasets:
    operations[operation]( ckan, dataset ) 

from data.gov.

btylerburton avatar btylerburton commented on July 20, 2024 1

since we don't have a rollback in the event of a partial completion this could be fine? do we need a rollback in the event of let's say 2/8 dataset creations fail?

This brings up a good question around error reporting/tracking. I've been thinking about this in regards to the DCAT pipeline. We should discuss as a team how we want to handle things.

Take this case:

[record to be created] -> validation [fails] -> load [skipped]

or

[record to be transformed] -> transform [fails] -> validation [skipped] -> load [skipped]

Airflow is happiest when I put a skip exception in at the failure step, which allows it to skip any downstream tasks gracefully, but this also means that the pipeline is "green", so we need a way of recording/handling those exceptions.

It's easy enough to log the failures, we just need to know what to do with them.

from data.gov.

rshewitt avatar rshewitt commented on July 20, 2024 1

i'm gonna process all our current harvest sources through the harvesting logic code to fix any issues on my end ( excluding any real creation, deletion, or updating )

from data.gov.

btylerburton avatar btylerburton commented on July 20, 2024

I like that enhancement. Will harvesting logic still return three separate lists of datasets?

from data.gov.

rshewitt avatar rshewitt commented on July 20, 2024

yeah that's what i'm thinking. the reason being that we could potentially expand each load-type ( i.e. create, update, delete) like this?

compare_result = compare(harvest_source, ckan_source) 
"""
compare_result = {
  "create": [ id1, id2, id3],
  "update": [ id5, id6, id7],
  "delete": [ id10, id15, id14]
}
"""

load.expand( compare_result["create"], "create" ) 
load.expand( compare_result["delete"], "delete" )
load.expand( compare_result["update"], "update" )

since we don't have a rollback in the event of a partial completion this could be fine? do we need a rollback in the event of let's say 2/8 dataset creations fail?

from data.gov.

rshewitt avatar rshewitt commented on July 20, 2024

pipeline test sketch as a reference. the changes are untested. meant to serve as an aggregate of @robert-bryson 's work with the classes and show the order of operations similar to how they could be in airflow ( not literally just more of workflow )

from data.gov.

jbrown-xentity avatar jbrown-xentity commented on July 20, 2024

It's easy enough to log the failures, we just need to know what to do with them.

This is what we should do with them: #4582

from data.gov.

rshewitt avatar rshewitt commented on July 20, 2024

using a data validator like pydantic against our proposed classes could be valuable. this could add teeth to our type hints as well @robert-bryson

from data.gov.

rshewitt avatar rshewitt commented on July 20, 2024

using dataclasses could be a nice way to treat our classes. the equality and hash methods that come with them seem relevant

from data.gov.

rshewitt avatar rshewitt commented on July 20, 2024

making use of the property decorator seems like it could give us more control of how we set, get, and/or delete attributes in our classes. the more constrictive we our with setting attributes could mean less headaches down the road ( or cause more? )

from data.gov.

rshewitt avatar rshewitt commented on July 20, 2024

if we intend to store an instance of a class we need to make sure it's serializable. dataclasses comes with a asdict method which may do what we want. take for example

@dataclass 
class A
  age: int

@dataclass
class B 
  records: Dict = field(default_factory=lambda: {})

a = A(25) 
b = B() 

b.records["a"] = a
dataclasses.asdict(b) # >> {'records': {'a': {'age': 25}}}

the asdict method will conveniently unpack the nested instance of A by its fields. we can then serialize this output into a json str. this example is meant to be a simplified version of our Record in Source design.

here's an example of what happens when I try to use asdict on b when its records dict contains something that's not serializable

ckan = ckanapi.RemoteCKAN("path\to\ckan\endpoint",apikey="api_key")
b.records["ckan"] = ckan # >> {'a': A(age=15), 'ckan': <ckanapi.remoteckan.RemoteCKAN object at 0x1010b4cd0>}
dataclasses.asdict(b) # >> returns error "TypeError: ActionShortcut.__getattr__.<locals>.action() takes 0 positional arguments but 1 was given"

from data.gov.

rshewitt avatar rshewitt commented on July 20, 2024

for continuity and organization i'm going to convert our tests to classes. i've also added a ckan extract test class. some tests could be...

  • using the wrong url or apikey in ckanapi.RemoteCKAN( url, apikey )
    • there's no verification when you run this function so I can pass ckanapi.RemoteCKAN( "a", "b" ) and it will return just fine. it will only complain when i try to do something with the return ( e.g. ckanapi.RemoteCKAN( "a", "b" ).action.package_create(**kwargs) won't work.
  • ckan being unavailable
  • getting a proper return and converting it to {"identifier": "hash"} format

a test that could be useful for the compare is when ckan returns nothing suggesting everything needs to be created

from data.gov.

rshewitt avatar rshewitt commented on July 20, 2024

draft pr. this pr may be open for an extended period of time to allow for fixes addressing issues @btylerburton gets as he tests the module out in airflow. or maybe that should be separate ticket?

from data.gov.

rshewitt avatar rshewitt commented on July 20, 2024

ran a test load on catalog-dev admin. number of datasets has increased from 345 to 720. this test was run on my machine and not in airflow.

from data.gov.

rshewitt avatar rshewitt commented on July 20, 2024

log on last nights load test of dcatus. processing seems to have been held up at 00:03:16,472
harvest_load.log

from data.gov.

nickumia avatar nickumia commented on July 20, 2024

Airflow is happiest when I put a skip exception in at the failure step, which allows it to skip any downstream tasks gracefully, but this also means that the pipeline is "green", so we need a way of recording/handling those exceptions.

@btylerburton What happens when you use a AirflowFailException? It still does the tasks afterwards?? 😱 Or is the concern that the subsequent tasks show up as failed (even when they're not run)? I ran into a similar issue when running github actions. This doesn't completely answer the question, but having robust conditions and clear logic is important..

Maybe this reference might provide some ideas to the team: https://www.restack.io/docs/airflow-knowledge-airflow-skip-exception-guide (i.e. having a fallback mechanism to do the email notification might meet the needs?)

from data.gov.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.