User Story In order to incorporate feedback from the most recent d

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

pipeline test <a href="https://github.com/GSA/datagov-harvesting-logic/blob/d1b8698465

using a data validator like <a href="https://docs.pydantic.dev/latest/" rel="nofollow"

making use of the <a href="https://docs.python.org/3/library/functions.html#property"

Refactor Harvesting Logic repo to incorporate feedback changes about data.gov HOT 16 CLOSED

btylerburton commented on July 20, 2024

Refactor Harvesting Logic repo to incorporate feedback changes

from data.gov.

Comments (16)

rshewitt commented on July 20, 2024 1

@jbrown-xentity and I spoke about the load function and we think we can nix it in the harvesting logic repo. in other words, a dag won't be calling something like harvester.load(*args). instead, the load function will reside in the dag itself. it could look something like this @btylerburton

def load( harvest_datasets, operation ):
  ckan = harvester.create_ckan_entrypoint( ckan_url, api_key ) 
  operations = { 
    "delete": harvester.purge_ckan_package,
    "create": harvester.create_ckan_package,
    "update": harvester.update_ckan_package
  } 
  for dataset in harvest_datasets:
    operations[operation]( ckan, dataset )

from data.gov.

btylerburton commented on July 20, 2024 1

since we don't have a rollback in the event of a partial completion this could be fine? do we need a rollback in the event of let's say 2/8 dataset creations fail?

This brings up a good question around error reporting/tracking. I've been thinking about this in regards to the DCAT pipeline. We should discuss as a team how we want to handle things.

Take this case:

[record to be created] -> validation [fails] -> load [skipped]

[record to be transformed] -> transform [fails] -> validation [skipped] -> load [skipped]

Airflow is happiest when I put a skip exception in at the failure step, which allows it to skip any downstream tasks gracefully, but this also means that the pipeline is "green", so we need a way of recording/handling those exceptions.

It's easy enough to log the failures, we just need to know what to do with them.

from data.gov.

rshewitt commented on July 20, 2024 1

i'm gonna process all our current harvest sources through the harvesting logic code to fix any issues on my end ( excluding any real creation, deletion, or updating )

from data.gov.

btylerburton commented on July 20, 2024

I like that enhancement. Will harvesting logic still return three separate lists of datasets?

from data.gov.

rshewitt commented on July 20, 2024

yeah that's what i'm thinking. the reason being that we could potentially expand each load-type ( i.e. create, update, delete) like this?

compare_result = compare(harvest_source, ckan_source) 
"""
compare_result = {
  "create": [ id1, id2, id3],
  "update": [ id5, id6, id7],
  "delete": [ id10, id15, id14]
}
"""

load.expand( compare_result["create"], "create" ) 
load.expand( compare_result["delete"], "delete" )
load.expand( compare_result["update"], "update" )

since we don't have a rollback in the event of a partial completion this could be fine? do we need a rollback in the event of let's say 2/8 dataset creations fail?

from data.gov.

rshewitt commented on July 20, 2024

pipeline test sketch as a reference. the changes are untested. meant to serve as an aggregate of @robert-bryson 's work with the classes and show the order of operations similar to how they could be in airflow ( not literally just more of workflow )

from data.gov.

jbrown-xentity commented on July 20, 2024

It's easy enough to log the failures, we just need to know what to do with them.

This is what we should do with them: #4582

from data.gov.

rshewitt commented on July 20, 2024

using a data validator like pydantic against our proposed classes could be valuable. this could add teeth to our type hints as well @robert-bryson

from data.gov.

rshewitt commented on July 20, 2024

using dataclasses could be a nice way to treat our classes. the equality and hash methods that come with them seem relevant

from data.gov.

rshewitt commented on July 20, 2024

making use of the property decorator seems like it could give us more control of how we set, get, and/or delete attributes in our classes. the more constrictive we our with setting attributes could mean less headaches down the road ( or cause more? )

from data.gov.

rshewitt commented on July 20, 2024

if we intend to store an instance of a class we need to make sure it's serializable. dataclasses comes with a asdict method which may do what we want. take for example

@dataclass 
class A
  age: int

@dataclass
class B 
  records: Dict = field(default_factory=lambda: {})

a = A(25) 
b = B() 

b.records["a"] = a
dataclasses.asdict(b) # >> {'records': {'a': {'age': 25}}}

the asdict method will conveniently unpack the nested instance of A by its fields. we can then serialize this output into a json str. this example is meant to be a simplified version of our Record in Source design.

here's an example of what happens when I try to use asdict on b when its records dict contains something that's not serializable

ckan = ckanapi.RemoteCKAN("path\to\ckan\endpoint",apikey="api_key")
b.records["ckan"] = ckan # >> {'a': A(age=15), 'ckan': <ckanapi.remoteckan.RemoteCKAN object at 0x1010b4cd0>}
dataclasses.asdict(b) # >> returns error "TypeError: ActionShortcut.__getattr__.<locals>.action() takes 0 positional arguments but 1 was given"

from data.gov.

rshewitt commented on July 20, 2024

for continuity and organization i'm going to convert our tests to classes. i've also added a ckan extract test class. some tests could be...

using the wrong url or apikey in ckanapi.RemoteCKAN( url, apikey )
- there's no verification when you run this function so I can pass ckanapi.RemoteCKAN( "a", "b" ) and it will return just fine. it will only complain when i try to do something with the return ( e.g. ckanapi.RemoteCKAN( "a", "b" ).action.package_create(**kwargs) won't work.
ckan being unavailable
getting a proper return and converting it to {"identifier": "hash"} format

a test that could be useful for the compare is when ckan returns nothing suggesting everything needs to be created

from data.gov.

rshewitt commented on July 20, 2024

draft pr. this pr may be open for an extended period of time to allow for fixes addressing issues @btylerburton gets as he tests the module out in airflow. or maybe that should be separate ticket?

from data.gov.

rshewitt commented on July 20, 2024

ran a test load on catalog-dev admin. number of datasets has increased from 345 to 720. this test was run on my machine and not in airflow.

from data.gov.

rshewitt commented on July 20, 2024

log on last nights load test of dcatus. processing seems to have been held up at 00:03:16,472
harvest_load.log

from data.gov.

nickumia commented on July 20, 2024

Airflow is happiest when I put a skip exception in at the failure step, which allows it to skip any downstream tasks gracefully, but this also means that the pipeline is "green", so we need a way of recording/handling those exceptions.

@btylerburton What happens when you use a AirflowFailException? It still does the tasks afterwards?? 😱 Or is the concern that the subsequent tasks show up as failed (even when they're not run)? I ran into a similar issue when running github actions. This doesn't completely answer the question, but having robust conditions and clear logic is important..

Maybe this reference might provide some ideas to the team: https://www.restack.io/docs/airflow-knowledge-airflow-skip-exception-guide (i.e. having a fallback mechanism to do the email notification might meet the needs?)

from data.gov.

Refactor Harvesting Logic repo to incorporate feedback changes about data.gov HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent