nike-inc / brickflow Goto Github PK
View Code? Open in Web Editor NEWPythonic Programming Framework to orchestrate jobs in Databricks Workflow
Home Page: https://engineering.nike.com/brickflow/
License: Apache License 2.0
Pythonic Programming Framework to orchestrate jobs in Databricks Workflow
Home Page: https://engineering.nike.com/brickflow/
License: Apache License 2.0
Describe the bug
This should be disabled unless you have a wheel task or something indicating that this should be done.
Error: artifacts.whl.Build(databricks-bdr-llms): Failed exit status 1, output: /Users/sri.tikkireddy/PycharmProjects/BDR-AI-Chatbot/venv/lib/python3.10/site-packages/setuptools/installer.py:27: SetuptoolsDeprecationWarning: setuptools.installer is deprecated. Requirements should be satisfied by a PEP 517 installer.
warnings.warn(
error in databricks-bdr-llms setup command: 'install_requires' must be a string or list of strings containing valid project/version requirement specifiers; Parse error at "'+https:/'": Expected string_end
Databricks bundles now support wheels and artifact building, support that so we can submit wheel tasks
Describe the bug
Since version 0.10.0 the command bf projects deploy
doesn't have the -w
flag used to specify a single workflow name.
It's also not mentioned in the docs.
To Reproduce
Steps to reproduce the behavior:
bf projects deploy -w myworkflow.py
Error: No such option: -w
Expected behavior
Should deploy that single workflow "myworkflow.py"
Is your feature request related to a problem? Please describe.
As per the conversation in this pr, we need to deprecate BrickflowTriggerRule
as we have native support now in Bundles
Cloud Information
Describe the solution you'd like
We will have deprecate message in the next version, and in further releases we will remove it from the code base
Describe alternatives you've considered
run_if
is natively available now in databricks bundles
Is your feature request related to a problem? Please describe.
When getting an error when workflow is being run capture the error in a wrapper and print the brickflow version.
Is your feature request related to a problem? Please describe.
Brickflow does not support the spark_jar_task feature yet . this is needed for any databricks jobs which gets executed using jar file
Cloud Information
Describe the solution you'd like
Need to enable a task called spark_jar_task in task.py
example
@wf.spark_jar_task(libraries=[JarTaskLibrary(jar="dbfs:<location>.jar or s3://<location>.jar")])
def example_jar():
return SparkJarTask(
main_class_name="com.example.Main",
)
Describe alternatives you've considered
tried using a bash operator but the solution is too complex
Is your feature request related to a problem? Please describe.
When setting up alerting off of brickflow, any manual pause to the job will always invoke an alert - even if manually cancelling a job to start a new run with a new configuration. Databricks, in the UI, has the option for "Mute notifications for [canceled/skipped] runs" that would be useful in this case.
Cloud Information
Describe the solution you'd like
In the notification settings code, change the code to handle these specific parameters
Describe the bug
When instance_pool_id
is already specified in brickflow.engine.compute.Cluster
dataclass node_type_id
should not be required.
Expected behavior
node_type_id
should be Optional when instance_pool_id
is specified.
To Reproduce
When create a new cluster with below configuration.
Cluster( name="some_cluster_name" spark_version="any_spark_version", node_type_id="some_node_type_id", instance_pool_id="some_instance_pool_id", num_workers=1, )
Got error below when run brickflow deploy.
Starting resource deployment Error: terraform apply: exit status 1 Error: cannot create job: The field 'node_type_id' cannot be supplied when an instance pool ID is provided. with databricks_job.ccs_atr_posted_to_doc_store, on bundle.tf.json line 323, in resource.databricks_job.ccs_atr_posted_to_doc_store: 323: },
Cause
Error: cannot create job: The field 'node_type_id' cannot be supplied when an instance pool ID is provided.
When omit the node_type_id.
Cluster( name="some_cluster_name" spark_version="any_spark_version", instance_pool_id="some_instance_pool_id", num_workers=1, )
Got below error, saying node_type_id is required.
Traceback (most recent call last): File "/Users/BBraec/Documents/GitHub/nike-glix/trade-customs-emea-outbound/src/workflows/entrypoint.py", line 19, in main f.add_pkg(src.workflows) File "/Users/BBraec/Documents/GitHub/nike-glix/trade-customs-emea-outbound/.venv/lib/python3.9/site-packages/brickflow/engine/project.py", line 104, in add_pkg spec.loader.exec_module(mod) # type: ignore File "<frozen importlib._bootstrap_external>", line 855, in exec_module File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed File "/Users/BBraec/Documents/GitHub/nike-glix/trade-customs-emea-outbound/src/workflows/ccs_treq_released_workflow.py", line 40, in <module> job_cluster = create_job_cluster( File "/Users/BBraec/Documents/GitHub/nike-glix/trade-customs-emea-outbound/src/glix/clusters/create_cluster.py", line 25, in create_job_cluster job_cluster = Cluster( TypeError: __init__() missing 1 required positional argument: 'node_type_id'
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Cloud Information
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Describe the bug
Brickflow should be removed from the entrypoint.py as a dependency when the project is initialized.
To Reproduce
Steps to reproduce the behavior:
Create a new project using command bf projects add
and say yes to entrypoint file creation
Expected behavior
Brickflow should not be added as a dependency to the entrypoint.py file
Cloud Information
Additional context
NA
Refactor task_dependency_sensor to support various auth mechanisms for airflow and not just oauth.
This is concerning a Databricks task waiting for airflow task to be finished executing.
Is your feature request related to a problem? Please describe.
As a deploying engineer, I want to have control over the run state of a new workflow, rather than allowing the job to assume schedule on publish.
Cloud Information
Describe the solution you'd like
On deployment, I would like to specify the state of the workflow as it's published to Databricks. This is to accommodate a trust but verify
approach to higher environment deployments, where an engineer can validate the state of the workflow prior to setting the workflow to UNPAUSED state.
Describe alternatives you've considered
I have a custom deployments tool that accommodates this status configuration. Additionally, I see that the pydantic models in brickflow.bundles.model
acknowledge that scheduling state is configurable.
Is your feature request related to a problem? Please describe.
Having an Autosys sensor operator in the workflow would help us place an dependency on Autosys jobs when necessary.
Cloud Information
Describe the solution you'd like
A Autosys sensor operator that takes in necessary parameters, pokes the Autosys API, checks the API response and exits the process marking the task as successful if the specified conditions are met.
If not, waits for the given poke interval, then runs the same process again and again until the conditions are satisfied or times out.
Describe alternatives you've considered
NA
Additional context
NA
Is your feature request related to a problem? Please describe.
Databricks now supports creating a "Run Job" task, which can trigger another job by its job_id. It would be nice to have this feature in BrickFlow.
Cloud Information
Describe the solution you'd like
Add a new run_job_task type. It would be nice to have something like this:
@wf.run_job_task
def trigger_downstream_job():
return RunJobTask(
job_id="12345",
)
Describe alternatives you've considered
I've looked at invoking dependent jobs with the Databricks API
Describe the bug
Task Dependency sensor (brickflow_plugins/airflow/operators/external_tasks/TaskDependencySensor) which pings upstream airflow clusters for state [success, failure etc] doesnt account for Execution delta window meaning when we ping the airflow cluster it is poking for success at same time instead of the execution delta window
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Should succeed once a success execution is found for a dag within the execution window.
Screenshots
If applicable, add screenshots to help explain your problem.
Cloud Information
Desktop (please complete the following information):
Additional context
Add any other context about the problem here.
Describe the bug
When we use WorkFlowDependencySensor, ideally it should trigger the task once dependent workflow is succeeded but it is not working as expected
To Reproduce
Steps to reproduce the behavior:
Expected behavior
If wf2 is poking wf1 and if wf1 is in failed state, wf2 should wait until wf1 issue is fixed and ran successfully.
Describe the bug
bf deploy --force-acquire-lock isnt mapping to the right bundles arg.
Fix
brickflow/brickflow/cli/bundles.py
Lines 89 to 93 in 273fe7e
To Reproduce
Bundle is called --force-deploy and force not --force-lock
Deploy bundle
Usage:
databricks bundle deploy [flags]
Flags:
-c, --compute-id string Override compute in the deployment with the given compute ID.
--force Force-override Git branch validation.
--force-deploy Force acquisition of deployment lock.
-h, --help help for deploy
Global Flags:
-e, --environment string bundle environment to use (if applicable)
--log-file file file to write logs to (default stderr)
--log-format type log output format (text or json) (default text)
--log-level format log level (default disabled)
-o, --output type output type: text or json (default text)
-p, --profile string ~/.databrickscfg profile
--progress-format format format for progress logs (append, inplace, json) (default default)
--var strings set values for variables defined in bundle config. Example: --var="foo=bar"
Is your feature request related to a problem? Please describe.
As data engineer I would like to have a possibility to call execution of python script directly from task, without additional wrapping via notebook (entrypoint.py)
Describe the solution you'd like
Task type Python script
is available and can be used to call python script
Describe the bug
when a new project is created with just entrypoint.py file , brickflow deploy is failing instead of skipping the deployment.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
ideally it should skip the deployment or mark it success when no files are synthesized.
bf projects deploy --env dev --project <<project_name>> --auto-approve
Is your feature request related to a problem? Please describe.
I would like to invoke Delta Live Table from brickflow
Cloud Information
Describe the solution you'd like
Curently, DLT is deployed in databricks as a wheel file. I would like to deploy the same DLT wheel file using brickflow
Describe alternatives you've considered
Additional context
I am facing certificate failed error while deploying the workflow to Databricks using below command from my Databricks cli terminal.
"brickflow projects deploy --project hello-world-brick flow -e local"
I followed the steps mentioned under Getting started section of GitHub(https://github.com/Nike-Inc/brickflow/tree/main).
Steps from 1 to 6 are complete without any error. But while deploying the workflow using the above-mentioned command, I get an error as certificate validation failed. data bricks Token configuration with cli was successful and I am able to list the databricks-dbfs files from cli without any issues. but I am not sure why I get this error while deploying the workflow as I am able to complete all other steps as expected. I haven't used docker for the set up. Kindly advise what could be the possible resolution for this error.
Please find below the screenshot of the error.
Is your feature request related to a problem? Please describe.
JobsHealth and JobsHealthRules are implemented in the Bundles Model but are missing from Brickflows Util Core (src/core/brickflow_utils.py) this means that Brickflow cannot be used to implement runtime timeout warning notifications.
Cloud Information
Describe alternatives you've considered
The alternative to this would be to not use brickflow
Additional context
health: Optional[JobsHealth] = None
Describe the bug
Sometimes you may get a permissions issue when trying to list envs in UC shared clusters
Fix
Ignore failures when trying to list directories. Assume that if its restricted you can ignore that whole folder tree.
Importing twice fixes it due it being cached.
To Reproduce
Import brickflow to reproduce on shared cluster
PermissionError: [Errno 13] Permission denied: '/local_disk0/.ephemeral_nfs/envs'
ermissionError Traceback (most recent call last)
File , line 2
1 from click.testing import CliRunner
----> 2 from brickflow.cli import projects
3 # runner = CliRunner()
4 # projects.add
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-d104ca58-ca0a-4058-9a03-985dc06bd6ae/lib/python3.10/site-packages/brickflow/init.py:332
285 all: List[str] = [
286 "ctx",
287 "get_bundles_project_env",
(...)
327 "BrickflowProjectDeploymentSettings",
328 ]
330 # auto path resolver
--> 332 get_relative_path_to_brickflow_root()
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-d104ca58-ca0a-4058-9a03-985dc06bd6ae/lib/python3.10/site-packages/brickflow/resolver/init.py:68, in get_relative_path_to_brickflow_root()
66 for path in paths:
67 try:
---> 68 resolved_path = go_up_till_brickflow_root(path)
69 _ilog.info("Brickflow root input path - %s", path)
70 _ilog.info("Brickflow root found - %s", resolved_path)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-d104ca58-ca0a-4058-9a03-985dc06bd6ae/lib/python3.10/site-packages/brickflow/resolver/init.py:45, in go_up_till_brickflow_root(cur_path)
39 valid_roots = [
40 BrickflowProjectConstants.DEFAULT_MULTI_PROJECT_ROOT_FILE_NAME.value,
41 BrickflowProjectConstants.DEFAULT_MULTI_PROJECT_CONFIG_FILE_NAME.value,
42 ]
44 # recurse to see if there is a brickflow root and return the path
---> 45 while not path.is_dir() or not any(
46 file.name in valid_roots for file in path.iterdir()
47 ):
48 path = path.parent
50 if path == path.parent:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-d104ca58-ca0a-4058-9a03-985dc06bd6ae/lib/python3.10/site-packages/brickflow/resolver/init.py:45, in (.0)
39 valid_roots = [
40 BrickflowProjectConstants.DEFAULT_MULTI_PROJECT_ROOT_FILE_NAME.value,
41 BrickflowProjectConstants.DEFAULT_MULTI_PROJECT_CONFIG_FILE_NAME.value,
42 ]
44 # recurse to see if there is a brickflow root and return the path
---> 45 while not path.is_dir() or not any(
46 file.name in valid_roots for file in path.iterdir()
47 ):
48 path = path.parent
50 if path == path.parent:
File /usr/lib/python3.10/pathlib.py:1017, in Path.iterdir(self)
1013 def iterdir(self):
1014 """Iterate over the files in this directory. Does not yield any
1015 result for the special paths '.' and '..'.
1016 """
-> 1017 for name in self._accessor.listdir(self):
1018 if name in {'.', '..'}:
1019 # Yielding a path object for these makes little sense
1020 continue
Task Type is: run_job_task
If run_job_task, indicates that this task must execute another job.
https://docs.databricks.com/api/workspace/jobs/create
Job managed by brickflow
wf = Workflow(...)
@wf.job_task
def run_job():
from some_some_folder import wf
return wf
Jobs not managed by brickflow
wf = Workflow(...)
@wf.job_task
def run_job():
return Workflows.from_name("...")
Is your feature request related to a problem? Please describe.
Reading through the project, it appears that Workflow deployments are preferred for execution within Databricks environment (phrasing of docs/highlevel.md - ...help deploy and manage workflows *in* Databricks.
). Would like to leverage this framework to handle deployments coming from an execution environment such as Github Actions/another Serverless execution function(?).
Cloud Information
Describe the solution you'd like
To fit the mantra of code/git first deployments, it would be helpful to further elaborate on deployments outside of the Databricks environment (leveraging GitHub Actions to handle deploys less state-dependent on the compute engine, and more state-dependent of the generated workflows).
IF deployments leveraging such an external compute environment are achievable, further documenting deployment execution with this framework will greatly improve onboarding for interested end-users.
Describe alternatives you've considered & Additional context
I am currently using a YAML configuration framework that determines deployment go/no-go for projects by leveraging process-generated tags on workflow object, which look for whether or not the tag in the workflow's repo configuration matches the tag tied to the active workflow in Databricks (if tags match, skip deployment of the workflow. If tags are different, deploy workflow by overwriting the existing workflow in Databricks). Because the evaluation happens at runtime, there is no additional state capture required as the versions are captured in-code.
^^ REASON FOR CASE SUMMARY: I want to know if I can employ this framework in a similar manner to the above situation, where configurations happen outside of Databricks and deployment state details captured in name/tags of the active workflow can drive new workflow creation/updates/deletions.
Additional context
This is more of a question ticket for the project as much as it is a feature request. Am looking to understand whether or not it is feasible to employ this framework as the engine for our workflows given the criteria it needs to meet for my use case.
Is your feature request related to a problem? Please describe.
Right now we are able to deploy each project at a time. We should be able to deploy/destroy all the projects at once.
Cloud Information
Describe the solution you'd like
We should be able to run the deployment of all projects in a git repo using the below commands
bf projects deploy-all --env {}
- This would deploy all the workflows in all projects in the gitrepo
bf projects destroy-all --env {}
- This will destroy all the workflows in all projects in the gitrepo
Describe alternatives you've considered
NA
Additional context
NA
Is your feature request related to a problem? Please describe.
Indicate clearly in the docs for updating poetry or anywhere else brickflows versions may be used such as pyproject.toml and poetry.lock files [Upgrade md]
Indicate which files should be git ignored and which files must not be gitignored
FAQ on brickflow is not imported or import error for modules
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.