Giter Club home page Giter Club logo

cuebook / cuelake Goto Github PK

View Code? Open in Web Editor NEW
284.0 284.0 28.0 28.7 MB

Use SQL to build ELT pipelines on a data lakehouse.

Home Page: https://cuelake.cuebook.ai

License: Apache License 2.0

Python 39.30% JavaScript 44.35% HTML 0.19% CSS 11.04% Dockerfile 0.21% Shell 2.62% SCSS 2.30%
apache-iceberg apache-spark data-engineering data-ingestion data-integration data-lake data-pipeline data-transfer datalake delta elt etl incremental-updates lakehouse pipelines spark-sql sql upsert zeppelin-notebook

cuelake's People

Contributors

ankitkpandey avatar prabhu31 avatar praveencuebook avatar sachinkbansal avatar vikrantcue avatar vincue avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cuelake's Issues

Fix hive metastore issues

Test the following scenarios on hive metastore for iceberg, delta and parquert tables:

  • Data should be created in the given warehouse directory as env variable
  • When table is dropped data should be deleted

Test and fix the behaviour of hive metastore on both S3 and GCS.

Use the latest version of iceberg and delta jars and also upgrade the spark version if required.

Improve logs UI

Currently, logs are just JSON dumps. Copy the parser code from zeppelin and implement in CueLake so that the logs look the same as they are in zeppelin.

Support for Jupyter Notebook

Is your feature request related to a problem? Please describe.
Your current system supports zeplin notebooks. We have a lot of notebooks designed with jupyter. And we have tons of tooling around the same. Its a tremendous effort to shift these. Requesting support for jupyter notebooks besides zepplyn.

Describe the solution you'd like
Ability to run jupyter notebooks

Describe alternatives you've considered
Tools for convert from jupyter to zepplyn. But thats a lot of work internally

Can we used minio as S3 compatible for apache iceberg

Is your feature request related to a problem? Please describe.
Can we used minio as S3 compatible for apache iceberg

Describe the solution you'd like
Can we used minio as S3 compatible for apache iceberg

Describe alternatives you've considered
If we can use minio, need the steps to configure minio with cuelake

Additional context
Can we used minio as S3 compatible for apache iceberg

Syntax error in interpreter.json

There is a syntax error (missing comma) on line 1275 in https://raw.githubusercontent.com/cuebook/cuelake/main/zeppelinConf/interpreter.json

Also, there is a \t on lines 1271 and 1272 that I suspect are incorrect.

And finally if you use less to view the content it C in the word Comma on line 201 is displayed as .

Below is a diff file or the changes that I made to the file.

201c201
<           "description": "Сomma separated schema (schema \u003d catalog \u003d database) filters to get metadata for completions. Supports \u0027%\u0027 symbol is equivalent to any set of characters. (ex. prod_v_%,public%,info)"
---
>           "description": "Comma separated schema (schema \u003d catalog \u003d database) filters to get metadata for completions. Supports \u0027%\u0027 symbol is equivalent to any set of characters. (ex. prod_v_%,public%,info)"
1271,1272c1271,1272
<         "spark.executor.extraJavaOptions\t": {
<           "name": "spark.executor.extraJavaOptions\t",
---
>         "spark.executor.extraJavaOptions": {
>           "name": "spark.executor.extraJavaOptions",
1275c1275
<         }
---
>         },

Rename models

Some models name are not so apt. Change the following model names:
RunStatus -> NotebookRunLogs
WorkflowRuns -> WorkflowRunLogs

Dashboard V1

Dashboard will show all the workspaces and their resouces.

CueLake will start with 0 workspaces.

User can add a workspace from dashboard.

For a workspace following info will be shown:

  • Resources currently running for the workspace. Zeppelin Server, All Interpreters
  • Restart button for the zeppelin server
  • Name & Description of the workspace

Workspaces v1

Ask following info while creating a workspace:

  • Name & Description
  • Storage (S3, GCS, AZFS, PV)
  • Storage credentials if required
  • Inactivity Timeout to shut down resources
  • Spark and Interpreter docker images (Show Cuelake's default values and link for creating custom images)

The default RBAC role is missing pods as a resource

Describe the bug
The default RBAC role is missing pods as a resource, which causes exceptions in lakehouse as shown below.

27.0.0.1 - - [27/May/2021:06:14:14 +0000] "GET /api/genie/notebooks/0 HTTP/1.1" 200 68 "http://127.0.0.1:8080/notebooks" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36"
Internal Server Error: /api/genie/driverAndExecutorStatus/
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/local/lib/python3.7/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
    return view_func(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/django/views/generic/base.py", line 70, in view
    return self.dispatch(request, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/rest_framework/views.py", line 509, in dispatch
    response = self.handle_exception(exc)
  File "/usr/local/lib/python3.7/site-packages/rest_framework/views.py", line 469, in handle_exception
    self.raise_uncaught_exception(exc)
  File "/usr/local/lib/python3.7/site-packages/rest_framework/views.py", line 480, in raise_uncaught_exception
    raise exc
  File "/usr/local/lib/python3.7/site-packages/rest_framework/views.py", line 506, in dispatch
    response = handler(request, *args, **kwargs)
  File "/code/genie/views.py", line 243, in get
    res = KubernetesServices.getDriversCount()
  File "/code/genie/services/services.py", line 657, in getDriversCount
    ret = v1.list_namespaced_pod(POD_NAMESPACE, watch=False)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 15302, in list_namespaced_pod
    return self.list_namespaced_pod_with_http_info(namespace, **kwargs)  # noqa: E501
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 15427, in list_namespaced_pod_with_http_info
    collection_formats=collection_formats)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api
    _preload_content, _request_timeout, _host)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request
    headers=headers)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 243, in GET
    query_params=query_params)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 233, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Audit-Id': '96c45951-281d-41d5-908d-b6429974a4dd', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Thu, 27 May 2021 06:14:14 GMT', 'Content-Length': '282'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods is forbidden: User \"system:serviceaccount:cuelake:default\" cannot list resource \"pods\" in API group \"\" in the namespace \"cuelake\"","reason":"Forbidden","details":{"kind":"pods"},"code":403}
```

***Workaround***

A workaround is to add "pods" as a resource in the default-role in cuelake.yaml.

```
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: default-role
rules:
- apiGroups: [""]
  resources: ["pods", "configmaps"]
  verbs: ["create", "get", "update", "patch", "list", "delete", "watch"]
- apiGroups: ["rbac.authorization.k8s.io"]
  resources: ["roles", "rolebindings"]
  verbs: ["bind", "create", "get", "update", "patch", "list", "delete", "watch"]
```

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.