Giter Club home page Giter Club logo

ytsaurus-k8s-operator's Introduction


License Telegram

YTsaurus

Website | Documentation | YouTube

YTsaurus is a distributed storage and processing platform for big data with support for MapReduce model, a distributed file system and a NoSQL key-value database.

You can read post about YTsaurus or check video:

video about YTsaurus

Advantages of the platform

Multitenant ecosystem

  • A set of interrelated subsystems: MapReduce, an SQL query engine, a job schedule, and a key-value store for OLTP workloads.
  • Support for large numbers of users that eliminates multiple installations and streamlines hardware usage

Reliability and stability

  • No single point of failure
  • Automated replication between servers
  • Updates with no loss of computing progress

Scalability

  • Up to 1 million CPU cores and thousands of GPUs
  • Exabytes of data on different media: HDD, SSD, NVME, RAM
  • Tens of thousands of nodes
  • Automated server up and down-scaling

Rich functionality

  • Expansive MapReduce module
  • Distributed ACID transactions
  • A variety of SDKs and APIs
  • Secure isolation for compute resources and storage
  • User-friendly and easy-to-use UI

CHYT powered by ClickHouse®

  • A well-known SQL dialect and familiar functionality
  • Fast analytic queries
  • Integration with popular BI solutions via JDBC and ODBC

SPYT powered by Apache Spark

  • A set of popular tools for writing ETL processes
  • Launch and support for multiple mini SPYT clusters
  • Easy migration for ready-made solutions

Getting Started

Try YTsaurus cluster using Kubernetes or try our online demo.

How to Build from Source Code

How to Contribute

We are glad to welcome new contributors!

Please read the contributor's guide and the styleguide for more details.

ytsaurus-k8s-operator's People

Contributors

achulkov2 avatar alexvsalexvsalex avatar amrivkin avatar andrewlanin avatar chizhonkova avatar dpoluyanov avatar e1pp avatar gritukan avatar hdnpth avatar kaikash avatar kmalov avatar koct9i avatar kontakter avatar kozubaeff avatar krisha11 avatar kristevalex avatar krock21 avatar kruftik avatar l0kix2 avatar omgronny avatar psushin avatar qurname2 avatar rebenkoy avatar renadeen avatar sashamelentyev avatar savnadya avatar sgburtsev avatar tinarsky avatar verytable avatar zlobober avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ytsaurus-k8s-operator's Issues

Automatically allow everyone to use ch_public CHYT clique

Currently, ch_public clique is created with default access control (//sys/access_control_object_namespaces/chyt/ch_public)

"principal_acl": [
        {
            "action": "allow",
            "inheritance_mode": "object_and_descendants",
            "permissions": [
                "use"
            ],
            "subjects": [
                "chyt_releaser"
            ]
        },
        {
            "action": "allow",
            "inheritance_mode": "object_and_descendants",
            "permissions": [
                "read",
                "remove",
                "manage"
            ],
            "subjects": [
                "chyt_releaser"
            ]
        }
    ],

And it requires manual actions (and superpowers) to allow everyone to use it
Can we add "allow everyone use" automatically during init ch_public job?

Support rolling update for masters

We want masters to update with leaders switch with minimal downtime.

Implementation can look like that:

  • we adding some flag/enum (at ytsaurus level or master leve) to support new rolling strategy for master instead of full update
  • we call yt_execute with switch_leader to ms-0 and wait for all things to become good.
  • we setting masters' sts updateStrategy =RollingUpdate with .spec.updateStrategy.rollingUpdate.partition=len(masters)-1, call sts update and wait until ms-2 is updated and catched up
  • (if possible) building snapshots and compare md5 for all of the snapshots (research convergence_check)
  • proceed with .spec.updateStrategy.rollingUpdate.partition=len(masters)-2 and so on until only ms-0 left unupdated
  • call switch_leader to len(masters)-1
  • remove updateStrategy.rollingUpdate/sync > wait ms-0 updated

Research memory consumption in operator

As shown in #90 operator memory consumption could be more than 256mb even for cluster with hundreds of nodes.
We need to investigate if there are memory leaks or ineffective memory usage.

One of the options could be to add pprof to operator and some regular process to dump samples to configured pvc.

Jobs images must be in sync with component they updating

Not sure if it is a problem for all the components, but it is bug for master exit read only job, because if you run master on 23.1 ytsaurus image and master read only job on 23.2 ytsaurus image job hangs because there is no such command in 23.1.

Maybe all nil instance spec images must be set to core in default webhook and we can use instanceSpec.Image in job and server.

Refactor operator reconcilliation flow

Currently we have an implementation where ytsaurus cluster update flow is hard to understand because it is rather complex and scattered across codebase. As a consequence it is error-prone, hard to maintain and operate on real clusters.

We want to simplify things by implementing an approach where components will be brought to the desired state one after another in strict linear sequence. I.e each reconciliation loop will go through that (well known and explicitly described in a single place of the codebase) sequence, find first component which is NOT in desired state and execute one action which should bring that component closer to the desired state.

We expect that such flow would be easier to understand and also it would be easy to troubleshoot in what stage of update operator and cluster currently are.

As a downside of the approach we expect that linear update of components would be slower, which is might be thing to improve in the future.

Strawberry controller fails to start with an error of missing `jupyt` path

An error in strawberry controller logs:

panic: call a4d6e19-3ee832b4-a54d258b-718c84e7 failed: error resolving path //sys/strawberry/jupyt: node //sys/strawberry has no child with key "jupyt"

goroutine 1 [running]:
go.ytsaurus.tech/yt/chyt/controller/internal/agent.(*Agent).Start(0xc0003b6600)
	/actions-runner/_work/ytsaurus/ytsaurus/ytsaurus/yt/chyt/controller/internal/agent/agent.go:355 +0x3b0
go.ytsaurus.tech/yt/chyt/controller/internal/app.(*App).Run(0xc00058fc78, 0xc0004b80c0)
	/actions-runner/_work/ytsaurus/ytsaurus/ytsaurus/yt/chyt/controller/internal/app/app.go:187 +0x555
main.doRun()
	/actions-runner/_work/ytsaurus/ytsaurus/ytsaurus/yt/chyt/controller/cmd/chyt-controller/run.go:59 +0x1d5
main.wrapRun.func1(0x10f6ce0?, {0xb60355?, 0x2?, 0x2?})
	/actions-runner/_work/ytsaurus/ytsaurus/ytsaurus/yt/chyt/controller/cmd/chyt-controller/main.go:30 +0x1d
github.com/spf13/cobra.(*Command).execute(0x10f6ce0, {0xc000356660, 0x2, 0x2})
	/actions-runner/_work/ytsaurus/go/pkg/mod/github.com/spf13/[email protected]/command.go:944 +0x847
github.com/spf13/cobra.(*Command).ExecuteC(0x11004c0)
	/actions-runner/_work/ytsaurus/go/pkg/mod/github.com/spf13/[email protected]/command.go:1068 +0x3bd
github.com/spf13/cobra.(*Command).Execute(...)
	/actions-runner/_work/ytsaurus/go/pkg/mod/github.com/spf13/[email protected]/command.go:992
main.main()
	/actions-runner/_work/ytsaurus/ytsaurus/ytsaurus/yt/chyt/controller/cmd/chyt-controller/main.go:38 +0x25

Seems like we just need to add "jupyt" here: https://github.com/ytsaurus/yt-k8s-operator/blob/main/pkg/ytconfig/chyt.go#L48

Replace EnableFullUpdate field with something better.

Currently we have EnableFullUpdate field in the ytsaurus main CRD and if it set to true operator consider it can recreate all the pods and fully update yt cluster.
The idea here is full update would be controllable by human, but the problem is on deploy we often forget to change EnableFullUpdate=false back.

It would be better replaced with something that can be changed back by operator itself after full update is triggered and approved by human,

Ideas are appreciated.

Ytop-chart-controller crashes

The pod ytsaurus-ytop-chart-controller-manager is in the infinite CrashLoopBackOff state.

The used configuration:

  • ytop-chart v 0.4.0
  • coreImage: ytsaurus/ytsaurus:stable-23.1.0-relwithdebinfo

The log is attached.
chart_fail.log

setting nodePort

setting node port for all components via configuration file
For example:

httpProxies:
  - serviceType: NodePort
    instanceCount: 3
    nodePort: 31860 - setting

when I apply config I got error:
strict decoding error: unknown field "spec.httpProxies[0].nodePort", unknown field "spec.rpcProxies[0].nodePort"

Operation archive init job should wait for sys bundle to be healthy

When looking at cluster initialization during any of the e2e tests, one can see the following errors in the init-job-op-archive pod. They get retried eventually, but the backoff duration increases the length of the already very slow tests.

achulkov2@nebius-yt-dev:~$ kubectl logs yt-scheduler-init-job-op-archive-btqb6  -nquerytrackeraco
++ export YT_DRIVER_CONFIG_PATH=/config/client.yson
++ YT_DRIVER_CONFIG_PATH=/config/client.yson
+++ /usr/bin/ytserver-all --version
+++ head -c4
++ export YTSAURUS_VERSION=23.1
++ YTSAURUS_VERSION=23.1
++ /usr/bin/init_operation_archive --force --latest --proxy http-proxies.querytrackeraco.svc.cluster.local
2024-01-10 19:37:20,124 - INFO - Transforming archive from 48 to 48 version
2024-01-10 19:37:20,134 - INFO - Mounting table //sys/operations_archive/jobs
Traceback (most recent call last):
  File "/usr/bin/init_operation_archive", line 749, in <module>
    main()
  File "/usr/bin/init_operation_archive", line 744, in main
    force=args.force,
  File "/usr/bin/init_operation_archive", line 731, in run
    transform_archive(client, next_version, target_version, force, archive_path, shard_count=shard_count)
  File "/usr/bin/init_operation_archive", line 639, in transform_archive
    mount_table(client, path)
  File "/usr/bin/init_operation_archive", line 55, in mount_table
    client.mount_table(path, sync=True)
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/client_impl_yandex.py", line 1394, in mount_table
    freeze=freeze, sync=sync, target_cell_ids=target_cell_ids)
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/dynamic_table_commands.py", line 524, in mount_table
    response = make_request("mount_table", params, client=client)
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/driver.py", line 126, in make_request
    client=client)
  File "<decorator-gen-3>", line 2, in make_request
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/common.py", line 422, in forbidden_inside_job
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/http_driver.py", line 301, in make_request
    client=client)
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/http_helpers.py", line 455, in make_request_with_retries
    return RequestRetrier(method=method, url=url, **kwargs).run()
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/retries.py", line 79, in run
    return self.action()
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/http_helpers.py", line 410, in action
    _raise_for_status(response, request_info)
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/http_helpers.py", line 290, in _raise_for_status
    raise error_exc
yt.common.YtResponseError: Error committing transaction 1-44d-10001-b753
    Error committing transaction 1-44d-10001-b753 at cell 65726e65-ad6b7562-10259-79747361
        No healthy tablet cells in bundle "sys"

***** Details:
Received HTTP response with error    
    origin          yt-scheduler-init-job-op-archive-btqb6 on 2024-01-10T19:37:20.203965Z    
    url             http://http-proxies.querytrackeraco.svc.cluster.local/api/v4/mount_table    
    request_headers {
                      "User-Agent": "Python wrapper 0.13-dev-5f8638fc66f6e59c7a06708ed508804986a6579f",
                      "Accept-Encoding": "gzip, identity",
                      "X-Started-By": "{\"pid\"=17;\"user\"=\"root\";}",
                      "X-YT-Header-Format": "<format=text>yson",
                      "Content-Type": "application/x-yt-yson-text",
                      "X-YT-Correlation-Id": "d71f4e98-4f2880b3-9213c0d0-9a5a9336"
                    }    
    response_headers {
                      "Content-Length": "1242",
                      "X-YT-Response-Message": "Error committing transaction 1-44d-10001-b753",
                      "X-YT-Response-Code": "1",
                      "X-YT-Response-Parameters": {},
                      "X-YT-Trace-Id": "c0235705-98e9c7a-369cf397-97d28dd7",
                      "X-YT-Error": "{\"code\":1,\"message\":\"Error committing transaction 1-44d-10001-b753\",\"attributes\":{\"host\":\"hp-0.http-proxies.querytrackeraco.svc.cluster.local\",\"pid\":1,\"tid\":12837479201307132255,\"fid\":18446447647636925386,\"datetime\":\"2024-01-10T19:37:20.202367Z\",\"trace_id\":\"c0235705-98e9c7a-369cf397-97d28dd7\",\"span_id\":1636727892750608515,\"cluster_id\":\"Native(Name=test-ytsaurus)\",\"path\":\"//sys/operations_archive/jobs\"},\"inner_errors\":[{\"code\":1,\"message\":\"Error committing transaction 1-44d-10001-b753 at cell 65726e65-ad6b7562-10259-79747361\",\"attributes\":{\"host\":\"hp-0.http-proxies.querytrackeraco.svc.cluster.local\",\"pid\":1,\"tid\":12837479201307132255,\"fid\":18446447647636925386,\"datetime\":\"2024-01-10T19:37:20.202206Z\",\"trace_id\":\"c0235705-98e9c7a-369cf397-97d28dd7\",\"span_id\":1636727892750608515},\"inner_errors\":[{\"code\":1,\"message\":\"No healthy tablet cells in bundle \\\"sys\\\"\",\"attributes\":{\"request_id\":\"dc5643d9-124e57a5-cf4b0583-8753d056\",\"connection_id\":\"6b2e13-a3e8b3e0-314a5f40-69069dfd\",\"verification_mode\":\"none\",\"realm_id\":\"65726e65-ad6b7562-10259-79747361\",\"timeout\":30000,\"method\":\"CommitTransaction\",\"address\":\"ms-0.masters.querytrackeraco.svc.cluster.local:9010\",\"encryption_mode\":\"optional\",\"service\":\"TransactionSupervisorService\"}}]}]}",
                      "X-YT-Request-Id": "93a09617-71caa1ec-cbfe7e46-922f5a1f",
                      "Content-Type": "application/json",
                      "Cache-Control": "no-store",
                      "X-YT-Proxy": "hp-0.http-proxies.querytrackeraco.svc.cluster.local",
                      "Authorization": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
                    }    
    params          {
                      "suppress_transaction_coordinator_sync": false,
                      "path": "//sys/operations_archive/jobs",
                      "freeze": false,
                      "mutation_id": "124ef88f-86123fd-62afd823-512f2084",
                      "retry": false
                    }    
    transparent     True
Error committing transaction 1-44d-10001-b753    
    origin          hp-0.http-proxies.querytrackeraco.svc.cluster.local on 2024-01-10T19:37:20.202367Z (pid 1, tid b227e3515560815f, fid fffef266ed3d2bca)    
    trace_id        c0235705-98e9c7a-369cf397-97d28dd7    
    span_id         1636727892750608515    
    cluster_id      Native(Name=test-ytsaurus)    
    path            //sys/operations_archive/jobs
Error committing transaction 1-44d-10001-b753 at cell 65726e65-ad6b7562-10259-79747361    
    origin          hp-0.http-proxies.querytrackeraco.svc.cluster.local on 2024-01-10T19:37:20.202206Z (pid 1, tid b227e3515560815f, fid fffef266ed3d2bca)    
    trace_id        c0235705-98e9c7a-369cf397-97d28dd7    
    span_id         1636727892750608515
No healthy tablet cells in bundle "sys"    
    origin          yt-scheduler-init-job-op-archive-btqb6 on 2024-01-10T19:37:20.204007Z    
    request_id      dc5643d9-124e57a5-cf4b0583-8753d056    
    connection_id   6b2e13-a3e8b3e0-314a5f40-69069dfd    
    verification_mode none    
    realm_id        65726e65-ad6b7562-10259-79747361    
    timeout         30000    
    method          CommitTransaction    
    address         ms-0.masters.querytrackeraco.svc.cluster.local:9010    
    encryption_mode optional    
    service         TransactionSupervisorService

We should wait for the tablet cells to be healthy before running the init job.

Fixed host addresses for masters

Let us recall that each pod has two addresses: pod address (which looks like ms-2.masters.<namespace>.svc.cluster.local) and host address, which is an FQDN of underlying k8s node.

Problem: in hostNetwork = true setting we are exposing master in a very hacky way.

  1. We specify hostNetwork = true for pods, making server processes bind to host network interfaces (making them available from outside k8s cluster). Actually, all our server processes bind to wildcard network interfaces, so masters even bind to CNI network interfaces too, allowing accessing them from their pod domain names.
  2. We still use pod names in cluster_connection section of other components, as well as in master configs describing master cells; this work because of the remark from p.1
  3. In order for master to identify himself in the list of peers int he cell, it searches for his fqdn in the list of peers. This does not work well because fqdn resolves to a host address, and the list of peers contains pod addresses, so we use an address resolver hack: we set /address_resolver/localhost_name_override to master's pod address with env variable substitution. This makes master being able to start and work.

This approach has a lot of downsides.

  1. We are mixing pod addresses and host addresses for different components; in case when k8s overlay network uses ipv4 and host network uses ipv6, this requires setting up dualstack address resoltuion in all components, which is in general a bad practice.
  2. We are relying on KubeDNS for pod addresses, which has a number of disadvantages (most notorious of which is c-ares DNS spoofing issue, which sometimes breaks DNS resolution at all).
  3. Masters register under pod addresses in //sys/primary_masters, which makes it hard to find their host addresses from outside the cluster (still possible via annotations, but not convenient at all). This complicates integration with outside tools like pull-based monitoring.
  4. Finally, the incorrect addresses of masters get into cluster connection, which makes it impossible to connect components outside of the cluster using the config //sys/@cluster_connection.

The proposal is to switch to fixing k8s nodes for masters a priori by their fqdns. If we know host fqdns of these nodes, a user must specify it as follows:

masterHostAddresses:
- 4932:
  - host1.external.address
  - host2.external.address
  - host3.external.address

Here 4932 is the cell tag of a master cell (for potential extension to the multicell case).

This option should be specified if and only if hostNetwork: true holds. In such case three changes in logic must apply.

  1. Specified host addresses must be put in master configs.
  2. Specified host addresses must be put in cluster connection configs of all components.
  3. Master statefulset must get an additional affinity constraint of form externalHostnameLabel IN (host1.external.address, host2.external.address, host3.external.address), where externalHostnameLabel is kubernetes.io/hostname by default, but may be overridden in case when the value of kubernetes.io/hostname does not resolve outside, but some other label contains actual external addresses of nodes.

Also, the address_resolver hack must be dropped unconditionally. It is ugly.

Operator should react on all out of sync k8s components

Currently operator only detects diff only with yson configs and images, but if we deploy new version of operator, which will want to add something in pod specs — this wouldn't be applied.

It is expected that operator should compare all the desired state with state of the world and try to lead components to the desired state. Changes that operator makes should lead to minimum downtime possible.

[Feature] Add ability to update Ytsaurus spec

I would like to propose an enhancement to the Ytsaurus CRD by adding advanced configuration options. These options will enable more granular control over Ytsaurus specifications and cater to varying use cases. The features I am proposing to add are:

  • Update Core Image: Introduce an ability to update the coreImage field under the spec to change the core image of Ytsaurus.
  • Add Instance specs: Allow users to add new instance specifications dynamically.
  • Edit Instance specs: Allow users to edit existing instance specifications. This includes:
    • Instance Count
    • Resources
    • Affinity
    • Volumes, VolumeMounts, Locations
    • Custom labels?
  • Remove Instance specs: Implement an option to remove instance specifications. Special care should be taken while removing data node and masters specifications to prevent data loss or inconsistency

Remote node CRD

We'd like to have an opportunity to describe a group of nodes (dat/exe/tab) as a separate CRD, providing the endpoints of a remote cluster. Such group of nodes (called remote nodes) will work the same as if they were described within the ytsaurus CRD, but they can belong to a different k8s cluster, which may be useful.

I'd suggest having a CRD RemoteNodeGroup, which looks exactly the same as dataNodes/execNodes/tabletNodes (and, ideally, reuses as much of existing code as possible).

The main difference is that there should be a field "remoteClusterSpec" which must be a reference to a resource (CRD) of type RemoteYtsaurusSpec.

RemoteYtsaurusSpec must describe necessary information for connecting a node to a remote cluster. remoteClusterSpec must be a structure, currently containing one field:

MasterAddresses: []string

But in the future it will probably be extended by various other fields.
When ytsaurus/ytsaurus#248 completes:

RemoteClusterUrl: string // allows taking cluster connection by HTTP cal to RemoteClusterUrl + "/cluster_connection".

When authentication in native protocol is employed:

Credentials: SomeKindOfCredentials

Strawberry not starting

Operator version: af42b5642038e9d0b2ee6746a96403a16c78e845

Strawberry version: 0.0.5 e6ba5a781e10717a4ce2611c4c623ed1087137e1e359f2b2af1391dd7bfc04aa

Error:

panic: call 718cac8d-86e46ba7-354f650c-9db060c9 failed: error resolving path //sys/strawberry/chyt: node //sys/strawberry has no child with key "chyt"

goroutine 1 [running]:
go.ytsaurus.tech/yt/chyt/controller/internal/agent.(*Agent).Start(0xc000000600)
        /actions-runner/_work/ytsaurus/ytsaurus/ytsaurus/yt/chyt/controller/internal/agent/agent.go:355 +0x390
go.ytsaurus.tech/yt/chyt/controller/internal/app.(*App).Run(0xc0002d5c78)
        /actions-runner/_work/ytsaurus/ytsaurus/ytsaurus/yt/chyt/controller/internal/app/app.go:186 +0x466
main.doRun()
        /actions-runner/_work/ytsaurus/ytsaurus/ytsaurus/yt/chyt/controller/cmd/chyt-controller/run.go:58 +0x1ab
main.wrapRun.func1(0x10bd260?, {0xb3de4a?, 0x2?, 0x2?})
        /actions-runner/_work/ytsaurus/ytsaurus/ytsaurus/yt/chyt/controller/cmd/chyt-controller/main.go:30 +0x1d
github.com/spf13/cobra.(*Command).execute(0x10bd260, {0xc0001a0640, 0x2, 0x2})
        /actions-runner/_work/ytsaurus/go/pkg/mod/github.com/spf13/[email protected]/command.go:920 +0x847
github.com/spf13/cobra.(*Command).ExecuteC(0x10c6780)
        /actions-runner/_work/ytsaurus/go/pkg/mod/github.com/spf13/[email protected]/command.go:1044 +0x3bd
github.com/spf13/cobra.(*Command).Execute(...)
        /actions-runner/_work/ytsaurus/go/pkg/mod/github.com/spf13/[email protected]/command.go:968
main.main()
        /actions-runner/_work/ytsaurus/ytsaurus/ytsaurus/yt/chyt/controller/cmd/chyt-controller/main.go:38 +0x25

Ytsaurus spec:

 strawberry:
    image: ytsaurus/strawberry:0.0.5
    resources:
      requests:
        cpu: 4
        memory: 8Gi
      limits:
        cpu: 8
        memory: 12Gi

There is //sys/strawberry directory in cypress, but there is nothing in it.

Incorrect jupyter FQDN breaks SPYT examples

Details:

Helm operator: 0.4.1
Cluster config: https://github.com/ytsaurus/yt-k8s-operator/blob/main/config/samples/0.4.0/cluster_v1_demo.yaml
Jupyter config: https://github.com/ytsaurus/yt-k8s-operator/tree/main/config/samples/jupyter

Steps to reproduce:

  1. (additional step) Change CHYT to strawberry
  2. Setup YTsaurus cluster
  3. Update SPYT library to match cluster: pip install -U ytsaurus-spyt==1.72.0 --user and restart kernel
  4. Open SPYT examples.ipynb in jupyter
  5. Change table path to any existing table in YTsaurus
  6. Run cells down to reading table

Expected behavior:

Table successfully readed by SPYT and displayed.

Current behavir:

SPYT execution nodes can't access driver. SPYT returns WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources.

Reason:

Spark executor gets incorrect FQDN of jupyter: --driver-url spark://CoarseGrainedScheduler@jupyterlab-0.jupyter-headless.default.svc.cluster.local:27001. This FQDN is the output of hostname, but k8s cluster doesn't have such FQDN. See stderror log from executor. Same behavior on each executor node.

I'm also made a clean setup of ubuntu with ytsaurus-spyt. In such case spark executor also gets hostname output, i.e. plain ubuntu for default image.

Proposal:

  1. Provide a way to pass custom driver url to spark executors. This will allow to bypass reading of hostname at all.
  2. Fix the jupyter configuration so the SPYT works out of the box.
  3. Add information about such potential confusion with hostname to docs.

Thank you!

QT ACOs is not created during operator update

After updating the operator 0.5.0 -> 0.6.0, the cluster to 23.2, and qt to 0.0.5. This error appears when running QT queries

Access control object "nobody" does not exist

This is due to the fact that necessary ACOs are created only when qt is initializing, not when it is updating
https://github.com/ytsaurus/yt-k8s-operator/blob/main/pkg/components/query_tracker.go#L193
We need to start doing this job every update

As a temporary solution, they can be created manually

yt create access_control_object_namespace --attr '{name=queries}'
yt create access_control_object --attr '{namespace=queries;name=nobody}'

Init job script is not recreated with operator update

I've updated the operator to the version, containing commit 7ff25c0
and was expecting that update op archive job will have new updated script, but turns out script same as was before the update.

I've localized that init script is not set directly in the job, but is set in the config map, which is not being recreated it seems.

kubectl get cm op-archive-yt-scheduler-init-job-config -oyaml | grep -A10 init-cluster.sh
  init-cluster.sh: |2-

    set -e
    set -x

    export YT_DRIVER_CONFIG_PATH=/config/client.yson
    /usr/bin/init_operation_archive --force --latest --proxy http-proxies.xxx.svc.cluster.local
    /usr/bin/yt set //sys/cluster_nodes/@config '{"%true" = {job_agent={enable_job_reporter=%true}}}'
kind: ConfigMap
metadata:
  creationTimestamp: "2023-11-09T18:44:06Z"```

Implement some signal from the operator that will mean "all good" in terms of operator

Currently we have Running cluster status, but it doesn't mean that operator finished all reconciliations.
For example cluster stays Running if full update couldn't start (if EnableFullUpdate not set for example) and it should be a way to detect such situations by human and/or monitoring without checking the logs.

Maybe we just should use other status then Running if operator is blocked and that would be enough.

  • There is a separate related issue to have some signal that operator is fully satisfied applied spec, currently status could be running if minimal amount of pods are created, but I consider that as an independent issue. Though it also may be resolved with some extra status, which will mean "cluster is usable, but not everything is running".

QT ACO namespace configuration

We need to set good default settings for ACO namespace queries that will be created during init job

Suggested format:

$ yt get //sys/access_control_object_namespaces/queries/@acl
[
    {
        "action" = "allow";
        "subjects" = [
            "owner";
        ];
        "permissions" = [
            "read";
            "write";
            "administer";
            "remove";
        ];
        "inheritance_mode" = "immediate_descendants_only";
    };
    {
        "action" = "allow";
        "subjects" = [
            "users";
        ];
        "permissions" = [
            "modify_children";
        ];
        "inheritance_mode" = "object_only";
    };
]

$ yt get //sys/access_control_object_namespaces/queries/nobody/@principal_acl
[]

$ yt get //sys/access_control_object_namespaces/queries/everyone/@principal_acl
[
    {
        "action" = "allow";
        "subjects" = [
            "everyone";
        ];
        "permissions" = [
            "read";
            "use";
        ];
        "inheritance_mode" = "object_and_descendants";
    };
]

$ yt get //sys/access_control_object_namespaces/queries/everyone-use/@principal_acl
[
    {
        "action" = "allow";
        "subjects" = [
            "everyone";
        ];
        "permissions" = [
            "use";
        ];
        "inheritance_mode" = "object_and_descendants";
    };
]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.