submariner-io / lighthouse Goto Github PK
View Code? Open in Web Editor NEWDNS service discovery across connected Kubernetes clusters.
Home Page: https://submariner-io.github.io/architecture/service-discovery/
License: Apache License 2.0
DNS service discovery across connected Kubernetes clusters.
Home Page: https://submariner-io.github.io/architecture/service-discovery/
License: Apache License 2.0
Aligning lighthouse CRDs with upstream design[1] CRDs on multi cluster CRDs and do the required changes in lighthouse controller to match the design proposal
[1] https://docs.google.com/document/d/1hFtp8X7dzVS-JbfA5xuPvI_DNISctEbJSorFnY-nz6o/edit
Currently ServiceImports are distributed as individual copy from each source cluster and exist that way. Aggregation is done in Plugin code, which is not optimal and not in compliance with the MCS API spec. Agent should aggregate ServiceImports into a single resource on destination clusters and plugin should use those. This will also make it easy to troubleshoot.
See enhancement proposal for details.
When the lighthouse E2E tests were initially added, a snapshot of the submariner project's framework module was copied into lighthouse. We should reuse the submariner framework coode instead maintaining a separate copy.
kubernetes/enhancements#1646 proposes a solution for Multi Cluster Service discovery that requires a central controller to aggregate and distribute resources to all the clusters.
Current lighthouse distrubutes each cluster's individual MCS CR and local agents aggregate it. We need a central controller to support distribution of aggregated MCS CR to clusters.
Update the lighthouse documentation with the new architecture.
Sub Task of #78
MCS API [https://github.com/kubernetes/enhancements/pull/1646] proposes use of endpoint slices and changes to, or use of an alt kube-proxy. With lighthouse we want to try things a bit differently, with a more DNS centered solution. Create a PoC exploring this solution to see how it compares to https://github.com/JeremyOT/mcs-demo and if it can be done within the constraints of the current API
Dependabot couldn't parse the go.mod found at /go.mod
.
To better reuse deployment logic, submariner-io/armada has been created to abstract multicluster K8s deployments with kind under the hood. It would help the maintainability of Lighthouse to move to this shared tooling.
This work is parallel to submariner-io/submariner#317, which added Armada support to the main Submariner repo. Also related to submariner-io/submariner#369, which will involve sharing scripting around Armada between various submariner-io/* repos.
Base the Dockerfile describing Lighthouse's CI base container (Dockerfile.dapper
) on the shared/common base container maintained in submariner-io/submariner. This will allow de-duplicating the tooling installs and version maintenance. It will also allow future sharing/deduplication, like with the e2e scripting per submariner-io/submariner#369.
Add a test in e2e to test if Internet connectivity works after lighthouseDNS server is deployed.
I opened this issue to track Subctl support for Lighthouse + Globalnet testing, when using Headless service.
On my env with Globalnet - the Headless service could not be exported:
https://qe-jenkins-csb-skynet.cloud.paas.psi.redhat.com/job/Submariner-OSP-AWS/797/Test-Report/
Status:
Conditions:
Last Transition Time: 2020-08-28T07:56:54Z
Message: Service doesn't have a global IP yet
Reason: ServiceGlobalIPUnavailable
Status: False
Type: Initialized
@vthapar
We should update Website docs, and point to submariner-io/submariner#732
Originally posted by @manosnoam in #271 (comment)
sub task of #78
Dependabot can't resolve your Go dependency files.
As a result, Dependabot couldn't update your dependencies.
The error Dependabot encountered was:
verifying github.com/submariner-io/[email protected]/go.mod: checksum mismatch
downloaded: h1:cPwX5Xwr6tZs7qQZmCPKNFL5LxOHR1W4MlRSZgwVBcw=
go.sum: h1:5vxFEjdLY3+kBeXLvxixXRRmcaemptjzJQeYUvmks9A=
SECURITY ERROR
This download does NOT match an earlier download recorded in go.sum.
The bits may have been replaced on the origin server, or an attacker may
have intercepted the download attempt.
For more information, see 'go help module-auth'.
If you think the above is an error on Dependabot's side please don't hesitate to get in touch - we'll do whatever we can to fix it.
When we update the ServiceExport
status, we always overwrite the entire in the ServiceExportCondition
. Since it's a list, we should append if the new ServiceExportCondition
is different than the previous entry. Also we probably want to truncate the list to keep the last 10 or so entries.
The test to cover Headless service needs to be added in e2e tests.
Modify lighthouse plugin to use round robin selection from list of IPs for a given multi cluster service.
Use the Gateway status information to determine if a cluster is connected or not. If it isn't connected, don't return IPs of services running in that cluster.
Installing Submariner with service-discovery on OSP (private cluster) and AWS (public cluster), the lighthouse-controller seems to be unreachable between clusters:
Testing connection between ngnix <--> netshoot works with direct IPs, but does not work with domain name:
export KUBECONFIG=/home/nmanos/automation/ocp-install/nmanos-cluster-a/auth/kubeconfig
/home/nmanos/automation/ocp-install/oc exec netshoot-58785d5fc7-82kc7 -- curl --output /dev/null --verbose --head --fail 100.96.144.67
* Trying 100.96.144.67:80...
* TCP_NODELAY set
* Connected to 100.96.144.67 (100.96.144.67) port 80 (#0)
> HEAD / HTTP/1.1
> Host: 100.96.144.67
> User-Agent: curl/7.68.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: nginx/1.17.8
< Date: Thu, 12 Mar 2020 12:56:09 GMT
< Content-Type: text/html
< Content-Length: 612
< Last-Modified: Tue, 21 Jan 2020 14:39:00 GMT
< Connection: keep-alive
< ETag: "5e270d04-264"
< Accept-Ranges: bytes
<
* Connection #0 to host 100.96.144.67 left intact
nginx_service_cluster_b=nginx-demo
# Nginx service on Cluster B, will be identified by its Domain Name (with --service-discovery): nginx-demo
/home/nmanos/automation/ocp-install/oc exec netshoot-58785d5fc7-82kc7 -- curl --output /dev/null -m 30 --verbose --head --fail nginx-demo
* Could not resolve host: nginx-demo
* Closing connection 0
curl: (6) Could not resolve host: nginx-demo
command terminated with exit code 6
Lighthouse pod log shows:
{Name:"nmanos-cluster-b-tunnel-jsvwf"}}, Status:v1beta1.KubeFedClusterStatus{Conditions:[]v1beta1.ClusterCondition{v1beta1.ClusterCondition{Type:"Offline", Status:"True", LastProbeTime:v1.Time{Time:time.Time{wall:0x0, ext:63719526907, loc:(*time.Location)(0x1dc4b60)}}, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63719526907, loc:(*time.Location)(0x1dc4b60)}}, Reason:"ClusterNotReachable", Message:"cluster is not reachable"}}, Zones:[]string(nil), Region:""}}
ServiceExport
has a status field which is currently not being used. Update the status correctly so it can be used for automation and diagnostics.
I would expect to see in LH pod logs:
lighthouse/pkg/agent/controller/agent.go
Line 86 in c39bbe5
However, it seems the LH pod log is blown with "transform function returned nil", and there's no indication of "Lighthouse agent syncer started", as in the following pod log:
lh.log
It's a 5mb LH pod log, that was generated in only 2 hours since Submariner was deployed and joined a completely new cluster.
To support StatefulSets [https://kubernetes.io/docs/concepts/workloads/controllers/statefulset] we need to provide means to access individual pods in the set using hostnames, rather than just return list of PodIPs as we do in Headless Services. This is currently not possible with what we have in ServiceImports and will likely need EndpointSlice support.
An endpointSlice needs to be created for each HeadlessService in Cluster.
Currently Service must already exist for ServiceExport to work. Allow ServiceExport to be created before service. This will help with automation.
We no longer use multiclusterservices, already replaced with ServiceImports. Deprecate the spec and cleanup any references and usages of it in code, scripts and deployments.
This is to test Headless Service on Submariner 0.6.0, as documented in:
submariner-io/submariner-website#251
Originally posted by @aswinsuryan in #252 (comment)
In a setup with two clusters, both exporting the same service, the imports are synchronised:
NAMESPACE NAME AGE
submariner-k8s-broker nginx-default-gn2 4d16h
submariner-k8s-broker submariner-test-app-quattro-submariner-test-gn2 15m
submariner-k8s-broker submariner-test-app-quattro-submariner-test-gn3 35m
submariner-operator nginx-default-gn2 4d16h
submariner-operator submariner-test-app-quattro-submariner-test-gn2 15m
submariner-operator submariner-test-app-quattro-submariner-test-gn3 35m
(on the broker cluster) but dig
only ever returns one of the IP addresses — the first one to be exported in this case.
I'm looking at options for deploying submariner and lighthouse in pre-existing clusters. If two clusters east
and west
have the same (default) dns suffix cluster.local
is there a way for pods in cluster east
to access services in cluster west
explicitly using some form of a forward plugin config for CoreDNS in east
so that lookups for *.svc.west.local
(or something else) are sent to lighthouse and will only match services exported by the west
cluster?
Note that this is a little different from but not necessarily incompatible with the current (0.4.0) design of Lighthouse which supports multi-cluster services which can be exported from (and therefore serviced by) one or more clusters. Such multi-cluster services are visible in all clusters, and discoverable via a new supercluster.local
domain name, and it's assumed that namespaces that export such services will be globally unique across the set of clusters. (I assume this looks like _service_._namespace_.svc.supercluster.local
?)
However, if the clusters are pre-existing, it's possible that they may have existing namespaces that are not globally unique, such as kube-system
or monitoring
(or in my case kafka
). So I'm trying to connect to a specific remote service in a specific remote cluster, not a "multi-cluster" service that may exist in one or more clusters. This makes the use case more similar to the pre-0.4.0 design of Lighthouse, but since both clusters are pre-existing, they do not have unique DNS names and both consider themselves to be .cluster.local
.
Lighthouse should remove the cluster-ip from the ServiceImport , if that cluster does not have any active backend pods. for the service.
The Lighthouse Documentation needs to be updated with 0.6.0 changes
See - > openshift/coredns#27
The list of plugins doesn’t match the list in CoreDNS 1.5.2
review and prune as most of them are not really in use
If globalnet is enabled, ServiceIP should not be used for an exported service. Only once GlobalIP shows up should we distribute Service and mark ServiceExport
as ready
The lighthouse implementation shall be changed to make use of the forward plugin, available in CoreDNS.
Sub task of #78
Currently, Lighthouse does not support kubectl/oc wait as it does not populate ready status.(submariner-io/submariner#640)
oc wait --timeout=3m --for=condition=ready serviceexport "nginx-cl-b"
This shall be populated when the service is exported as below.
While working on #127 I discovered that globalnet isn't supported by the E2E tests.
The failures are recorded in the log - log.txt
Excerpt:
2020-05-11T09:45:50.2149430Z �[91m�[1m• Failure [136.424 seconds]�[0m
2020-05-11T09:45:50.2149647Z [dataplane] Test Service Discovery Across Clusters
2020-05-11T09:45:50.2154525Z �[90m/go/src/github.com/submariner-io/lighthouse/test/e2e/dataplane/service_discovery.go:16�[0m
2020-05-11T09:45:50.2154684Z when a pod tries to resolve a service in a remote cluster
2020-05-11T09:45:50.2155033Z �[90m/go/src/github.com/submariner-io/lighthouse/test/e2e/dataplane/service_discovery.go:19�[0m
2020-05-11T09:45:50.2155343Z �[91m�[1mshould be able to discover the remote service successfully [It]�[0m
2020-05-11T09:45:50.2155633Z �[90m/go/src/github.com/submariner-io/lighthouse/test/e2e/dataplane/service_discovery.go:20�[0m
2020-05-11T09:45:50.2155716Z
2020-05-11T09:45:50.2156603Z �[91mFailed to verify if service IP is discoverable. expected execution result "; <<>> DiG 9.14.8 <<>> @100.90.0.10 nginx-demo.e2e-tests-dataplane-sd-pmx54.svc.cluster2.local +short\n; (1 server found)\n;; global options: +cmd\n;; connection timed out; no servers could be reached\n169.254.3.81" to contain "100.90.209.156"
2020-05-11T09:45:50.2156764Z Unexpected error:
2020-05-11T09:45:50.2156844Z <exec.CodeExitError>: {
2020-05-11T09:45:50.2156938Z Err: {
2020-05-11T09:45:50.2157034Z s: "command terminated with exit code 9",
2020-05-11T09:45:50.2157133Z },
2020-05-11T09:45:50.2157472Z Code: 9,
2020-05-11T09:45:50.2157573Z }
2020-05-11T09:45:50.2157872Z command terminated with exit code 9
2020-05-11T09:45:50.2158333Z occurred�[0m
...
2020-05-11T09:47:46.3741029Z �[91m�[1m• Failure [116.159 seconds]�[0m
2020-05-11T09:47:46.3742070Z [dataplane] Test Service Discovery Across Clusters
2020-05-11T09:47:46.3744907Z �[90m/go/src/github.com/submariner-io/lighthouse/test/e2e/dataplane/service_discovery.go:16�[0m
2020-05-11T09:47:46.3746916Z when a pod tries to resolve a service which is present locally and in a remote cluster
2020-05-11T09:47:46.3750226Z �[90m/go/src/github.com/submariner-io/lighthouse/test/e2e/dataplane/service_discovery.go:25�[0m
2020-05-11T09:47:46.3751495Z �[91m�[1mshould resolve the local service [It]�[0m
2020-05-11T09:47:46.3755627Z �[90m/go/src/github.com/submariner-io/lighthouse/test/e2e/dataplane/service_discovery.go:26�[0m
2020-05-11T09:47:46.3757270Z
2020-05-11T09:47:46.3760909Z �[91mFailed to verify if service IP is discoverable
2020-05-11T09:47:46.3761064Z Unexpected error:
2020-05-11T09:47:46.3761437Z <exec.CodeExitError>: {
2020-05-11T09:47:46.3761536Z Err: {
2020-05-11T09:47:46.3761666Z s: "command terminated with exit code 9",
2020-05-11T09:47:46.3761768Z },
2020-05-11T09:47:46.3761844Z Code: 9,
2020-05-11T09:47:46.3761940Z }
2020-05-11T09:47:46.3762034Z command terminated with exit code 9
2020-05-11T09:47:46.3762339Z occurred�[0m
With "headless" service, instead of a single Service IP with load-balancing, the platform should return the IPs of associated Pods. This allows to interact directly with the Pods instead of a proxy.
An headless services can be configured by explicitly specifying "None" for the clusterIP (.spec.clusterIP), and can be utilized with or without selectors: https://kubernetes.io/docs/concepts/services-networking/service/#headless-services
This issue is related to #271
I've installed Submariner version: v0.6.0-1-gcb7275d with service-discovery (but on non-overlapping clusters cidrs, and without Globalnet), and exported a headless service - but it's DNS nginx-cl-b.test-submariner-headless.svc.clusterset.local
could not be resolved, even after 3 minutes:
08:30:35 $ oc get serviceexport "nginx-cl-b" -n test-submariner-headless -o yaml
apiVersion: lighthouse.submariner.io/v2alpha1
kind: ServiceExport
metadata:
creationTimestamp: "2020-08-31T05:27:21Z"
generation: 3
name: nginx-cl-b
namespace: test-submariner-headless
resourceVersion: "11622968"
selfLink: /apis/lighthouse.submariner.io/v2alpha1/namespaces/test-submariner-headless/serviceexports/nginx-cl-b
uid: b4daec8f-7541-4366-ac6f-c8d17ebdb0f9
status:
conditions:
- lastTransitionTime: "2020-08-31T05:30:31Z"
message: Awaiting sync of the ServiceImport to the broker
reason: AwaitingSync
status: "True"
type: Initialized
- lastTransitionTime: "2020-08-31T05:30:31Z"
message: Service was successfully synced to the broker
reason: ""
status: "True"
type: Exported
### After 3 minutes:
08:33:36 $ oc exec netshoot-cl-a-new -n test-submariner -- ping -c 1 nginx-cl-b.test-submariner-headless.svc.clusterset.local
ping: nginx-cl-b.test-submariner-headless.svc.clusterset.local: Name does not resolve
Full test report:
https://qe-jenkins-csb-skynet.cloud.paas.psi.redhat.com/job/Submariner-OSP-AWS/800/Test-Report/
The last step includes pods logs, and subctl info.
Note that the same test, but with regular service, not headless, the connection works good:
$ oc exec netshoot-cl-a -n test-submariner -- /bin/bash -c "curl --max-time 30 --verbose nginx-cl-b.test-submariner.svc.clusterset.local:8080"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 100.96.76.75:8080...
* Connected to nginx-cl-b.test-submariner.svc.clusterset.local (100.96.76.75) port 8080 (#0)
Sub task of #78
@Vishal Thapar @mkolesni @tom Pantelis
Hi , Do you guys know how/where these tests are run ? via github actions / subctl verify ? , thankstest/e2e/discovery/service_discovery.go:21
var _ = Describe("[discovery] Test Service Discovery Across Clusters", func() {
https://github.com/submariner-io/lighthouse|submariner-io/lighthousesubmariner-io/lighthouse | Added by GitHub
WIP:
Vishal Thapar 11:45 AM
@pkomarov via both. Github Actions runs these in lighthouse CI e.g. https://github.com/submariner-io/lighthouse/actions/runs/170918828
Example of using subctl verify is in operator repo's CI e.g. https://github.com/submariner-io/submariner-operator/runs/876513305?check_suite_focus=true#step:5:8436
https://github.com/submariner-io/lighthouse/actions/runs/170918828
Now all services ( except for few default services ) are discoverable across clusters.
An opt-in feature shall be added, where only services with a specific label (could explore other options) are discoverable.
Deploy three KIND clusters (you can use submariner repo and execute "make clusters")
[sgaddam@localhost submariner]$ export KUBECONFIG=output/kubeconfigs/kind-config-cluster2
[sgaddam@localhost submariner]$ kubectl get svc -n submariner-operator
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
submariner-lighthouse-coredns ClusterIP 100.92.253.164 53/UDP 42m
submariner-operator-metrics ClusterIP 100.92.66.37 8383/TCP,8686/TCP 42m
[sgaddam@localhost submariner]$
ConfigMap of CoreDNS
apiVersion: v1
data:
Corefile: |
#lighthouse
supercluster.local:53 {
forward . 100.92.253.164 <--- This matches with the serviceip of lighthouse-coredns
}
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster2.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
After removing the submariner.io/gateway label on active gatewaynode of cluster2, enginePod is terminated, and when we try DNS resolution from the Pod, it's very erratic (PSB).
In clientPod (deployed via kubectl run netshoot-2-1 -i --tty --image nicolaka/netshoot -- /bin/bash) on cluster-2
bash-5.0# The ping requests below were issued without any delay.
bash-5.0# ping nginx.default.svc.supercluster.local
PING nginx.default.svc.supercluster.local (100.93.129.152) 56(84) bytes of data. <--- Here ping resolves
^C
--- nginx.default.svc.supercluster.local ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
bash-5.0#
bash-5.0# ping nginx.default.svc.supercluster.local
ping: nginx.default.svc.supercluster.local: Name does not resolve <--- ping does not resolve
bash-5.0#
bash-5.0# ping nginx.default.svc.supercluster.local
ping: nginx.default.svc.supercluster.local: Name does not resolve
bash-5.0#
bash-5.0# ping nginx.default.svc.supercluster.local
PING nginx.default.svc.supercluster.local (100.93.129.152) 56(84) bytes of data. <--- ping resolves again
^C
--- nginx.default.svc.supercluster.local ping statistics ---
9 packets transmitted, 0 received, 100% packet loss, time 8215ms
After leaving the setup for some 20 mins, when I tried to run ping/dig, DNS does not resolve at all.
bash-5.0# ping nginx.default.svc.supercluster.local
ping: nginx.default.svc.supercluster.local: Name does not resolve
bash-5.0#
Re-verified the lighthouse-coredns serviceip and it matches with the ipaddress in configMap of CoreDNS, but still DNS does not work.
Validation, tests and other targets were migrated to Shipyard, use them instead of cloned scripts.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.