Giter Club home page Giter Club logo

Comments (33)

mattf avatar mattf commented on May 20, 2024 5

three things should be present for something like this to work.

  1. data donated to the community, for the benefit of the community, needs to be available to the community. for instance, data readily available to anyone going to kubeflow.org.
  2. there should be a clear value proposition for the community. for instance, being able to connect with others who are using similar projects or are in similar locations, or clear use of the data for improvement of the project, which may take some time to demonstrate.
  3. it should be opt-in.

the first two go to the social contract established.

the last is my personal position, and i'm usually mollified by a strong social contract, clear indication that the data is collected, a trivial opt-out option.

from kubeflow.

mattf avatar mattf commented on May 20, 2024 1

opt-in is my personal view.

i agree that opt-out is a reasonable starting point for the community, especially if we make it clear we're collecting, make it clear how to opt out, share the data with the community, and demonstrate ways we use the data to benefit the community.

i don't think all those things must, or even can, be done before proceeding.

let's proceed in good faith.

the kubeflow-discuss post has given this heightened attention for a week now. i propose this be on the agenda for the next community meeting and give until the following day for comments before proceeding w/ opt-out.

from kubeflow.

jlewi avatar jlewi commented on May 20, 2024 1

@mattf Sounds good. I've updated the PR to make it opt in for now and updated the instructions to include the commands to opt in (and make it clear you can skip them).

from kubeflow.

mhausenblas avatar mhausenblas commented on May 20, 2024 1

I'm for opt-in (with very clear strong red-blink notice at install time) and while a questionnaire like suggested by @inc0 sounds nice I believe the point is automation so I don't think the folks who want the data for planning or whatever reasons would prefer that option (understandably so).

from kubeflow.

mhausenblas avatar mhausenblas commented on May 20, 2024 1

After having reviewed the kubernetes-incubator/spartakus source code now I do have a question: given that it has a hard dependency on BigQuery, how are folks supposed to use this behind a firewall, in an on-premises setup?

Don't get me wrong I love and admire BigQuery—heck, a long time ago I even contributed to the open source version of the underlying engine called Dremel, that is, Apache Drill—but I really wouldn't know how I'd explain someone who wanted to set up Kubeflow in a stand-alone fashion that in order to do so she needs a BigQuery account and can't really use Kubeflow "off-line" with telemetry enabled. Please tell me I'm missing something obvious here?

from kubeflow.

aronchick avatar aronchick commented on May 20, 2024 1

from kubeflow.

jlewi avatar jlewi commented on May 20, 2024 1

Per the discussion in this thread, we are now collecting metrics opt-in. This is described in our instructions
https://github.com/kubeflow/kubeflow#steps

So I'm closing this issue.

@mhausenblas thanks for chipping in on spartakus that will be very useful.

from kubeflow.

jlewi avatar jlewi commented on May 20, 2024

This looks pretty easy to setup.

  1. Setup a GKE cluster running the collector
  2. Add the volunteer component to our ksonnet core package with an option to disable it.

from kubeflow.

jlewi avatar jlewi commented on May 20, 2024

Created the project kubeflow.org/kubeflow-usage

Create the cluster

gcloud container clusters create --project=kubeflow-usage reporting --zone=us-central1-c

Reserve a static IP

gcloud compute --project=kubeflow-usage addresses create stats-collector --global

from kubeflow.

jlewi avatar jlewi commented on May 20, 2024

Created a DNS record to associate stats-collector.kubeflow.org with the static IP.

from kubeflow.

erikerlandson avatar erikerlandson commented on May 20, 2024

I feel obligated to mention that modern ML technology (irony!) has demonstrated the ability to infer PII from patterns in data that have no literal PII in them. To be clear, when I look at the information currently broadcast by spartakus, I can't off the top of my head imagine a scenario for how that would happen here. OTOH that's what ML is good at, exploiting patterns humans can't directly perceive.

And yes, users can opt out :)

from kubeflow.

elmiko avatar elmiko commented on May 20, 2024

And yes, users can opt out :)

+1

from kubeflow.

erikerlandson avatar erikerlandson commented on May 20, 2024

Is there a writeup anywhere that gives examples of the various stats that spartakus will collect, and how we plan to use those to improve Kubeflow roadmapping?

from kubeflow.

jlewi avatar jlewi commented on May 20, 2024

@erikerlandson https://github.com/kubernetes-incubator/spartakus describes the basic metrics collected; these are all generic K8s metrics that aren't Kubeflow specific.

So I think the immediate use for these metrics is so that contributors to Kubeflow can demonstrate impact and justify further investment.

I think the next step would be to collect more specific Kubeflow metrics to see which components are being used.

from kubeflow.

erikerlandson avatar erikerlandson commented on May 20, 2024

@jlewi so iiuc, the idea is to demonstrate that Kubeflow is being used in the wild? As in "our metrics show that xxx Kubeflow clusters are reporting in, and here is a plot of Kubeflow cluster reports over time"

If I'm reading the report definitions right, it's reporting total resources available on nodes in a cluster. Like "here is a node that has 1TB of RAM" as opposed to "here is a pod using 200MB of RAM"

from kubeflow.

aronchick avatar aronchick commented on May 20, 2024

from kubeflow.

jlewi avatar jlewi commented on May 20, 2024

@erikerlandson An obvious metric to track would be deployments of different versions of Kubeflow. This will help us making informed decisions about breaking changes and how much effort to spend supporting older versions.

from kubeflow.

jlewi avatar jlewi commented on May 20, 2024

100% on board with the first 2. One of the main reasons we want to collect this data is to build trust in Kubeflow by showing that companies/individuals investing in Kubeflow are extending their reach.

I'm strongly in favor of starting with opt out and seeing what users think. We're still in alpha/experimental so I think that's very reasonable.

If we're opt out we'll get much higher participation just because its the default option.

from kubeflow.

aronchick avatar aronchick commented on May 20, 2024

+1 with 100% about the first two - this should absolutely be available and build trust.

I think we're saying the same thing on #3 - specifically, Matthew has said (which I support), that we have a strong social contract and trivial opt out.

Trivial opt out is done (just one command, and it's gone). What does a clear social contract look like?

from kubeflow.

mattf avatar mattf commented on May 20, 2024

the social contract is embodied in doing (0) and (1).

from kubeflow.

gsunner avatar gsunner commented on May 20, 2024

I agree with trust and transparency should be the main goal.

We are also looking to get some basic usage tracking on our project seldon-core using spartakus.
We also have the same issue of whether to have the usage tracking on by default with an easy opt-out.

As we are in the process of integrating Seldon and Kubeflow, we would also want to take advantage of any global flag for an 'opt-out' of all tracking.

Also as you are proposing to share collected data with the community - we may not need to collect the same data as long as usage of Kubeflow related components such as Seldon is also available.

from kubeflow.

jlewi avatar jlewi commented on May 20, 2024

It seems like the consensus is that collecting metrics is a good thing.

Let start with opt in opt out and see what users say. If people would strongly prefer opt out we can change.

@gsunner My hope is that in follow on PRs we can include additional metadata to break down usage by component.

Does someone want to approve the actual PR?

from kubeflow.

aronchick avatar aronchick commented on May 20, 2024

from kubeflow.

jlewi avatar jlewi commented on May 20, 2024

@aronchick That was a typo on my part. I agree with you about making it opt out by default.

from kubeflow.

aronchick avatar aronchick commented on May 20, 2024

from kubeflow.

elmiko avatar elmiko commented on May 20, 2024

opt-in is my personal view.

same for me, thanks for updating the PR @jlewi

from kubeflow.

jlewi avatar jlewi commented on May 20, 2024

PR has been submitted with opt in.

I have created a group
[email protected]
to give access to the data in BigQuery to folks preparing reports. I've given access to @chrisheecho who's been doing some of our data analysis and who I'm going to ask to prepare some initial reports.

I can share access with other folks who will be working on preparing reports for the community.

I'll also open up an issue on whether we should make the raw data open to all.

from kubeflow.

inc0 avatar inc0 commented on May 20, 2024

As I said on meeting, even opt-in is iffy for me. This can be security risk and well, damages from these can be hard to recover from. Another thing would be usefulness of this data. We can see scale of cluster people use, but how much of it is kubeflow? We can add footnote that if you're willing to run spartacus, that's our endpoint and thank you:)

I'd rather create google doc (?) questionnaire that we can modify and ask open questions tailored to actually improve our project. If we put scale brackets rather than number of nodes, that's easier to convince operators to share this info etc.

from kubeflow.

mhausenblas avatar mhausenblas commented on May 20, 2024

I asked around a bit and Tim confirmed Spartakus is a PoC and so I think, since we've apparently decided to adopt it, it would make sense to do it properly ;)

I've reached out to Tim to see how I can get involved so that if we have needs (for example, my interest for on-prem deployments is to allow for alternative back-ends) we can meet them in a timely fashion. WDYT @jlewi @aronchick?

from kubeflow.

aronchick avatar aronchick commented on May 20, 2024

from kubeflow.

mhausenblas avatar mhausenblas commented on May 20, 2024

Thanks @aronchick.

In re: on-prem, part of the idea is that we're able to track how this is
being used even on-prem. The fact that it uses a centralized logging system
(BQ) is a feature, not a bug, because otherwise how would we aggregate?
Because opting out is SO trivial, we're hoping that it doesn't cause any
issues.

Yes, I get that and I hope you remember that we actually decided on an opt-in policy ;)

I think I might be missing the point in re: using KF offline - did you mean
you think that users would like to aggregate all the KF deployed across
their enterprise in an offline way? What an interesting (and exciting)
proposition! I love the idea of exploring that.

That is exactly what I mean, apologies for not being able to communicate that better. We're all guilty of having a bit of a tunnel vision as we're living in a bubble where we take the tools in our org for granted, but you can trust me, I've been in enough situations with users/customers that went like: "what do you mean, technology X is hard-wired and can't be replaced?" not gonna use/buy it …

FWIW, I'm in touch with @thockin concerning Spartakus, will raise issues there and see how I can help in refactoring and extending the plug-able backend stuff with the goal to have a reliable component we can ship with Kubeflow. Hope that makes sense?

from kubeflow.

mhausenblas avatar mhausenblas commented on May 20, 2024

@aronchick for now I think we should be good, thanks. I'm trying to get involved in Spartakus to ensure that it's a stable and reliable component for our needs, for starters I'm focusing on improving the docs, see kubernetes-retired/spartakus#31 and then we'll see how merciful Mr @thockin is with my refactoring PRs ;)

from kubeflow.

jlewi avatar jlewi commented on May 20, 2024

@mhausenblas The spartakus collector defines an interface that abstracts away the database. So if someone wanted to support a DB other than BigQuery it should be pretty straightforward.

from kubeflow.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.