Giter Club home page Giter Club logo

cluster-data's Introduction

Overview

This repository describes various traces from parts of the Google cluster management software and systems.

  • Please join our (low volume) discussion group, so we can send you announcements, and you can let us know about any issues, insights, or papers you publish using these traces. Important: to avoid spammers, you MUST fill out the "reason" field, or your application will be rejected. Once you are a member, you can send email to [email protected] to:

    • Announce tools and techniques that can help others analyze or decode the trace data.
    • Share insights and surprises.
    • Ask questions (the group has a few hundred members) and get help. If you ask for help, please include concrete examples of issues you run into; screen shots; error codes; and a list of what you have already tried. Don't just say "I can't download the data"!
  • We provide a trace bibliography of papers that have used and/or analyzed the traces, and encourage anybody who publishes one to add it to the bibliography using a github pull request [preferred], or by emailing the bibtex entry to [email protected]. In either case, please mimic the existing format exactly.

Borg cluster workload traces

These are traces of workloads running on Google compute cells that are managed by the cluster management software internally known as Borg.

  • version 3 (aka ClusterData2019) provides data from eight Borg cells over the month of May 2019.
  • version 2 (aka ClusterData2011) provides data from a single 12.5k-machine Borg cell from May 2011.
  • version 1 is an older, short trace that describes a 7 hour period from one cell from 2009. Deprecated. We strongly recommend using the version 2 or version 3 traces instead.

ETA traces

In addition, this site hosts a set of execution traces from ETA (Exploratory Testing Architecture) - a testing framework that explores interactions between distributed, concurrently-executing components, with an eye towards improving testing them.

Power traces

This site also hosts power traces for 57 power domains during the month of May 2019. This trace is synergistic with the ClusterData2019 dataset.

License

Creative Commons CC-BY license The data and trace documentation are made available under the CC-BY license. By downloading it or using them, you agree to the terms of this license.

cluster-data's People

Contributors

ajajoo avatar charlesreiss avatar fjxmlzn avatar johnwilkes avatar lordbarker avatar lsliwko avatar monnand avatar moonlightdrive avatar nikhil96sher avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cluster-data's Issues

Unable to download Data ClusterData2011_10

Hi, I'm a student. I'm trying to download the Trace data: clusterData2011_10 
using gsutil with the command "gsutil ls gs://clusterdata-2011-10", the 
following error occurs:

GSResponseError:: status=404, code=NoSuchBucket, reason=Not Found.

What's the matter?

I needs your help. thanks!

Original issue reported on code.google.com by [email protected] on 29 Oct 2011 at 8:11

Questions about job dependencies

Hi, I want to ask about job dependencies in the 2019 trace.
In the "Collections and instances" section, the doc mentioned

A common pattern is to run masters (controllers) and workers in separate jobs (e.g., this is used in MapReduce and similar systems). A worker job that has a master job as a parent​ will automatically be terminated when the ​parent​ exits, even if its workers are still running. A job can have multiple child jobs but only one ​parent​.

Another pattern is to run jobs in a pipeline (this doesn’t apply to alloc sets). If job A says that it should ​run after​ job B, then job A will only be scheduled (made READY) after job B successfully finishes. A job can list multiple jobs that it should ​run after,​ and will only be scheduled after all those jobs finish successfully. 

Could you provide a few examples of the second pattern?

For example, if a Spark program executed on Borg in the trace, would different stages be scheduled as separate jobs in pipeline? Or were they scheduled as tasks and their order was decided by an external scheduler?

can't download

I am in China,i can't access to the download website, does any other way to download the data?

Clarification of cores and memory usage

From my initial analysis of the data set, 51.61% of the tasks have a CPU
column value of 0 -- what does this mean?  Obviously each task must use at
least some fraction of the CPU.  The same goes for memory usage.  0.94% of
all tasks have a memory column value of 0.

A clarification of this would be much appreciated.

Original issue reported on code.google.com by [email protected] on 18 Mar 2010 at 11:12

data center architecture

Hi~
May I ask about the architecture of the data center network (DCN)?
What is the DCN architecture when collecting these traces?

Thank you so much!

Identifcation of MapReduce jobs/tasks?

I assume that these traces contain many MapReduce jobs. Since my research topic 
is the performance modeling of MapReduce jobs, I am very interested in 
identifying the MapReduce jobs in these traces, and in being able to 
distinguish the map and reduce tasks.

Would you (Google) be able and willing to provide a mapping for all MapReduce 
tasks to task type (eg., (job ID, task index) -> (map|reduce|...)), or if that 
is not feasible maybe just a list of logical jobnames (or job IDs) of MapReduce 
jobs?

Original issue reported on code.google.com by [email protected] on 8 Dec 2011 at 3:38

Sample portion for all records equal to zero

Hello,

I have noticed that no record for the task usage contains a sample portion which is non-zero. I have verified this programatically. According to the format & schema document, this field represents the ratio of excepted to observed samples during a measurement period. So, it should not be the case that this is zero for all records. Why is that the case regardless?

the units for some values

Hello, What are the units for the values in the 'CPU request' and 'memory request' columns in the table of "task events"?,

Priority of the Best-effort Batch (beb) tier jobs

In the Google cluster-usage traces v3 document, the priority range of Beb jobs is 100-115, whereas it is given as 110-115, in the Borg: the Next Generation paper and the paper does not mention how those between 100-110 are included. Which one is correct? And does that mean those between 100 and 110 are not included as a part of the paper's analysis?

Thanks!

User does not have bigquery.jobs.create permission in project google.com:google-cluster-data

So I am trying to query in bq shell with a VM on google cloud platform and got no permission. Here is what I did

bash-5.0# bq shell --project_id=google.com:google-cluster-data
Welcome to BigQuery! (Type help for more information.)
google.com:google-cluster-data> ls
      datasetId       
 -------------------- 
  clusterdata_2011_1  
  clusterdata_2019_a  
  clusterdata_2019_b  
  clusterdata_2019_c  
  clusterdata_2019_d  
  clusterdata_2019_e  
  clusterdata_2019_f  
  clusterdata_2019_g  
  clusterdata_2019_h  
google.com:google-cluster-data> query 'select count(*) from clusterdata_2011_1'
BigQuery error in query operation: Access Denied: Project google.com:google-cluster-data: User does not have bigquery.jobs.create permission in project google.com:google-cluster-data.

Is this not the way to do it?

Link to research blog at project home is broken

What steps will reproduce the problem?
1.
2.
3.

What is the expected output? What do you see instead?


What version of the product are you using? On what operating system?


Please provide any additional information below.

External links
research blog 

is linking to

http://www.blogger.com/post-create.g?blogID=1901025479979892432


Original issue reported on code.google.com by [email protected] on 12 Jun 2012 at 1:52

wrong timestamp format

Hi,
Could you please explain how to format timestamp (start & end time). When i query start date from table it returns me following unix timestamp.

Unix 2076900000000 -> UTC Thu Oct 25 2035 04:40:00
1500120000000 -> UTC Sat Jul 15 2017 12:00:00

According to table, data should contain observations from 2019.

Question about job_name and logic_name

I found that job_name and logic_name have a many-to-one relationship. So..for two jobs with the same logic_name but have different job_name, What are their same and different? I know they have the same code. But what's the difference about their input data size?
Looking for your reply. Thanks!

Misleading column names

The data I downloaded has colums

Time ParentID TaskID JobType NrmlTaskCores NrmlTaskMem

But description says there should be

Time (int)
JobID (int)
TaskID (int)
Job Type (0, 1, 2, 3)
Normalized Task Cores (float)
Normalized Task Memory (float)

JobID is named ParentID?

Original issue reported on code.google.com by [email protected] on 13 Jun 2010 at 1:22

download time out

hi,it always time out when i use gsutil to download,can you helpy me?thank you very much

Task durations

It would be quite useful if task durations were included in the data set. 
This way, the data could be used as input to test, say, a job placement
strategy.

Original issue reported on code.google.com by [email protected] on 18 Mar 2010 at 11:10

Misleading column naes

The data I downloaded has colums

Time ParentID TaskID JobType NrmlTaskCores NrmlTaskMem

But description says there should be

Time (int)
JobID (int)
TaskID (int)
Job Type (0, 1, 2, 3)
Normalized Task Cores (float)
Normalized Task Memory (float)

JobID is named ParentID?

Original issue reported on code.google.com by [email protected] on 13 Jun 2010 at 1:22

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.