google / cluster-data Goto Github PK

View Code? Open in Web Editor NEW

827.0 40.0 184.0 7.13 MB

Borg cluster traces from Google

TeX 72.26% Jupyter Notebook 27.74%

cluster-data's Introduction

Overview

This repository describes various traces from parts of the Google cluster management software and systems.

Please join our (low volume) discussion group, so we can send you announcements, and you can let us know about any issues, insights, or papers you publish using these traces. Important: to avoid spammers, you MUST fill out the "reason" field, or your application will be rejected. Once you are a member, you can send email to [email protected] to:
- Announce tools and techniques that can help others analyze or decode the trace data.
- Share insights and surprises.
- Ask questions (the group has a few hundred members) and get help. If you ask for help, please include concrete examples of issues you run into; screen shots; error codes; and a list of what you have already tried. Don't just say "I can't download the data"!
We provide a trace bibliography of papers that have used and/or analyzed the traces, and encourage anybody who publishes one to add it to the bibliography using a github pull request [preferred], or by emailing the bibtex entry to [email protected]. In either case, please mimic the existing format exactly.

Borg cluster workload traces

These are traces of workloads running on Google compute cells that are managed by the cluster management software internally known as Borg.

version 3 (aka ClusterData2019) provides data from eight Borg cells over the month of May 2019.
version 2 (aka ClusterData2011) provides data from a single 12.5k-machine Borg cell from May 2011.
version 1 is an older, short trace that describes a 7 hour period from one cell from 2009. Deprecated. We strongly recommend using the version 2 or version 3 traces instead.

ETA traces

In addition, this site hosts a set of execution traces from ETA (Exploratory Testing Architecture) - a testing framework that explores interactions between distributed, concurrently-executing components, with an eye towards improving testing them.

Power traces

This site also hosts power traces for 57 power domains during the month of May 2019. This trace is synergistic with the ClusterData2019 dataset.

License

The data and trace documentation are made available under the CC-BY license. By downloading it or using them, you agree to the terms of this license.

cluster-data's People

Contributors

Stargazers

Watchers

Forkers

ustcldf yanjiegao caomw shaomiaochen yizhaothu xnyan xlong88 kmittal1986 kleopatra999 vanirz colfad cxxly datascibox samtashukla charlesxrwu xiechengsheng alexxnica kryndex giangzuzana yekkehk2 minhqnguyen gonzalovera x1957 sguazt mohamedgalil devbib douhui2002 gamvrosi azh18 mascor1331 enterstudio scape1989 biploabray nsq974487195 infiniator lpstudy kimihe fromsystem akmalputra charliehartono bingmous azywait atpollmann ik2sb lemon2ml ljp580230 tjwallas mishrashan eboyu hemphries doandongnguyen vinithraghav salamismaeel sysu-ndc-lab wyonghui lyxiaowangzi gaorunzebit a24ibrah shaokaiyang jingyuanlu imadem sucharitha93 seawish carlwitt nikhil96sher lsliwko shrishtihore learn-knowlege boblinp dihao qiaojialin qxwsniff chattg1 xiaoze17 yuexiarenjing raniprashanth yxsllgz gracefulman mattelundell rafaelvfalc jun69 chaomneg costallat duongtungnguyen wakewater howna adhocmaster lihua1137471141 azrajabin j-takemasa jamalansari84 liangbirui tekrajchhetri beyonddream-productions goldiekatsu rh01 holajiawei sdwivedi onlyone0001 arm-comal

cluster-data's Issues

How to estimate the deadline of each task?

Is the deadline of each calculation task in the dataset? How to estimate the deadline of each task?
Thank you very much

Unable to download Data ClusterData2011_10

Hi, I'm a student. I'm trying to download the Trace data: clusterData2011_10 
using gsutil with the command "gsutil ls gs://clusterdata-2011-10", the 
following error occurs:

GSResponseError:: status=404, code=NoSuchBucket, reason=Not Found.

What's the matter?

I needs your help. thanks!

Original issue reported on code.google.com by [email protected] on 29 Oct 2011 at 8:11

Questions about job dependencies

Hi, I want to ask about job dependencies in the 2019 trace.
In the "Collections and instances" section, the doc mentioned

A common pattern is to run masters (controllers) and workers in separate jobs (e.g., this is used in MapReduce and similar systems). A worker job that has a master job as a parent will automatically be terminated when the parent exits, even if its workers are still running. A job can have multiple child jobs but only one parent.

Another pattern is to run jobs in a pipeline (this doesn’t apply to alloc sets). If job A says that it should run after job B, then job A will only be scheduled (made READY) after job B successfully finishes. A job can list multiple jobs that it should run after, and will only be scheduled after all those jobs finish successfully.

Could you provide a few examples of the second pattern?

For example, if a Spark program executed on Borg in the trace, would different stages be scheduled as separate jobs in pipeline? Or were they scheduled as tasks and their order was decided by an external scheduler?

can't download

I am in China,i can't access to the download website, does any other way to download the data?

Clarification of cores and memory usage

From my initial analysis of the data set, 51.61% of the tasks have a CPU
column value of 0 -- what does this mean?  Obviously each task must use at
least some fraction of the CPU.  The same goes for memory usage.  0.94% of
all tasks have a memory column value of 0.

A clarification of this would be much appreciated.

Original issue reported on code.google.com by [email protected] on 18 Mar 2010 at 11:12

data center architecture

Hi~
May I ask about the architecture of the data center network (DCN)?
What is the DCN architecture when collecting these traces?

Thank you so much!

New publication (pull request)

New publication:
#9

Identifcation of MapReduce jobs/tasks?

I assume that these traces contain many MapReduce jobs. Since my research topic 
is the performance modeling of MapReduce jobs, I am very interested in 
identifying the MapReduce jobs in these traces, and in being able to 
distinguish the map and reduce tasks.

Would you (Google) be able and willing to provide a mapping for all MapReduce 
tasks to task type (eg., (job ID, task index) -> (map|reduce|...)), or if that 
is not feasible maybe just a list of logical jobnames (or job IDs) of MapReduce 
jobs?

Original issue reported on code.google.com by [email protected] on 8 Dec 2011 at 3:38

Sample portion for all records equal to zero

Hello,

I have noticed that no record for the task usage contains a sample portion which is non-zero. I have verified this programatically. According to the format & schema document, this field represents the ratio of excepted to observed samples during a measurement period. So, it should not be the case that this is zero for all records. Why is that the case regardless?

the units for some values

Hello, What are the units for the values in the 'CPU request' and 'memory request' columns in the table of "task events"?,

Priority of the Best-effort Batch (beb) tier jobs

In the Google cluster-usage traces v3 document, the priority range of Beb jobs is 100-115, whereas it is given as 110-115, in the Borg: the Next Generation paper and the paper does not mention how those between 100-110 are included. Which one is correct? And does that mean those between 100 and 110 are not included as a part of the paper's analysis?

Thanks!

wiraswasta

User does not have bigquery.jobs.create permission in project google.com:google-cluster-data

So I am trying to query in bq shell with a VM on google cloud platform and got no permission. Here is what I did

bash-5.0# bq shell --project_id=google.com:google-cluster-data
Welcome to BigQuery! (Type help for more information.)
google.com:google-cluster-data> ls
      datasetId       
 -------------------- 
  clusterdata_2011_1  
  clusterdata_2019_a  
  clusterdata_2019_b  
  clusterdata_2019_c  
  clusterdata_2019_d  
  clusterdata_2019_e  
  clusterdata_2019_f  
  clusterdata_2019_g  
  clusterdata_2019_h  
google.com:google-cluster-data> query 'select count(*) from clusterdata_2011_1'
BigQuery error in query operation: Access Denied: Project google.com:google-cluster-data: User does not have bigquery.jobs.create permission in project google.com:google-cluster-data.

Is this not the way to do it?

Link to research blog at project home is broken

What steps will reproduce the problem?
1.
2.
3.

What is the expected output? What do you see instead?


What version of the product are you using? On what operating system?


Please provide any additional information below.

External links
research blog 

is linking to

http://www.blogger.com/post-create.g?blogID=1901025479979892432

Original issue reported on code.google.com by [email protected] on 12 Jun 2012 at 1:52

wrong timestamp format

Hi,
Could you please explain how to format timestamp (start & end time). When i query start date from table it returns me following unix timestamp.

Unix 2076900000000 -> UTC Thu Oct 25 2035 04:40:00
1500120000000 -> UTC Sat Jul 15 2017 12:00:00

According to table, data should contain observations from 2019.

Question about job_name and logic_name

I found that job_name and logic_name have a many-to-one relationship. So..for two jobs with the same logic_name but have different job_name, What are their same and different? I know they have the same code. But what's the difference about their input data size?
Looking for your reply. Thanks!

Misleading column names

The data I downloaded has colums

Time ParentID TaskID JobType NrmlTaskCores NrmlTaskMem

But description says there should be

Time (int)
JobID (int)
TaskID (int)
Job Type (0, 1, 2, 3)
Normalized Task Cores (float)
Normalized Task Memory (float)

JobID is named ParentID?

Original issue reported on code.google.com by [email protected] on 13 Jun 2010 at 1:22

Publishing results derived from data

Are there any restrictions on publishing results derived from this dataset?

Original issue reported on code.google.com by [email protected] on 18 Mar 2010 at 2:06

download time out

hi,it always time out when i use gsutil to download,can you helpy me?thank you very much

Task durations

It would be quite useful if task durations were included in the data set. 
This way, the data could be used as input to test, say, a job placement
strategy.

Original issue reported on code.google.com by [email protected] on 18 Mar 2010 at 11:10

Misleading column naes

The data I downloaded has colums

Time ParentID TaskID JobType NrmlTaskCores NrmlTaskMem

But description says there should be

Time (int)
JobID (int)
TaskID (int)
Job Type (0, 1, 2, 3)
Normalized Task Cores (float)
Normalized Task Memory (float)

JobID is named ParentID?