|
The computer failure data repository (CFDR)
With the growing scale of todays IT installations, component failure
is becoming an ever larger problem.
Yet, virtually no data on failures in real systems is publicly available, forcing
researchers working on system reliability to base their work on anecdotes and back of the envelope
calculations, rather than empirical data.
The computer failure data repository (CFDR) aims at
accelerating research on system reliability by filling the nearly
empty collection of public data with
detailed failure data from a variety of large production systems.
Please join us, either by contributing data,
downloading data, or joining our mailing lists.
News
You are viewing a first draft of the CFDR. For feedback and
comments please contact the moderators.
Available data
The table below provides an overview over the available data sets.
| Name |
Time period |
System type |
Type of data |
| LANL |
Dec 96 - Nov 05 |
HPC clusters |
The data covers node outages at 22 cluster systems at LANL, including a
total of 4,750 nodes and 24,101 processors. Usage logs and error logs are available as well. |
| HPC1 |
Aug 01 - May 06 |
HPC cluster |
The data covers hardware replacements at a 765 node cluster with more than 3,000 hard drives. |
| HPC2 |
Jan 04 - Jul 06 |
HPC cluster |
Hard drive replacements in a 256 node cluster with 520 drives. |
| HPC3 |
Dec 05 - Nov 06 |
HPC cluster |
Hard drive replacements observed in a 1,532-node HPC cluster with more than 14,000 drives. |
| HPC4 |
2004 - 2006 |
HPC cluster |
Error logs collected at 5 supercomputing systems at SNL
and LLNL, ranging from 512 to 131072 processors. |
| PNNL |
Nov 03 - Sep 07 |
HPC cluster |
Hardware failures recorded on the MPP2 system (a 980 node HPC cluster) at PNNL. |
| NERSC |
2001 - 2006 |
HPC cluster |
I/O specific failures collected at a number of production systems at NERSC. |
| COM1 |
May 2006 |
Internet services cluster |
Hardware failures recorded by an internet service provider and drawing from multiple distributed sites. |
| COM2 |
Sep 04 - Apr 06 |
Internet services cluster |
Warranty service log of hardware failures aggregating events in multiple distributed sites.
|
| COM3 |
Jan 05 - Dec 05 |
Internet services cluster |
Aggregate quarterly statistics of disk failures at a large external storage system. |
| ask.com |
Dec 06 - Feb 07 |
Internet services cluster |
Memory error data collected on a 212 node server farm at ask.com. |
How to contribute
First of all, thank you for your interest in contributing to the CFDR.
If you already have your data public on your reference web page
so that any one can download it, then all you need to do is to send us a pointer to your
reference web page and a brief description of the data.
But otherwise - if you want to make the first release of your data
through the CFDR - then the data contribution procedure is as follows:
1. We need to have a necessary paperwork on file to show that we actually have
permission to host this data. You need to sign or find someone
at your organization to sign our contributor's agreement .
2. If the data contains some sensitive
information like user or vendor names, you need to sanitize (anonymize) it.
If you don't have proper sanitization tools, we will try to help you.
3. Please provide any available documentation or description of the data you are contributing. If no documentation is readily available, it would be helpful to create one in the form
of a FAQ with answers to frequently asked questions on the data. You can take a look at the
FAQ accompanying the LANL data sets to get an idea
of the kind of questions people commonly ask about failure data.
4. Make your data accessible for us, then we will host it on the CFDR server.
Thanks!
Best Practices
Currently, data collection and analysis is
complicated by the fact that there is no widely accepted
format for anomaly data and there exist no guidelines on what data to collect and how.
We hope that the experiences from working with a variety of sites on
collecting and analyzing failure data will lead to some best practices
for failure data collection.
Providing such guidelines will make it easier for sites to collect data that is
useful and comparable across sites.
If you would like to contribute your experiences on collecting or working with failure
data please contact the moderators.
FAQ
Access to CFDR
How can I access CFDR data or tools?
You first need to register
for full access to CFDR data/tools to obtain a user id/password. You can then go to the data overview page
to download data. Clicking a link and entering your id and password will start downloading.
Contribution to CFDR
How can I upload and release my data or tools through CFDR?
If you already have your data or tools public on your reference web page so that any one can download them, then just let us know. We will download your data, create metadata and host them in CFDR. Otherwise - if you want to make the first release of your data through CFDR, then see the "How To Contribute" page.
Registration
You first need to register
for full access to CFDR data/tools to obtain a user id/password. You can then go to the data overview page
About the computer failure data repository
The Computer Failure Data Repository (CFDR) started as an initiative at CMU in 2006 and was
motivated by the fact that hardly any failure data from real, large-scale production systems
is available to researchers.
The goal of the CFDR is to collect and make available failure data from a large variety
of sites enabling researchers to gain a better understanding of the characteristics of
failures in the real world.
The CFDR started to become reality when Los Alamos National Laboratory (LANL)
decided to publicly release a large set of failure data collected at LANL's HPC systems.
The data was collected over 9 years covering more than 23,000 outages and was the first
to become publicly available as part of the CFDR.
The current moderators of this site are Garth A. Gibson and Bianca Schroeder. You can contact
them by e-mail.
Contact us
We would like to hear your feedback, comments, insights and experiences!
Please e-mail them to the moderators.
|