| System: | Various HPC clusters |
| Duration: | 2001 - 2006 |
| Data Type: | Database with I/O specific failures |
About the data:
This data was collected with the purpose of providing failure specifics for I/O related systems and components in as much detail as possible so that
analysis might produce some useful findings. Data were collected for storage, networking, computational machines, and file systems
in production use at NERSC from the 2001-2006 timeframe. The data was extracted form a database used for tracking system troubles,
called Remedy, and is currently stored in a mySQL database and available for export to Excel format. There are also some basic
query and graph capabilities available. For more information on the data, please visit the
NERSC web site hosting the raw data or contact the PDSI researcher at NERSC:
Akbar Mokhtarani or the Principal Investigator for PDSI at
NERSC: Bill Kramer .
Downloads:
The data and more information is available for download here.
Papers using this data:
This data has not yet been reported on in any paper.
If you are using this data in a paper, please send an e-mail with the paper reference to the moderators and we will add it to this page.
Acknowledgments:
We would like to thank Bill Kramer and Akbar Mokhtarani from NERSC for collecting the data and sharing it.
If you use these data in your work, please use a similar acknowledgment.
The COM1 data
| System: | Internet services clusters |
| Duration: | May 2006 |
| Data Type: | Hardware replacement log |
About the data:
COM1 is a log of hardware failures recorded by an internet service provider and drawing from multiple distributed sites.
Each record in the data
contains a timestamp of when the failure was repaired,
information on the failure symptoms, and a list of steps that were
taken to diagnose and repair the problem. The data does
not contain information on when each failure actually happened, only
when repair took place. The data covers a population of 26,734 10K rpm
SCSI disk drives. The total number of servers in the monitored sites is not known.
Downloads:
The data will soon become available for download.
Papers using this data:
A first analysis of the HPC3 data is presented in the following paper:
Bianca Schroeder and Garth A. Gibson.
"Disk failures in the real world: What does an MTTF of 1,000,000 hours mean too you?".
5th Usenix Conference on File and Storage Technologies (FAST 2007).
If you are using this data in a paper, please send an e-mail with the paper reference to the moderators and we will add it to this page.
Acknowledgments:
We would like to thank the people at the organization, who has provided us with data, but would
like to remain unnamed, for collecting the data and helping us to interpret the data.
The COM2 data
| System: | Internet services cluster |
| Duration: | September 2004 thru April 2006 |
| Data Type: | Warranty service log of hardware failures |
About the data:
COM2 is a warranty service log of hardware failures recorded on behalf of an internet service provider
aggregating events in multiple distributed sites.
Each failure record contains a repair code (e.g. ``Replace hard drive'') and the time when
the repair was finished. Again there is no information on the start time of each failure.
The log does not contain entries for failures of disks that were replaced in the customer site by hot-swapping in a spare disk,
since the data was created by the warranty processing, which does not participate in on-site hot-swap replacements.
To account for the missing disk replacements we obtained numbers for the periodic replenishments of on-site spare disks
from the internet service provider.
The size of the underlying system changed significantly
during the measurement period, starting with
420 servers in 2004 and ending with 9,232 servers in 2006. We obtained quarterly hardware purchase records
covering this time period that make it possible to estimate the size of the disk population.
Downloads:
The data will soon become available for download.
Papers using this data:
A first analysis of the HPC3 data is presented in the following paper:
Bianca Schroeder and Garth A. Gibson.
"Disk failures in the real world: What does an MTTF of 1,000,000 hours mean too you?".
5th Usenix Conference on File and Storage Technologies (FAST 2007).
If you are using this data in a paper, please send an e-mail with the paper reference to the moderators and we will add it to this page.
Acknowledgments:
We would like to thank the people at the organization, who has provided us with data, but would
like to remain unnamed, for collecting the data and helping us to interpret the data.
The COM3 data
| System: | Internet services clusters |
| Duration: | January 2005 thru December 2005 |
| Data Type: | Aggregate harddrive replacement statistics |
About the data:
The COM3 data set comes from a large external storage system used by an internet service provider and
comprises four populations of different types of FC disks.
While this data was gathered in 2005, the system has some legacy
components that were as old as from 1998 and were known to have been physically moved after initial installation.
COM3 differs from the other data sets in that it
provides only aggregate statistics of disk failures, rather than individual records
for each failure. The data contains the counts of disks that failed
and were replaced in 2005 for each of the four disk populations.
Downloads:
The data will soon become available for download.
Papers using this data:
A first analysis of the HPC3 data is presented in the following paper:
Bianca Schroeder and Garth A. Gibson.
"Disk failures in the real world: What does an MTTF of 1,000,000 hours mean too you?".
5th Usenix Conference on File and Storage Technologies (FAST 2007).
If you are using this data in a paper, please send an e-mail with the paper reference to the moderators and we will add it to this page.
Acknowledgments:
We would like to thank the people at the organization, who has provided us with data, but would
like to remain unnamed, for collecting the data and helping us to interpret the data.
The Cray data
| System: | Cray system |
| Duration: | N/A |
| Data Type: | Event logs, syslog, console logs |
About the data:
These data sets come from one or more Cray XT series machines, running Linux. They include the syslog, the event log and the console log.
The syslog contains messages produced by various Linux daemons, drivers and utilities that use the syslog protocol.
The event log is an XT-specific log that records actions on the XT control network, such as transfers of boot images.
The console log is the aggregate console output of all the nodes; the console log may partially overlap with the syslog and event log.
In addition to the system events, we will also provide a failure log that will contain a description of the failures that were encountered. This will include information about which log entry was critical in identifying this error if this information was known.
The files for download are .tar.gz files, that when unpacked, will have a README file in it.
The first section of the README explains what Cray thought happened and (very) briefly why it was thought that was the problem.
There are 6 dumps.
The directory name is the YYMMDDHHMM that the dump started at (basically, a unique ID).
The next sections are the same in all the README's. They explain a bit about how the machine is put together, the naming convention
for nodes, and finally, what all the log files present in each dump are.
Downloads:
Data set 1
Data set 2
Data set 3
Data set 4
Data set 5
Data set 6
Papers using this data:
If you are using this data in a paper, please send an e-mail with the paper reference to the moderators and we will add it to this page.
Acknowledgments:
We thank Forest Godfrey at Cray for making this data available.