About the data:
This data was collected with the purpose of providing failure specifics for I/O related systems and components in as much detail as possible so that
analysis might produce some useful findings. Data were collected for storage, networking, computational machines, and file systems
in production use at NERSC from the 2001-2006 timeframe. The data was extracted form a database used for tracking system troubles,
called Remedy, and is currently stored in a mySQL database and available for export to Excel format. There are also some basic
query and graph capabilities available. For more information on the data, please visit the
NERSC web site hosting the raw data or contact the PDSI researcher at NERSC:
Akbar Mokhtarani or the Principal Investigator for PDSI at
NERSC: Bill Kramer .
Downloads:
The data and more information is available for download here.
Papers using this data:
This data has not yet been reported on in any paper.
If you are using this data in a paper, please send an e-mail with the paper reference to the moderators and we will add it to this page.
Acknowledgments:
We would like to thank Bill Kramer and Akbar Mokhtarani from NERSC for collecting the data and sharing it.
If you use these data in your work, please use a similar acknowledgment.
The COM1 data
| System: | Internet services clusters |
| Duration: | May 2006 |
| Data Type: | Hardware replacement log |
About the data:
COM1 is a log of hardware failures recorded by an internet service provider and drawing from multiple distributed sites.
Each record in the data
contains a timestamp of when the failure was repaired,
information on the failure symptoms, and a list of steps that were
taken to diagnose and repair the problem. The data does
not contain information on when each failure actually happened, only
when repair took place. The data covers a population of 26,734 10K rpm
SCSI disk drives. The total number of servers in the monitored sites is not known.
Downloads:
The data will soon become available for download.
Papers using this data:
A first analysis of the HPC3 data is presented in the following paper:
Bianca Schroeder and Garth A. Gibson.
"Disk failures in the real world: What does an MTTF of 1,000,000 hours mean too you?".
5th Usenix Conference on File and Storage Technologies (FAST 2007).
If you are using this data in a paper, please send an e-mail with the paper reference to the moderators and we will add it to this page.
Acknowledgments:
We would like to thank the people at the organization, who has provided us with data, but would
like to remain unnamed, for collecting the data and helping us to interpret the data.
The COM2 data
| System: | Internet services cluster |
| Duration: | September 2004 thru April 2006 |
| Data Type: | Warranty service log of hardware failures |
About the data:
COM2 is a warranty service log of hardware failures recorded on behalf of an internet service provider
aggregating events in multiple distributed sites.
Each failure record contains a repair code (e.g. ``Replace hard drive'') and the time when
the repair was finished. Again there is no information on the start time of each failure.
The log does not contain entries for failures of disks that were replaced in the customer site by hot-swapping in a spare disk,
since the data was created by the warranty processing, which does not participate in on-site hot-swap replacements.
To account for the missing disk replacements we obtained numbers for the periodic replenishments of on-site spare disks
from the internet service provider.
The size of the underlying system changed significantly
during the measurement period, starting with
420 servers in 2004 and ending with 9,232 servers in 2006. We obtained quarterly hardware purchase records
covering this time period that make it possible to estimate the size of the disk population.
Downloads:
The data will soon become available for download.
Papers using this data:
A first analysis of the HPC3 data is presented in the following paper:
Bianca Schroeder and Garth A. Gibson.
"Disk failures in the real world: What does an MTTF of 1,000,000 hours mean too you?".
5th Usenix Conference on File and Storage Technologies (FAST 2007).
If you are using this data in a paper, please send an e-mail with the paper reference to the moderators and we will add it to this page.
Acknowledgments:
We would like to thank the people at the organization, who has provided us with data, but would
like to remain unnamed, for collecting the data and helping us to interpret the data.
The COM3 data
| System: | Internet services clusters |
| Duration: | January 2005 thru December 2005 |
| Data Type: | Aggregate harddrive replacement statistics |
About the data:
The COM3 data set comes from a large external storage system used by an internet service provider and
comprises four populations of different types of FC disks.
While this data was gathered in 2005, the system has some legacy
components that were as old as from 1998 and were known to have been physically moved after initial installation.
COM3 differs from the other data sets in that it
provides only aggregate statistics of disk failures, rather than individual records
for each failure. The data contains the counts of disks that failed
and were replaced in 2005 for each of the four disk populations.
Downloads:
The data will soon become available for download.
Papers using this data:
A first analysis of the HPC3 data is presented in the following paper:
Bianca Schroeder and Garth A. Gibson.
"Disk failures in the real world: What does an MTTF of 1,000,000 hours mean too you?".
5th Usenix Conference on File and Storage Technologies (FAST 2007).
If you are using this data in a paper, please send an e-mail with the paper reference to the moderators and we will add it to this page.
Acknowledgments:
We would like to thank the people at the organization, who has provided us with data, but would
like to remain unnamed, for collecting the data and helping us to interpret the data.