Before Terabytes Fall Disk reliability in Windows Vista and beyond
Windows Storage Devices Strategic pillars
Session Outline
What Matters Most To Our Users?
The Answer Is…
Protecting Data:  Windows Vista disk diagnostics
Quantifying Disk Failures
Disk Failure Mitigations
Windows Vista Disk Diagnostics
Disk Diagnostics Details
Startup Repair/Windows Recovery Environment
Corrupted File Recovery
Windows Vista Disk Diagnostics
Opportunities For Future Technology
Future Technology: Protecting User Data And Preventing Hard Drive Failure Proactively
What Is PRCS?
Why Is PRCS Important?
Goals Of PRCS
PRCS Features
PRCS Advantages
Proactive Disk Diagnostics
HDD Reliability 101
Reliability Versus Temperature
Performance Versus Vibration
Reliability Versus Shock
Reliability Design Guidelines
PRCS
Call To Action
Additional Resources
1.44M

Before Terabytes Fall Disk reliability in Windows Vista and beyond

1. Before Terabytes Fall Disk reliability in Windows Vista and beyond

Before Terabytes Fall
Disk reliability in Windows Vista
and beyond
Frank Shu
Program Manager
WDEG-Storage
Microsoft Corporation
Matthew Kerner
Program Manager
Windows Diagnosis
Microsoft Corporation

2. Windows Storage Devices Strategic pillars

Storage Fabrics
Server/Enterprise
Personal Storage
Client/Consumer
Optical Platform
Client/Consumer
Preferred
Storage Platform
Partner/Customer
Leading platform enabling
storage fabric adoption
Optimized platform features
enabling your Windows
experience, here and now
Timely, comprehensive, quality
platform support for optical devices
Preferred platform for developing,
deploying, and using
storage devices

3. Session Outline

Introduction (Frank Shu)
Windows Vista Disk Diagnostics
(Matthew Kerner)
Future Technology (Frank Shu)
Demo (Microsoft and Samsung)

4. What Matters Most To Our Users?

What Matters Most
To Our Users?
A consumer bought a new computer and it
works great at work and at home. She
couldn’t do her everyday tasks without it.
What matters most to her?
a) CPU power
b) Network connection
c) Battery life
d) Something else…

5. The Answer Is…

The Data

6. Protecting Data: Windows Vista disk diagnostics

Protecting Data:
Windows Vista disk diagnostics
Matthew Kerner

7. Quantifying Disk Failures

Catastrophic disk failures
~200 disks replaced per week at Microsoft
in 2003
Top driver of Microsoft support’s hardwarerelated support calls in both client and server
Based on Microsoft figures, disk failures cost
many millions of dollars per year in enterprises
Localized failures (bad blocks)
Kernel and user-mode crashes
1.7% of customer-report Microsoft Online Crash
Analysis crashes are due to disk errors
Application hangs during read recovery

8. Disk Failure Mitigations

Prevention
Hybrid hard disks (mobile systems)
RAID
Catastrophic failure recovery
Data backup
Disk replacement
Localized failure recovery
Repair from redundant copy
Restore from backup

9. Windows Vista Disk Diagnostics

Windows Vista
Disk Diagnostics
Purpose: Save user data before
catastrophic disk failure
Client SKUs
Self Monitoring And Reporting Technology
(S.M.A.R.T.) polling triggers diagnostic
Uses S.M.A.R.T. trip status – no
threshold/attribute comparison
Warns user of impending failure and walks
them through backup and replacement
Windows Vista backup improvements

10. Disk Diagnostics Details

Disk class driver polls S.M.A.R.T. status hourly
as it has done since Windows 2000
Based on industry feedback, no use of Disk
Self-Test or attribute comparison
Failure triggers user-mode code
Filter out duplicate failures
Log SMART READ LOG details to OS event log
Device error count from summary error log sector
Life timestamp from most recent error log entry
Trigger user-context interactive resolution
Customizable by Group Policy
Print instructions, walk user through backup

11. Startup Repair/Windows Recovery Environment

Purpose: Recover from non-bootable
states, including those caused by
disk failures
Automatic failover on boot failure
to recovery partition
Optionally deployed by OEM
Available on installation media
Hands-free diagnosis and repair
of top non-boot issues

12. Corrupted File Recovery

Purpose: Turn repeat user-mode crashes
caused by corrupted system binaries into
one-time crash with silent repair
from cache
Windows Error Reporting crash handler
triggers diagnostic on inpage error
crashes due to bad blocks
Diagnoses corrupted system files
Silent repair from System File Cache

13. Windows Vista Disk Diagnostics

Windows Vista
Disk Diagnostics
Matthew Kerner
Program Manager
Windows Diagnosis

14. Opportunities For Future Technology

Opportunities For
Future Technology
Proactive failure prevention
Reduce scenario pain by enabling
resolutions other than just data recovery
Requires finer-grained failure description
to help host choose the best resolution
Increase warning time before failures
to allow users to save data

15. Future Technology: Protecting User Data And Preventing Hard Drive Failure Proactively

Future Technology:
Protecting User Data
And Preventing Hard
Drive Failure Proactively
Frank Shu

16. What Is PRCS?

Proactive Reporting and Correcting
Safeguard (PRCS) enables a device and
host to correct failure conditions proactively
Device can report hostile conditions before
damage or failure occurs
Host reacts to a device event in real time
based on policy and user preference
A proposal for the PRCS protocol has
been submitted to T13

17. Why Is PRCS Important?

User’s digital data is more valuable than
ever before
Disk drive capacity continue to increase
Not every PC user can afford RAID
Deliver on opportunities for improvements
beyond S.M.A.R.T.

18. Goals Of PRCS

Proactively protect user data
Improve the user experience
when data is at risk
Reduce OEM’s customer support costs
Reduce warranty costs for disk
drive vendors

19. PRCS Features

Device monitors its own conditions
in real time
Reduce host monitoring performance impact
Device sends meaningful PRCS events to
the host for correction of hostile conditions
and data protection
No translations or guesses required
Host acts on device’s PRCS event
proactively according to policy and
user preference

20. PRCS Advantages

PRCS is proactive
Taking a corrective action before errors occur
Protecting data when it is at risk
PRCS is designed for end users, not just
computer experts
No need to understand a cryptic message to
benefit from PRCS. For example: “The previous
self-test completed having the electrical element
of the test failed”
PRCS enables transparent mitigation of a hostile
condition or a recovery process
Users do not need to configure a self-test mode or
reporting method
Users control policy as desired

21. Proactive Disk Diagnostics

Proactive
Disk Diagnostics
Debasis Baral
Vice President of Engineering
Samsung

22. HDD Reliability 101

HDD reliability and performance
is negatively impacted by extremes
in the following operating conditions
Temperature
Demo
Vibration
Demo
Shock
Demo
Duty cycle
Altitude
Humidity
A combination of the above conditions
A history of the above combinations

23. Reliability Versus Temperature

HDD life decreases with temperature
Failure rates increase exponentially with temperature
for all HDD suppliers
Environmental temperature increase from 25C to 100C
could translate into 10 – 50x shorter life
Ref.: Samsung reliability tests
Samsung HDD Lab Engineering Sample Data

24. Performance Versus Vibration

Performance Versus Vibration
Data throughput or drive performance can be
significantly affected in the presence of vibration
Effect of vibration is reversible
Cumulative effects of vibration on long term drive
reliability is a subject of ongoing research
Performance Loss With Vibration
100
120
80
10
60
40
20
Samsung HDD Lab Engineering Sample Data
1
0
0.05
0.10
0.20
0.50
0.75
Vibration level, Arb. Units
Thruput, MB/S
Off Track
1.00
1.30
Offtrack, % Track Ptich
Throughput in MB/s
100

25. Reliability Versus Shock

Shock Modeling
Operating shock damage
Op. Shock
Scratches
Damage by corners, leading edge,
and side edges of the slider.
Courtesy: E. Jayson and Frank Talke, UC San Diego
Excessive shock is the major
cause of failure in both PC
and consumer electronics
environments
Non-operating shock damage

26. Reliability Design Guidelines

Failure modes and failure rates
of disk drives depend on their
operating environments
Temperature and Handling
(shock and vibration) are major factors
impacting HDD reliability
HDD reliability will be enhanced if OS
detects and manages reliability risks
and stress events intelligently (PRCS)
Users can improve HDD data reliability
by correctly responding to PRCS events

27. PRCS

Kai Chen
Microsoft Corporation
Debasis Baral
Samsung

28. Call To Action

Test your drives with Windows Vista Disk
Diagnostics and send feedback
Ensure your drives comply with ATA-7
specs to surface device error count and
life timestamp
Engage with the Startup Repair team to
build a plan for Startup Repair in OEM
factory processes
Participate in T13 discussions on PRCS
Plan your device designs in line with
PRCS guidelines

29. Additional Resources

Whitepapers
Windows Recovery Environment/Startup
Repair/Built-in Diagnostics:
http://www.microsoft.com/technet/windowsvista/evaluate/feat/relperf.mspx
Feedback/Questions
Windows Vista Disk Diagnosis:
Dfdfeed @ microsoft.com
Corrupt File Recovery: Dfdfeed @ microsoft.com
Windows Recovery Environment/Startup Repair:
Recovery @ microsoft.com
PRCS: Prcsdisc @ microsoft.com

30.

© 2006 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market
conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
English     Русский Правила