5.1 Uptime Guarantees
Uptime guarantees are the cornerstone of service level agreements. Understanding how to define, measure, and enforce availability commitments is essential for ensuring service reliability meets business requirements.
Defining Availability
Availability Levels and Business Impact
| Availability | Annual Downtime | Monthly Downtime | Typical Use Case |
|---|---|---|---|
| 99.0% (Two 9s) | 3.65 days | 7.31 hours | Non-critical internal applications |
| 99.5% | 1.83 days | 3.65 hours | Standard business applications |
| 99.9% (Three 9s) | 8.76 hours | 43.8 minutes | Important business services |
| 99.95% | 4.38 hours | 21.9 minutes | Critical applications |
| 99.99% (Four 9s) | 52.6 minutes | 4.38 minutes | Mission-critical systems |
| 99.999% (Five 9s) | 5.26 minutes | 26 seconds | High-availability infrastructure |
Measurement Methodology
How availability is measured significantly impacts SLA outcomes:
- Measurement period: Monthly is standard; annual may mask monthly volatility
- Measurement points: Define where availability is measured (user endpoint vs. data center)
- Monitoring tools: Specify approved tools and measurement frequency
- Partial outages: Define how degraded performance is calculated
Vendors may manipulate availability metrics through: (1) Measuring at favorable points in infrastructure, (2) Excluding degraded performance states, (3) Using measurement intervals that miss brief outages, (4) Defining "availability" narrowly. Ensure clear definitions covering all service components and performance thresholds.
Exclusions from Availability Calculation
Standard SLAs exclude certain events from downtime calculations. Negotiate exclusion limits:
| Exclusion Type | Vendor Position | Customer Negotiation Target |
|---|---|---|
| Scheduled Maintenance | Unlimited exclusion | Cap hours/month; require off-peak scheduling |
| Force Majeure | Broad definition | Narrow to truly unforeseeable events |
| Customer-caused issues | All customer actions | Only issues from documented customer fault |
| Third-party services | Full exclusion | Include critical dependencies in SLA |
5.2 Response Time Metrics
Response time metrics define how quickly providers must respond to and resolve incidents. Well-structured response metrics ensure timely issue resolution while aligning urgency levels with business impact.
Incident Priority Classification
| Priority | Definition | Business Impact |
|---|---|---|
| P1 - Critical | Complete service outage or major function unavailable | Business operations halted; revenue impact |
| P2 - High | Significant degradation; workaround available | Major productivity impact; customer-facing issues |
| P3 - Medium | Partial impact; acceptable workaround exists | Moderate productivity impact; non-critical functions |
| P4 - Low | Minor issue or enhancement request | Minimal impact; cosmetic or convenience issues |
Response and Resolution Targets
| Priority | Response Target | Resolution Target | Escalation |
|---|---|---|---|
| P1 - Critical | 15 minutes | 4 hours | Immediate to management |
| P2 - High | 30 minutes | 8 hours | 2 hours to management |
| P3 - Medium | 4 hours | 24 hours | 8 hours to senior support |
| P4 - Low | 8 hours | 72 hours | 24 hours to senior support |
24x7 vs. Business Hours Coverage
Support coverage hours significantly impact effective service levels:
24x7x365: Round-the-clock support including holidays; essential for critical systems
24x7 (Business Days): Round-the-clock weekdays; reduced weekend coverage
Business Hours: Standard working hours (e.g., 9 AM - 6 PM IST); clock pauses outside hours
Follow-the-Sun: Global support with handoffs between time zones
Match coverage to business needs: (1) P1/P2 should always be 24x7 for business-critical systems, (2) Clarify clock stopping rules for customer dependencies, (3) Define holiday coverage explicitly, (4) Ensure escalation contacts available during off-hours coverage.
5.3 Measurement and Reporting
Objective measurement and transparent reporting are essential for SLA enforcement. This section covers measurement methodologies, reporting requirements, and dispute resolution mechanisms.
Measurement Principles
- Objectivity: Metrics must be objectively measurable, not subjective assessments
- Independence: Prefer independent measurement or customer verification rights
- Granularity: Measurement frequency must capture brief outages
- Transparency: Methodology and raw data available for customer review
- Auditability: Measurement systems subject to audit
Reporting Requirements
| Report Type | Frequency | Contents |
|---|---|---|
| Availability Report | Monthly | Uptime percentage, outage list, exclusions claimed |
| Incident Report | Monthly | Incident count by priority, response/resolution times |
| Performance Dashboard | Real-time | Current status, recent incidents, trending |
| Executive Summary | Quarterly | Trends, improvements, issues, credit summary |
| Root Cause Analysis | Per P1/P2 | Cause, impact, remediation, prevention |
Dispute Resolution
Establish clear procedures for SLA measurement disputes:
- Initial review: Customer raises dispute within 30 days of report
- Data exchange: Provider shares raw measurement data
- Technical review: Joint technical team evaluates dispute
- Management escalation: Unresolved disputes escalate to account managers
- Independent arbiter: For material disputes, engage agreed third-party
Protect measurement integrity: (1) Deploy independent monitoring tools, (2) Require provider to retain raw data for 12 months, (3) Include audit rights for measurement systems, (4) Define specific dispute timelines to prevent stale claims.
5.4 Penalty Structures
Penalty structures create financial incentives for SLA compliance. Well-designed remedies balance meaningful consequences with commercial viability while providing escalating pressure for chronic underperformance.
Service Credit Models
Standard Service Credit Structure
| Availability Achieved | Standard Credit | Enhanced Credit (Negotiated) |
|---|---|---|
| 99.0% - 99.9% | 10% of monthly fee | 15-25% |
| 95.0% - 99.0% | 25% of monthly fee | 30-50% |
| 90.0% - 95.0% | 50% of monthly fee | 75-100% |
| Below 90.0% | 100% of monthly fee | 100% + termination right |
Beyond Service Credits
Service credits alone may be insufficient. Negotiate additional remedies:
- Termination rights: Right to terminate for chronic SLA failures (e.g., 3 months consecutive)
- Root cause requirements: Mandatory RCA for significant outages
- Remediation plans: Provider must submit improvement plans after failures
- Staffing additions: Provider adds resources for chronic issues
- Fee reductions: Permanent fee reduction for sustained underperformance
- Liability carve-outs: SLA failures may exempt from liability caps
Service credits are often "sole and exclusive remedy" for SLA breaches. This means credits are the only compensation available regardless of actual damages. Negotiate: (1) Credits as minimum remedy, not exclusive, (2) Preserved termination rights, (3) Carve-out from liability caps for gross negligence causing outages.
Credit Caps and Claim Procedures
Standard SLAs cap credit exposure. Key provisions include:
Caps: Negotiate higher caps (50-100% vs. standard 25-30%)
Claim window: Minimum 60 days to claim credits after report
Automatic credits: Require automatic credit application without claim filing
Credit form: Cash refund option in addition to invoice credit
Aggregation: Allow multiple SLA failures in one month to aggregate
Key Takeaways
- Uptime percentages have dramatically different real-world downtime impacts
- Response and resolution times should align with incident priority and business impact
- Measurement methodology and exclusions significantly affect effective SLA levels
- Service credits are often insufficient - negotiate termination rights and additional remedies
- Automatic credit application prevents administrative credit losses
Knowledge Check
Test your understanding of service level agreements