Skip to main content

Alerting Guide

Configure alerts to get notified when issues arise.

Overview

SkySignal alerts notify you when:

  • Performance degrades beyond thresholds
  • Error rates spike
  • System resources are constrained
  • Custom metric thresholds are crossed

Creating Alerts

Basic Alert Setup

  1. Navigate to SettingsAlerts
  2. Click Create Alert
  3. Configure the alert:
    • Name - Descriptive name
    • Metric - What to monitor
    • Condition - When to trigger
    • Threshold - Value that triggers alert
    • Duration - How long condition must persist

Alert Types

Performance Alerts

Alert: Slow Methods
Metric: Method Response Time (P95)
Condition: Greater than
Threshold: 500ms
Duration: 5 minutes
Alert: High Error Rate
Metric: Error Rate
Condition: Greater than
Threshold: 5%
Duration: 2 minutes

System Alerts

Alert: High Memory Usage
Metric: Memory Percentage
Condition: Greater than
Threshold: 85%
Duration: 10 minutes
Alert: Event Loop Lag
Metric: Event Loop Lag
Condition: Greater than
Threshold: 100ms
Duration: 5 minutes

Custom Metric Alerts

Alert: Low Order Rate
Metric: orders.created (Counter Rate)
Condition: Less than
Threshold: 10/hour
Duration: 30 minutes

Notification Channels

Email Notifications

Default notification method for all accounts:

  1. Go to SettingsNotifications
  2. Add team member emails
  3. Configure notification preferences

Email notifications include:

  • Alert name and severity
  • Current value vs threshold
  • Link to affected resource
  • Quick investigation steps

Alert Configuration

Severity Levels

SeverityUse CaseNotification
CriticalService down, data lossImmediate, all channels
WarningDegraded performanceEmail + Slack
InfoInformationalEmail digest

Alert Conditions

ConditionDescription
Greater thanValue exceeds threshold
Less thanValue below threshold
EqualsValue matches exactly
ChangesValue changes from baseline

Duration Settings

Prevent alert fatigue with duration requirements:

  • Immediate - Trigger on first occurrence
  • 1 minute - Brief transients filtered
  • 5 minutes - Recommended for most alerts
  • 15 minutes - For gradual degradation

Alert Management

Muting Alerts

Temporarily silence alerts:

Mute Duration: 1 hour | 4 hours | 24 hours | Custom
Reason: Planned maintenance (optional)

Muted alerts still record but don't notify.

Alert History

View alert timeline:

  • Trigger and resolution times
  • Duration of incidents
  • Notification delivery status

Alert Escalation

Configure escalation policies:

  1. Initial - Notify primary channel
  2. After 15 min - Notify secondary channel
  3. After 30 min - Page on-call engineer

Best Practices

1. Start with Baselines

Before setting thresholds, observe normal values:

# First: Monitor for a week
# Then: Set threshold at 2x normal P95

Normal P95 Response Time: 150ms
Alert Threshold: 300ms

2. Avoid Alert Fatigue

# ❌ Bad: Too sensitive
Threshold: 100ms
Duration: Immediate

# ✅ Good: Meaningful alerts
Threshold: 300ms
Duration: 5 minutes

3. Use Composite Alerts

Combine conditions to reduce noise:

Alert: Service Degradation
Conditions:
- Error Rate > 5% AND
- Response Time P95 > 500ms
Duration: 5 minutes

4. Document Runbooks

Include response instructions in alert descriptions:

## High Error Rate Alert

### Investigation Steps
1. Check recent deployments
2. Review error details in SkySignal
3. Check database connectivity
4. Review external service status

### Escalation
Contact: oncall@company.com
Slack: #incidents

5. Regular Review

Monthly alert hygiene:

  • Disable alerts that never trigger
  • Tighten thresholds on frequent alerts
  • Add alerts for recent incidents

Example Alert Configurations

Production Essentials

# Error Rate
- name: High Error Rate
metric: error_rate
condition: greater_than
threshold: 5%
duration: 2m
severity: critical

# Response Time
- name: Slow Response Time
metric: method_p95
condition: greater_than
threshold: 500ms
duration: 5m
severity: warning

# Memory
- name: High Memory Usage
metric: memory_percentage
condition: greater_than
threshold: 85%
duration: 10m
severity: warning

# New Errors
- name: New Error Type
metric: new_error
condition: exists
duration: immediate
severity: info

E-commerce Specific

# Checkout failures
- name: Checkout Errors
metric: method_errors
filter: method:checkout.*
condition: greater_than
threshold: 1%
duration: 2m
severity: critical

# Order rate
- name: Low Order Rate
metric: orders.created
condition: less_than
threshold: 10/hour
duration: 30m
severity: warning

Troubleshooting

Alert Not Triggering

  1. Verify metric is being collected
  2. Check threshold and duration settings
  3. Ensure alert is not muted
  4. Verify notification channel is configured

Too Many Alerts

  1. Increase threshold values
  2. Extend duration requirements
  3. Use composite conditions
  4. Review and consolidate similar alerts

Next Steps