Alerting Guide
Configure alerts to get notified when issues arise.
Overview
SkySignal alerts notify you when:
- Performance degrades beyond thresholds
- Error rates spike
- System resources are constrained
- Custom metric thresholds are crossed
Creating Alerts
Basic Alert Setup
- Navigate to Settings → Alerts
- Click Create Alert
- Configure the alert:
- Name - Descriptive name
- Metric - What to monitor
- Condition - When to trigger
- Threshold - Value that triggers alert
- Duration - How long condition must persist
Alert Types
Performance Alerts
Alert: Slow Methods
Metric: Method Response Time (P95)
Condition: Greater than
Threshold: 500ms
Duration: 5 minutes
Alert: High Error Rate
Metric: Error Rate
Condition: Greater than
Threshold: 5%
Duration: 2 minutes
System Alerts
Alert: High Memory Usage
Metric: Memory Percentage
Condition: Greater than
Threshold: 85%
Duration: 10 minutes
Alert: Event Loop Lag
Metric: Event Loop Lag
Condition: Greater than
Threshold: 100ms
Duration: 5 minutes
Custom Metric Alerts
Alert: Low Order Rate
Metric: orders.created (Counter Rate)
Condition: Less than
Threshold: 10/hour
Duration: 30 minutes
Notification Channels
Email Notifications
Default notification method for all accounts:
- Go to Settings → Notifications
- Add team member emails
- Configure notification preferences
Email notifications include:
- Alert name and severity
- Current value vs threshold
- Link to affected resource
- Quick investigation steps
Alert Configuration
Severity Levels
| Severity | Use Case | Notification |
|---|---|---|
| Critical | Service down, data loss | Immediate, all channels |
| Warning | Degraded performance | Email + Slack |
| Info | Informational | Email digest |
Alert Conditions
| Condition | Description |
|---|---|
| Greater than | Value exceeds threshold |
| Less than | Value below threshold |
| Equals | Value matches exactly |
| Changes | Value changes from baseline |
Duration Settings
Prevent alert fatigue with duration requirements:
- Immediate - Trigger on first occurrence
- 1 minute - Brief transients filtered
- 5 minutes - Recommended for most alerts
- 15 minutes - For gradual degradation
Alert Management
Muting Alerts
Temporarily silence alerts:
Mute Duration: 1 hour | 4 hours | 24 hours | Custom
Reason: Planned maintenance (optional)
Muted alerts still record but don't notify.
Alert History
View alert timeline:
- Trigger and resolution times
- Duration of incidents
- Notification delivery status
Alert Escalation
Configure escalation policies:
- Initial - Notify primary channel
- After 15 min - Notify secondary channel
- After 30 min - Page on-call engineer
Best Practices
1. Start with Baselines
Before setting thresholds, observe normal values:
# First: Monitor for a week
# Then: Set threshold at 2x normal P95
Normal P95 Response Time: 150ms
Alert Threshold: 300ms
2. Avoid Alert Fatigue
# ❌ Bad: Too sensitive
Threshold: 100ms
Duration: Immediate
# ✅ Good: Meaningful alerts
Threshold: 300ms
Duration: 5 minutes
3. Use Composite Alerts
Combine conditions to reduce noise:
Alert: Service Degradation
Conditions:
- Error Rate > 5% AND
- Response Time P95 > 500ms
Duration: 5 minutes
4. Document Runbooks
Include response instructions in alert descriptions:
## High Error Rate Alert
### Investigation Steps
1. Check recent deployments
2. Review error details in SkySignal
3. Check database connectivity
4. Review external service status
### Escalation
Contact: oncall@company.com
Slack: #incidents
5. Regular Review
Monthly alert hygiene:
- Disable alerts that never trigger
- Tighten thresholds on frequent alerts
- Add alerts for recent incidents
Example Alert Configurations
Production Essentials
# Error Rate
- name: High Error Rate
metric: error_rate
condition: greater_than
threshold: 5%
duration: 2m
severity: critical
# Response Time
- name: Slow Response Time
metric: method_p95
condition: greater_than
threshold: 500ms
duration: 5m
severity: warning
# Memory
- name: High Memory Usage
metric: memory_percentage
condition: greater_than
threshold: 85%
duration: 10m
severity: warning
# New Errors
- name: New Error Type
metric: new_error
condition: exists
duration: immediate
severity: info
E-commerce Specific
# Checkout failures
- name: Checkout Errors
metric: method_errors
filter: method:checkout.*
condition: greater_than
threshold: 1%
duration: 2m
severity: critical
# Order rate
- name: Low Order Rate
metric: orders.created
condition: less_than
threshold: 10/hour
duration: 30m
severity: warning
Troubleshooting
Alert Not Triggering
- Verify metric is being collected
- Check threshold and duration settings
- Ensure alert is not muted
- Verify notification channel is configured
Too Many Alerts
- Increase threshold values
- Extend duration requirements
- Use composite conditions
- Review and consolidate similar alerts
Next Steps
- Error Tracking - Investigate triggered alerts
- Performance Optimization - Fix performance issues