We classify three types of production issues:
outages
- An outage is any user-impacting disruption of service.bugs
- Bugs are not always outages. A bug generally affects a single version of the codebase.
priority::high
, priority::higher
, or priority::highest
.Examples of outages include:
meltano.com
website is down.hub.meltano.com
website is down.discovery.yml
web endpoint is down.pipx install meltano
is failing for any reason - including upstream package dependency breakages, PyPI outages, etc.urgency::highest
)
Examples of critical bugs include:
Bugs labeled priority::highest
should be alerted ASAP, and should be resolved within 24 hours or sooner. By approval from a Staff Engineer or higher, the problem version may be optionally yanked from PyPI.
Always tag Taylor, and Will Da Silva when a critical bug is identified.
The #meltano-alerts
Slack channel receives alerts for outages and high-priority bugs.
The #troubleshooting channel is the primary place we notify users of outages and critical bugs. Depending on severity and percentage of users impacted, we may also notify users in the #announcements channel.
If you are responding to an alert in #meltano-alerts
:
If you have identified a production outage or a critical bug and no alert is yet logged to #meltano-alerts
:
#meltano-alerts
.When outages are expected to impact users, please share the alert or create a new notification in the #troubleshooting
channel. Users would otherwise inquire in #troubleshooting
should discover your notification and know that the Meltano team is addressing the issue.
Occasionally we observe outages due to upstream services failures.
If the issue requires action from us or is otherwise worthy of investigation, we should log an issue for tracking our work and then proceed with the alerting process.
If the issue does not require any action from us, such as a significant PyPI or GitLab service outage, we may not need to open an issue but we should nevertheless notify users as appropriate.
The Information Security Manager (ISM) as described in Meltano’s policies is the primary on-call engineer. Should the primary on-call engineer be unavailable, the ISM is the secondary on-call engineer.
Potential security incident must be reported to the ISM and other on-call engineers promptly either via email or Slack. The notification channel #internal-infra-alerts
should be used to report any incidents. Refer to the Incident Response Policy document (in Drata) for additional details.
All staff must complete training for the “Procedure For Executing Incident Response” as outlined in the internal Meltano Incident Response Policy document at least once per year.
The incident response plan is tested anually during the month of July.