Monitoring¶
Reporting channels¶
Users of ConsDB, ConsDBClient (pqserver
) should report issues via #consolidated-database in rubin-obs.slack.com.
ConsDB operators monitor this channel and #ops-usdf, #ops-usdf-alerts for issues and outages reported, as well as escalate verified database issues.
Database¶
The ConsDB team is responsible for verifying whether or not the database is up when issues are reported.
They can check the method reported by the users, check using psql
/ pgcli
, and check in the #ops-usdf slack channel for currently reported issues.
Once the ConsDB team has confirmed there is an issue with the database, they should notify #ops-usdf slack channel and USDF DBAs should be responsible for fixing/restarting.
REST API Server¶
If we suspect the API server died, the ConsDB team should be responsible for checking and restarting it.
Use the appropriate argo-cd deployment graph to check deployment logs, and potentially restart the service.
Other issues¶
If the K8s infrastructure died then the ConsDB team can verify the problem, but there are likely to be wider issues seen.
USDF or Summit K8s/IT support should be responsible for fixing.