Monitoring
Observe platform health in real time — service uptime, error rates, background job queues, webhook delivery, and structured application logs.
Overview
The Monitoring section of the Admin Portal is the operational health center for the DramWell platform. It surfaces real-time and historical data about service availability, API error rates, background job execution, webhook delivery, and application logs — giving the engineering and operations teams what they need to detect, diagnose, and resolve incidents quickly.
Key Concepts
Health Check — A periodic ping to each DramWell service endpoint that reports its status (Up, Degraded, Down) and response latency.
Error Rate — The percentage of API requests that returned a 4xx or 5xx response in a given time window. Tracked per service and per route.
Job Queue — A list of background jobs pending execution. Queues can accumulate during high-load periods or when a worker is down. Deep queues indicate a processing backlog.
Webhook Event — An outbound HTTP request sent from DramWell to a customer-configured endpoint when a platform event occurs. Failed webhook events are retried with exponential backoff.
Log — A structured application log entry emitted by any DramWell service. Logs include severity, service name, trace ID, and a structured payload.
Service Health
Go to Monitoring > Health. Each row represents a service with its current status and a 24-hour uptime percentage:
| Service | Description |
|---|---|
| API | Core REST/tRPC API (api.dramwell.ai) |
| Dashboard | DramGuest/DramPulse/DramTrade app server |
| Admin | Admin Portal app server |
| Supabase | Database and authentication layer |
| Twilio Relay | Telephony webhook handler |
| Edge Functions | Supabase edge function runtime |
Click any service to see a response-time histogram for the last 24 hours and a list of recent health check failures.
Error Tracking
Go to Monitoring > Errors. The top panel shows a sparkline of error rate over time. Below it, errors are grouped by type and sorted by occurrence count. Each error group shows:
- Error message and stack trace sample
- First seen / last seen timestamps
- Occurrence count
- Affected service and route
- A sample of affected request IDs for correlation
Click any error group to see the full trace and all recent occurrences. Use the Resolve button to mark an error group as investigated (does not suppress future occurrences — they will reopen the group if they recur).
Job Queues
Go to Monitoring > Queues. Each queue is listed with its current depth, processing rate (jobs per minute), and oldest job age. Normal depth for all queues is under 100 jobs. A queue with depth over 500 or oldest job age over 5 minutes indicates a problem.
Actions available per queue:
- Pause — Stops workers from pulling new jobs. Use while diagnosing a poison-pill job.
- Resume — Re-enables processing after a pause.
- Purge — Deletes all jobs in the queue. Irreversible — use only when you have confirmed the jobs are safe to discard.
- Retry Failed — Re-enqueues all jobs in the Failed state for another processing attempt.
Webhook Delivery
Go to Monitoring > Webhooks. The table shows every webhook event in the last 7 days with its delivery status:
- Delivered — Customer endpoint returned a 2xx response
- Pending — Awaiting first delivery attempt or retry
- Failed — All retry attempts exhausted
Click any failed event to see the full delivery history including each attempt's timestamp, HTTP status returned, and response body. From the detail view, click Force Retry to immediately attempt delivery again outside the normal retry schedule.
Logs
Go to Monitoring > Logs. The log viewer streams live structured logs from all services. Use the filter bar to scope by:
- Service — API, Dashboard, Admin, Edge Functions
- Severity — Debug, Info, Warning, Error
- Trace ID — Correlate all logs from a single request
- Time range — Up to 7 days of historical logs are queryable
Click any log entry to expand the full structured payload. Use the Copy Trace ID button to pull all logs associated with a request into a filtered view.
Tips
- Set a browser bookmark directly to the Health page. It is the fastest way to confirm whether a reported issue is a platform outage or a configuration problem on the customer's end.
- Correlate error spikes with recent deployments using the deployment timestamps overlaid on the Error Rate chart (toggle Show Deployments in the chart header).
- The Logs search is most effective when you have a trace ID. Every API response includes an
X-Trace-Idheader — ask customers to provide it when reporting issues.
Related Articles
Was this article helpful?