Chat Shell Graceful Shutdown
Chat Shell's /v1/responses endpoint is a long-lived streaming API. When Kubernetes terminates a Pod, the service must stop accepting new traffic before waiting for existing streams to finish. Otherwise users can see interrupted SSE responses.
Shutdown Flowβ
Use /shutdown/wait from the preStop hook:
/shutdown/waitmarks the service asshutting_down./readyreturns503, so Kubernetes stops routing regular new traffic to the Pod./v1/responsesreturns503whileshutting_down, covering load balancer endpoint propagation delays.- Existing streams keep running until the active stream count reaches zero.
- If the wait exceeds
GRACEFUL_SHUTDOWN_TIMEOUT, Chat Shell cancels the remaining streams and returns a timeout result.
Kubernetes Configurationβ
terminationGracePeriodSeconds must be greater than GRACEFUL_SHUTDOWN_TIMEOUT so the preStop hook and process shutdown have enough time.
spec:
terminationGracePeriodSeconds: 330
containers:
- name: chat-shell
lifecycle:
preStop:
httpGet:
path: /shutdown/wait
port: 8001
readinessProbe:
httpGet:
path: /ready
port: 8001
livenessProbe:
httpGet:
path: /health
port: 8001
If GRACEFUL_SHUTDOWN_TIMEOUT=300, set terminationGracePeriodSeconds to at least 330.
Endpointsβ
GET /health: liveness probe. It still returns200during shutdown and includesshutting_downandactive_streams.GET /ready: readiness probe. It returns503during shutdown.POST /shutdown/initiate: manually enter shutdown state.POST /shutdown/wait: enter shutdown state and wait for active streams to finish; intended for KubernetespreStop.GET /v1/streams/active-count: return the active/v1/responsesstream count for debugging and monitoring.
Design Constraintsβ
shutdown_manageris the single source of truth for active stream counts, keeping/health,/v1/health, and shutdown waiting consistent./v1/responsesrejects only new streaming requests; already registered streams continue.- Timeout cancellation uses the cancel event registered by each stream instead of module-level temporary state.