Knowledge Indexing Protection
Backgroundβ
Document RAG indexing previously relied mostly on Celery delivery and retry behavior, without a business-level idempotency guard.
In production, the following situations could cause repeated embedding for the same large file:
- worker / Pod restarts that trigger broker redelivery
- duplicate enqueue for the same document
- an old task resuming after a newer indexing attempt has already started
Goalsβ
- Only one active indexing generation should be valid for a document at a time
- Old redelivery / retry tasks must not overwrite newer results
- Even if Celery re-delivers a task, the business layer should reject it quickly
- Indexing state should be explicit for debugging and future UI usage
Designβ
The protection has three layers.
1. Document indexing state machineβ
The knowledge_documents table now stores:
index_status:not_indexed | queued | indexing | success | failedindex_generation
index_generation is the key field. Every valid new indexing attempt gets a new generation. Any older task becomes stale automatically.
not_indexed is the explicit initial state. It separates "never indexed yet" from a real execution failure instead of overloading both into failed.
2. Business dedupe before enqueueβ
Before sending a Celery task, the orchestrator updates database state first:
- if the document is already
queued / indexing, duplicate enqueue is skipped - if
queued / indexinghas been stuck too long, a new request can take over with a new generation - if the document is already
success, normal retry requests are skipped - only flows that explicitly need a new attempt create a new generation
Current policy:
- new document: create a new generation
- retry for failed document: create a new generation
- content update / web refresh: replace the active generation so old tasks become stale
- long-lived
queued / indexing: allow takeover after stale detection based onupdated_at
Current default thresholds:
queuedolder than 10 minutes can be replaced by a new generationindexingolder than 45 minutes can be replaced by a new generation
3. Final guard before worker executionβ
Before calling the embedding model, the Celery worker:
- acquires a Redis distributed lock based on
document_id - validates the current
index_generationandindex_statusin the database
The task only executes when:
- the task generation matches the current database generation
- the current status is still
queuedorindexing
Otherwise it returns skipped and does not call the embedding model again.
Why generation solves redeliveryβ
A lock alone is not enough.
After a worker crash, the lock will eventually expire, while the old broker message may still be redelivered. Without generation checks, that old message could acquire the lock later and run again.
With generation:
- a new indexing attempt increments the generation
- old redelivered messages become stale immediately
- even if an old task reaches the final write phase, it cannot overwrite the latest generation state
Long-running lock watchdogβ
Embedding can run for many minutes, so the document-level Redis lock uses a watchdog extension pattern:
- short initial TTL
- background extension while the task is alive
- natural expiration after process exit or crash
This avoids both failure modes:
- lock expires too early during a long task
- lock stays too long after a crashed worker
Relationship with Celery retryβ
This change still focuses on business-side protection first, so index_document no longer uses generic retry for all failures.
The current behavior is:
lock_held: limited short-delay retries only, to cover the leftover lock window after worker restarts- duplicate / stale tasks that business logic can explain:
skipped - actual execution failure: persisted as
failed - if the indexing layer returns a business
skippedresult during execution: it is also persisted asfailedinstead ofsuccess - any further retry must create a new business generation explicitly
This prevents the same failed logical task from being replayed repeatedly between broker and workers.
Current default lock settings:
- document lock TTL: 120 seconds
- watchdog extend interval: 30 seconds
lock_heldretry delay: 15 secondslock_heldmax retries: 10
Existing data migrationβ
The migration initializes historical documents as follows:
is_active = truebecomesindex_status = success- all other historical documents default to
index_status = not_indexed
This lets existing rows enter the new state machine immediately.
Follow-up workβ
This change fixes business idempotency and repeated execution, but it is not the end state for Celery operations. Production still should move toward:
- separate Web / Worker / Beat deployments
- disabling embedded Celery in production
- dedicated queues for knowledge indexing
- timeout handling, stale recovery, and monitoring metrics
Business protection is the safety net. Celery deployment hardening should still continue.