How We Cut Chat Latency: 7 Phases of Optimization
From 12 seconds to 2.8 — a story of how we transformed a slow legal chat into a tool that is a pleasure to use
How We Cut Chat Latency: 7 Phases of Optimization
When a lawyer asks a question to an AI system, every second of waiting is a second when they start doubting the technology. Here is how we cut response time from 12 seconds to 2.8.
Starting Point: Why the Chat Was Slow
LEX AI does not work like a regular chatbot. Our ChatService implements an agentic loop: upon receiving a user request, the LLM decides on its own which tools to call, analyzes results, and may run up to 5 iterations before forming a final response. A typical query like "What is the court practice on compensation for moral damages from traffic accidents?" goes through this path:
- LLM analyzes the query and selects tools
- Calls
search_court_decisions(semantic search in Qdrant + PostgreSQL) - Calls
get_court_decisionfor 3-5 found decisions - LLM analyzes the texts and forms a response
- SSE streaming of the result to the client
Each step is a network request, and they were executed sequentially. We profiled a typical query and got this breakdown:
| Stage | Time (ms) | Share | |——-|———–|——-| | First LLM call (tool selection) | 2,400 | 20% | | Qdrant search (embedding + query) | 1,800 | 15% | | Loading 4 decisions from ZakonOnline | 4,200 | 35% | | Second LLM call (analysis + response) | 3,100 | 26% | | Serialization, SSE, overhead | 500 | 4% | | Total | 12,000 | 100% |
Median response time — 12 seconds. P95 — 18.4 seconds. For an interactive chat, this is unacceptable.
Phase 1: Parallel Tool Execution
Problem: When the LLM requested multiple tool calls simultaneously (e.g., search_court_decisions + get_legislation_section), we executed them sequentially via a simple for...of loop.
Solution: Replaced sequential execution with Promise.allSettled():
// Before:
for (const toolCall of toolCalls) {
const result = await this.executeTool(toolCall);
results.push(result);
}
// After:
const promises = toolCalls.map(tc => this.executeTool(tc));
const settled = await Promise.allSettled(promises);
We added a semaphore with a limit of 6 parallel calls to avoid overloading either the ZakonOnline API or the database. Each call received an individual timeout of 8 seconds instead of a shared one.
Result: -2,100 ms on queries with 3+ tools. The biggest gain — when the LLM requests 4-5 court decisions at once.
Phase 2: SSE Streaming from the First Token
Problem: We waited for the complete response from the LLM and only then sent it to the client as a single SSE message. The user saw a blank screen for 3+ seconds during text generation.
Solution: Switched the OpenAI API to stream: true mode and piped tokens directly into SSE:
// SSE events now fly as they are generated
for await (const chunk of openaiStream) {
const token = chunk.choices[0]?.delta?.content;
if (token) {
res.write(\`data: \${JSON.stringify({ type: 'token', content: token })}\\n\\n\`);
}
}
On the frontend, the useAIChat() hook now updates the UI on every received token. First text appears within 200-400 ms after generation starts.
Result: Perceived latency (Time to First Token) dropped from 3,100 ms to 380 ms. Total time did not change, but UX improved dramatically.
Phase 3: Tool-Level Caching
Problem: The same get_court_decision call for a popular Supreme Court decision was made dozens of times per day, each time hitting the ZakonOnline API.
Solution: Added three-tier caching: Redis (TTL 4 hours) -> PostgreSQL (TTL 30 days) -> API:
async getDocumentFullText(docId: string): Promise<string> {
const cached = await this.redis.get(\`doc:fulltext:\${docId}\`);
if (cached) return cached; // ~2ms
const pgCached = await this.db.query(
'SELECT full_text FROM document_cache WHERE zakononline_id = $1', [docId]
);
if (pgCached.rows[0]) {
await this.redis.setex(\`doc:fulltext:\${docId}\`, 14400, pgCached.rows[0].full_text);
return pgCached.rows[0].full_text; // ~15ms
}
const text = await this.zoAdapter.fetchFullText(docId); // ~800ms
// ... save to both caches
return text;
}
After a week of operation, cache hit rate stabilized at 73% for Redis and 91% for PostgreSQL.
Result: -1,900 ms on repeated queries (most of them). Traffic savings to ZakonOnline: ~68%.
Phase 4: Connection Pooling and Keep-Alive
Problem: Every HTTP request to ZakonOnline opened a new TCP connection. TLS handshake added 120-180 ms per call.
Solution: Configured an HTTP Agent with keep-alive and pooling:
const zoAgent = new https.Agent({
keepAlive: true,
maxSockets: 15,
maxFreeSockets: 5,
timeout: 10000,
});
We also increased the PostgreSQL connection pool from 10 to 25 (via PgBouncer in transaction mode) and enabled Redis pipelining.
Result: -380 ms per external call after the first. With 4 calls per query — that is -1,100 ms total.
Phase 5: Prompt Optimization
Problem: The ChatService system prompt contained 2,800 tokens — a detailed description of all 36 tools, response format, legal terminology. The LLM spent time processing this context on every iteration.
Solution: We restructured the prompt:
- Shortened tool descriptions to key parameters (from 2,800 to 1,400 tokens)
- Added
DOMAIN_TOOL_MAP— a compact domain-based routing map instead of the full list - Moved usage examples from the system prompt to a few-shot section, added only on the first call
Result: -420 ms per LLM call. With 2 calls per query — -840 ms.
Phase 6: Pre-computed Embeddings
Problem: Every search query generated an embedding via OpenAI text-embedding-ada-002 — that is 300-600 ms per API call.
Solution: Introduced an embedding cache in Redis with query normalization:
function normalizeQuery(q: string): string {
return q.toLowerCase().trim()
.replace(/[\u00AB\u00BB"']/g, '')
.replace(/\s+/g, ' ');
}
const cacheKey = \`emb:\${crypto.createHash('md5')
.update(normalizeQuery(query)).digest('hex')}\`;
Additionally, we implemented a nightly background job that pre-computes embeddings for the top 200 most frequent queries from analytics.
Result: -450 ms for repeated queries (cache hit ~41% in the first week, ~58% after a month).
Phase 7: Materialized Search Results
Problem: Semantic search in Qdrant returned document IDs, after which we made N queries to PostgreSQL to fetch metadata (court name, date, case number).
Solution: Created a materialized view that refreshes every 15 minutes:
CREATE MATERIALIZED VIEW mv_court_decision_search AS
SELECT d.zakononline_id, d.title, d.court_name, d.case_number,
d.judgment_date, d.justice_kind, d.doc_type,
LEFT(d.full_text, 500) AS snippet
FROM court_decisions d
WHERE d.full_text IS NOT NULL;
CREATE INDEX idx_mv_search_zoid ON mv_court_decision_search(zakononline_id);
Now after receiving IDs from Qdrant, we make one batch query to the materialized view instead of N separate ones.
Result: -680 ms on searches with 10+ results.
Summary: Before and After
| Metric | Before | After | Change | |——–|——–|——-|——–| | Median response (p50) | 12.0 s | 2.8 s | -77% | | P95 | 18.4 s | 5.2 s | -72% | | Time to First Token | 3,100 ms | 380 ms | -88% | | Cache hit rate (Redis) | 0% | 73% | – | | External API calls/query | 6.2 | 2.1 | -66% | | OpenAI cost per query | 0.034 | 0.021 | -38% |
The biggest impact came from three things: parallel tool execution (phase 1), caching (phase 3), and streaming (phase 2, for perception). The remaining phases gave smaller but consistent gains that accumulate.
Conclusion
Latency optimization in LLM systems is not a single silver bullet, but a combination of approaches at every level of the stack. Paradoxically, the biggest impact on user satisfaction came not from reducing total time, but from streaming the first token. A lawyer who sees the system "thinking" and gradually forming a response is willing to wait significantly longer than one staring at a blank screen.