System Design Interviews in 2025: What CTOs Are Actually Asking
If you're still practicing "Design a URL Shortener" or "Design Twitter's Timeline," you're preparing for 2019.
We reviewed hundreds of system design interviews this year. The pattern is clear: CTOs are now asking about LLM integration, vector database scaling, and cost-aware architectureβtopics that didn't exist in interview prep guides three years ago.
The technology landscape has shifted. Engineering teams are building AI-native applications, wrestling with cloud costs that spiral out of control, and architecting real-time systems that process millions of events per second. The interview questions have followed.
What you'll learn:
- The 6 question categories dominating 2025 interviews
- Specific example questions with what interviewers actually evaluate
- The 4-step framework that structures winning answers
- Common mistakes that sink senior candidates
Whether you're preparing for your next role or designing interview questions for your team, this is the playbook.
How System Design Interviews Have Evolved
The shift didn't happen overnight. Here's how the focus has changed over the past decade:
| Era | Primary Focus | Typical Question |
|---|---|---|
| 2015-2019 | Scaling CRUD applications | "Design Twitter's timeline" |
| 2020-2022 | Distributed systems fundamentals | "Design a distributed cache" |
| 2023-2024 | Real-time + ML serving | "Design a recommendation engine" |
| 2025 | AI-native, cost-aware, event-driven | "Design a RAG system with cost caps" |
Three forces are driving this evolution:
1. AI/ML integration is no longer optional. Every product team is experimenting with LLMs, embeddings, or some form of AI-assisted feature. Candidates who can't discuss RAG architectures, vector databases, or model serving patterns are at a disadvantage.
2. Cloud costs have become a first-class constraint. With engineering budgets under scrutiny, CTOs expect candidates to discuss unit economics, not just theoretical scalability. "It scales" isn't enoughβ"It scales at $0.003 per request" is.
3. Real-time is the default expectation. Users expect instant feedback. Event-driven architectures using Kafka, Flink, and WebSockets are foundational knowledge, not specialized skills.
What Different Companies Emphasize
The interview focus varies by company type:
| Company Type | Primary Focus | Unique Considerations |
|---|---|---|
| FAANG / Big Tech | Scale (billions of users), internal tooling | Deep dives into one component |
| High-Growth Startups | Speed to market, cost efficiency | MVP thinking, technical debt trade-offs |
| Fintech | Consistency, compliance (PCI-DSS, GDPR) | Transactional integrity, audit trails |
| Healthtech | HIPAA compliance, reliability | Zero-downtime requirements |
| AI-Native Companies | LLM orchestration, embeddings | Cost-per-query optimization |
Before your interview: Research the company's tech blog, engineering posts on LinkedIn, or conference talks. A candidate who says "I noticed you're using Kafka for your event pipeline based on your engineering blogβhere's how I'd approach..." immediately stands out.
The 6 Question Categories You'll Face
These are the categories we see repeatedly in senior and staff-level interviews. Master these, and you'll be prepared for 90% of what you'll encounter.
1. LLM Integration & RAG Architectures
This is the biggest shift from previous years. CTOs want to know if you can build production AI systems, not just call an API.
Example Question:
"Design a RAG (Retrieval-Augmented Generation) system for a customer support chatbot that handles 10,000 queries per day."
What interviewers are evaluating:
- Document ingestion pipeline: How do you chunk documents? What's your embedding strategy? How do you handle updates?
- Vector storage selection: Can you articulate when to use Pinecone vs. pgvector vs. Weaviate?
- Retrieval strategy: Do you understand hybrid search (dense vectors + sparse BM25)? When would you add a reranking step?
- Cost optimization: Can you discuss model cascading (cheap models for simple queries, expensive models for complex ones)? Semantic caching?
- Failure modes: What happens when the LLM hallucinates? How do you handle context window limits?
Numbers you should know:
| Model | Cost (per 1M input tokens) | Use Case |
|---|---|---|
| GPT-4 Turbo | $10-30 | Complex reasoning |
| Claude 3.5 Sonnet | ~$3 | Balanced performance |
| Self-hosted Llama 3 70B | $0.50-1.00 (GPU costs) | High-volume, predictable load |
Cost optimization patterns:
- Prompt compression can reduce token count by 20-40%
- Semantic caching (caching responses for semantically similar queries) reduces LLM calls by 30-50%
- Model cascading routes simple queries to cheaper models
Architecture to sketch:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INGESTION PIPELINE β
β Documents β Chunking β Embedding Model β Vector DB β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β QUERY PIPELINE β
β User Query β Embedding β Vector Search β Retrieved Context β
β β β
β Prompt Assembly β LLM β Response β Cache β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Pro tip: When sketching this in an interview, separate ingestion from query pipelines. It shows you understand that these have different scaling characteristics and failure modes.
2. Vector Database Scaling
With embeddings powering search, recommendations, and RAG systems, vector database architecture has become a core competency.
Example Question:
"Design a semantic search system that handles 100M documents with sub-100ms P99 latency."
Technology trade-offs to discuss:
| Database | Best For | Scaling Approach | Trade-off |
|---|---|---|---|
| Pinecone | Low ops overhead, serverless | Automatic | Higher cost at scale |
| Weaviate | Flexible deployments, multi-modal | Kubernetes-native | More operational complexity |
| Qdrant | High performance workloads | Distributed clusters | Requires tuning |
| Milvus | Enterprise, massive scale | GPU-accelerated | Complex setup |
| pgvector | Teams already on Postgres | Standard Postgres scaling | Limited at very high scale |
Key concepts to demonstrate:
- Indexing algorithms: HNSW (Hierarchical Navigable Small World) offers the best recall/speed trade-off for most use cases. Mention Product Quantization for memory efficiency.
- Sharding strategies: Partition embeddings by namespace or tenant to isolate workloads.
- Hybrid search: Combine dense vector search with sparse BM25 for better recall on keyword-heavy queries.
- Metadata filtering: Pre-filter by metadata before vector search to reduce the search space and improve latency.
3. Cost-Aware Architecture (FinOps)
This is where many senior candidates fall short. CTOs are increasingly asking candidates to design with explicit cost constraints.
Example Question:
"Design a system that processes 10M events per day and stays within a $50K/month cloud budget. Walk me through your cost model."
What this tests:
- Unit economics thinking: Can you calculate cost-per-transaction?
- Procurement strategy: When do you use reserved instances vs. spot vs. on-demand?
- Right-sizing discipline: Are you monitoring P95 utilization, not just P99?
- Cost allocation: In multi-tenant systems, how do you attribute costs?
The serverless vs. containers decision framework:
| Monthly Request Volume | Recommendation | Rationale |
|---|---|---|
| < 1M requests | Serverless | Pay-per-use wins |
| 1-10M requests | Hybrid | Base load on containers, burst on serverless |
| > 10M requests | Containers/VMs | Fixed costs become more economical |
Phrases that signal senior thinking:
- "Each API call costs approximately $0.003, so at 1M requests per day, we're looking at $90K/month in compute alone."
- "We'd run our base load on reserved instances for 40% savings, with auto-scaling to spot for burst traffic."
- "This design assumes a 3:1 read-to-write ratio. If that changes, we'd need to revisit the caching layer."
4. Real-Time & Event-Driven Systems
Users expect instant feedback. Batch processing is no longer acceptable for most user-facing features.
Example Questions:
"Design a real-time fraud detection system that processes transactions with sub-200ms latency." "Design a collaborative editing system like Google Docs."
Patterns you must know:
| Pattern | Use Case | Key Consideration |
|---|---|---|
| Event Sourcing | Audit trails, replay capability | Storage costs grow; need compaction strategy |
| CQRS | Read/write optimization | Eventual consistency handling |
| CDC (Change Data Capture) | Database-to-stream sync | Debezium, schema evolution |
| Saga Pattern | Distributed transactions | Compensation logic for rollbacks |
WebSocket scaling architecture:
For real-time collaborative systems, you'll need to address:
- Connection layer: WebSocket servers behind a load balancer with sticky sessions or connection-aware routing
- State synchronization: Redis Pub/Sub or Kafka for cross-server message routing
- Presence management: Distributed presence with heartbeats (typically 30-second intervals)
- Delivery guarantees: At-least-once delivery with client-side deduplication
Capacity planning example:
Given: 1M concurrent WebSocket connections
Memory per connection: ~10KB
Base memory requirement: 10KB Γ 1M = 10GB
Buffer for queues (3x): 30GB total per server
Server capacity (64GB): ~50K connections/server
Servers needed: 1M Γ· 50K = 20 servers (+ redundancy)
Result: Plan for 25-30 servers in the connection layer
Showing this math in an interview demonstrates that you think about infrastructure as a cost center, not an abstraction.
5. Observability & Modern Infrastructure
"How would you know if this system is healthy?" is now a standard follow-up question.
Example Question:
"Design the observability stack for this system. How would you debug a latency spike at 3 AM?"
The three pillars:
| Pillar | What It Answers | Key Implementation |
|---|---|---|
| Metrics | "What's happening right now?" | Prometheus, RED method (Rate, Errors, Duration) |
| Logs | "What happened and why?" | Structured JSON logs, correlation IDs |
| Traces | "How did the request flow?" | OpenTelemetry, distributed trace context |
Modern observability stack:
Services β OpenTelemetry Collector β Prometheus (metrics)
β Jaeger/Tempo (traces)
β Loki/Elasticsearch (logs)
β Grafana (visualization)
Service mesh considerations (Istio/Linkerd):
Interviewers may ask when to introduce a service mesh. Key triggers:
- You need mTLS between all services (zero-trust requirement)
- You want traffic splitting for canary deployments without code changes
- You need consistent observability across polyglot services
- You're implementing circuit breakers and retries at the infrastructure level
Red flag: Proposing a service mesh for a 5-service application. The operational overhead rarely justifies it below 20-30 services.
Interview insight: When asked about observability, start with "What are we trying to detect?" before jumping to tools. This shows you think about outcomes, not just technology.
6. Security & Compliance
Security is no longer an afterthought section. Expect direct questions about your security architecture.
What's expected:
- Zero-trust principles: Never trust, always verify. Service-to-service authentication via mTLS or SPIFFE/SPIRE.
- Data residency: For GDPR, can you articulate where data is stored and processed?
- Encryption: At-rest (AES-256) and in-transit (TLS 1.3) as baseline expectations.
- Rate limiting patterns: Token bucket for API rate limiting, sliding window for more granular control.
For regulated industries:
| Industry | Key Compliance | System Design Impact |
|---|---|---|
| Fintech | PCI-DSS, SOX | Audit logging, encryption, access controls |
| Healthtech | HIPAA | PHI isolation, BAA requirements, audit trails |
| Enterprise SaaS | SOC 2, GDPR | Data residency, right to deletion, consent management |
Red flags interviewers watch for:
- Security mentioned only at the end as an afterthought
- No discussion of authentication/authorization
- Storing secrets in environment variables without a secrets manager
- Ignoring compliance requirements mentioned in the problem statement
What Separates Senior from Staff+ Answers
The same question can yield a passing senior answer or an exceptional staff-level answer. Here's what differentiates them:
| Aspect | Senior Answer | Staff+ Answer |
|---|---|---|
| Problem framing | Solves the given problem | Questions whether it's the right problem |
| Trade-offs | Acknowledges them | Quantifies them with data |
| Scale planning | Handles stated requirements | Plans for 10x growth |
| Operational thinking | Mentions monitoring | Designs for on-call experience |
| Cost awareness | Considers it | Optimizes for unit economics |
| Scope management | Covers everything superficially | Goes deep on 2-3 critical components |
The #1 differentiator: Making decisions.
"A common failure point occurs when candidates don't make decisions. Often, candidates will say things like: 'We could use this type of DB, or this other...' and then move on. It's good practice to discuss trade-offs, but then you have to commit."
Staff-level candidates state their choice, justify it with constraints from the problem, and move forward. They might say: "Given our 100ms latency requirement and 10M daily queries, I'd choose Qdrant over pgvector. Pgvector would work at lower scale, but the HNSW implementation in Qdrant gives us better P99 performance. Let me show you how I'd deploy it."
The 10 Most Common Mistakes
We've seen these patterns repeatedly in candidates who underperform:
- Jumping to solutions β Not spending 5 minutes gathering requirements
- Over-engineering β Adding Kafka to a system that processes 100 requests per hour
- Under-engineering β Ignoring the "10M users" requirement in the problem
- Skipping capacity math β No back-of-envelope calculations for storage, bandwidth, or compute
- Happy path only β Not discussing what happens when the database is down
- Name-dropping without depth β "We'd use Kafka" without explaining why or how
- Siloed thinking β Designing the write path perfectly but forgetting about reads
- Ignoring cost β Proposing a solution that would cost $500K/month without acknowledging it
- Security as afterthought β "We'd add auth later"
- Poor communication β Designing in silence instead of thinking aloud
The fix for most of these: slow down, structure your approach, and verbalize your reasoning.
The next section gives you that structure.
The 4-Step Interview Framework
Structure separates candidates who pass from those who ramble. Use this framework:
Step 1: Requirements Gathering (5 minutes)
Before drawing anything, clarify:
- Functional requirements: What exactly should the system do? What are the core use cases?
- Non-functional requirements: What's the expected scale? Latency requirements? Consistency vs. availability preference?
- Constraints: Budget limits? Compliance requirements? Existing tech stack to integrate with?
Sample questions to ask:
- "Are we optimizing for read-heavy or write-heavy workloads?"
- "What's our target latency for the critical path?"
- "Is this a greenfield system or integrating with existing infrastructure?"
Step 2: High-Level Design (10-15 minutes)
Sketch the core components and data flow:
- Identify 5-7 major components (clients, load balancers, services, databases, caches)
- Draw the request flow for the primary use case
- Identify the data model at a high level
Don't go deep yet. Get the skeleton on the board so you can discuss trade-offs.
Step 3: Deep Dive (20-25 minutes)
Pick 2-3 components and go deep. The interviewer may guide you, or you may need to choose.
For each component:
- Discuss specific technology choices and why
- Address scaling bottlenecks
- Cover failure modes and mitigation
- Calculate capacity requirements
This is where you demonstrate expertise. It's better to go deep on two components than shallow on five.
Step 4: Scale & Wrap-up (5-10 minutes)
- Identify remaining bottlenecks
- Discuss how the system evolves at 10x scale
- Mention monitoring, alerting, and operational considerations
- Propose future enhancements
Preparation Resources
Top Resources for 2025
| Resource | Format | Best For | Investment |
|---|---|---|---|
| System Design Primer (GitHub) | Open source | Foundational concepts | Free |
| Hello Interview | Interactive course | Structured prep, mock interviews | Free tier available |
| ByteByteGo (Alex Xu) | Newsletter + course | Visual learning, breadth | $79-199/year |
| Designing Data-Intensive Applications | Book | Deep fundamentals | ~$40 |
| System Design Interview Vol. 1 & 2 | Books | Structured problem walkthroughs | ~$40 each |
Effective Practice Strategy
- Time yourself: 45 minutes per problem, simulating real conditions
- Use a drawing tool: Excalidraw, Miro, or a physical whiteboard
- Verbalize your thoughts: Practice explaining as you design
- Record yourself: Review for filler words, long pauses, or unclear explanations
- Get feedback: Practice with experienced engineers who can critique your approach
A useful mental model:
"Pretend it's 1999. A lot of the tools we have today don't exist. You and your team are in a garage. How would you design this so your friends could start coding it today?"
This forces you to focus on fundamentals rather than relying on managed services as a crutch.
Key Takeaways
If you remember nothing else from this guide:
- 2025 interviews test AI integration, cost awareness, and real-time systemsβnot just scale
- Use the 4-step framework: Requirements β High-Level Design β Deep Dive β Wrap-up
- Make decisions and commitβthe #1 differentiator between senior and staff+ candidates
- Quantify everything: latency budgets, cost-per-request, capacity requirements
- Practice out loudβsilent design is a red flag
Conclusion
System design interviews in 2025 test a different skill set than they did five years ago. LLM integration, vector databases, cost-aware architecture, and real-time systems are now table stakes. The candidates who succeed are those who:
- Stay current with how production systems are actually built
- Think in terms of trade-offs and constraints, not just "best practices"
- Communicate their reasoning clearly and make decisive choices
- Understand that cost, security, and operability matter as much as functionality
The 4-step frameworkβrequirements, high-level design, deep dive, and wrap-upβprovides structure. But structure alone isn't enough. You need reps. Practice with modern problems, get feedback, and iterate.
The interview is a conversation, not an exam. The best candidates treat it as a collaborative design session with a future colleague. That mindset shift alone can transform your performance.
Frequently Asked Questions
How long should I prepare for system design interviews?
For senior roles, plan for 4-6 weeks of focused preparation if you have production experience with distributed systems. If you're transitioning from smaller-scale systems, budget 8-12 weeks. The key is consistent practiceβ2-3 problems per week with full 45-minute simulationsβrather than cramming.
Do I need to know specific technologies like Kafka or Kubernetes?
You don't need to be an expert, but you should understand when and why to use them. Interviewers want to see that you can evaluate trade-offs, not recite documentation. Know 2-3 options for each category (message queues, databases, caches) and articulate when you'd choose each.
How important is cost estimation in these interviews?
Increasingly critical. Most interviewers now expect back-of-envelope cost calculations. You don't need exact AWS pricing, but you should be able to say "This design would cost roughly $X per month at Y scale" and explain your assumptions. Ignoring cost entirely is a red flag at senior levels.
Should I memorize solutions to common problems?
Memorizing hurts more than it helps. Interviewers can tell when you're reciting a rehearsed answer rather than thinking through the problem. Instead, internalize the patterns (caching strategies, sharding approaches, consistency models) and apply them to the specific constraints given. Every problem has unique requirements that change the optimal solution.
How do LLM-related questions differ from traditional system design?
LLM questions add new dimensions: token costs, latency budgets for inference, embedding storage, and handling non-deterministic outputs. Traditional questions focus on data consistency and throughput. LLM questions also require discussing failure modes unique to AIβhallucinations, context window limits, and model degradation. The core system design principles apply, but you need familiarity with the AI-specific components.
What's the best way to practice if I don't have a study partner?
Record yourself. Set a 45-minute timer, pick a problem, and talk through your solution as if someone were in the room. Review the recording for dead air, unclear explanations, or moments where you got stuck. Many candidates are surprised by how different they sound compared to how they think they sound. Combine this with written practiceβsketching architectures in Excalidraw or on paperβto build muscle memory for the visual component.