Implementations & Operations
Popular API Gateway implementations compared — Kong, AWS API Gateway, Envoy, NGINX, Traefik — plus deployment, HA, and production operations.
Table of Contents
Kong
Kong is the most popular open-source API Gateway. Built on NGINX and OpenResty (LuaJIT), it combines NGINX's raw performance with a plugin architecture that adds API management features. It can run with a PostgreSQL/Cassandra database (traditional mode) or DB-less with declarative YAML configuration.
| Aspect | Details |
|---|---|
| Core | NGINX + OpenResty (LuaJIT) — handles millions of requests/sec |
| Plugin system | 100+ plugins: auth, rate limiting, logging, transformation |
| Configuration | DB-backed (PostgreSQL/Cassandra) or DB-less (declarative YAML) |
| Admin API | RESTful API for dynamic configuration changes |
| Kubernetes | Kong Ingress Controller (KIC) — native K8s integration |
| Enterprise | Kong Enterprise adds: Dev Portal, RBAC, OIDC, Vitals analytics |
# Kong DB-less declarative configuration _format_version: "3.0" services: - name: user-service url: http://user-service:8080 connect_timeout: 5000 read_timeout: 30000 retries: 2 routes: - name: users-route paths: - /api/v1/users methods: - GET - POST - PUT strip_path: true plugins: - name: rate-limiting config: minute: 100 policy: redis redis_host: redis - name: jwt config: claims_to_verify: - exp consumers: - username: mobile-app plugins: - name: rate-limiting config: minute: 5000 # Higher limit for mobile app
DB-less Mode for Kubernetes
In Kubernetes, DB-less mode is preferred. Configuration lives in declarative YAML (stored in Git), applied via Kong Ingress Controller or deck sync. No database dependency means simpler operations, faster startup, and GitOps-friendly workflows. The trade-off: no Admin API for dynamic changes — all changes go through config files.
AWS API Gateway
AWS API Gateway is a fully managed service — no infrastructure to operate. It comes in two flavors: REST API (full-featured, more expensive) and HTTP API (simpler, cheaper, faster). Both integrate deeply with AWS services (Lambda, IAM, Cognito).
| Aspect | REST API | HTTP API |
|---|---|---|
| Price | $3.50/million requests | $1.00/million requests |
| Latency | Higher (more features) | Lower (optimized path) |
| Auth | IAM, Cognito, Lambda authorizer, API keys | IAM, Cognito, JWT authorizer |
| Throttling | Per-method, per-stage, per-key | Per-route, per-stage |
| WebSocket | Separate WebSocket API type | Not supported |
| Caching | Built-in response caching | Not available |
| Transformation | VTL request/response mapping | Simple parameter mapping |
| Best for | Full API management, complex transformations | Simple proxy, Lambda backends, cost-sensitive |
// AWS CDK — HTTP API with Lambda integration { "Type": "AWS::ApiGatewayV2::Api", "Properties": { "Name": "order-api", "ProtocolType": "HTTP", "CorsConfiguration": { "AllowOrigins": ["https://app.example.com"], "AllowMethods": ["GET", "POST", "PUT", "DELETE"], "AllowHeaders": ["Authorization", "Content-Type"], "MaxAge": 86400 } } } // Route with JWT authorizer { "Type": "AWS::ApiGatewayV2::Route", "Properties": { "ApiId": {"Ref": "OrderApi"}, "RouteKey": "POST /orders", "AuthorizationType": "JWT", "AuthorizerId": {"Ref": "CognitoAuthorizer"}, "Target": {"Fn::Join": ["/", ["integrations", {"Ref": "OrderLambdaIntegration"}]]} } }
AWS API Gateway Limitations
- ❌29-second timeout maximum — not suitable for long-running operations
- ❌10 MB payload limit — large file uploads need presigned S3 URLs
- ❌No WebSocket support on HTTP API type
- ❌Vendor lock-in — deeply tied to AWS ecosystem
- ❌Cold start latency when backed by Lambda
- ❌Limited plugin/extension model compared to Kong or Envoy
Envoy & NGINX
Envoy Proxy
Envoy is a modern, high-performance proxy designed for cloud-native architectures. It's the data plane for Istio and the foundation of many API gateways (Ambassador/Emissary, Gloo). Its killer feature is dynamic configuration via xDS APIs — no restarts needed for config changes.
| Envoy Feature | Description |
|---|---|
| xDS API | Dynamic configuration — routes, clusters, listeners updated without restart |
| gRPC-native | First-class gRPC support including streaming and transcoding |
| Observability | Built-in stats, tracing (Zipkin/Jaeger), access logging |
| Filters | Extensible filter chain — Lua, Wasm, external processing |
| HTTP/2 & HTTP/3 | Full HTTP/2 support, experimental HTTP/3 (QUIC) |
| Service mesh | Foundation of Istio, used as sidecar proxy |
NGINX
NGINX is the battle-tested workhorse — the most deployed reverse proxy in the world. As a gateway, it's extremely fast and stable but configuration-driven (static config files, reload required for changes). NGINX Plus adds dynamic configuration, health checks, and a dashboard.
# Envoy bootstrap — connects to xDS control plane for dynamic config admin: address: socket_address: address: 0.0.0.0 port_value: 9901 dynamic_resources: lds_config: # Listener Discovery Service api_config_source: api_type: GRPC grpc_services: - envoy_grpc: cluster_name: xds-cluster cds_config: # Cluster Discovery Service api_config_source: api_type: GRPC grpc_services: - envoy_grpc: cluster_name: xds-cluster static_resources: clusters: - name: xds-cluster connect_timeout: 5s type: STRICT_DNS load_assignment: cluster_name: xds-cluster endpoints: - lb_endpoints: - endpoint: address: socket_address: address: control-plane port_value: 18000
Envoy vs NGINX — When to Choose
Choose Envoy when: you need dynamic configuration (xDS), gRPC support, service mesh integration, or Wasm extensibility. Choose NGINX when: you need raw performance for simple proxying, have existing NGINX expertise, or want the simplest possible configuration. Envoy is more capable but more complex; NGINX is simpler but less dynamic.
Traefik & Others
Traefik
Traefik is a Kubernetes-native reverse proxy and API gateway. Its killer feature is automatic service discovery and automatic Let's Encrypt certificate management. It watches Kubernetes Ingress resources and configures itself — zero manual route configuration.
| Gateway | Key Strength | Best For |
|---|---|---|
| Traefik | Auto-discovery, auto-TLS, Kubernetes-native | Kubernetes clusters, small-medium APIs |
| Ambassador/Emissary | Envoy-based, Kubernetes CRDs, developer-friendly | Kubernetes with Envoy features |
| Azure API Management | Full lifecycle management, Azure integration | Azure-native architectures |
| Apigee (Google) | Enterprise API management, analytics, monetization | Large enterprises, API-as-product |
| Tyk | Open-source, Go-based, GraphQL-native | GraphQL APIs, open-source preference |
| KrakenD | Ultra-high performance, stateless, no DB | Performance-critical, simple routing |
# Traefik auto-discovers routes from Kubernetes Ingress apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: api-ingress annotations: traefik.ingress.kubernetes.io/router.middlewares: default-rate-limit@kubernetescrd, default-auth@kubernetescrd spec: tls: - hosts: - api.example.com secretName: api-tls # Auto-managed by cert-manager rules: - host: api.example.com http: paths: - path: /api/users pathType: Prefix backend: service: name: user-service port: number: 8080 - path: /api/orders pathType: Prefix backend: service: name: order-service port: number: 8080 # Traefik automatically picks this up — no restart needed
Choosing a Gateway
For most teams: Kong (full API management, large plugin ecosystem) or AWS API Gateway (fully managed, serverless). For Kubernetes-native: Traefik (simplest) or Ambassador (Envoy-powered). For service mesh: Envoy (Istio data plane). For enterprise API-as-product: Apigee or Azure APIM. Start simple — you can always migrate to a more complex solution when you outgrow the simple one.
Comparison Table
A feature comparison across the major API Gateway implementations to help you choose the right tool for your requirements.
| Feature | Kong | AWS API GW | Envoy | NGINX | Traefik |
|---|---|---|---|---|---|
| Open Source | ✅ (Apache 2.0) | ❌ (Managed) | ✅ (Apache 2.0) | ✅ (BSD) / Plus | ✅ (MIT) |
| Path Routing | ✅ | ✅ | ✅ | ✅ | ✅ |
| Header Routing | ✅ | ✅ | ✅ | ✅ | ✅ |
| JWT Auth | ✅ Plugin | ✅ Built-in | ✅ Filter | ⚠️ Plus only | ✅ Middleware |
| Rate Limiting | ✅ Plugin | ✅ Built-in | ✅ Filter | ✅ Built-in | ✅ Middleware |
| gRPC | ✅ | ⚠️ Limited | ✅ Native | ✅ | ✅ |
| WebSocket | ✅ | ✅ (REST API) | ✅ | ✅ | ✅ |
| Dynamic Config | ✅ Admin API | ✅ (Managed) | ✅ xDS API | ❌ Reload | ✅ Auto-discovery |
| Kubernetes | ✅ KIC | ⚠️ External | ✅ (Istio) | ✅ Ingress | ✅ Native |
| Scaling Model | Horizontal | Auto (managed) | Horizontal | Horizontal | Horizontal |
Decision Framework
Ask these questions: (1) Do you need full API management (portal, plans, analytics)? → Kong Enterprise or Apigee. (2) Do you want zero ops? → AWS API Gateway. (3) Do you need gRPC-native + service mesh? → Envoy. (4) Do you want simplest Kubernetes setup? → Traefik. (5) Do you need maximum raw performance with minimal features? → NGINX or KrakenD.
High Availability & Scaling
The gateway is the most critical infrastructure component — if it goes down, everything goes down. High availability is non-negotiable.
| Principle | Implementation |
|---|---|
| No single point of failure | Minimum 2 instances across availability zones |
| Stateless gateway | All state in external stores (Redis, DB) — any instance can serve any request |
| Active-active | All instances serve traffic simultaneously (not active-passive) |
| Health-checked | Network load balancer health-checks gateway instances |
| Auto-scaling | Scale gateway instances based on CPU/connections/request rate |
| Graceful shutdown | Drain connections before terminating an instance |
# Kubernetes deployment for HA gateway apiVersion: apps/v1 kind: Deployment metadata: name: api-gateway spec: replicas: 3 # Minimum 3 for HA strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 # Zero downtime during updates template: spec: # Spread across availability zones topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule # Anti-affinity — don't schedule on same node affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app: api-gateway topologyKey: kubernetes.io/hostname containers: - name: gateway resources: requests: cpu: "2" memory: "4Gi" limits: cpu: "4" memory: "8Gi" readinessProbe: httpGet: path: /health port: 8001 initialDelaySeconds: 5 periodSeconds: 5 lifecycle: preStop: exec: command: ["sleep", "15"] # Drain connections
The Hospital Emergency Room
Your gateway should be like a hospital ER — always open, always staffed, with backup generators and redundant systems. You don't have one doctor on call; you have a team across shifts. If one doctor gets sick, others cover. The gateway needs the same resilience: multiple instances, across zones, with automatic failover. Downtime is not an option for the front door of your system.
Capacity Planning
Size your gateway for 3x normal peak traffic. Why 3x? (1) Normal peak handles expected load. (2) 2x handles a traffic spike or one AZ going down. (3) 3x handles a spike during a partial outage. If your gateway can't absorb unexpected load, it becomes the bottleneck that causes the outage instead of preventing it.
Configuration & Zero-Downtime Deployments
Gateway configuration changes (new routes, updated rate limits, plugin changes) must be applied without dropping requests. Zero-downtime configuration updates are essential for a component that handles all traffic.
| Approach | How It Works | Downtime |
|---|---|---|
| Hot reload | Gateway reloads config without restarting (NGINX: nginx -s reload) | Zero — existing connections maintained |
| Dynamic API | Push changes via Admin API (Kong, Envoy xDS) | Zero — applied immediately |
| Rolling update | Deploy new config to instances one at a time | Zero — if done correctly with drain |
| GitOps | Config in Git, CI/CD applies changes automatically | Zero — uses rolling update underneath |
| Blue-green config | Deploy new config to green, switch traffic | Zero — atomic switch |
# GitOps workflow for gateway configuration # 1. Developer pushes config change to Git # 2. CI validates config (syntax, schema, dry-run) # 3. CD applies to staging gateway # 4. Automated tests verify staging # 5. CD applies to production gateway (rolling) # CI validation step validate: script: # Kong deck validates declarative config - deck validate --state kong.yaml # Dry-run against staging - deck diff --state kong.yaml --kong-addr http://staging-gateway:8001 # CD deployment step deploy: script: # Apply config with zero downtime - deck sync --state kong.yaml --kong-addr http://gateway:8001 # Or for Kubernetes: - kubectl apply -f gateway-config.yaml # Rolling update handles the rest
Graceful Shutdown
# Graceful shutdown sequence for gateway instance: # 1. Remove from load balancer (stop receiving new connections) # - Kubernetes: pod enters Terminating state # - NLB: health check fails → deregisters target # 2. Wait for in-flight requests to complete # - preStop hook: sleep 15 (allow LB to deregister) # - Gateway drains: finish active requests (up to 30s) # 3. Close idle connections # - Send Connection: close on keep-alive connections # 4. Terminate process # - SIGTERM → graceful shutdown # - SIGKILL after grace period (30s) if still running # Key: the sleep in preStop gives the load balancer time to # stop sending new requests before the gateway starts draining.
Config Validation in CI
Never apply gateway config directly to production. Always: (1) Validate syntax in CI. (2) Dry-run against staging (deck diff, envoy validate). (3) Apply to staging and run integration tests. (4) Apply to production with rolling update. A bad gateway config (invalid route, broken plugin) can take down all traffic instantly. Treat gateway config with the same rigor as application code.
Interview Questions
Q:Compare Kong and AWS API Gateway. When would you choose each?
A: Kong: open-source, self-managed, runs anywhere (cloud, on-prem, Kubernetes), extensive plugin ecosystem, full control over configuration and scaling. AWS API Gateway: fully managed (zero ops), deep AWS integration (Lambda, IAM, Cognito), pay-per-request pricing, but vendor lock-in and limited customization. Choose Kong when: you need portability, custom plugins, or run on-prem. Choose AWS when: you're all-in on AWS, want zero ops, and your API patterns fit within its limitations (29s timeout, 10MB payload).
Q:How do you achieve zero-downtime gateway deployments?
A: Multiple layers: (1) Stateless gateway — no local state, all shared state in Redis/DB. (2) Rolling updates — update one instance at a time, never all simultaneously. (3) Graceful shutdown — drain in-flight requests before terminating (preStop hook + SIGTERM handling). (4) Health check integration — NLB stops sending traffic to terminating instances. (5) maxUnavailable: 0 in Kubernetes — never reduce below desired replica count during update. (6) Config validation in CI — catch bad configs before they reach production.
Q:Why is Envoy's xDS protocol significant for API Gateways?
A: xDS (discovery services) allows Envoy to receive configuration updates dynamically via gRPC — without restarts or config file reloads. This means: (1) New routes added instantly when services deploy. (2) Upstream endpoints updated in real-time as pods scale. (3) Rate limits and policies changed without touching the proxy. (4) A control plane (Istio, custom) manages configuration centrally and pushes to all Envoy instances. This is fundamentally different from NGINX's 'edit file, reload' model and enables true GitOps and automation at scale.
Q:How would you size and scale an API Gateway for 100K requests/second?
A: Sizing: (1) Benchmark single instance throughput (Kong on modern hardware: ~30-50K req/s). (2) Need 3-4 instances for 100K req/s at normal load. (3) Plan for 3x headroom: 9-12 instances. (4) Each instance: 4 CPU cores, 8GB RAM minimum. Scaling: (1) Horizontal auto-scaling on CPU utilization (target 60%). (2) Spread across 3 AZs. (3) Network load balancer in front (L4, not L7 — avoid double processing). (4) Connection pooling to upstreams (avoid connection storms). (5) Monitor: request rate, latency p99, connection count, CPU.
Q:What's the operational difference between DB-backed and DB-less gateway modes?
A: DB-backed (Kong + PostgreSQL): Admin API for dynamic changes, multiple nodes sync via DB, good for teams that need runtime flexibility. Operational cost: must manage and HA the database. DB-less (declarative YAML): config in Git, applied via CI/CD, no database dependency, faster startup, GitOps-friendly. Operational cost: no dynamic changes — all updates go through Git + deploy pipeline. Choose DB-less for Kubernetes (GitOps natural fit). Choose DB-backed for teams that need to make quick runtime changes without a deploy.
Common Mistakes
No graceful shutdown during deployments
Gateway instances are killed immediately during rolling updates — dropping in-flight requests and returning 502 errors to clients.
✅Implement graceful shutdown: (1) preStop hook with sleep (15s) to allow LB deregistration. (2) Handle SIGTERM by stopping new connections and draining existing ones. (3) Set terminationGracePeriodSeconds high enough for long requests to complete. (4) Set maxUnavailable: 0 to never reduce capacity during updates.
Gateway config applied directly to production
Pushing gateway configuration changes directly to production without validation — a typo in a route regex takes down all traffic.
✅Treat gateway config like code: validate in CI (syntax + schema), dry-run against staging, apply to staging with integration tests, then rolling deploy to production. A single bad config line can cause a total outage. Never skip validation.
Single availability zone deployment
All gateway instances run in one AZ — an AZ outage takes down the entire API.
✅Deploy gateway instances across at minimum 2 (preferably 3) availability zones. Use topology spread constraints in Kubernetes or multi-AZ target groups in AWS. The gateway must survive a full AZ failure without degradation.
Choosing a gateway based on features alone
Selecting the most feature-rich gateway without considering operational complexity, team expertise, or actual requirements.
✅Choose based on: (1) What you actually need today (not might need someday). (2) Team expertise — a gateway your team can't operate is worse than a simpler one they can. (3) Operational model — managed (AWS) vs self-managed (Kong). (4) Ecosystem fit — Kubernetes-native if you're on K8s. Start simple, migrate when you outgrow it.