Scaling toward large fleets (design target: 600k+ devices, e.g. Teltonika FMC920 class)

Platform-wide capacity goals (including ~40M users on the application plane) are summarized in ../PLATFORM-MASTER.md. This document focuses on the telematics ingest path (TCP gateway, parsers, storage, observability).

TCP gateway horizontal scaling

Run multiple telematics-gateway instances behind a TCP load balancer.
Use session affinity (sticky) so the same IMEI lands on the same node while the TCP connection stays open. If a node dies, devices reconnect — design parsers to be stateless except per-socket buffer.

Database

Partition or use TimescaleDB for device_positions by time.
Archive cold data to object storage if compliance allows.
Use read replicas for reporting APIs; writes stay on primary (or sharded by tenant in multi-tenant designs).

Metrics and SLOs

The gateway exposes Prometheus-style metrics on METRICS_PORT (default 9092):

telematics_tcp_connections_total
telematics_avl_records_total
telematics_parse_errors_total
telematics_imei_rejected_total

Define SLOs such as: p99 ingest latency < 2s from TCP receive to DB commit in steady state.

Backpressure

Per-socket maximum buffer size (gateway drops/reset on abuse).
Rate limits at network edge (firewall, cloud security group).
Queue depth alerts on device_commands if commands back up.

Multi-tenant (future)

Add tenant_id to devices and enforce row-level security in PostgreSQL.
Separate API scopes per tenant; never mix data in dashboards.

What not to optimize prematurely

Full Codec parity with vendor cloud before you have staging device volume and packet captures.

TCP gateway horizontal scaling​

Database​

Metrics and SLOs​

Backpressure​

Multi-tenant (future)​

What not to optimize prematurely​