Scaling toward large fleets (design target: 600k+ devices, e.g. Teltonika FMC920 class)
Platform-wide capacity goals (including ~40M users on the application plane) are summarized in ../PLATFORM-MASTER.md. This document focuses on the telematics ingest path (TCP gateway, parsers, storage, observability).
TCP gateway horizontal scaling
- Run multiple
telematics-gatewayinstances behind a TCP load balancer. - Use session affinity (sticky) so the same IMEI lands on the same node while the TCP connection stays open. If a node dies, devices reconnect — design parsers to be stateless except per-socket buffer.
Database
- Partition or use TimescaleDB for
device_positionsby time. - Archive cold data to object storage if compliance allows.
- Use read replicas for reporting APIs; writes stay on primary (or sharded by tenant in multi-tenant designs).
Metrics and SLOs
The gateway exposes Prometheus-style metrics on METRICS_PORT (default 9092):
telematics_tcp_connections_totaltelematics_avl_records_totaltelematics_parse_errors_totaltelematics_imei_rejected_total
Define SLOs such as: p99 ingest latency < 2s from TCP receive to DB commit in steady state.
Backpressure
- Per-socket maximum buffer size (gateway drops/reset on abuse).
- Rate limits at network edge (firewall, cloud security group).
- Queue depth alerts on
device_commandsif commands back up.
Multi-tenant (future)
- Add
tenant_idtodevicesand enforce row-level security in PostgreSQL. - Separate API scopes per tenant; never mix data in dashboards.
What not to optimize prematurely
- Full Codec parity with vendor cloud before you have staging device volume and packet captures.