Architecture
UDProbe consists of two main components: the Reflector and the Collector. These work together to measure network loss and latency between distributed endpoints.
Overview
+----------- + +-----------+ +-----------+
| | | | UDP Probes | |
| | | | =============> | |
| | =======> | | | |
| Prometheus | Scrapes | COLLECTOR | Reflected Probes | REFLECTOR |
| | | | <============= | |
| | | | | |
| | | | | |
+----------- + +-----------+ +-----------+
:5200/metrics :8100 (for probes)
:8200/metrics (for reflector metrics)
Components
Reflector
The reflector is a lightweight daemon that receives UDP probes and immediately sends them back to their source. It's designed to have minimal overhead and fast response times.
Responsibilities: - Listen for incoming UDP probes on a configurable port (default 8100) - Unmarshal the probe to validate it's a valid UDProbe packet - Add a receive timestamp to the probe - Re-marshal and reflect the probe back to the sender - Expose Prometheus metrics on port 8200
Prometheus Metrics:
| Metric | Type | Description |
|---|---|---|
udprobe_reflector_packets_received_total |
Counter | Total UDP packets received |
udprobe_reflector_packets_reflected_total |
Counter | Packets successfully reflected |
udprobe_reflector_packets_bad_data_total |
Counter | Malformed packets received |
udprobe_reflector_packets_throttled_total |
Counter | Packets dropped due to rate limiting |
udprobe_reflector_tos_changes_total |
Counter | ToS bit changes on the socket |
udprobe_reflector_up |
Gauge | Health status |
Collector
The collector is responsible for sending probes, tracking their responses, and exposing metrics for Prometheus to scrape.
Responsibilities: - Load and parse configuration (targets, ports, rate limits) - Send UDP probes to configured reflectors at defined rates - Track in-flight probes and their send times - Receive reflected probes and calculate round-trip times - Summarize results at configurable intervals - Expose Prometheus metrics on port 5200
Data Flow:
- Configuration Loading - Collector reads YAML config defining targets, ports, and rate limits
- TestRunner Creation - Creates TestRunner instances for each configured test
- Probe Sending - TestRunner sends probes at configured rate using multiple ports
- Probe Tracking - Sent probes are tracked with timestamps in an in-flight map
- Reflection Receipt - Returned probes are matched with sent probes
- Summarization - Results are aggregated and metrics updated at configured intervals
Prometheus Metrics:
| Metric | Type | Description |
|---|---|---|
udprobe_packet_loss_percentage |
Gauge | Packet loss percentage |
udprobe_packets_sent |
Gauge | Packets sent in period |
udprobe_packets_lost |
Gauge | Packets lost in period |
udprobe_rtt |
Gauge | Average RTT in milliseconds |
Metric Labels:
src_ip- Source IP address of the collectordst_ip- Destination IP address of the reflectortos- The TOS byte value of the probes being sentsrc_hostname- Source hostname (from config tags)dst_hostname- Destination hostname (from config tags)
Prometheus Server
The prometheus server is repsonsible for scraping stats from the Collector (and Reflector if you want those stats)
Protocol
UDProbe uses Protocol Buffers for packet serialization. The probe message contains:
- UUID - Unique identifier for tracking
- Sent - Timestamp when probe was sent (Unix nanoseconds)
- Rcvd - Timestamp when probe was received by reflector (Unix nanoseconds)
- Tos - Type of Service byte
Design Decisions
Why UDP?
- ECMP Hashing - UDP allows for ECMP path hashing, useful for testing multiple network paths
- Per-packet Granularity - Unlike TCP, each UDP packet is independent
- No Setup/Teardown - No handshake overhead, faster probing
- ICMP Limitations - ICMP can be rate-limited or routed differently than data traffic
- No Extra Permissions - UDP packets can be crafted and sent without extra permission unlike ICMP which typically requires root privlege.
Why Prometheus?
- Widely Used - De facto standard for infrastructure and application monitoring
- Supported Datasource for Grafana - Allows for easy dashboarding and querying
- Powerful Query Language - PromQL allows for querying and transformation of data in powerful ways. Useful for alerting and SLA/SLO monitoring
- Alertmanager - Bundled alerting with most features / integrations teams commonly want baked in