Early-stage software. Shurli is experimental and built with AI assistance. It will have bugs. Not recommended for production or safety-critical use. Read the disclaimer.
From Broken to Bulletproof: How Chaos Testing Transformed Shurli's Network Layer

From Broken to Bulletproof: How Chaos Testing Transformed Shurli's Network Layer

March 14, 2026·
Satinder Grewal

Before and after: network transitions went from broken (manual restarts needed) to bulletproof (zero restarts, all automatic)

The problem we solved

Before this work, switching networks could break Shurli. Move from your home WiFi to your phone’s hotspot? The daemon might stay stuck on a dead connection. Connect a VPN? Total isolation. Switch between two mobile hotspots? The daemon didn’t even notice. In each case, the only recovery was restarting the daemon.

This is the kind of problem you don’t find in unit tests. It only shows up when real hardware switches between real networks.

What this means for you

The short version: Switch networks freely. Shurli handles it.

You can walk from your home WiFi to a coffee shop, tether to your phone on the train, plug into a wired LAN at the office, connect a VPN for privacy, and Shurli stays connected through all of it. No restarts. No manual intervention. The daemon detects every change, cleans up, and reconnects automatically.

Every transition is now automatic: what you do on the left, what Shurli does on the right

Every one of these required a daemon restart before. Now: zero restarts across 16 test cases.

The numbers

MetricValue
Networks tested5 ISPs + 3 VPN providers
Test cases16 distinct transitions
Root causes found11 (all fixed)
Post-chaos flags8 investigated (6 fixed, 2 informational)
Engineering time40+ hours across 4 days
Code commits14
Daemon restarts needed after fixes0

How we got here

The testing method

No simulations. A MacBook switching between real networks, a home server on a local network, two relay servers in different countries. Real satellite WiFi, real 5G hotspots, real VPN tunnels. Switch a network, watch the daemon logs, check if it recovered, fix what broke, rebuild, switch again.

Each fix exposed the next layer - the layered discovery pattern

Why it took 40+ hours

Each fix exposed the next problem. Fixing the transport blocker revealed a dial cache bug. Fixing the dial cache revealed a VPN blind spot. Fixing VPN detection revealed that the daemon couldn’t see switches between two mobile hotspots (both private IPv4, no change visible to the monitor). Every layer peeled back to reveal something deeper.

We also hit hurdles that ate real time:

Operating system surprises: macOS 15 silently blocks daemon TCP connections to local network IPs. The daemon code was correct, the network was reachable, but connections failed with “no route to host.” Hours of investigation before identifying it as a macOS privacy setting, not a code bug.

Wrong assumptions: We initially thought satellite WiFi ARP resolution (60-80 seconds to resolve a LAN address) was causing slow reconnections. After tracing through libp2p’s source code, the real cause was completely different: a shared hashmap caching stale dial errors. The wrong assumption cost investigation time but led us to a deeper, more important fix.

Platform-specific behavior: The macOS route socket was listening for the right events for IPv6 switches, but WiFi hotspot switches fire different message types. And the system command needed its full path (/sbin/route) because the daemon runs under launchd, which has a minimal PATH. Two separate discoveries, each requiring a deploy-test-discover cycle.

Technical highlights

How the recovery chain works: detect, clean, reset, reconnect - all automatic

For the technically curious: here’s what we found inside libp2p that needed fixing, and how we fixed each one.

Transport state that outlives network switches

libp2p tracks which transports work and which don’t. After enough IPv6 failures (e.g., on a CGNAT network), it decides “IPv6 is broken” and blocks all IPv6 dials. This state persists even after you switch to a network with working IPv6. The daemon stays on relay when a direct path exists.

Fix: reset the transport state on every network change. A WiFi switch is a definitive event that invalidates all prior transport knowledge.

Shared dial cache poisoning

libp2p creates one dial worker per peer. All dial attempts share a cache. If any subsystem (PeerManager, DHT discovery, autonat) dials a stale address and fails, that error is cached. When mDNS discovers the peer on the new network and tries to dial, it gets the cached error without ever actually dialing.

Fix: three coordinated changes. Strip stale addresses from the peerstore before reconnect (prevents cache poisoning by other subsystems). Probe TCP reachability before calling libp2p’s dial path (prevents self-poisoning). Retry with backoff when the probe fails during WiFi settling.

Invisible network switches

The network monitor only watched global IP addresses. Three categories of switches were invisible: VPN tunnels (private IPv4 only), CGNAT-to-CGNAT carrier switches (no global IPs on either), and WiFi hotspot switches on macOS (fires route changes, not address events).

Fix: three new detection methods. VPN tunnel interface detection by name pattern. Default gateway tracking by parsing the routing table. Route socket expanded to listen for route changes alongside address events.

Relay reconnection defaults

libp2p’s autorelay subsystem is tuned for DHT-discovered relays: 1-hour retry backoff, 30-second query interval, 3-minute boot delay. For known static relays, these defaults meant 30+ second reconnection after a network change.

Fix: backoff reduced to 30 seconds, query interval to 5 seconds, boot delay eliminated. Relay reconnection now takes 5-10 seconds.

All 11 root causes

#Root CauseFix
1Transport blocker persists across network switchesReset on network change
2Probe tests relay server IP instead of peer IPSkip circuit addresses in probe targets
3Direct dial tries all addresses, cascade failureConstrain to confirmed address only
4LAN cleanup fights remote node’s reconnectLet direct and relay coexist
5Stale connection checker misses private IPsCheck all interface IPs
6Relay dropped on networks with public IPv6Permanent relay fallback
7LAN upgrade poisoned by transport blockerTCP-only filter for LAN paths
8Valid IPv6 killed during address verification windowGrace period for new addresses
91-hour relay retry backoffReduced to 30 seconds
10Reconnect cooldown blocks recovery after disconnectClear cooldown on disconnect and network change
11Closed connection reported as live for 57 secondsCheck if local IP still exists on any interface

What’s next

The network layer is now tested against every physical transition we can reproduce. File transfer testing across these same transitions comes next. After that: wiring the reputation system for autonomous peer trust decisions, and the Path Memory feature for scored path caching per network fingerprint.