Network Topology Testing: The Edge Cases Teams Skip

Your test suite probably validates 200ms latency against a local WireMock instance. That's not network testing; that's testing whether JSON parsing works. Real users traverse IPv6-only cellular backha

April 05, 2026 · 9 min read · Performance

Production Networks Are Hostile Environments, Not Slow Wi-Fi

Your test suite probably validates 200ms latency against a local WireMock instance. That's not network testing; that's testing whether JSON parsing works. Real users traverse IPv6-only cellular backhauls behind captive portals with 1420-byte MTU VPN tunnels that drop every third packet during tower handoffs. If your test plan doesn't include programmatically enabling airplane mode mid-transaction or forcing a Radio Access Technology (RAT) switch from 5G NSA to LTE during an SSL handshake, you're not testing network topology—you're testing localhost with extra steps.

The gap between "works on my machine" and "works in a Mumbai metro station" isn't bandwidth. It's state machines. Network topology testing isn't about speed; it's about Byzantine failure modes that occur when layer 3 assumptions meet layer 2 reality.

Captive Portals: The Walled Garden You Can't Detect

iOS and Android handle captive portals through CNA (Captive Network Assistant) and CaptivePortalLogin respectively, but your app's network stack lives in a parallel universe. When a user connects to Starbucks_WiFi, the OS launches a browser overlay to authenticate, but your OkHttp 4.12.0 client with a 30-second read timeout has already fired a POST request. That request hits a WISPr 2.0 redirect to http://10.1.10.1/login.html, returns a 200 OK with HTML instead of JSON, and your Gson parser throws a JsonSyntaxException. The user sees "Unexpected Error," not "Please sign into Wi-Fi."

The detection mechanism matters. On Android 11+, ConnectivityManager provides getCaptivePortalUrl(), but it's only reliable if you request NetworkCapabilities.NET_CAPABILITY_VALIDATED. Prior to API 29, you had to implement RFC 7710 DHCP option 160 parsing yourself or probe known endpoints like http://clients3.google.com/generate_204. iOS is murkier: CNCopyCurrentNetworkInfo returns SSID data only if your app has the Access WiFi Information entitlement, and NEHotspotConfiguration doesn't expose portal state at all.

The Test Vector:

Don't mock the portal. Use a real one. Configure a Ubiquiti UniFi Dream Machine (UDM-Pro v3.2.12) with a guest network enabled and 1-hour authentication timeouts. Programmatically connect test devices using adb shell cmd wifi connect-network (Android 10+) or the UIAutomation framework for iOS. Then trigger the race condition:


// Android: The naive implementation that fails in portals
val request = Request.Builder()
    .url("https://api.yourservice.com/v1/payment")
    .post(paymentBody)
    .build()

client.newCall(request).execute().use { response ->
    // If behind portal, response.body is HTML, not JSON
    val payment = gson.fromJson(response.body?.string(), Payment::class.java)
}

The fix isn't better parsing; it's pre-flight validation using NetworkCapabilities.NET_CAPABILITY_CAPTIVE_PORTAL:


val network = connectivityManager.activeNetwork
val caps = connectivityManager.getNetworkCapabilities(network)
if (caps?.hasCapability(NET_CAPABILITY_CAPTIVE_PORTAL) == true) {
    // Queue request until NetworkCallback detects NET_CAPABILITY_VALIDATED
}

SUSA's autonomous personas encounter this during mall Wi-Fi sessions. One persona connects, authenticates through the captive portal, then disconnects 15 minutes later when the session expires mid-upload. The platform logs the SSL handshake failure against the portal's IP, not your API's, distinguishing between server errors and network topology traps.

VPN Intermittency: Split Tunneling and MTU Black Holes

Corporate VPNs don't just add latency; they rewrite routing tables. When a user enables WireGuard 1.0.15 with a 0.0.0.0/0 allowed-ips configuration on Android, the OS creates a tunnel interface with MTU 1420 (accounting for the 80-byte WireGuard overhead). Your app sends a 1500-byte packet, the VPN fragments it, and a middlebox on the path drops ICMP "Fragmentation Needed" packets. You get a TCP timeout, not an error.

Worse is split tunneling. A finance app might route api.bankservice.com through the VPN tunnel (tun0) while Firebase Cloud Messaging uses the physical Wi-Fi interface (wlan0). If the VPN drops during a request—common during cellular-to-WiFi handoffs—the socket doesn't immediately error out. It hangs in SYN_SENT until TCP retries exhaust, typically 75 seconds on Linux kernels (sysctl tcp_syn_retries = 5).

Testing the Transition:

Use adb shell settings put global airplane_mode_on 1 to kill all radios, then toggle VPN states via adb shell cmd connectivity set-vpn-state. But that's binary. Real intermittency requires packet loss injection.

On macOS, use Network Link Conditioner (Hardware IO Tools for Xcode 15). On Linux, tc (iproute2 v6.5+) provides surgical precision:


# Add 30% packet loss to VPN interface only
tc qdisc add dev tun0 root netem loss 30%

# Latency spike during RAT change simulation
tc qdisc change dev tun0 root netem delay 2000ms 500ms distribution normal

Android's emulator supports extended network controls via telnet (telnet localhost 5554), but it doesn't simulate VPN interface creation. For that, use a physical device with a test VPN profile and the VpnService API to programmatically establish/terminate tunnels during Espresso tests.

The critical assertion: verify that IOException subclasses are distinguishable. A SSLPeerUnverifiedException (certificate pinned to corp CA) requires different UX than a SocketTimeoutException (MTU black hole). If your retry logic treats both as transient network errors, you'll hammer the VPN gateway with retries during a certificate mismatch, draining battery and burning CPU.

Dual-SIM Failover: The SubscriptionId Trap

Modern flagships (Pixel 8, Galaxy S24, iPhone 15 Pro) support Dual SIM Dual Standby (DSDS), but the failover logic is rarely tested. When the primary data SIM (SubscriptionId 1) loses signal—say, entering a parking garage—the OS switches data to SubscriptionId 2. However, existing TCP connections bound to the first SIM's network interface don't migrate. They stall.

Android's SubscriptionManager (API 24+) exposes getActiveSubscriptionInfoList(), but ConnectivityManager returns separate Network objects for each SIM. If your app uses bindProcessToNetwork() or setNetworkPreference(), you're responsible for socket migration.

The Edge Case:

User starts a 100MB file upload on Wi-Fi. Wi-Fi drops; OS fails over to SIM 1 (AT&T). Mid-upload, AT&T drops to 1xRTT (2G) due to congestion, and the OS switches data to SIM 2 (T-Mobile). The upload socket was bound to the AT&T interface via Network.getSocketFactory(). It hangs. The user sees 0KB/s indefinitely because your ProgressListener tracks bytes written to the buffer, not ACKs received from the server.

Test this with adb shell cmd phone data-enable 0 (disable SIM 1) during an active transfer. iOS is trickier; you need the CoreTelephony framework and a jailbroken device or automated UI testing to toggle cellular plans via Settings.app automation.

The fix requires listening to TelephonyCallback.ServiceStateListener and gracefully closing sockets when dataNetworkType changes from LTE to UNKNOWN, rather than waiting for TCP timeouts. On iOS, CTCellularData provides limited state, but NWPathMonitor (Network.framework) is more reliable for detecting interface viability changes.

IPv6-Only Networks: 464XLAT and Happy Eyeballs

T-Mobile US and Verizon LTE are IPv6-only in many markets. They deploy 464XLAT (RFC 6877) with CLAT (Customer-side Translator) on the device. Your app sees an IPv6 address, but the destination server is IPv4-only. The OS translates IPv4 packets to IPv6 using the well-known prefix 64:ff9b::/96, routes them to the PLAT (Provider-side Translator), and masquerades the return traffic.

Things break when:

  1. DNS64 synthesis fails: If your DNS resolver (not the carrier's) returns A records only, and the app attempts IPv4 connection on an IPv6-only network, the socket fails immediately with ENETUNREACH.
  2. Happy Eyeballs race conditions: RFC 8305 implementations in OkHttp 4.12.0 and URLSession default to 300ms connection delay between IPv6 and IPv4 attempts. If the IPv6 path has 50% packet loss (common in congested cells), the race may select the broken path before IPv4 is attempted.
  3. CLAT overhead: The translation adds ~20-40ms latency per RTT, but more critically, it reduces effective MTU by 20 bytes (IPv4 header encapsulated in IPv6). If your TLS ClientHello with ALPN extensions exceeds 1420 bytes, you hit the same MTU black hole as VPN scenarios.

Testing IPv6-Only:

Don't rely on NAT64 in Docker; it doesn't simulate CLAT/plat latency. Use a real IPv6-only VLAN with a Linux router running Tayga 0.9.2 for 464XLAT:


# Ubuntu 22.04 Tayga configuration
# /etc/tayga.conf
tun-device nat64
ipv4-addr 192.168.255.1
prefix 64:ff9b::/96
dynamic-pool 192.168.255.0/24

Force your test device onto this network, then verify behavior when the PLAT is temporarily disabled (systemctl stop tayga). The app should fallback to IPv4 literal attempts if DNS64 is unavailable, but many modern HTTP clients disable IPv4 entirely when the interface lacks an IPv4 address.

Verify your HTTP client's address selection. In OkHttp, enable event listener logging:


class NetworkTopologyEventListener : EventListener() {
    override fun connectStart(call: Call, inetSocketAddress: InetSocketAddress, proxy: Proxy) {
        Log.d("NetworkTest", "Connecting to: ${inetSocketAddress.address} " +
              "type: ${if (inetSocketAddress.address is Inet6Address) "IPv6" else "IPv4"}")
    }
}

If you see IPv4 addresses being attempted on an IPv6-only interface, your DNS resolver is bypassing the carrier's DNS64, or you've hardcoded IPv4 literals in your codebase—a violation of Apple App Store guidelines Section 2.5.4 for IPv6 compatibility.

The State Machine of Connectivity

The deadliest bugs occur during transitions, not steady states. The sequence: Wi-Fi ON → VPN Connect → Airplane Mode ON → Wi-Fi OFF → Airplane Mode OFF → VPN Reconnect → Cellular Data ON represents a plausible 30-second window in a user's commute. Most apps handle the endpoints (Wi-Fi only, Cellular only) but fail during the chasm.

Automating Chaos:

Use Clumsy (WinDivert 2.2.0) on Windows hosts to introduce lag, drop, or throttle on specific ports. On macOS, Network Link Conditioner profiles can be triggered via xcrun simctl status_bar for iOS simulators, but physical devices require Apple Configurator 2 profiles.

For Android, adb provides granular control:


# Simulate specific network type (LTE with poor signal)
adb shell cmd phone data-network-type LTE
adb shell cmd connectivity set-wifi-enabled false

# Introduce 20% packet loss on specific port (requires rooted device + iptables)
adb shell iptables -A INPUT -p tcp --dport 443 -m statistic --mode random --probability 0.2 -j DROP

Programmatic State Injection:

Write a test harness that uses the AccessibilityService (Android) or XCUITest (iOS) to physically toggle settings while the app runs. This isn't unit testing; it's exploratory automation. SUSA implements this via device farm orchestration—one persona enables a WireGuard profile during checkout, another toggles between SIMs while uploading photos, logging the specific UnknownHostException vs SSLHandshakeException patterns that emerge.

Tooling Matrix: From Emulation to Real Radios

ToolPlatformCapabilityLimitation
Clumsy 0.3WindowsLag/Drop/Throttle per processWinDivert requires admin
Network Link ConditionermacOS/iOS3G, Edge, DSL profilesiOS 17+ requires hardware, not sim
tc/netemLinuxReorder, duplicate, corruptKernel module dependencies
Charles Proxy 4.6.4Cross-platformThrottle, Breakpoints, Map RemoteTLS 1.3 middlebox issues on Android 14
ADB ShellAndroidsvc wifi/data, cmd phoneCannot simulate specific RAT types on unrooted devices
Tayga 0.9.2Linux464XLAT simulationRequires dedicated network namespace
WireGuard + tcLinuxVPN + packet manglingComplex orchestration

For CI integration, avoid emulators for network topology tests. Use device farms (Firebase Test Lab, AWS Device Farm) with specific network profiles enabled via host-side tc rules on the farm's Linux routers. Export JUnit XML with custom properties indicating the network state during failure:


<testcase name="testPaymentFlow" time="45.2">
  <failure message="SSLHandshakeException: Connection closed by peer">
    Network State: VPN_MTU_1420, IPv6_Only, 30%_Packet_Loss
  </failure>
</testcase>

Observability: Logging What Actually Failed

When a crash report arrives with java.net.UnknownHostException, the stack trace is useless without network topology context. Implement a NetworkDiagnostics singleton that captures:

  1. Active Network Capabilities: TRANSPORT_WIFI, NET_CAPABILITY_VALIDATED, NET_CAPABILITY_NOT_METERED
  2. Interface Addresses: Whether eth0 has an IPv4 address or only fe80:: link-local
  3. Proxy Configuration: ProxySelector.getDefault() return values (corporate PAC files break more apps than firewalls)
  4. DNS Resolution Time: Separate logs for A vs AAAA record queries

On iOS, use NWPathMonitor to snapshot the path status at crash time:


let monitor = NWPathMonitor()
monitor.pathUpdateHandler = { path in
    let topology = [
        "usesInterfaceType": path.usesInterfaceType(.cellular),
        "supportsDNS": path.supportsDNS,
        "isExpensive": path.isExpensive,
        "reason": path.unsatisfiedReason
    ]
    Crashlytics.setCustomKeysAndValues(topology)
}

Distinguish between "no network" (airplane mode), "captive portal" (DNS resolves but TCP to port 443 fails), and "broken middlebox" (MTU black hole). Your retry policy should differ: captive portals require user intervention; MTU issues require smaller packets; DNS failures require exponential backoff.

When Autonomous Testing Encounters Topology

Manual QA can't cover the combinatorial explosion of network states. A 10-persona autonomous testing approach—where each persona represents a user archetype (commuter, corporate VPN user, international roaming)—can systematically explore the state space. One persona maintains a persistent WebSocket connection while switching from 5G to Wi-Fi to VPN; another uploads large binaries during simulated tower handoffs.

The value isn't just finding crashes; it's discovering non-deterministic ANRs (Application Not Responding) that occur when ConnectivityManager callbacks deadlock with OkHttp connection pool evictions. SUSA captures these as regression scripts—Appium tests that reproduce the exact sequence of adb shell commands that triggered the race condition, ensuring the fix persists across releases.

Concrete Validation Strategy

Stop testing "offline mode." Test "degraded topology." Create a test schedule that includes:

  1. IPv6-only degradation: Deploy Tayga, disable IPv4 DHCP, verify file uploads don't stall at 0% due to CLAT MTU issues
  2. VPN churn: Automate WireGuard connect/disconnect every 30 seconds during a 5-minute stress test; assert no IllegalStateException in your network callback handlers
  3. Captive portal resurrection: Connect to a MikroTik RouterOS v7.x hotspot, authenticate, let the session expire mid-API-call, verify the app surfaces a "Network Login Required" message rather than HTTP 418 parsing errors
  4. Dual-SIM ping-pong: Use adb shell cmd phone switch-data-subscription (Android 14+) during active downloads; assert resume capability via Range headers rather than full restart

If your test suite passes on a stable gigabit fiber connection, it proves nothing except that your JSON serializers work. Ship a feature only after it survives a test device strapped to a moving vehicle traversing cell tower boundaries while tethered to a laptop running tc qdisc add dev usb0 root netem delay 500ms 100ms loss 5%.

Test Your App Autonomously

Upload your APK or URL. SUSA explores like 10 real users — finds bugs, accessibility violations, and security issues. No scripts.

Try SUSA Free