Recently, data team members at Doximity encountered an issue. This article is the story of the journey that the Data Infrastructure team ended up embarking on to identify and address the problem. Our voyage will take us through capturing syscalls on a live process, building a testing binary, and examining raw network packets to find a quirky hack that leads to unforeseen consequences.

A bit of background is in order. At Doximity, we have around 50 contributors working every day through our AWS-based data platform. One of the most common activities is to access our data warehouse. A detail that will be important later is that for legacy reasons connecting to our data warehouse currently requires making network connections through us-east-1 to us-west-2.

Our story begins with an issue report from a single person. The person reported that they saw occasional timeouts when attempting to connect to our data warehouse through our internal tool named Mortar. Mortar is responsible for automating a range of data-related activities, one of which is to establish connections to our data-warehouse through the open-source project sshuttle. The purpose is to grant easy access to our internal data warehouse without explicitly managing ssh tunnels. At first, the issue seemed isolated to a single contributor, and we erroneously brushed it off as a networking issue between the contributor and us-east-1.

On the following day, several contributors started to report connectivity issues prompting us to investigate. We began by examining login failures with our data warehousing vendor in us-west-2 and tested connectivity from key nodes in us-east-1. An early hypothesis was that a round-robin between specific nodes in us-east-1 would occasionally take a connection through a misconfigured node. From our investigations, this was not the case, and all nodes successfully established connections to our data warehousing vendor.

Already feeling a bit confused, we reverted a few recent changes to our configuration and Mortar itself. We collected system data from contributors impacted by the issue (such as os version, Python version, sshuttle version, etc.). After several tests, we still had nothing.

Our next focus became sshuttle since it is responsible for managing the connections between data team members' laptops and our data warehouse. On a high-level, sshuttle operates through Python scripts running on both the data team members' laptops and servers in us-east-1. The sshuttle script running on the us-east-1 servers is responsible for managing connections to internal services, such as our data warehouse. Since sshuttle ran without errors, we knew the connection from the data team members' laptops to the us-east-1 servers was operating as expected. Thus, our inclination at the time became to understand better what was happening within sshuttle when it was establishing connections between our us-east-1 servers and our data warehouse.

Since sshuttle is a 3rd party tool, we had no experience deploying it in development mode, nor did we know of any best practices for debugging it, so we turned to strace. Strace is a tool for tracing system calls and signals. More importantly, you can attach strace to a running process to learn more about what's going on within by looking at what syscalls the process executes. Running strace with some timing options, we learned the following about sshuttle attempting to connect to our data warehouse:

14:03:27.998433 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 4 <0.000042>
...
14:03:27.999071 connect(4, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("*.*.*.*")}, 16) = -1 EINPROGRESS (Operation now in progress) <0.000042>
...
14:03:28.000082 select(5, [0], [4], [], NULL) = 1 (in [0]) <129.505638>

This observation highlighted that the connection between our us-east-1 servers to our data warehouse was the culprit for our timeouts. Sshuttle created a socket, initialized a connection, and then became stuck in a select (for 129.5 seconds here) waiting on the file descriptor to get ready for writing.

At this point, we wanted to run several tests. But reproducibility from locations outside of us-east-1 was tricky. Proper end-to-end testing would require new infrastructure, network policy changes, Mortar code changes, and sshuttle/ssh reconfigurations. Instead, we pulled out the syscalls that we had just observed and compiled them into a portable testing binary. Building this testing binary was done in Go by mapping the captured syscalls responsible for establishing the network connection to the appropriate Go syscall functions. This testing binary made it easy to run tests accessing any service from anywhere with the same logic and socket configuration as the problematic sshuttle connection.

Using our test binary, we quickly realized that the issue was more widespread than initially thought. Any connection established using our testing binary from any server in us-east-1 to any server and service in us-west-2 exhibited the problem. At this point, we escalated to AWS support and started working with them in tandem on investigating the issue.

Motivated by observing a failure to connect to a regular HTTPS server, we made another discovery that deepened the mystery. Curl and Telnet would establish connections from servers in us-east-1 to us-west-2 without the timeout issue. What was going on?!

It was time to understand how Curl and Telnet were different from Sshuttle. In particular, if connections are not working from Sshuttle but they are from other tools, then surely there must be a difference in the packets sent by the various tools.

Using Tcpdump, we watched a curl invocation create a perfectly fine TCP connection from us-east-1 to a testing HTTP host we configured in us-west-2.

14:14:28.573233 IP us-east-1 > us-west-2: Flags [S], seq 1854244865, win 26883, options [mss 8961,sackOK,TS val 1862324119 ecr 0,nop,wscale 7], length 0
14:14:28.645327 IP us-west-2 > us-east-1: Flags [S.], seq 2375826929, ack 1854244866, win 26847, options [mss 1460,sackOK,TS val 2111547314 ecr 1862324119,nop,wscale 8], length 0
14:14:28.645378 IP us-east-1 > us-west-2: Flags [.], ack 1, win 211, options [nop,nop,TS val 1862324191 ecr 2111547314], length 0
14:14:28.645450 IP us-east-1 > us-west-2: Flags [P.], seq 1:78, ack 1, win 211, options [nop,nop,TS val 1862324191 ecr 2111547314], length 77: HTTP
14:14:28.717508 IP us-west-2 > us-east-1: Flags [.], ack 78, win 105, options [nop,nop,TS val 2111547386 ecr 1862324191], length 0
14:14:28.719729 IP us-west-2 > us-east-1: Flags [P.], seq 1:451, ack 78, win 105, options [nop,nop,TS val 2111547388 ecr 1862324191], length 450
14:14:28.719751 IP us-east-1 > us-west-2: Flags [.], ack 451, win 219, options [nop,nop,TS val 1862324266 ecr 2111547388], length 0
14:14:28.719934 IP us-east-1 > us-west-2: Flags [F.], seq 78, ack 451, win 219, options [nop,nop,TS val 1862324266 ecr 2111547388], length 0
14:14:28.792101 IP us-west-2 > us-east-1: Flags [F.], seq 451, ack 79, win 105, options [nop,nop,TS val 2111547461 ecr 1862324266], length 0
14:14:28.792130 IP us-east-1 > us-west-2: Flags [.], ack 452, win 219, options [nop,nop,TS val 1862324338 ecr 2111547461], length 0

At the same time, we observed an invocation of our sshuttle testing binary timing out after several SYN retransmits

14:11:13.015619 IP us-east-1 > us-west-2: Flags [S], seq 824638074, win 26883, options [mss 8961,sackOK,TS val 1862128566 ecr 0,nop,wscale 7], length 0
14:11:14.042399 IP us-east-1 > us-west-2: Flags [S], seq 824638074, win 26883, options [mss 8961,sackOK,TS val 1862129593 ecr 0,nop,wscale 7], length 0
14:11:16.058348 IP us-east-1 > us-west-2: Flags [S], seq 824638074, win 26883, options [mss 8961,sackOK,TS val 1862131609 ecr 0,nop,wscale 7], length 0
14:11:20.186343 IP us-east-1 > us-west-2: Flags [S], seq 824638074, win 26883, options [mss 8961,sackOK,TS val 1862135737 ecr 0,nop,wscale 7], length 0
14:11:28.378346 IP us-east-1 > us-west-2: Flags [S], seq 824638074, win 26883, options [mss 8961,sackOK,TS val 1862143929 ecr 0,nop,wscale 7], length 0
14:11:44.506332 IP us-east-1 > us-west-2: Flags [S], seq 824638074, win 26883, options [mss 8961,sackOK,TS val 1862160056 ecr 0,nop,wscale 7], length 0
14:12:18.298341 IP us-east-1 > us-west-2: Flags [S], seq 824638074, win 26883, options [mss 8961,sackOK,TS val 1862193848 ecr 0,nop,wscale 7], length 0

An SYN packet is the first packet sent when establishing a TCP connection. If a sender does not receive a response (an SYN-ACK packet) within a specific time, it will attempt to retransmit. Therefore, the behavior we observed was consistent with there either being no server at the other end (we know there was) or the SYN packet is being dropped or lost in transit. Since the packets from the curl call all made it through, the only remaining hypothesis was that something caused the SYN packet from the sshuttle testing binary to get dropped. Furthermore, the only reason this would happen is if it contained data that differs from the SYN packet of the curl connection, triggering a hop in the route between us-east-1 and us-west-2 to drop the packet.

To understand the difference between the curl SYN packet and the sshuttle SYN packet, we used tcpdump again to capture the working and non-working SYN packets' raw data.

curl SYN packet

45 00 00 3C 12 D5 40 00 40 06 EC F0 ** ** ** ** ** ** ** ** 9B 66 01 BB 4F 86 57 F7 00 00 00 00 A0 02 F5 07 3B 25 00 00 02 04 23 01 04 02 08 0A 27 C3 9D A3 00 00 00 00 01 03 03 07

sshuttle SYN packet

45 00 00 3C 11 4E 40 00 2A 06 04 78 ** ** ** ** ** ** ** ** 9B 64 01 BB 95 29 7A 5E 00 00 00 00 A0 02 F5 07 3B 25 00 00 02 04 23 01 04 02 08 0A 27 C3 78 F6 00 00 00 00 01 03 03 07

We are now at the part of our story, where everything starts to make sense. If you're unfamiliar with SYN packets, let me highlight the difference between the two we captured. First, several fields should be (and are) different: identification values, header checksums, source port, sequence number, and timestamp. But, there's one field that unexpectedly stood out as different. At the 9th byte, an IPv4 packet stores the IP_TTL. Which configures the number of hops a packet is allowed to take before it gets dropped. From the SYN packets we captured, we noticed that the IP_TTL on the curl SYN packet was set to 0x40 = 64, the Linux default. Sshuttle had the SYN packet set to 0x2A = 42, lower than 64, and a non-default.

A reasonable hypothesis is starting to form. Perhaps the routing between us-east-1 and us-west-2 occasionally takes our connection through a route with more than 42 hops but fewer than 64. If that were the case, connections from curl and other tools using the Linux default would work fine. But connections from sshuttle with its non-default IP_TTL of 42 would get dropped.

This hypothesis is fairly easy to test. We ran a few tcptraceroute from us-east-1 to us-west-2 and found that about 50% of our connections got routed through one or two hops above 42! This result was looking promising. Furthermore, returning to the syscalls we captured from stracing sshuttle, we discovered this one:

setsockopt(4, SOL_IP, IP_TTL, [42], 4)

And there it was! This syscall confirms that sshuttle is explicitly setting a non-default IP_TTL. Jumping into the sshuttle source code, we find out that sshuttle uses a non-default IP_TTL value as a hack to avoid infinite loops. A quick update to the syscall in our testing binary further confirms that this was the issue.

Finally, we have a path forward. We forked sshuttle, patched it to use a non-default IP_TTL value of 63 instead of 42 and built, tested, and deployed Mortar with our patched version of sshuttle. This patch solves the timeout issue. We concluded our adventure by also contributing our fix back to the sshuttle project.

It has been a great journey.


Thanks to all the reviewers and Hannah Gambino for the illustrations.

Be sure to follow @doximity_tech if you'd like to be notified about new blog posts.