Unraveling Network Woes: The Mystery of Missing TCP Window Updates

Introduction

Have you ever encountered the vexing message “There appears to be trouble with your network connection. Retrying…” while working on your projects? Something like this:

=> [deps 5/5] RUN npm install --omit=dev --production=true                                                                                                                        86.5s
=> => # yarn install v1.22.19                                                                                                                                                 
=> => # [1/5] Validating package.json...                                                                                                                                      
=> => # [2/5] Resolving packages...                                                                                                                                           
=> => # [3/5] Fetching packages...                                                                                                                                            
=> => # info There appears to be trouble with your network connection. Retrying...                                                                                            
=> => # info There appears to be trouble with your network connection. Retrying...
Shell

Solution ( immediate fix )

I understand you might be as frustrated with this as I was when I faced the situation. So, without delay, what worked for me was increasing the network timeout while installing npm packages in the `Dockerfile``:

-RUN npm install --omit=dev --production=true
+RUN npm install --omit=dev --production=true --network-timeout 100000000
Diff

--network-timeout increases the TCP timeout for network requests.

Root cause

It’s a frustrating situation that many of us have faced, and often. This issue was very interesting as i know for the fact my internet speed is not slow ( Shown below )

https://www.speedtest.net/result/15250717552.png

I tried checking in with my friends as well if they were seeing similar behaviours as well but to my surprise, they were not seeing the same behaviour.

Let’s delve deeper into the issue. Even after resolving it, I found myself unsatisfied with the solution. I had a hunch it had something to do with network because even i had a supposedly high speed link, but the package download was really slow. When things like these happen, it is mostly network. However the specifics eluded me. That’s when I decided to conduct a more thorough examination using packet capture tools ( wireshark ).

Upon scrutinizing the data in Wireshark, I noticed a significant number of RST packets originating from my PC and being sent to the npm server. This puzzled me, especially since this configuration was working smoothly for a friend of mine.

rst

Here in north america, there whole lot of IPv6 going on, so please don’t get confused on src and destination ;).

Before we proceed, let’s touch on RST packets. These packets are typically employed to forcefully terminate a TCP connection. In my scenario, however, they were being dispatched from my PC to the npm server. This was indeed peculiar, as RST packets usually originate from the server’s end.

Upon closer inspection, a pattern emerged. Preceding the dispatch of the RST packet, there was a succession of packets indicating ‘TCP ZeroWindow’. Essentially, this was my client PC signaling to the sender that it was currently unable to receive any further data.

Here lies the heart of the matter: the absence of ‘TCP Window Update’ packets. According to protocol, once a client receives packets indicating ‘TCP ZeroWindow’, it must expeditiously dispatch a ‘TCP Window Update’ to the server, allowing communication to resume. In this case, these vital update packets were nowhere to be found.

This riddle led to a crucial question: why wasn’t my client PC sending out the necessary ‘TCP Window Update’ packets? The npm server, eager to resume the exchange of data, was left in limbo, resulting in the eventual dispatch of that troublesome RST packet.

The question persisted: why was my computer not sending window update packets as expected?

Further looking into the PCAP file, I discovered that when my computer sent the SYN packet, it did include a WS of 256, meaning the scale factor was set to 2^8. However, the response in SYN-ACK failed to acknowledge my WS. For those unfamiliar with the term, the window scale factor is a parameter in the TCP header that extends the effective window size beyond its traditional 64KB limit. This led me to believe that it was window scaling was not being done.

tcp

This revelation led me to suspect that something on my computer or my proxy was suppressing scaling of window scale factor. Now that we’ve pinpointed the issue, the next step was to identify the culprit. After some investigation, I determined that my router had TCP window scaling disabled by default. Once I enabled it and rebooted, everything started working seamlessly.

if you are seeing similar issue, you can adjust your tcp window scaling configuration on your proxy or router, following the below thread

https://serverfault.com/questions/1039212/how-to-adjust-tcp-window-size

Coming back to the initial solution

After reading the above, you might still be wondering how did adding --network-timeout in original solution helped resolve the issue temporarily. Increasing the TCP timeout in the context of the above blog helps in situations where network connections are experiencing difficulties. When a network connection encounters trouble, it might lead to timeouts or failures in data transfer. By increasing the TCP timeout value, you’re essentially allowing more time for the network connection to stabilize and for data to be successfully transmitted.

By increasing the TCP timeout, the system is given more time to attempt to establish and maintain a stable network connection, which can prevent premature timeouts and potential failures.

It’s important to note that while increasing the TCP timeout can be a helpful immediate fix in some situations, it may not always be the optimal or permanent solution. In complex network environments, other factors may need to be addressed, such as network configurations, firewalls, or routing issues. The increased timeout essentially provides a buffer of extra time for the network to stabilize, but it doesn’t address the root cause of the network instability.

Conclusion

As evident, what initially appeared to be an application issue was, in fact, a network issue. Occasionally, the apparent problem may not be the underlying issue, which emphasizes the importance of employing the 5-Why’s technique to uncover the root cause. If you found this article helpful, please consider giving it a thumbs-up and sharing it.