C
Coder.com3mo ago
whizz

Unable to make direct connection between two NATs, even with STUN

Hi our coder is hosted in an on-prem k8s cluster (one subnet), and the client devices are sitting on another subnet. I have installed an internal STUN server as well. We are having trouble establishing direct connection, I suspect due to restrictive firewall rule. so I start with something simple: create a pod in the same k8s cluster, and do coder login to act as a client. to my surprise, despite being on the same subnet, coder client still wont direct connect. coder netcheck output is attached.
34 Replies
Codercord
Codercord3mo ago
Codercord
Codercord3mo ago
<#1396603880602730506>
Category
Help needed
Product
N/A
Platform
N/A
Logs
Please post any relevant logs/error messages.
What product are you using?
Phorcys
Phorcys3mo ago
is that 1000stun (10.20.20.113) part of your cluster net? ah yeah, i assule it's the custom internal STUN server what STUN solution are you using?
whizz
whizzOP3mo ago
coturn but I'm surprised, with pod inside k8s and coder hosted in k8s, why does it need STUN at all? they can ping each other. not really, but I changed it now to STUN server that is really part of the cluster (Same subnet 10.244.x.x) same result:
whizz
whizzOP3mo ago
log with STUN in 10.244.x.x range
whizz
whizzOP3mo ago
Hi @Phorcys sorry to bother you, any idea where should I look next?😄
matifali
matifali3mo ago
Maybe @Ethan knows what's happening here?
zounce
zounce3mo ago
I'm honestly not sure - do you have anything that might be blocking UDP? actually the netcheck says udp: true on both
whizz
whizzOP3mo ago
Within the same subnet inside the cluster? No I don't think so. Yes, but what does can_exchange_messages = false mean?
zounce
zounce3mo ago
can_exchange_messages just means whether the node was able to send a message over the DERP protocol and since it's on the the STUN server node (STUNOnly = true) in this case, it can't possibly be true so it's unfortunately not relevant or helpful here to answer your question in #general, the DERP protocol is wireguard UDP over TCP but DERPs listen on the HTTPS port (443), i'm pretty sure
whizz
whizzOP3mo ago
I see thanks for explanation Then I still would like to find a solution, I don't understand why node 999 is preferred over 1000. Since 1000 is healthy. And why direct connection also doesn't work on same subnet Is it because of the workspace URL pointing to a load balancer ? What TCP port ?
Phorcys
Phorcys3mo ago
DERP uses TCP port 443 after upgrading from an HTTP/s connection
whizz
whizzOP3mo ago
That is definitely NOT blocked 😆
Phorcys
Phorcys3mo ago
make sure you're letting Upgrade: derp through though whoops
Phorcys
Phorcys3mo ago
nevermind though, your healthcheck page would be telling you this
whizz
whizzOP3mo ago
You mean for the ingress of the main coder frontend? Or the agent running on workspace? Yes all green there
Phorcys
Phorcys3mo ago
it would've been the ingress of the Coder control plane, but i realized that'd be unrelated to your issue anyways 😅 i am not well versed with our networking stack, which is why i'm not answering this thread a lot 😅 but thankfully @zounce is here to take care of you :-)
whizz
whizzOP3mo ago
Thanks for helping so far. It is appreciated 👍 Hi @zounce is there way to trace or debug to discover why the client prefer DERP relay node rather than direct connection, when they are on the same subnet ?
Phorcys
Phorcys2mo ago
From DMs with @whizz:
By the way the relay approach kinda introduces SPoF.. We are still interested in getting direct connection. Not sure what need to be done next.
Ethan
Ethan2mo ago
Unfortunately being able to establish direct connections doesn't eliminate the dependency on the Coder deployment it's the control plane for all connections, so it is just necessary. We do support high availability for premium deployments, however. There's a short list of diagnostics that might appear at the end of a coder ping <workspace> (many of the issues are listed here) but unfortunately it's just netcheck and that
whizz
whizzOP2mo ago
Thanks for answering @Ethan. I see, in this case, does direct connection offer any latency/responsiveness advantage , compared to relay? Yes I'm aware, still waiting for availability of smaller tier licenses, as we are below 50 seats. 😉 Our use case is single location deployment (at the moment). When outside intranet, People VPN into this location when they access coder.
Ethan
Ethan2mo ago
ahhh the VPN is a super important detail if you run coder ping <workspace> with your VPN on, do you get any warnings about MTU?
whizz
whizzOP2mo ago
No no no, the log I uploaded is without VPN, and nstwork on the same subnet
Ethan
Ethan2mo ago
ah I see, so you can't get direct in either case
whizz
whizzOP2mo ago
I'm just explaining the end use case 😉
Ethan
Ethan2mo ago
in any case, you won't be able to get direct connections over the VPN
whizz
whizzOP2mo ago
Yeah, I figured out I start simple. Deploy a pod and connect to control plane (both in same network segment), somehow still no direct connect. Pod and control plane can ping each other, UDP traffic can pass, etc I see, not even with STUN server and NAT traversal?
whizz
whizzOP2mo ago
Perhaps before spending more effort, I should confirm this: is there any network performance/latency advantage with direct connect, for intranet usage?🤔
Ethan
Ethan2mo ago
yeah you'll definitely see better latency and bandwidth with a direct connection, though it'll depend how much
whizz
whizzOP2mo ago
Hello just to give an update. Apparently our cluster had some weird restrictive firewall rule that blocks certain traffic between pods. After removing that, now direct connection works! For some reason direct connection also work with VPN, maybe because of the way our VPN setup, it has enough MTU and each client got IP that can route bidirectionally without NAT to the coder pod.
Ethan
Ethan5w ago
ah interesting. do you know what sort of traffic was being blocked? we might be able to improve the diagnostics in the product (coder ping) to pick up on that and warn yeah if you're not seeing any MTU warnings when doing a coder ping with the deployment available over a VPN then there shouldn't be any issues with the VPN a too small MTU is the biggest reason a lot of customers have issues with direct connections & VPNs
whizz
whizzOP5w ago
Hi, it's the ephemeral high UDP ports, I think needed by tailscale. So previously we can ping, but the port 43000 something was blocked.

Did you find this page helpful?