Unable to make direct connection between two NATs, even with STUN
Hi our coder is hosted in an on-prem k8s cluster (one subnet), and the client devices are sitting on another subnet.
I have installed an internal STUN server as well.
We are having trouble establishing direct connection, I suspect due to restrictive firewall rule.
so I start with something simple: create a pod in the same k8s cluster, and do
coder login
to act as a client.
to my surprise, despite being on the same subnet, coder client still wont direct connect.
coder netcheck
output is attached.34 Replies
<#1396603880602730506>
Category
Help needed
Product
N/A
Platform
N/A
Logs
Please post any relevant logs/error messages.
What product are you using?
is that 1000stun (
10.20.20.113
) part of your cluster net?
ah yeah, i assule it's the custom internal STUN server
what STUN solution are you using?coturn
but I'm surprised, with pod inside k8s and coder hosted in k8s, why does it need STUN at all? they can ping each other.
not really, but I changed it now to STUN server that is really part of the cluster (Same subnet 10.244.x.x)
same result:
log with STUN in 10.244.x.x range
Hi @Phorcys sorry to bother you, any idea where should I look next?😄
Maybe @Ethan knows what's happening here?
I'm honestly not sure - do you have anything that might be blocking UDP?
actually the netcheck says udp: true on both
Within the same subnet inside the cluster? No I don't think so.
Yes, but what does can_exchange_messages = false mean?
can_exchange_messages
just means whether the node was able to send a message over the DERP protocol
and since it's on the the STUN server node (STUNOnly = true
) in this case, it can't possibly be true
so it's unfortunately not relevant or helpful here
to answer your question in #general, the DERP protocol is wireguard UDP over TCP
but DERPs listen on the HTTPS port (443), i'm pretty sureI see thanks for explanation
Then I still would like to find a solution, I don't understand why node 999 is preferred over 1000. Since 1000 is healthy.
And why direct connection also doesn't work on same subnet
Is it because of the workspace URL pointing to a load balancer ?
What TCP port ?
DERP uses TCP port 443 after upgrading from an HTTP/s connection
That is definitely NOT blocked 😆
make sure you're letting
Upgrade: derp
through though
whoopsnevermind though, your healthcheck page would be telling you this
You mean for the ingress of the main coder frontend?
Or the agent running on workspace?
Yes all green there
it would've been the ingress of the Coder control plane, but i realized that'd be unrelated to your issue anyways 😅
i am not well versed with our networking stack, which is why i'm not answering this thread a lot 😅
but thankfully @zounce is here to take care of you :-)
Thanks for helping so far. It is appreciated 👍
Hi @zounce is there way to trace or debug to discover why the client prefer DERP relay node rather than direct connection, when they are on the same subnet ?
From DMs with @whizz:
By the way the relay approach kinda introduces SPoF.. We are still interested in getting direct connection. Not sure what need to be done next.
Unfortunately being able to establish direct connections doesn't eliminate the dependency on the Coder deployment
it's the control plane for all connections, so it is just necessary. We do support high availability for premium deployments, however.
There's a short list of diagnostics that might appear at the end of a
coder ping <workspace>
(many of the issues are listed here) but unfortunately it's just netcheck
and thatThanks for answering @Ethan. I see, in this case, does direct connection offer any latency/responsiveness advantage , compared to relay?
Yes I'm aware, still waiting for availability of smaller tier licenses, as we are below 50 seats. 😉
Our use case is single location deployment (at the moment). When outside intranet, People VPN into this location when they access coder.
ahhh the VPN is a super important detail
if you run
coder ping <workspace>
with your VPN on, do you get any warnings about MTU?No no no, the log I uploaded is without VPN, and nstwork on the same subnet
ah I see, so you can't get direct in either case
I'm just explaining the end use case 😉
in any case, you won't be able to get direct connections over the VPN
Yeah, I figured out I start simple. Deploy a pod and connect to control plane (both in same network segment), somehow still no direct connect. Pod and control plane can ping each other, UDP traffic can pass, etc
I see, not even with STUN server and NAT traversal?
Yeah, full context is in https://github.com/coder/coder/issues/15523#issuecomment-2480014377
Perhaps before spending more effort, I should confirm this: is there any network performance/latency advantage with direct connect, for intranet usage?🤔
yeah you'll definitely see better latency and bandwidth with a direct connection, though it'll depend how much
Hello just to give an update.
Apparently our cluster had some weird restrictive firewall rule that blocks certain traffic between pods.
After removing that, now direct connection works!
For some reason direct connection also work with VPN, maybe because of the way our VPN setup, it has enough MTU and each client got IP that can route bidirectionally without NAT to the coder pod.
ah interesting. do you know what sort of traffic was being blocked? we might be able to improve the diagnostics in the product (
coder ping
) to pick up on that and warn
yeah if you're not seeing any MTU warnings when doing a coder ping
with the deployment available over a VPN then there shouldn't be any issues with the VPN
a too small MTU is the biggest reason a lot of customers have issues with direct connections & VPNsHi, it's the ephemeral high UDP ports, I think needed by tailscale. So previously we can ping, but the port 43000 something was blocked.