Topics

Coder.com•5mo ago

Unable to make direct connection between two NATs, even with STUN

Hi our coder is hosted in an on-prem k8s cluster (one subnet), and the client devices are sitting on another subnet. I have installed an internal STUN server as well. We are having trouble establishing direct connection, I suspect due to restrictive firewall rule. so I start with something simple: create a pod in the same k8s cluster, and do coder login to act as a client. to my surprise, despite being on the same subnet, coder client still wont direct connect. coder netcheck output is attached.

message.txt

34 Replies

Codercord•5mo ago

Codercord•5mo ago

<#1396603880602730506>

Category

Help needed

Product

N/A

Platform

N/A

Logs

Please post any relevant logs/error messages.

What product are you using?

Phorcys•5mo ago

is that 1000stun (10.20.20.113) part of your cluster net? ah yeah, i assule it's the custom internal STUN server what STUN solution are you using?

whizzOP•5mo ago

coturn but I'm surprised, with pod inside k8s and coder hosted in k8s, why does it need STUN at all? they can ping each other. not really, but I changed it now to STUN server that is really part of the cluster (Same subnet 10.244.x.x) same result:

whizzOP•5mo ago

log with STUN in 10.244.x.x range

message.txt

whizzOP•5mo ago

Hi @Phorcys sorry to bother you, any idea where should I look next?😄

matifali•5mo ago

Maybe @Ethan knows what's happening here?

zounce•5mo ago

I'm honestly not sure - do you have anything that might be blocking UDP? actually the netcheck says udp: true on both

whizzOP•5mo ago

Within the same subnet inside the cluster? No I don't think so. Yes, but what does can_exchange_messages = false mean?

zounce•5mo ago

can_exchange_messages just means whether the node was able to send a message over the DERP protocol and since it's on the the STUN server node (STUNOnly = true) in this case, it can't possibly be true so it's unfortunately not relevant or helpful here to answer your question in #general, the DERP protocol is wireguard UDP over TCP but DERPs listen on the HTTPS port (443), i'm pretty sure

whizzOP•5mo ago

I see thanks for explanation Then I still would like to find a solution, I don't understand why node 999 is preferred over 1000. Since 1000 is healthy. And why direct connection also doesn't work on same subnet Is it because of the workspace URL pointing to a load balancer ? What TCP port ?

Phorcys•5mo ago

DERP uses TCP port 443 after upgrading from an HTTP/s connection

whizzOP•5mo ago

That is definitely NOT blocked 😆

Phorcys•5mo ago

make sure you're letting Upgrade: derp through though whoops

Phorcys•5mo ago

https://coder.com/docs/admin/monitoring/health-check#derp-node-uses-websocket

Coder

Health Check | Coder Docs

Learn about Coder's automated health checks

Phorcys•5mo ago

nevermind though, your healthcheck page would be telling you this

whizzOP•5mo ago

You mean for the ingress of the main coder frontend? Or the agent running on workspace? Yes all green there

Phorcys•5mo ago

it would've been the ingress of the Coder control plane, but i realized that'd be unrelated to your issue anyways 😅 i am not well versed with our networking stack, which is why i'm not answering this thread a lot 😅 but thankfully @zounce is here to take care of you :-)

whizzOP•5mo ago

Thanks for helping so far. It is appreciated 👍 Hi @zounce is there way to trace or debug to discover why the client prefer DERP relay node rather than direct connection, when they are on the same subnet ?

Phorcys•4mo ago

From DMs with @whizz:

By the way the relay approach kinda introduces SPoF.. We are still interested in getting direct connection. Not sure what need to be done next.

Ethan•4mo ago

Unfortunately being able to establish direct connections doesn't eliminate the dependency on the Coder deployment it's the control plane for all connections, so it is just necessary. We do support high availability for premium deployments, however. There's a short list of diagnostics that might appear at the end of a coder ping <workspace> (many of the issues are listed here) but unfortunately it's just netcheck and that

whizzOP•3mo ago

Thanks for answering @Ethan. I see, in this case, does direct connection offer any latency/responsiveness advantage , compared to relay? Yes I'm aware, still waiting for availability of smaller tier licenses, as we are below 50 seats. 😉 Our use case is single location deployment (at the moment). When outside intranet, People VPN into this location when they access coder.

Ethan•3mo ago

ahhh the VPN is a super important detail if you run coder ping <workspace> with your VPN on, do you get any warnings about MTU?

whizzOP•3mo ago

No no no, the log I uploaded is without VPN, and nstwork on the same subnet

Ethan•3mo ago

ah I see, so you can't get direct in either case

whizzOP•3mo ago

I'm just explaining the end use case 😉

Ethan•3mo ago

in any case, you won't be able to get direct connections over the VPN

whizzOP•3mo ago

Yeah, I figured out I start simple. Deploy a pod and connect to control plane (both in same network segment), somehow still no direct connect. Pod and control plane can ping each other, UDP traffic can pass, etc I see, not even with STUN server and NAT traversal?

Ethan•3mo ago

Yeah, full context is in https://github.com/coder/coder/issues/15523#issuecomment-2480014377

whizzOP•3mo ago

Perhaps before spending more effort, I should confirm this: is there any network performance/latency advantage with direct connect, for intranet usage?🤔

Ethan•3mo ago

yeah you'll definitely see better latency and bandwidth with a direct connection, though it'll depend how much

whizzOP•3mo ago

Hello just to give an update. Apparently our cluster had some weird restrictive firewall rule that blocks certain traffic between pods. After removing that, now direct connection works! For some reason direct connection also work with VPN, maybe because of the way our VPN setup, it has enough MTU and each client got IP that can route bidirectionally without NAT to the coder pod.

Ethan•3mo ago

ah interesting. do you know what sort of traffic was being blocked? we might be able to improve the diagnostics in the product (coder ping) to pick up on that and warn yeah if you're not seeing any MTU warnings when doing a coder ping with the deployment available over a VPN then there shouldn't be any issues with the VPN a too small MTU is the biggest reason a lot of customers have issues with direct connections & VPNs

whizzOP•3mo ago

Hi, it's the ephemeral high UDP ports, I think needed by tailscale. So previously we can ping, but the port 43000 something was blocked.

Community server for Coder.com, an open-source platform for cloud development environments.

3.3KMembers

View on Discord

Did you find this page helpful?