← all writing

When one go2rtc isn't enough — moving the streaming server to the edge

The symptom that kicked this off was a single camera. One feed, on a sub-stream at 512 kbps, on a branch office I was tunneled into over a site-to-site VPN. The camera wouldn't play. "Stream stalled." Every other camera at that branch loaded on SD; HD was hit-or-miss across the board.

The camera was fine. The NVR was fine. I could pull RTSP from the camera directly when I was sitting at the branch. The problem was that the streaming server — the one converting RTSP to WebRTC for the dashboard — lived in our central office. To watch a camera at a remote branch, the RTSP feed crossed the VPN to the central office, got transcoded, and the WebRTC output crossed the VPN back to me.

Two VPN hops. One of them carrying multi-megabit RTSP. That one camera was just flaky enough over the double hop to consistently stall.

This is the post about how I fixed it, what I broke along the way, and the engineering decision that turned out to be the whole answer.

#The architecture that worked, until it didn't

The dashboard streams dozens of cameras across multiple NVRs spread across multiple branch offices. All of them, originally, funneled through a single go2rtc instance running on a server in the central office:

Browser → VPN → go2rtc (central) → VPN → NVR (branch)

For the cameras that lived at the central office, this was perfect. The NVR and go2rtc were on the same LAN — RTSP was a local connection, WebRTC went out to the browser, no VPN involved on the streaming path. Everything was as fast as it would ever be.

For every other branch, every RTSP stream made the round trip. go2rtc pulled the feed from the remote NVR over the VPN, transcoded if needed, then served the WebRTC output back over the VPN to the viewer. The expensive hop was the RTSP one — multi-megabit, latency-sensitive, no adaptive bitrate. WebRTC was the polite one: it adapts to available bandwidth, drops resolution gracefully, recovers from packet loss without retransmits.

I had it backwards. The forgiving protocol was crossing the VPN; the unforgiving protocol was crossing it too.

#The realization

If go2rtc ran on the same LAN as the NVR, the RTSP pull would be local — fast, reliable, full-bandwidth. Only the WebRTC output would cross the VPN. And WebRTC is the protocol designed to cross unreliable links. It would adapt. It would degrade gracefully on a saturated VPN. It would recover from a moment of packet loss without restarting the whole stream.

One hop instead of two, and the expensive hop becomes the one that's adaptive.

Browser → VPN → go2rtc (branch LAN) → NVR (branch LAN)

This wasn't a software problem. It wasn't a code problem. It was a network topology problem that I'd been trying to fix in the wrong layer for weeks.

The hardware question answered itself. We already had a small i7 box at one of the remote branches running BIND DNS and Chrony for NTP. It was using almost no CPU and had 15 GB of RAM doing nothing. go2rtc is a single static binary that takes about as much memory as htop. It could run alongside DNS without either noticing the other.

#The pilot

The pilot deployment was almost embarrassingly straightforward. SSH in, drop the binary, install ffmpeg, write a YAML config that points at the local NVR's RTSP streams, write a systemd unit, open the API port and the WebRTC port in the firewall, start it.

The whole thing — from the moment I decided to try it to the moment a frame capture confirmed the local go2rtc was serving JPEGs from the local NVR — took about ninety minutes. Most of that was reading the go2rtc docs to make sure I was setting up the API auth correctly. (Spoiler: I wasn't. More on that.)

The config is mostly a list of streams. Each stream gets two entries — a primary RTSP source, and an ffmpeg fallback that re-encodes the source to baseline H.264 if the original is something the browser doesn't speak natively (older IP cameras love to ship H.265 as the default main stream). go2rtc tries the first source; if it fails, it falls back to the ffmpeg pipeline. That's the whole control flow.

#The dashboard knew about one go2rtc, not many

The interesting work wasn't on the branch box. It was in the dashboard.

The dashboard already proxied all go2rtc traffic through a single path — /go2rtc/ — that added the dashboard's own auth on top of the streaming server's API. To support per-branch go2rtc instances, I needed three changes:

Per-NVR proxy routes. Each NVR's config now optionally carries a go2rtcUrl field. NVRs with a URL get their own reverse proxy registered at startup at a per-NVR path — /go2rtc-nvr/{branchId}/. NVRs without a URL fall through to the central proxy.

Per-NVR API clients. The Go server uses a typed client for stream registration, health checks, and frame captures. NVRs with a local go2rtc get their own client instance with their own auth credentials. The credentials are parsed from the URL — http://user:pass@host:port — at config load.

Skip central registration. The dashboard's startup loop registers all known streams with the central go2rtc. NVRs with a local go2rtc don't need that — their streams are already configured in the local YAML. The loop now skips any NVR with a Go2RTCURL set.

On the front end, each camera tile carries its own proxy path as a data-proxy attribute. The grid view supports cameras from multiple NVRs in the same grid, so the JavaScript reads each tile's proxy individually:

function initGridStream(el) {
    var proxy = el.dataset.proxy || defaultGo2rtcProxy;
    el.src = new URL(proxy + '/api/ws?src=' + encodeURIComponent(el.dataset.stream), location.href);
}

That's the whole client-side change. One line decides which streaming server a tile talks to, and the rest of the player code is identical.

#The auth problem

First production deploy: black screens on every camera at the branch. The dashboard could ping the branch go2rtc box. It could even reach the API endpoint. It just couldn't get any frames.

I'd added basic auth to the branch go2rtc's web UI as an afterthought — "just to avoid someone wandering in if they get to the IP." That auth turned out to apply to the API too, which I hadn't tested for, because in development I'd been running everything on localhost without auth. The dashboard's API client wasn't sending an Authorization header. The streaming server was politely refusing every request.

The fix was two changes:

The API client needed to parse credentials from the URL and send basic auth on every outbound request. Easy.

The WebSocket proxy needed it too. This one was less obvious — the proxy builds raw HTTP upgrade requests by hand to negotiate the WebSocket handshake. I had to inject the Authorization header into the upgrade request explicitly. Once I did, the WebSocket negotiation succeeded and frames started flowing.

The deeper lesson here was about the dev-prod gap. I had verified everything worked from one server. I deployed to a different server with different VPN routing and different auth assumptions, and got black screens. Test from production's network path, not your own. Especially when the difference between paths is a VPN.

#The performance tuning

The first connections were slow. Five or six seconds of "Connecting to stream..." before video appeared. Tracing it, the cold-start chain was:

  1. RTSP handshake — 1-2 seconds
  2. Wait for a keyframe from the NVR — 1-4 seconds, depending on the NVR's GOP
  3. ffmpeg transcode startup if the source is H.265 — another second or two

The second one was the biggest. The browser cannot render anything until it gets the first I-frame, and the NVR was set to a long GOP — keyframes every 50 to 100 frames at 30 fps, so up to three seconds of waiting. That's an NVR-side fix, not a software one. Reducing the keyframe interval to 30 frames (one per second) cut the wait in half at the cost of slightly more storage. Worth it.

The ffmpeg startup was the next biggest. The default ffmpeg preset accumulates frames before encoding to do quality analysis. For live streaming, that buffering is the enemy. The right preset is:

ffmpeg:
  h264: "-c:v libx264 -g 30 -preset ultrafast -tune zerolatency"

-preset ultrafast skips the analysis, encoding immediately. -tune zerolatency disables look-ahead and frame reordering. -g 30 matches the NVR's new keyframe cadence so the encoder isn't inserting redundant keyframes the source already provided. The net effect was about a second off the cold-start time across every stream.

On the client side I dropped the stall-detection timeout from five seconds to three. If a stream hasn't produced a frame in three seconds, the player reconnects automatically rather than sitting on an empty <video> element. The loading spinner timeout came down accordingly.

End result: cold-start time on a remote branch went from five-or-six seconds to roughly two. Reconnect time after a network hiccup went from ten seconds to four. Two changes I should have made on day one.

#The ghost configs

After the local go2rtc was handling its branch, the central server still had every one of those branch's stream definitions in its YAML config. go2rtc connects to RTSP on demand, so it wouldn't have actively pulled them — but the definitions were still there, and any stale reference in the dashboard could have triggered a pull over the VPN through the central server, exactly the path I was trying to eliminate.

I removed the migrated streams from the central config and restarted. Total stream count on the central server dropped accordingly.

One gotcha caught me here. The go2rtc systemd service on the staging box pointed at a config file from a different project's directory — left over from an older deployment. Both files had the same stream definitions. I had to clean both, restart the right service, and verify the central server was no longer trying to reach those branches over the VPN.

Stale configs are a quiet trap. They don't break anything immediately. They just create paths through your system that nobody remembers exist, and the next person debugging a problem assumes that path is still load-bearing.

#What I learned

The VPN is the bottleneck, not the code. I spent weeks before this trying to optimize transcoding settings, browser-side reconnect logic, and cache strategies. None of that addressed the actual problem, because the actual problem was a multi-megabit unidirectional RTSP stream crossing a flaky VPN twice. Architecture eats optimization for breakfast. Move the work to the right place and the optimization need goes away.

Adaptive protocols cross flaky networks; non-adaptive ones don't. WebRTC is forgiving. RTSP is not. If your topology forces one of them to cross the slow link, make sure it's the forgiving one. This sounds obvious and isn't — most camera dashboards I've seen put the streaming server in the central office because that's where the dashboard is, not because that's where the streams should originate.

Auth-on-by-default beats auth-when-you-remember. I added basic auth to the branch go2rtc as an afterthought, and it broke the first deploy. But after the fix, having auth on every box from day one is plainly correct. An unauthenticated streaming API on the branch network is the kind of thing that ends up in a "how I got into a corporate network" conference talk.

Test from production's network path, not your own. I verified everything worked from staging, deployed to production, and got black screens. Staging had different VPN routing. The fix was obvious once I tested from production. The lesson is that "it works" is meaningless without specifying "from where, with which routing, against which credentials."

Don't leave ghost configs behind. Old stream definitions on the central server. Old systemd unit files pointing at the wrong directory. Old DNS entries for retired branches. After a migration, clean up the source you migrated from, or someone will spend a Saturday debugging a problem that was solved months ago.

#The state now

One branch is running on its own go2rtc instance. The streams that used to cross the VPN twice now cross it once, in the protocol that's designed for it. Cold-start time is half what it was. The flaky camera that started this whole project plays cleanly.

The remaining branches are queued for the same migration. Each one is roughly two hours of work — fifteen minutes for the deployment itself, and the rest for the deploy guide I'm now writing while the gotchas are fresh. The next branch should take fifteen minutes. The one after that, ten.

The architecture diagram I started with has changed. There's no longer a single streaming server doing all the work. There's one per site, each handling its local NVR, all funneled into a dashboard that knows where each stream lives. The dashboard is the thin coordinating layer. The streaming work happens at the edge, where the streams are.

That was the entire fix.


← all writing