This work was led by my student Sadjad Fouladi. If you liked Salsify, you might really like Sadjad's talk in June at USENIX ATC about "gg", his system for letting people use AWS Lambda as a rented supercomputer (e.g. he can compile inkscape and Chromium really fast by outsourcing the computation to 8,000 Lambda nodes that all talk to each other directly over UDP): https://www.youtube.com/watch?v=Cc_MVldSijA (code here: https://github.com/StanfordSNR/gg)
You might also be interested in our current video project, led by my student Francis Yan, on trying to improve live video streaming. If you visit and watch some live TV you can help contribute to our study: https://puffer.stanford.edu
Amazing work, I am really impressed with what you are doing. I have never found any good reading in this area beside your work, the only thing I have seen is GStreamer's rtpjitterbuffer and libwebrtc. I haven't really felt confident in what I have learned from either though.
Have you thought about testing/comparing against other WebRTC implementations? It makes me a little sad to see WebRTC get a bad name just because one implementation has issues. WebRTC is a huge opportunity to get companies to invest in one thing, instead of reinventing the wheel and locking people in. Do you think things could be improved and eventually match Salsify?
Somewhat unrelated but I am working on Pion WebRTC. By design it is not coupled tightly with the encoder/decoder (but I want to give feedback to the user so it could influence it if they wanted). Do you have any suggested readings, would you be ok with me contacting you directly? I really want to build something elegant that allows people to build amazing things with WebRTC. Right now I have been having people do things manually, I don't want to ship something without doing my due diligence.
My main advice for implementers would be, benchmark your end-to-end glass-to-glass video latency (including the time spent waiting for the next frame to be captured & encoded, and then the end-to-end latency through to the display of that frame) over varied/unpredictable networks. In my experience, implementers sometimes get caught up focusing too much on network-layer measurements (IP latency) and can end up missing what in my view is the bottom line: glass-to-glass video latency. You can use our mahimahi mm-delay/mm-link/mm-onoff/mm-loss tools (part of Debian/Ubuntu) and the included traces to model some bad networks. And then see how you do on the same traces we use in our paper. If you can make a plot like our Figure 8(a) and it all looks good, that seems like good progress in my book.
'Course, I still kind of miss scan-lines, and had hoped we'd finally all have enough bandwidth to just blast raw frames over the network by now.
(1) Are there any benefits over H. 264/RTP/UDP in a point-to-point streaming scenario (like two non-stationary nodes directly connected via WiFi)?
(2) If it was implemented on a fast enough hardware, are there any obstacles to achieving sub 20ms latency in a scenario like the one described in (1)?
(1) Salsify is mostly about how you control the video encoder (e.g., a VP8/VP9/AV1/H.264/H.265 encoder) and transport protocol (e.g., RTP/UDP). H.264, RTP, and UDP themselves don't say anything about the control part, i.e., how to (1) estimate the network path's varying capacity, and then (2) how to adapt the desired frame size to match that capacity, and then (3) how to actually encode a frame of video to match the desired compressed frame size or bitrate.
If you have an unpredictable/variable network path, Salsify is probably better than the control strategies in systems with less tightly controlled encoders and transports, e.g. WebRTC.org/Chrome, Hangouts, Skype, or FaceTime. If the network is mostly constant or at least predictable, Salsify's not going to be helpful. So, bottom line is "maybe."
(2) Getting 20ms p95 glass-to-glass latency with compressed digital video is pretty difficult even under ideal circumstances, and basically impossible with a consumer webcam. At 60 Hz, just the interval between two frames is already 17 ms! So if you have one frame's worth of buffer between the camera/USB/encoder/sender and a frame's worth of buffer in the receiver/video card/display, you're already toast. You would really have to work hard to pipeline everything. Even the very best gaming monitors have latencies on the order of 10 milliseconds, and it's hard to buy a USB camera that even starts giving you any bits from the picture before the exposure is over. (And v4l2 doesn't return from VIDIOC_DQBUF until the frame is completely received, so you're probably looking at changing the kernel if you want to use UVC.) So <20ms I think is really hard. DJI just released an end-to-end custom-engineered system that claims 28ms latency (https://www.dji.com/fpv/info#specs).
If you look at Figure 8(a) from the Salsify paper, Chrome (using the WebRTC.org codebase) is getting per-frame latencies on the order of 600 milliseconds even when the network is totally constant and perfect, and Salsify's per-frame latencies are around 250 milliseconds but less consistent. Then you have to factor in the extra latency associated with waiting for the next frame -- if these systems are running at 10 fps (look at the extreme sparsity of the dots after the network hiccup, especially for WebRTC), there's 100 ms of extra glass-to-glass latency right there. Getting to <20ms @p95 is just another world from where these programs are today.
AFAIK codecs exposing rate control functionality to applications is not new, and most of WebRTC is functionally equivalent to any protocol of its sort; so before I go reading your paper it'd be nice to know the reasoning behind the seeming ground-up approach.
Salsify is mainly about the benefits you can get if you (a) extract the control loop out of the video codec, and make it expose a functional-style API, (b) use a Sprout-like congestion control algorithm that tries to follow evolving network capacity quickly and estimate "how many bytes can we send right now while trying to maintain a bound on end-to-end delay", and (c) have a single control loop that works every frame and has the choice of whether to send (1) a frame whose coded length is already known and is about equal to what we think the network can handle [it is very hard to get this from any existing video encoder on a single pass!], or (2) no frame at all.
You can certainly do this within the WebRTC protocol, but doing it within the WebRTC.org codebase is going to be a lot of work. :-( I think unfortunately the codebase has some pretty deeply ingrained assumptions about how the control flow is going to go. Inverting that (as we propose), I don't think is an easy incremental change.
Now, you might ask whether there are some incremental improvements you could make to WebRTC.org to get 80% of the benefits of Salsify without a major refactor. E.g., maybe you don't need the functional API if you just make some tweaks to the rate-control algorithm and congestion control. I don't think we know for sure though, but I suspect that for somebody already experienced with that codebase, probably yes, there are gains to be had. There's another question about whether the gains of Salsify (which are on a particular type of flaky cellular network) might come at the expense of costs on other types of (dependable?) networks. To be confident on that we'd probably need to try this stuff much more broadly and on real people (this is the motivation for Puffer).
I guess a complementary question is in order: since you believe that the WebRTC protocol itself is not incompatible with the mechanisms that enable Salsify's performance, how much work would it be to adapt Salsify incrementally into a WebRTC implementation? (Ignoring mandatory codec support for a moment).
I've been developing an Opus codec mode for Bluetooth A2DP, and I have a similar sort of situation. I've been looking (casually) in to using more interesting rate control based on channel performance to improve QoS with Bluetooth A2DP. I think Opus has a property you could only dream of: strict limits on frame size. :- )
Added: just found that the video codec uses the VP8 bitstream, that was not immediately clear to me; seems to me that that is a substantial selling point that should be made more clear on the webpage.
Since your interface doesn't require any change to the bitstream of a conventional codec, it's a heck of a lot closer to public use than I initially thought!
I don't think that means that a Salsify sender would be interoperable with, like, a Chrome receiver, though (even though we're just using VP8) -- we'd have to implement at least the receiver side of Salsify's congestion-control protocol inside WebRTC.org/Chrome, and Salsify sometimes likes to encode a VP8 frame that has to be interpreted relative to a certain (prior) decoder state.
On #2, honestly I've worked with libopus a bit for Puffer and it's just really pleasant to work with. As you say, a lot of the difficulties of interacting with a video encoder you just don't have in this context. It's also pretty easy to get "gapless/clickless" back-to-back playback of audio excerpts that were encoded completely independently, unlike with video where this is a huge pain in the neck and usually requires a SAP/IDR/closed GOP/sequence header+I-frame+P-frame (which takes a lot of bytes so you can't do it very often). See https://github.com/StanfordSNR/puffer/blob/master/src/opus-e... if you are interested for more.
One of the challenges here is that it's hard to know how big a compressed frame will be until you compress it, and that takes time. Ideally an encoder would be able to produce a compressed frame that's the exact size that can be accommodated by the network path, right away. In practice that's not so easy. Salsify gets around this by having the encoder encode two candidate coded frames (one bigger, one smaller) and then giving the application the option of sending either one (after encoding is finished and it knows their exact length and quality) or no frame at all. If the encoder could just do the job perfectly and instantly the first time, you wouldn't need all that.
Like maybe salsify.com?
So by toggling the pause / play button, you get the video control in a state that finally mirrors the browser's state.
So it's only natural that it should be possible to greatly improve upon what we're using today. I'm glad to see advances in this field, because it's sad that video calls are still a hugely worse experience compared with good old phone calls...
I'm taking guitar lessons over Skype and for the most part it works just fine. However, every once in a while we have to resort to recording and sending snippets over the wire to get the required fidelity. With digital amp simulators this is thankfully a trivial exercise, but it would be great to do without.
I've looked for other alternatives, but couldn't really find any that fit the bill. The last thing I looked at was a few Audio over IP products, but all of them were designed to be run on a LAN.
I wonder if we can do this with WebSockets? (The decoder might need to be in WASM for performance to be worthwhile.)
RTSP is mostly about control and framing. It doesn't specify any particular algorithm to estimate the network's capacity, how a video encoder should try to match that estimated capacity, or how to recover from lost packets.
H.265 is a format for compressed video that defines a bit-exact decoder. It doesn't specify the way that an encoder encodes anything into the compressed format, how the encoder should try to match an externally supplied target frame size / bitrate target (while also meeting a latency target), or what the API should be.
The Salsify techniques could work fine with RTSP, and could work fine using H.265 as the coded video format. The special thing about Salsify is really about where the control lies.
Traditionally (in Skype, FaceTime, or the WebRTC.org codebase), there is a drop-in codec with its own control loop (making frame-by-frame decisions), and a congestion-control protocol with its own control loop (making packet-by-packet decisions), and these control loops are at close enough timescales that they end up doing poorly together. And the API to the codec is generally too limited (especially when it's a very general API, as in WebRTC.org, that tries to abstract across pre-existing implementations of VP9/H.264/H.265 to give the application agility across different formats) to achieve the kind of rapid adaptation to network flakiness that you need over these cellular or bad Wi-Fi networks.
Salsify basically says, "hey, if your codec supports a functional-style API, and your transport protocol too, and you can extract the long-lived control state from each of those individual modules and just have one control loop that jointly controls both the codec module and the transport module, you can do a heck of a lot better."
More to the point, we don't have an empirical end-to-end measurement of Zoom the way we do for Skype, FaceTime, Google Hangouts, and WebRTC-in-Chrome (with and without VP9-SVC). But my understanding is that Zoom is architected similarly to those systems and can be expected to behave within the same envelope. Would need to measure it to know for sure.
On the other hand, Zoom is an actual product with users, and Salsify is a research prototype that doesn't even have audio, much less users. So hard to compare outside the narrow technical questions of video quality and latency over imperfect networks.
Not really a breakthrough though, JPEG is just an old format. But maybe machine learning stuff will get the 4x video compression seen in the show.