NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Lyra audio codec enables high-quality voice calls at 3 kbps bitrate (cnx-software.com)
bscphil 1138 days ago [-]
I've just taken a minute to confirm what my ears told me in Audacity. Please have a look at this screenshot: https://cloudflare-ipfs.com/ipfs/Qma41RMzieQ6ZGdGem9rLxnxEL1...

The Lyra version is clearly much louder. This is a serious problem and it borders on being reasonable to call it "cheating".

It's well known in the audio biz that if you ask people to compare two experiences, and one of them is a bit louder than the other, people will say that the louder one was better, or came through more clearly, or whatever it is you're trying to market for. For the purpose of comparing artifacts in two samples, it's absolutely crucial that they be the same volume. You might as well compare two image compression codecs where one of them "enhances" the colors of the original image.

Note: I took the clips for this comparison from the "clean speech" examples at the original source on Googleblog, not the blogspam.

adrianmonk 1138 days ago [-]
Definitely a real effect, but it seems like Google accounted for that in their listening tests.

The Google blog post links to the Lyra paper[1], and Section 5.2 of the paper says:

> To evaluate the absolute quality of the different systems on different SNRs a Mean Opinion Score (MOS) listening test was performed. Except for data collection, we followed the ITU-T P.800 (ACR) recommendation.

You can download those ITU test procedures[2], and skimming through that, it does mention making "the necessary gain adjustments, so as to bring each group of sentences to the standardized active speech level" and a 1000 Hz calibration test tone related to that. (See sections B.1.7 and B.1.8.)

So, if I skimmed correctly, and if the ITU's method of distilling speech loudness into a single number is an effective way to match the volume levels[3], then it seems like they did what they could to avoid cheating at the listening tests.

It is still interesting that Lyra makes things louder, though.

---

[1] https://arxiv.org/pdf/2102.09660.pdf

[2] https://www.itu.int/rec/T-REC-P.800-199608-I

[3] and even for speech that passes through different codecs before its loudness is determined

bscphil 1138 days ago [-]
That's good information, thanks. My comment is mostly directed at the misleading blog post. I have no direct reason to believe that the study itself was compromised, though it would be great to have confirmation from the authors that it was not.

The part about matching volume levels in the ITU recommendation seems to be talking about making sure the source recordings were balanced. All their clips might well have been exactly at the ITU recommended level of -26 dB, but if Lyra introduced a level mismatch this would have to have been corrected at a later stage, and it's at least possible that it might not have been. The Lyra paper does explicitly say that they didn't follow the ITU rec for "data collection".

Interestingly, the Opus and Reference sources are almost exactly -26 dB relative to full scale (according to several measurements of loudness), but the Lyra clip is about 6 dB hotter. So the source (the reference clip) exactly follows the ITU rec. Did they remember to fix the levels on the Lyra clips? I hope so!

rectang 1138 days ago [-]
Excellent catch. To be precise you need a measure of perceptual loudness rather than raw waveform excursion, but I would expect the results to be in line with what you've found.

> It's well known in the audio biz that if you ask people to compare two experiences, and one of them is a bit louder than the other, people will say that the louder one was better, or came through more clearly, or whatever it is you're trying to market for.

As a former mastering engineer, you're absolutely right that this is well understood in the audio industry. I used to present my clients with level-matched comparisons of source audio vs. processed so they would understand exactly what was being done, aesthetically.

bscphil 1138 days ago [-]
Here's an EBU R 128 measure using r128gain:

    File 'reference.flac': loudness = -25.7 LUFS, sample peak = -9.2 dBFS
    File 'lyra.flac': loudness = -19.9 LUFS, sample peak = -4.1 dBFS
    File 'opus.flac': loudness = -25.9 LUFS, sample peak = -9.7 dBFS
So that also matches pretty closely what my ears heard.
ChuckMcM 1138 days ago [-]
So nearly 6dB louder. That is quite a bit.
bscphil 1138 days ago [-]
Yes. Assuming this was done by the Lyra encoder directly, and not the person who wrote the blog post pushing the slider, you have to wonder how it would respond to an input with a peak around -3 dB. Would it clip? Is it performing some kind of normalization? Who knows!

It's also interesting that the Lyra clip is ever so slightly longer than the other two. The Opus clip has exactly the same number of samples as the reference. Maybe they didn't use a decoder for Lyra at all, just played the file on one system and recorded it using a line-in on another?

ChuckMcM 1137 days ago [-]
Well the blog post states they use a generative model. If that means what I think that means, they are doing in audio what folks have done in images which is sketch where a rabbit should be and have the model generate a rabbit there. Great encoding because the notion of 'rabbit-ness' is in the model not the data.

Again, assuming I understand correctly, that isn't a "trans coder" that is "here is a seed, generate what this seed creates." kind of thing.

Another way to look at it would be to think about text to speech. That takes words, as characters, and applies a model for how the speech is spoken, and generates audio. You could think of that as really low bit rate audio but the result doesn't "sound" like the person who wrote the text, it sounds like the model. If instead, you did speech to text and captured the person's timbre and allophones as a model, sent the model and then sent the text, you would get what they said, and it would sound like them.

It is a pretty neat trick if they are doing it that way since it seems reasonably obvious for speech that if you could do it this way then the combination of model deltas and phonemes would be a VERY dense encoding.

cycomanic 1137 days ago [-]
From that I would naively expect that the performance of the codec could be very language dependent.

It would be interesting to see how well it does in other languages.

ChuckMcM 1137 days ago [-]
But that is the thing, what if it isn't a codec? What if it is simply a set of model parameters, a generative model, and a stream of fiducial bits which trigger the model? We have already seen some of this with generative models that let you generate voices that sound like speaker data used to train the model right? What if, instead of say "i-frames" (or what ever their equivalent would be in an audio codec) you sent "m-frames" which were tweaks to the generative model for the next few bits of data.
p1esk 1137 days ago [-]
I think he's saying that if it is in fact a generative model, we will see significant differences when we try different languages.
ChuckMcM 1137 days ago [-]
I think I understand what he is saying, what I am struggling with is why would a 'sound' GAN care about different languages when an 'image' GAN doesn't care about different images?

What I'm getting at is this, do they use the sample as a training data set with a streamlined model generation algorithm so that they can send new initial model parameters as a blob before the rest of the data arrives?

It has my head spinning but the possibilities seem pretty tantalizing here.

p1esk 1137 days ago [-]
I think you would agree that a GAN, or any generative model can only generate something in the same domain as what it was trained on. If you trained on mostly on human faces with a little bit of rabbits, it's not going to generate rabbits well. If you trained it on mostly on English text and a little bit on Mandarin, it's not going to generate good text in Mandarin. Same with sounds. Different languages use different sounds.

If they use any generative model in their codec, they had to train it first, offline, on some dataset. They can't possibly train it equally well on all languages, so we should be able to tell the difference in quality when comparing English to more exotic languages.

ChuckMcM 1137 days ago [-]
I agree with you 100%! This is where I am wondering:

> If they use any generative model in their codec, they had to train it first, offline, on some dataset.

One thing I'm wondering if they have a model that can be "retrained" on the fly.

Let's assume for this discussion that you've got a model with 1024 weights in it. You train it on spoken text, all languages, just throw anything at it that is speech. That gets a you a generalized model that isn't specialized for any particular kind of speech and the results will be predictably mixed when you generate random speech from it. But if you take it, and ran a "mini" training system on just the sample of interest, so you have this general model, you digitize the speech, you run it through your trainer, now the generalized model is better at generating exactly this kind of speech agreed? So now you take the weights and generate a set of changes from the previous "generic" set, you bundle those changes in the header of the data you are sending and label them appropriately. Now you send only the data bits from the training set that were needed to activate those parts of the model that are updated. Your data product becomes (<model deltas>, <sound deltas>).

What I'm wondering is this, if every digitization is used to train the model, and you can send the model deltas in a way that the receiver can incorporate those changes in a predictable way to its local model. Can you then send just the essential features of the digitized sound and get it to re-generate by the model on the other end (which has incorporated the model deltas you sent).

Here is an analogy for how I'm thinking about this, and it can be completely wrong, just speculating. If you wanted to "transport" a human with the least number of bits you could simply take their DNA and their mental state and transmit THAT to a cloning facility. No need to digitize every scar, every bit of tissue, instead a model is used to regenerate the person and their 'state' is sent as state of mind.

That is clearly science fiction, but some of the GAN models I've played with have this "feel" where they will produce reliably consistent results from the same seed. Not exact results necessarily, but very consistent.

From that, and this article, I'm wondering if they figured out how to compute the 'seed' + 'initial conditions', given the model that will reproduce what was just digitized. If they have, then its a pretty amazing result.

p1esk 1136 days ago [-]
What you described could work in principle, but in practice, "mini" training on a single sample is not likely to produce good results, unless the sample is very large. Also, this finetuning would most likely be quite resource intensive. I recall older speech recognition systems where they would ask you to read a specific text sample to adapt the model to your voice, so yes, this can work.

If you can fit a large generative model (e.g. an rnn or a transformer) in the codec, you might be able to offer something like "prompt engineering" [1], where the weights of the model don't change, but the hidden state vectors are adjusted using the current input. So, using your analogy, weights would be DNA, and the hidden state vectors would be the "mental state". By talking to this person you adjust their mental state to hopefully steer the conversation in the right direction.

[1] https://www.gwern.net/GPT-3#prompts-as-programming

perfmode 1138 days ago [-]
how do the samples compare when loudness is made constant/normalized?
bscphil 1138 days ago [-]
I do still prefer Lyra overall, though not as much as some others (see sibling comment). To me, Lyra is cleaner and easier to understand, but the artifacts it introduces are more annoying and fatiguing than those introduced by Opus. Some people in this thread have reported trouble understanding Lyra, which I attribute to the strange artifacts it introduces.
0-_-0 1138 days ago [-]
Just reduce the volume of the lyra one by hand. Doesn't change the fact that it sounds leagues above the others.
tshaddox 1138 days ago [-]
When I was doing amateur audio engineering from my parents' basement 15 years ago this phenomenon was easily noticeable and extremely difficult to avoid, particularly when doing things where the entire point is to change the loudness of everything (one aspect of mastering) or to change the loudness of things relative to other things (mixing). My "solution" was to simply take a long break (perhaps overnight) and see if I still thought the newer version sounded better with clear ears than I remember the old version sounding with clear ears.
fireattack 1138 days ago [-]
Just as a reference

    Title                   RMS     Peak    Diff
    clean_p257_011_lyra     -20.07  -1.13   18.93
    clean_p257_011_opus     -26.07  -6.65   19.41
    clean_p257_011_refer    -25.77  -6.15   19.63
PSD (Welch's method, window=213)

https://i.imgur.com/Y8A4kkx.png

dkjaudyeqooe 1138 days ago [-]
Isn't that due to audio (frequency) compression coming out of the generative model?

I guess that can be tweaked either way but they're going to tend towards that exactly because it sounds louder and thus clearer.

bscphil 1138 days ago [-]
There are a couple of effects here:

1. Lossy codecs will use a low-pass filter to get rid of hard to compress higher frequencies. This is often inaudible, but even when it is, it should lower the volume, unless you're applying some kind of compensation for it.

2. It's true that lossy codecs compress different frequencies differently, but that's not usually done in such a way that amounts to applying EQ to the frequencies.

3. Even if the relative balance of frequencies did shift as a result of applying lossy compression, this is still done in a way that the overall loudness of the audio does not change. In this case the Lyra output has changed significantly and in an easily audible way (about +6 dB). You could easily get the same effect in Opus just by amplifying (or applying compression to) the result, but Opus is doing things correctly.

jhoechtl 1137 days ago [-]
I wouldn't call this cheating though. Audio compression makes use of the mental way we perceive sound. If a sound artefact is perceived to sound more clear when it is louder compared to another one with the same compression bitrate yet less volume I would say this falls into the category psychoaccoustic compression
rcthompson 1137 days ago [-]
If simply turning up the volume made it easier to understand the speech, then not turning up the volume on the other codecs would make for an unfair comparison.
domoritz 1138 days ago [-]
Off topic but how do you put images on ipfs and what’s the advantage over e.g. Imgur?
bscphil 1138 days ago [-]
Cloudflare are kindly hosting [1] a free HTTP gateway for the IPFS [2] network. So I can host an image myself on a server with IPFS, and Cloudflare will cache it for me. This is better than Imgur because the latter has been redirecting users to annoying "social" pages with ads instead of showing them the actual image, at least in some cases. I also can't be sure whether Imgur recompresses your uploads or not - I assume they usually do.

It's also more generally useful because I can host other files too, not just images.

[1] https://www.cloudflare.com/distributed-web-gateway/

[2] https://ipfs.io/

javajosh 1138 days ago [-]
Is hosting the image yourself, on like a $5 Digital Ocean Droplet and a $10 personal domain, out of the question? This would seem to be the ideal situation in terms of simple, decentralized file hosting solution. What are the downsides of this approach?

(I can imagine a server package that can modify index.html sub-resource URLs depending on current server load, preferring private, locally hosted sub-resources but willing to use 3rd party solutions like Cloudflare, too, if required by a black swan event.)

bscphil 1138 days ago [-]
Out of the question? No. As convenient as running one command on a desktop computer? Also no.

> the ideal situation in terms of simple, decentralized file hosting solution

Not sure what you mean by "decentralized" if you are in fact hosting it yourself.

> What are the downsides of this approach?

Well, for the casual person it has the obvious downside that you have to have your own VPS. Most people don't have those. Even if you do, IPFS has a couple of advantages: you can host images anonymously, and anyone anywhere in the world can "pin" the image to make sure it stays live. If you're using a server and you forget to pay DO your $5 one month, all your images go poof into the ether.

StavrosK 1136 days ago [-]
There's also https://imgz.org, that doesn't have annoying social stuff (I made it specifically for that!).
svnpenn 1137 days ago [-]
I keep getting 524 errors when trying to access files I uploaded. What am I doing wrong?
jeroenhd 1138 days ago [-]
Not OP, and I haven't done it myself yet, but it makes a lot of sense. It's basically free image hosting if you can get the file cached by Cloudflare.

Imgur these days is slow and riddled by ads. A page show will sometimes load many times the image size in Javascript, stylesheets and images. It also doesn't allow the user to just view the raw image, going as far as redirecting requests to the raw image to a web page if you directly access the URL.

The only downside I see is that the URL is less user friendly without the IPFS toolset installed. Sounds like a pretty good idea to me.

Semaphor 1137 days ago [-]
Are these imgur problems a USA thing? Because I literally never had any of the behavior described. Direct image links always go to the image, there is no JS or HTML or anything.
tokamak-teapot 1137 days ago [-]
Yep same here. Maybe the issue is about non-direct links, but it could be that imgur changes what it responds with depending on the request. If the url ends with .jpg it can still serve an HTML page.
superkuh 1138 days ago [-]
If you upload and then link to an image on Imgur and the person clicking the link has not run Imgur's javascript yet within $timeperiod, the image will not display. Instead you'll be given javascript to run.

Cloudflare as a gateway is distasteful and this won't last long, but for now at least when you click an ipfs image over cloudflare you get an image and not javascript code.

manigandham 1138 days ago [-]
Why is CF distasteful?
SilverRed 1137 days ago [-]
Not OP but I assume because it kind of defets the purpose of IPFS. IPFS is all about links that refer to content and not location, a cloudflare link is now back to a location and when the cf mirror goes down, the link will be broken.

But its also the only way normal users can see the content.

dwild 1136 days ago [-]
I lowered the volume on the Lyra one and the sound is still clearly WAY MORE clearer than the other two.
notretarded 1137 days ago [-]
If it sounds better who cares what methods are used?
rectang 1138 days ago [-]
What I want to know is whether Lyra takes any longer to encode than the alternatives.

Because as far as I can tell, nobody cares in the slightest about latency.

Phone calls are getting to be like writing postcards to each other. Speak in a whole paragraph. Wait several seconds for the latency to clear. Then the other party responds with a whole paragraph, waits several seconds for the latency to clear...

Improvements to fidelity are nice-to-have, but I would like some real-time in my real-time communications, please.

reaperducer 1138 days ago [-]
I had the pleasure of using a real landline just before the pandemic. Honest wire-to-wire connection between two ranches, so no silly VOIP steps between.

It was fantastic.

You don't appreciate how much latency is destroying our ability to communicate verbally until you go back to the old way.

One example is arguing. It's no wonder people used to be able to argue with one another on a telephone. You could raise your voice and still hear the other side and adjust your speech in real time. Today it's just one party shouting over the other to drown the opponent out.

rectang 1138 days ago [-]
Between miserable latency, not-so-great fidelity, and the fecklessness of phone companies in the face of the robocall epidemic, I have come to hate phone calls.

I'm rooting for something to replace phone communications. Any chance that Matrix can do better on any of those fronts? Especially on fidelity and latency since they're germane to the high-level subject of this discussion.

colordrops 1138 days ago [-]
One conspiracy theory is that tech companies have lobbied to prevent real action on robo-calls, in order to get people like you to hate calls and migrate to online services.
908B64B197 1138 days ago [-]
I don't think the FCC needed any lobbying to do nothing for the consumer these last few years.
Majestic121 1137 days ago [-]
That seems unlikely. The robocall phenomena don't really exists in France and many people still dislike calling
AshamedCaptain 1137 days ago [-]
It definitely exists in France. I still receive a robo call every single day on my land-line, liste rouge/Robinson or not. I have stopped bothering reporting them. The prevalence of "who called me/is it important" services makes me think it is not an uncommon problem.
eeZah7Ux 1138 days ago [-]
> conspiracy theory

Sounds like a run-the-mill everyday business decision.

tgv 1137 days ago [-]
Another thing I clearly experienced in the beginning of the mobile phone era, when one still could compare those things, was dynamic range. When speaking via a good landline, you could hear the presence of the other, but on mobile, there was only speech or silence. Noise suppression saves bandwidth, but it is so aggressive that the nuances in the other's utterances (breathing, hesitations, etc.) simply disappear. For a simple transaction that's not a problem, but when your SO lives far away, miscommunication arises too easily.
hackmiester 1138 days ago [-]
VoIP is definitely not the issue here. Codecs can be, and are, fast. I have no idea what cellular providers are doing to mangle the voice path so bad, but it certainly isn’t inherent to VoIP.
dguaraglia 1137 days ago [-]
I have no idea what they are doing but it sucks. I have in many occasions found that making a call using WhatsApp works much better - both in quality and latency - than making a phone call. Phone lines are so atrociously bad, they've put me off from doing a voice call unless it's absolutely necessary.
andai 1138 days ago [-]
I thought the delay would be because VoIP uses packet-based communications rather than a direct connection.
cnorthwood 1138 days ago [-]
Most landline networks are also digital now for exchange to exchange communications, so even those will be packet based
nayuki 1137 days ago [-]
Not all packets are created equal. ATM (Asynchronous Transfer Mode) cells have a fixed 48-byte payload, but IP (Internet Protocol) packets have longer headers and variable-length payloads up to about 1500 bytes. ATM is designed for low latency audio transfer, whereas IP incentivizes using bigger packets to reduce overhead at the expense of more latency.
comex 1137 days ago [-]
Historically, sure. But at a realistic modern internet speed of, say, 15 MB/s, 1500 bytes takes 0.1ms to transfer. Even at 1.5 MB/s it takes 1ms. Typical VoIP call latency apparently ranges from 20ms to 200ms, so packet size is not a major contributor.
LargoLasskhyfv 1138 days ago [-]
Ever experienced european ISDN to european ISDN at 64kb/s ?

I know americans said 'I still don't need(it)' but I still do miss it :-)

Funny thing was I had better(and cheaper!) calls to the US using calling cards, dialing into Frankfurt, and from there to the US than using the native offer of my telco.

IshKebab 1137 days ago [-]
Arguing is definitely affected by latency but it is at least possible to reduce latency, e.g. by getting fibre internet or using ethernet.

A more annoying "feature" of many VoIP systems is that they mute the other person while you're talking. You literally can't interrupt people because they won't hear you.

I presume this is done in order to reduce feedback, but it still sucks.

bentcorner 1138 days ago [-]
I'm glad you bring up latency - I've experienced several second latency in discord and it's really terrible when it happens (diagnosed via side-channel). The worst thing is that the app does nothing to try to salvage the conversation and the app doesn't tell you it's in a degraded state, so you'd never know this was a problem.
kroltan 1138 days ago [-]
It has a quality indicator on the bottom right, and if you click it there's a nice chart that shows the latency.

https://i.imgur.com/vR7NSpG.png

Or are you talking about something else?

bentcorner 1138 days ago [-]
Well, I stand corrected. I'll take a look at that chart next time I experience something like that happening. I haven't seen this UI before. Thanks!
mbar84 1137 days ago [-]
Add to that when the other party is hearing you over speakers, and the echo cancellation kicks in whenever you start to speak so that you can't hear what they're saying when they try to interject. You just see lips moving, stop speaking and you start to hear them mid sentence.
908B64B197 1138 days ago [-]
I remember switching to VoIP and noticing the voice quality getting better compared to the old twisted pair. No noticeable latency.
Aloha 1138 days ago [-]
A cell phone tbh, is about the same latency as a landline in most end to end call circumstances. Latency only really is noticeable when better than 600ms. (And only a real problem over 1000)
acdha 1138 days ago [-]
Do you have a citation for that? I've heard 200ms as the key threshold and the ITU uses 100ms for their default delay sensitivity class[1]. One key concept is that this isn't fixed but situational: if you're watching a TV show, the threshold is higher than if you're trying to react to something which is higher than simply noticing a delay and speech is more forgiving than, say, music hitting a precise tempo (I believe musicians have been tested as noticing delays down into the 10-20ms range).

One other big factor is consistency: if the variability is due to compressor overhead which is constant, the effect will be noticeable but less distracting than if it's varying due to something like wireless conditions.

1. https://www.itu.int/rec/T-REC-G.107-201506-I/en

neltnerb 1138 days ago [-]
I know that if the delay is constant, a good musician can compensate for it. I think it's a lot easier to hear that two things don't happen at exactly the same time than it is to tell whether two things happen 200ms or 300ms apart.

Musicians have that internal metronome to compare things with.

kwindla 1138 days ago [-]
I disagree with these numbers, in general. Though of course "noticeable" is subjective and varies by use case as well as by person.

For many people, end-to-end audio latency in a 1:1 conversation becomes noticeable/annoying at 200ms. And in a multi-participant conversation, talking over each other becomes noticeably more common even at 100ms compared to 50ms.

xellisx 1136 days ago [-]
IIRC the Bell standard points out no more than 50ms.
lynndotpy 1138 days ago [-]
I think latency is noticeable at even lower values. As a basic example, try to sing a song with someone over a voice call. Consider using Airpods or similar bluetooth headphones to make it more apparent.
ubercow13 1138 days ago [-]
Imagine trying to have a face-to-face conversation with 600ms of latency...
TheRealSteel 1138 days ago [-]
Do you have a source for that? Gaming latency is noticeable in the ~100ms range, and humans are highly sensitive to small adjustments in speech.

I would've thought it would be closer to the 200ms range, but I don't have any data to support that.

jeffbee 1138 days ago [-]
There is a huge quantity of research opposing your statements. The ITU considers 300ms round-trip latency to be catastrophic.
kwindla 1138 days ago [-]
This is a fantastic question. I agree with you that we're slowing boiling the frog (and the frog is ourselves) in accepting more and more latency in our real-time communications.

I think the answer for Lyra is that latency is a concern, but maybe at this stage not as much of a concern as it could be. I'm only guessing, though based on this [0]:

> The basic architecture of the Lyra codec is quite simple. Features, or distinctive speech attributes, are extracted from speech every 40ms and are then compressed for transmission.

That sounds like the minimum frame size for Lyra is 40ms. For Opus (the audio codec used for most WebRTC applications), the default frame size is 20ms [1], and most implementations support frame sizes of 10ms [2].

Of course, your favorite web browser might not default to 20ms frames for Opus. And by "most implementations" I meant Google Chrome. :-)

[0] https://ai.googleblog.com/2021/02/lyra-new-very-low-bitrate-...

[1] https://tools.ietf.org/html/rfc7587#section-6.1

[2] https://chromium.googlesource.com/external/webrtc/+/HEAD/mod...

toomim 1138 days ago [-]
Google Chrome has a latency of 20ms to just repeat back audio on the local device.

That is, with no networking, and no processing, it takes 20ms for any information to from microphone back out to speakers.

pjc50 1137 days ago [-]
The default Windows Audio Engine buffer size is 10ms, so that will undoubtedly be one buffer delay in and one buffer delay out.

https://docs.microsoft.com/en-us/windows-hardware/drivers/au...

dwild 1136 days ago [-]
I can't click a button I see flash faster than 200 ms, and that's when 100% of my attention is looking closely at a light change color and clicking. If 20ms is an issue for a conversation, where there so much more processing and understanding from both side, I'm clearly a subhuman...
Miraste 1138 days ago [-]
Even the default Windows audio device "listen to yourself" option has some serious latency.
revenant3-2 1137 days ago [-]
oh yeah it's bad, worse than cl_loopback in csgo even. one of the reasons i decided i wanted an XLR mic + amp
TheRealSteel 1138 days ago [-]
I spent the last year and a half living about as far from my home city as possible (Melbourne > Edinburgh) (Currently in quarantine) and the latency of phone calls drives me nuts.

The huge gaps between people speaking, and the complete change in conversation flow since you have to speak in huge continuous paragraphs. The constant "go ahead" , "no you go" etc... Ugh.

But I agree with you... nobody seems to care except me.

qayxc 1137 days ago [-]
> Melbourne > Edinburgh

Well, there's part of your problem right there. No need to even mention Bluetooth or device-related delays.

South Australia to North Britain is the better part of literally half way across the globe. It's either 64000 km at lightspeed (via satellite) or roughly 35000 km via optic fibre cable at 60% lightspeed (e.g. equivalent to ~60000km at lightspeed).

That's 200 ms one way latency just from the distance alone (best case scenario, no less), so 400 ms of latency just from distance alone. Even with something like Starlink we'd still be talking about at least 100 ms latency.

The whole latency from wireless protocols and codecs are just the cherry on top.

robmsmt 1138 days ago [-]
I also agree. Sometimes I speak to work colleagues who are on BT headsets and it must add what feels like 100ms to the call. I can sometimes hear the end of my sentence as I speak. Infuriating.
Miraste 1137 days ago [-]
I don't understand how Bluetooth has made it this far for phone calls. It has two modes: unintelligible quality and bad latency, or decent quality and ludicrous latency. The entire standard is an advertisement for Airpods.
viraptor 1137 days ago [-]
And that's on top of a ridiculously complex stack of protocols. I don't understand it either.
kozak 1137 days ago [-]
Is LE Audio going to change this to better?
sodality2 1137 days ago [-]
What platform are you on? iOS/android/windows/macos/linux?
Miraste 1137 days ago [-]
Mostly Windows and Android. Although both audio pipelines aren't great I don't think it's a platform problem--I've never seen a measurement of even 100ms lag with Bluetooth (excluding aptX-LL) and it's usually in the 300ms range. It also still only does two-way audio at 8kHz, which is unbelievable to me in 2021.
yholio 1137 days ago [-]
I would go the exact opposite route for this exact reason: something like 24Kbit ADPCM or 16Kbit G728 can provide an absolutely decent quality with only a 5 ms delay. This kind of bandwidth is now available in the vast majority of VoIP scenarios, an 100ms 3Kbps codec is only relevant for the extreme niches.
ksec 1138 days ago [-]
Discussed a few days ago [1] Copying my comment here

Even 3G AMR, I think that was pre 2000 Speech Codec started at 5Kbps. With a latency of only ~20ms. If I am reading correctly the encode due to ML nature would take at least 40ms and up to 90ms for Lyra.

I am sure there are some specific usage that would be a great use case. But for most consumer consumption I cant think of one on top of my head. One should also be aware of the current roadmap in 5G and the on going work in 6G. We still have a long way to go in maximising bandwidth / transfer per capita. i.e More bandwidth for everybody.

It seems to be the case with ML they want to take these speech codec to new low bitrate. While it is fun doing it as a research, I much rather they push the envelop of at least 6 / 8Kbps if not even higher closer to perfection with even lower latency ( 10ms if not lower ).

[1] https://news.ycombinator.com/item?id=26279891

Causality1 1138 days ago [-]
Providers just don't seem to care. My internet connection is a hundred times faster than it was in 2004 but I'm still playing games with the same 50ms-60ms of lag. Sure the speed of electrical signals in wire is a good chunk of it but there's still so much room for improvement.
qayxc 1137 days ago [-]
> but I'm still playing games with the same 50ms-60ms of lag

What kind of lag, though? Input lag has actually gone up in the past 15 years (e.g. due to displays and USB device polling). These things add up quickly and it doesn't even have to be just the network that introduces lag.

kixiQu 1138 days ago [-]
In the article,

> This trick enables Lyra to not only run on cloud servers, but also on-device on mid-range phones in real time (with a processing latency of 90ms, which is in line with other traditional speech codecs).

Does that not cover it?

bscphil 1138 days ago [-]
Worth noting that the paper itself explicitly says 90ms algorithmic delay. Seems plausible to me that there might be additional processing required on top of that, and then you have to add network transit time... suffice it to say that the vast majority of use cases are going to be better served by Opus, for the time being.

> The overall algorithmic delay is 90 ms

https://arxiv.org/pdf/2102.09660.pdf

regularfry 1138 days ago [-]
It covers it, but it's not exactly brilliant. 200ms is the point at which conversation breaks down. If half that budget has gone on the codec, not much has to happen on the wire for it to be noticeable.
mlyle 1138 days ago [-]
If your 200ms number is round trip, 90ms + 90ms is more than half that budget.
volta83 1137 days ago [-]
The pipeline is basically:

- stream your voice through an encoder: X ms

- send encoded packages over the network: 20-100ms latency (fiber vs mobile phone)

- potential decoding + encoding (if receiver does not support the senders codec, e.g., a landline phone using old codec)

- stream packages through a decoder: Y ms

If you are aiming for 60ms audio latency, which is what I would consider "good", then in the best scenario (20ms network latency; both using same codec) the latency of the encoder+decoder has to be max 40ms (e.g. 20ms for encoder, and 20ms for decoder).

It should be obvious that a decoder that does not meet the 20 ms budget, but takes 90 ms instead which is > 3x the budget, can produce better audio (ideally 3-4x better).

Latency wise, everything below 60 ms is really good, 60ms is good, and the 60-200ms range goes from good to unusable. That is, 200ms, which is what this new codec would hit under ideal conditions, has a latency that humans consider "unusable" because it is too high to be able to have a fluent conversation.

For me, personally, if latency is higher than 120ms, I really don't care about how good a codec "sounds". I use a phone to talk to people, and if we start speaking over each other, cutting each other, etc. because latency is too high, then the function of the phone is gone.

Its like having a super nice car that cannot drive. Sure its nice, but when I want to use a car, I actually want to drive with it somewhere. If it cannot drive anywhere, then it is not very useful to me.

regularfry 1137 days ago [-]
It's not, that's mouth-to-ear.
jtsiskin 1138 days ago [-]
The “mid-range phones” has me suspicious, I wish they defined that better. And 90ms is much higher than what Opus is supposed to achieve
amluto 1138 days ago [-]
That covers it, but 90ms sucks. Opus has much lower latency.
morsch 1137 days ago [-]
There is voip software that is designed to be low latency, eg Jamulus. It's not as easy to use as Jitsi and Zoom are, though.
HeadsUpHigh 1137 days ago [-]
This must be a local( american?) issue. It's all voip here and there's only a minor difference in latency.
kevin_thibedeau 1138 days ago [-]
Just disable VoIP or disconnect from WiFi.
chrisseaton 1138 days ago [-]
Isn't the actual phone network packet-switched and running over fibre optics now anyway? I don't think you can get a literally analog phone call anymore can you?
giantrobot 1138 days ago [-]
You haven't had "analog" phone calls for decades. Your analog line was converted to a digital signal at the central office or concentrator (ugly green boxes).

From the CO to a tandem through the core network your call was digitally switched to the endpoint where it was converted back to analog for the callee's phone.

wrs 1138 days ago [-]
However, those were circuit-switched, not packet-switched. (At least for a while.)
giantrobot 1137 days ago [-]
I should have been clearer, the OP doesn't want "analog" service but PSTN (TWM switched) service. Unfortunately that is dying out as carriers are moving to all IP core networks. As customers have moved away from fixed landline service it ends up costing more per customer to keep the PSTN running. The last mile equipment is the same for TWM or IP switching but they can dump that directly into an IP network rather than maintain a hierarchy of COs/tandems/toll switches with dedicated links between them.
kevin_thibedeau 1137 days ago [-]
Telco networks have QoS guarantees that your ISP will never meet. Bypass the ISP and you won't have to deal with their latency.
jeffbee 1138 days ago [-]
You probably can still get a real phone, but plugging a plain old phone into an optical network terminal, which converts to VoIP, has most of the virtue of POTS. I use a regular POTS device with Sonic fiber service and it sounds fantastic, with no noticeable latency.

Most of the problems with voice over mobile networks is caused by frame drops and wacky inter-arrival times. A wired IP network just doesn't have those problems.

Going back to the article/paper, I'd love to hear more about how Lyra interacts with Duo's machine-learned voice interpolation that fills in for dropped frames. Do they complement each other, or interfere?

temp-dude-87844 1138 days ago [-]
When Google's announcement [1] was posted a few days ago, I listened to their samples and heard an odd effect in the "chocolate bread" sample (the video chat example) [1], which is not mirrored in this article.

On that sample, I felt [2] that the Lyra version exaggerates the pronunciation of the phrase 'with chocolate' in a way that meaningfully differs from the speaker's original. It weakens the voiced 'th' to nothingness, and overshoots both the lead consonant and first vowel of 'choc', and then proceeds to wash the entire rest of the sentence with a peculiar brightened voice that's high, lacks consonant definition, and is close to ringing.

I'm guessing it's actually style transfer, because though the result sounds not much like the speaker's original, the result is reminiscent of the speech pattern and accent that people with East Asian and Southeast Asian ancestry adopt when speaking American English. It was surprising, given that the speaker doesn't sound like that in the original. I wonder if others hear this too.

While Lyra sounds richer and wider-band than Opus or Speex at these bitrates, the degradations and artifacts of those codecs are universally recognized (through years of familiarity with telephones) as compression artifacts and not innate features of the speaker themselves. Therefore listeners can be expected to be sympathetic to the quality issues and not attribute the whole of the sound on the speaker's person.

If AI-trained voice synthesizer codecs become the norm, and it performs well on most speakers, that expectation will go away, and the resulting audio will be attributed wholly to the speaker. That increases the impact of mistakes and misrepresentations introduced by the codec, unbeknowst to the speaker and listener.

[1] https://ai.googleblog.com/2021/02/lyra-new-very-low-bitrate-...

[2] https://news.ycombinator.com/item?id=26282519

BugsJustFindMe 1138 days ago [-]
> 'with chocolate' in a way that meaningfully differs from the speaker's original. It weakens the voiced 'th' to nothingness

I honestly don't hear a 'th' in the original.

> It was surprising, given that the speaker doesn't sound like that in the original.

I disagree. Note that the speaker says "these bread". The three possibilities for those two words—"these bread", "thiiiis bread", and "these breads" with a dropped "s"—would all be weird things for a native english speaker to say for different reasons relating to either wrong pronunciation of "this" or "breads" or the fact that bread is its own collective noun and therefore we typically require separate qualifiers like "these buns" or "these loaves" when separating multiple individual "pieces" (another) into a non-collective. We ask for "some bread" or "a piece of bread", but we don't say "a bread" or "some breads" unless we are discussing categorical types of bread ("ciabatta and rye are breads") rather than instances of such, and only one type of bread is represented in the video.

The Lyra reproduction has a band-pass filtered quality to it, but I find it still remarkably representative of the reference.

cbdumas 1138 days ago [-]
I agree completely, I've listened to the reference sample probably ten times now and I can only hear /wɪ/
hackpert 1138 days ago [-]
Yes yes yes please somebody look into latency with these fancy ML methods! You can quite literally have most ML models approximated to a very good degree as very fast DSP using very few processor cycles given modern-day CPU optimizations. Or heck use an ISA simulator plugged into another fancy ML model, and have it also optimize to minimize instruction count while recreating the same signal! (having a model optimize on its own inference is a neat trick, but I digress.) I’m sure ML is just one bottleneck among many (looking at you, Chromium) but I so desperately wish people started caring about latency again.
scotty79 1138 days ago [-]
"When a man looks for something beyond his reach ..."

The word "looks" sounds completely wrong for me with Lyra. To the point of completely not understanding what this word is supposed to be (first example with your [1] link).

est31 1138 days ago [-]
For me "looks" sounds fine but the word before, "Man", sounds like "Lan". So to me the opus sample sounds more understandable. Even though the "quality" of Lyra is better, that shouldn't be the score to optimize for, but fidelity of the compression. It's not helpful if the compression algorithm generates a beautiful flower from a flower image but it's a red flower instead of a blue one like the original. Gives me Xerox vibes...
ghusbands 1137 days ago [-]
Similarly, for me, the word "miracle" in the the noisy environment becomes like "vericle" with Lyra, where in Opus it is clearer. (Speex does fairly badly, but in a way that's a clear failure overall rather than making it sound like something else.)
com2kid 1138 days ago [-]
To me, "man" sounds like "lan", but "looks" sounds correct.

I actually listened to the Lyra version first, and thought the speaker said "when a lad looks for something beyond his reach"

dkjaudyeqooe 1138 days ago [-]
Depending on your accent and the model there is a fine line between "can't" and "cunt" or "six" and "sex".
ampdepolymerase 1138 days ago [-]
Are the speech models sufficiently generic across all languages?
qayxc 1137 days ago [-]
That remains to be seen. In my experience the performance with anything other than (US) English is mediocre at best and less common the language, the worse results get.

So while Spanish, French, or German might get there eventually, don't even try Polish, Czech, or Farsi (Persian) dialects.

cityzen 1138 days ago [-]
I read your comment before I watched that video and I can't stop laughing. It sounds ridiculous!
motiejus 1138 days ago [-]
Nothing about licensing or patents. I assume the worst (read: unusable for small businesses)?

10+ years ago I worked in a small voip shop, where we had very high quality (low jitter), but low bandwidth connection. I researched many codecs of the time (2010-ish).

We liked speex, because it can be used "without strings attached". Also, I can choose the quality depending on the bandwidth. Although for low bandwidth g729 was better. Which we couldn't use because of royalties (but allowed myself to test it).

We chose alaw/ulaw when bandwidth was not a concern, and speex when it was.

Since it does not mention usability outside of google, I also find this comparison unfair or incomplete: if you are comparing a proprietary codec, compare it to g729. If you are comparing a codec to speex, it should be open/free.

Edit: grammar

bscphil 1138 days ago [-]
These days the correct comparison would be to Opus, which is similarly unencumbered and performs fantastically at low bitrates (and has a speech specific mode for even lower bitrates, because it's a hybrid of two codecs). It's also extremely low latency, so there's now no reason to accept trade-offs. (For the same bitrate as alaw/ulaw, you can get high quality full band music with Opus.)

These days it's more or less the standard for realtime voice. WebRTC uses it, most of the popular realtime voice applications use it as well, as does Signal.

IshKebab 1137 days ago [-]
They did compare to Opus.
lights0123 1138 days ago [-]
I wouldn't be as pessimistic—they invented VP9 and were important to AV1, and continue to promote both and use them everywhere.
LinuxBender 1137 days ago [-]
I am curious about this as well. With all the debates in the threads here about this codec, I think it would be worth having the folks at Mumble [1] (murmur server) incorporate it as an option so people could fire up an instance and put it to the test. That of course only works if the license is compatible. They compare the performance to Opus which is used currently by Mumble.

[1] - https://www.mumble.info/

pabs3 1138 days ago [-]
1996 1138 days ago [-]
> Nothing about licensing or patents. I assume the worst (read: unusable for small businesses)?

If there's a free software implementation, and a company offering the service based in the EU (or shop around and find any other jurisdiction where software patents don't matter), it's often YOLO - but call that "legal arbitrage" if you want to sound fancy :)

kreetx 1138 days ago [-]
What are the current choices for "CD quality" speech compression (lossy but indiscernible) at the moment? Just had a discussion with a friend of keeping an always-on speech recorder on and wondered about disk space consumption.
p1mrx 1138 days ago [-]
Basically Opus (Vorbis successor) or AAC (MP3 successor).
dkjaudyeqooe 1138 days ago [-]
A fair point if you're evaluating actual usage rather than just quality, but Google isn't going to be only one with this sort of codec. We can expect better free versions to come along in time.
marcodiego 1138 days ago [-]
With good enough licensing it could possibly replace speex.

It is sad that we have to think about licensing and patents of technologies instead of only how good or advanced they are.

Aloha 1138 days ago [-]
I'm over here preferring g711/ulaw because I prefer the hard roll off at 4kc.
londons_explore 1138 days ago [-]
This is blogspam that doesn't add meaningfully to the original[1].

[1]: https://ai.googleblog.com/2021/02/lyra-new-very-low-bitrate-...

boneitis 1138 days ago [-]
I found the Lyra codecs in both examples easily the most difficult to comprehend, even compared against the scratchy Speex.

Am I the only one? It is a little odd to me to see the praise here and on the previous discussion.

To be fair, I am convinced I have APD (and, to be fair again, I have never got it checked out).

E: Just realized there is a third example. Perhaps it is not as strong a statement due to Opus' doubled bitrate, but it is still far scratchier. Yet, it is more decipherable than the Lyra codec to me.

contravariant 1138 days ago [-]
It sounds 'clean' but the remaining artefacts are quite weird. Somehow the intonation/timing feels off, which is just sounds plain odd. The compression artefacts are 'ugly' but at least recogniseable as compression artefacts.
bscphil 1138 days ago [-]
I personally wouldn't say "difficult to comprehend". I would say that the Lyra audio is "cleaner" but that the artifacts that there are, are louder and more annoying in Lyra. There's a very bad ringing effect and some flutter. If you personally find these artifacts distracting or confusing, I could very easily see the Lyra examples being harder to understand.

I'm almost certain that Lyra has increased the volume on the first sample too. It's quite audible, although I haven't confirmed this with Audacity.

Through good quality headphones, I actually find the Lyra artifacts rather piercing and think I'd pretty quickly get fatigued through having it in my ears over a long conversation. Maybe they would handle this better with a bit of a lowpass filter added.

acdha 1138 days ago [-]
I noticed the loudness, too, followed by a bit of a letdown with the subsequent odd artifacting. I was wondering whether that was also a factor in the user-perception ratings similar to how FM radio stations all started deploying compression to sound louder because it increased the odds of people favorably picking out their station while scanning through the spectrum.
1138 days ago [-]
mfkp 1137 days ago [-]
After reading vidarh's comment below ("The biggest challenge with evaluating all of these, is that once you've listened to a comprehensible version of one of these samples, they all sound more intelligible."), I recommend trying to listen to the Lyra version first, then listen to the original next. It actually does sound very strange when you don't have the true audio parsed in your head already. I'm not sure if I would like having to do the extra work to translate this in my head on every conversation.

It's more apparent in the video example from the Google post, I won't spoil it but there's a word that starts with B that sounds very funny in the Lyra version (listen to that one first): https://ai.googleblog.com/2021/02/lyra-new-very-low-bitrate-...

IshKebab 1137 days ago [-]
Definitely a real effect, but Lyra is still clearly an improvement and I assume their MOS testing accounts for that.
jbluepolarbear 1137 days ago [-]
I did as you said and watched the Lyra version first. I couldn’t understand her and the tone wasn’t matching her lips and mouth. It sounds like someone with a speech impediment is dubbing someone who is not. I wonder how much miscommunication is caused by poor compression and latency.
hatsunearu 1138 days ago [-]
http://www.rowetel.com/wordpress/?page_id=452

Take a look at this too. Also runnable on low power devices. And there was some work of using AI to enhance the codec2 encoded bits too.

ddevault 1138 days ago [-]
Yeah, with Codec 2 setting the gold standard, I don't find this very impressive. I find this more intelligible at one-third of the Lyra bitrate:

http://www.rowetel.com/downloads/codec2/hts2a_1300.wav

Codec 2 does a better job of isolating the parts of sound which are most necessary to intelligible speech, without necessarily caring too much about preserving the original qualities of the speaker's voice or environment.

Fun fact: Codec 2 can be used to transmit voice over IRC:

https://github.com/asiekierka/voirc

londons_explore 1138 days ago [-]
I had to listen to your wav sample 4 times before understanding what it was saying... To me, that isn't intelligible... Perhaps with practice one could learn to understand it, but that isn't really what I want from my audio codec.
vidarh 1138 days ago [-]
The biggest challenge with evaluating all of these, is that once you've listened to a comprehensible version of one of these samples, they all sound more intelligible. I had problems with the example too. After hearing the original it's now easy. It makes it really hard to properly assess the intelligibility for developers without decent sized panels of people to help evaluating them.
IshKebab 1137 days ago [-]
It's definitely impressive but your ears are broken if you think that is more intelligible.
qayxc 1137 days ago [-]
Or maybe they are visually impaired. I'm always impressed when I sit next to a blind person in a train and hear them using a smartphone.

The screen readers run at 2x normal speed (at least) and to my untrained ears the robotic noises just sound like a garbled mess instead of intelligible speech. The blind person using the phone, however, seems to have no problem understanding it. Fascinates me every time.

ekelsen 1138 days ago [-]
Could not understand that sample at all, I don't think this makes your point.
alvarlagerlof 1138 days ago [-]
Completely unrelated but damn so they need to update the illustration at the top of that page. It's hideous.
1138 days ago [-]
ajb 1137 days ago [-]
Funny thing: voice codecs below 2.4kbps are export controlled by ITAR, right along side military technology and nuclear devices. David Rowe, who made the codec2 open source voice codec, had to get confirmation from his govt that they would not enforce it on his project: https://www.mail-archive.com/freetel-codec2@lists.sourceforg...

I'm not sure exactly why - someone mentioned that it might be because deep submarines have very little bandwidth and need to use them, but I don't have a reference.

Nevertheless, wouldn't be surprised if google didn't want the hassle.

charlesdaniels 1138 days ago [-]
This reminds me of a technology in Vernor Vinge's "zone of thought" series. I think they called it "evocations", where at the beginning of a call, a model is transmitted that allows the other end to re-construct what the sender would look/sound like from severely abridged data. It sure sounds plausible - the semantically meaningful parts of a conversation (video or audio) would appear to have significantly less entropy that all of the details captured by a mic/webcam. The fact that things like JPEG and MP3 exist are proof enough, and those (to my knowledge) aren't even feature-based.

Maybe N years from now, your {Skype,FaceTime,Zoom,Jitsi} call starts by transmitting a pre-trained auto-encoder that can reproduce your speech and visual appears with a "good enough" margin of error from a few kbps worth of data.

ryukafalz 1138 days ago [-]
It's been done, at least for video: https://www.youtube.com/watch?v=NqmMnjJ6GEg&feature=emb_titl...

Not for audio yet, I think?

walrus01 1138 days ago [-]
Microsoft also just announced a 6kbps advanced audio codec:

https://techcommunity.microsoft.com/t5/microsoft-teams-blog/...

perryizgr8 1137 days ago [-]
And it sounds significantly better to me, which is expected since they use double the bitrate.
Jowsey 1137 days ago [-]
Who could've possibly guessed that doubling the bitrate would increase the quality? :P
drudu 1137 days ago [-]
hn allows meaningless statements (as above) but bans calling anyone out for being dumb
esturk 1138 days ago [-]
As one post alluded to the side effect of creating a "synthetic accent", I wonder what kind of social implications this will create.

Over time, as these "synthetic accents" gain wide spread adoption, would the main stream pronunciation also adopt such an accent? It's certainly interesting to think about.

kristofferR 1138 days ago [-]
The Lyra examples are way harder to understand than the rest of them though. It injects sounds that wasn't there.

It sounds like he said "Someuve" instead of the "Some" clearly audible in the other versions.

RasmusLarsen 1138 days ago [-]
I'm not going to say it's not impressive given the limitations, but am I an outlier if I say it sounds unacceptably bad for a voice call? If someone was on discord/zoom/hangouts with that quality I'd ask them to check if they had some hardware issues or their connection was borked.
peterhil 1137 days ago [-]
I find the glitches in Lyra codec very unpleasant compared to the other codecs.

They remind of the visual glitches of AI generated images, and I really hope Lyra does not become a common codec!

drudu 1137 days ago [-]
To clarify, do you find Lyra @ 3kbs to be more unpleasant overall than Speex at @ 3kbps?
annoyingnoob 1138 days ago [-]
Looks interesting. What is the use case for a 3k codec in 2021?

Having the Lyra sample louder than the others is cheating and gives a false impression.

walrus01 1138 days ago [-]
The original Iridium satellite phone voice codec did voice calls at approximately 2800 bps 22 years ago, but the quality is quite poor, and it's very rudimentary. Encoding power and CPUs were quite different in 1998/1999.
jimktrains2 1138 days ago [-]
I wonder how this compares with codec2[0] which is decent at around 3kbps and can go even lower.

[0] https://en.m.wikipedia.org/wiki/Codec_2

donpark 1138 days ago [-]
According to someone who listened to both, Lyra beats Codec2 on quality. YMMV
etaioinshrdlu 1138 days ago [-]
Important to remember that this type of codec can be used as a backup for higher-bandwidth codecs. You don’t necessarily need to hear it’s artifacts all the time. The higher level codec also only needs to encode the differences between the prediction and groundtruth. The same thinking applies to video especially of faces. Neural nets are a huge leap forward for this type of data compression and will likely be used pretty much everywhere in the future with great success.
tyingq 1138 days ago [-]
If the demos are actually representative, it does seem impressive. Could save a lot of bandwidth for VoiP if it replaced 8kb/s G729.
lxgr 1138 days ago [-]
Isn‘t VoIP at such low data rates already dominated by the overhead of UDP, IP and whatever lower layer? Multiplexing it with a low-bandwidth video stream would be possible, though.

I was thinking this could be most relevant for something like digital wireless transmissions.

tyingq 1138 days ago [-]
G.729 is 21-30kbps with transport overhead, depending on a few factors. So shaving off 5kbps would still be meaningful. Or better quality at the same bandwidth might enable in-band DTMF or fax, neither works on G.729 now.
lxgr 1138 days ago [-]
In-band fax will certainly not work over a lossy voice codec, unless your fax modem is able to mimic human speech patterns.
tyingq 1138 days ago [-]
I thought I remembered getting 2400 baud fax (unreliably) working on G.729. Though I guess, yes, if this codec is trained on voices that probably doesn't bode well for fax.
lxgr 1138 days ago [-]
Ah, I keep forgetting how low bandwidth fax modems are. That might actually work – 2400 baud is basically a person whistling or humming one out of two tones :)
nousermane 1138 days ago [-]
To say the least, yeah. At 3kbps and 20ms framing, it's only 7.5 bytes of payload per frame.

RTP, UDP, IP, and Ethernet overhead are what - 60-ish bytes?

zamadatix 1138 days ago [-]
60ish sounds right, though with Ethernet it's going to be padded to a minimum 64 bytes regardless. Might not matter depending what your bottleneck link actually uses though.
rjsw 1138 days ago [-]
You might have PPPoE on top of that with another 8 bytes.
vortico 1137 days ago [-]
Why in the world do we need 3 kbps audio for voice? It's so hard to hear people speaking over mobile phones. Why don't we use 32-64 kbps 48 kHz for all voice communication? GSM CSD offers 9-14 kbps down/up, 3G offers 384 kbps, and EDGE offers 473 kbps. Why limit to 3 kbps?
mkl 1137 days ago [-]
There are many reasons.

- Not everyone has such good connectivity.

- So you can handle many streams at once, like in a big meeting, without a server mixing them.

- So you can have decent quality video on the same limited connection.

- So you can archive large amounts of speech.

- To advance the state of the art.

- etc.

kilroy123 1137 days ago [-]
I think the old Iridium satellite constellation only went as high 3 kbps. It was heavily depended on by the US military. I'm sure there's still a need for this.
Manfred 1137 days ago [-]
For situations with low bandwidth. Communication in space or far outside of civilization.
drudu 1137 days ago [-]
if you can't answer the question yourself, you wouldn't ever be able to understand the answer. maybe i want to have two calls at once / maybe i have a shitty gsm/csd link / maybe we are transmitting over a link that has less than 10kbps bandwidth. asking this question implies you have no imagination or awareness beyond what you already know. i'll get banned from hn for this but goddamn you're a dumb person
dang 1137 days ago [-]
We've banned this account. Please do not create accounts to break HN's rules with.

https://news.ycombinator.com/newsguidelines.html

wrongdonf 1138 days ago [-]
We are getting to the point where compression is so good, you aren’t actually hearing the other person. Wild
moonbug 1138 days ago [-]
you'll never know if you're speaking to the Blight.
1138 days ago [-]
kragen 1137 days ago [-]
The LPCNet decoder for Codec2 enables high-quality voice calls at 1700 bits per second since 02019: http://www.rowetel.com/?p=6639

Why doesn't this article mention LPCNet or Codec2?

MarkusWandel 1137 days ago [-]
This is subjective, but there is loss of real speech information here. I don't hear the accents of these speech samples daily. I can easily follow the uncompressed versions. But the heavily compressed versions delete cues and I could not initially understand them.

This is an effect even with regular telephony these days. A smartphone, carefully held a little way from your ear because you don't trust it to mask touch events when used as a phone, and using 16kbps audio, is not as understandable as an old fashioned hardline phone. Ironically, higher fidelity audio via app (e.g. Whatsapp phone calls) scores better, despite the occasional glitches.

walrus01 1138 days ago [-]
the 3kbps example here with the 'bread with chocolate filling inside' video is frankly amazing, how good it is compared to the original.

https://ai.googleblog.com/2021/02/lyra-new-very-low-bitrate-...

It is unfortunate that for now this appears to be proprietary, closed source and being treated as a google competitive advantage over others, unlike opus which is fully open.

londons_explore 1138 days ago [-]
If they add it to WebRTC as they suggest, it will get auto-included in nearly all videoconferencing applications (most use webrtc under the covers, and a simple git pull will get it included in the next release).
walrus01 1138 days ago [-]
They briefly mention the existence and current role of the webrtc codecs, but I don't see where they suggest they intend to contribute it or open it up as a library others can use.
londons_explore 1138 days ago [-]
If they don't release it as part of webrtc, they won't be able to use it in browser based videoconferencing. Google Meet/Hangouts on desktop rely on that.

While it is theoretically possible to process audio with something compiled with webassembly, data can't be marshalled into/out of a webassembly worker without the main browser threads help, and that tends to be too janky to use for realtime audio on most platforms.

That pretty much forces googles hand - if they want to use it in their web-based products, it must be opensource and available to all competitors.

They might say this feature only works with their mobile apps though.

lankalanka 1138 days ago [-]
It would take some time to see how it goes. It's possible to run on browsers without being a member of WebRTC codec family. The codec can be deployed as a WebAssemble module and WebRTC acts as a data channel for it. Zoom used to deploy a similar approach while staying on its proprietary codec.
sophiebits 1138 days ago [-]
jbluepolarbear 1137 days ago [-]
I listened to the first 4 samples of the woman speaking and Lyra was the most clear and close to the loudness of the original, but it was also the most different. Lyra sounds like a computer voice of the women speaking the sentence back. I don’t know how to explain it better, I pick out the slight buildup/double “p” and it’s not present in the Lyra version. Almost like it’s removing human speech imperfections.
jdkee 1137 days ago [-]
How does this bit rate compare to POTS?

"Restricted to a narrow frequency range of 300–3,300 Hz, called the voiceband, which is much less than the human hearing range of 20–20,000 Hz"

https://en.wikipedia.org/wiki/Plain_old_telephone_service

Vadoff 1138 days ago [-]
How come the clean reference wav file is 168KB, while the clean Lyra (@3kbps) wav file is significantly larger at 328KB?
bscphil 1138 days ago [-]
When they converted back to wav from Lyra, they used a 32 bit 16 Khz wav instead of 16 bit 16 Khz wav like the source. The size of the Lyra file is almost exactly 2x as big as the reference.

Note that this isn't cheating in any way, the source is the source, so it's just a quirk from their conversion process. Probably the tooling around Lyra is pretty rudimentary and the decoder could only output a 32 bit file.

walrus01 1138 days ago [-]
Since a browser can't play lyra I think they took the lyra output and put it inside something lossless like a 44kHz stereo wav so that people can listen to it.
Saris 1137 days ago [-]
Hmm, I have to say it's impressive to do this at 3 kbps, but even the Lyra sample I found very hard to understand, and misheard most of the words. The original was the only one I could understand fully.
1138 days ago [-]
faebi 1138 days ago [-]
Maybe in the future, all we need is a speech example, some AI and the continious transmission of text for low data voice transmission?
dheera 1138 days ago [-]
I think this will be the rough direction, but not exactly text, rather some other efficient, machine-readable embedding of speech that is also able to carry tone and rhythm effectively and pronounciation accurately and unambiguously.
wmf 1138 days ago [-]
Basically yes. "Features, or distinctive speech attributes, are extracted from speech every 40ms and are then compressed for transmission. The features themselves are log mel spectrograms, a list of numbers representing the speech energy in different frequency bands, which have traditionally been used for their perceptual relevance because they are modeled after human auditory response."
cecja 1138 days ago [-]
Why speak then?
viraptor 1138 days ago [-]
Only if you're ready to kill all the intonation nuance. If you're ok with that, why not stick to just reading text? At least we can use emotes in there.
trevorishere 1138 days ago [-]
I'd be curious to know how this compares to Microsoft's Satin codec, now used in Teams, which is ML-driven.
guywhocodes 1137 days ago [-]
Feels like one could get something comparable with RNNoise into DCT and then just gziping the frames.
skyde 1138 days ago [-]
this is super impressive
eznzt 1138 days ago [-]
Is it any good for languages other than English?
wmf 1138 days ago [-]
"As with any ML based system, the model must be trained to make sure that it works for everyone. We’ve trained Lyra with thousands of hours of audio with speakers in over 70 languages using open-source audio libraries and then verifying the audio quality with expert and crowdsourced listeners. ... Lyra trains on a wide dataset, including speakers in a myriad of languages, to make sure the codec is robust to any situation it might encounter."
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 15:40:03 GMT+0000 (Coordinated Universal Time) with Vercel.