I have a project idea that might be useful for someone to learn about audio proc...

JoshTriplett · on April 4, 2017

I'd find that really compelling.

Consider hand-producing a sample via manual audio editing, to demonstrate the limits of what ought to be possible. Find some audio you think could be listened to at that rate, see how fast you can listen to it via standard sound stretching techniques (e.g. libsoundtouch and similar), and then demonstrate how much better you can do with hand-editing. Worry about how to automate that after you successfully demonstrate that possibility and make it compelling.

jessriedel · on April 4, 2017

Yea, it's a good point. We can logically distinguish between the basic audio problem (which might be really hard) from the automation problem.

On the other hand, suppose we somehow got good training data by getting a bunch of audio samples at the same number of words per minute that were graded by human listeners as easy or hard to understand. Then in principle something like a neural net might figure out what audio features were responsible for intelligibility and then adjust the non-intelligible audio in that direction (a la using convolutional neural nets to make pictures appear in the style of a famous painter without changing the content). This would be done automatically without any humans actually understanding the solution.

JoshTriplett · on April 4, 2017

Sure, you could try any number of things to produce a solution. But even if you try an approach where you don't know at first what might work, you should likely still put effort into figuring out what features made it work, so that you can improve it further and maintain stability.

i336_ · on April 6, 2017

Nobody else has mentioned anything along these lines so I'll do so.

If you haven't talked to the blind community about this sort of thing already I would strongly recommend doing so as they'll be able to rapidly point you at the bleeding edge of what currently exists - they routinely use text to speech cranked up to illegible speeds.

(I also heard of one guy who would listen to TTS from his computer with one ear, while using his other ear to hold a phone conversation. I believe I read this in Thunder Dog: The True Story of a Blind Man, His Guide Dog, and the Triumph of Trust (978-1400204724 / 1400204720).)

The second reason I suggest this is that the blind community is a dense populus of users who use systems like this as part of their everyday lives, so if you aimed such a system at them the feedback quality would be extremely high and allow for ridiculously fast iteration times and a great product. High-speed speech (TTS in particular) is a genuine technology hole/need.

jessriedel · on April 6, 2017

Great comment, thanks. Agreed that blind users would be at cutting edge of this and provide the most useful feedback.

> they routinely use text to speech cranked up to illegible speeds.

Note that this is not quite what I'm talking about. TTS is a slightly different problem since you're constructing the speech, and you can actually choose voice synthesizers that sound weird but remain intelligible at high speeds. On the other hand, trying to modify existing voice audio (with no text) for greater intelligibility at high speeds is a different problem.

i336_ · on April 7, 2017

>> they routinely use text to speech cranked up to illegible speeds.

> Note that this is not quite what I'm talking about.

Good point. I think I forgot to fully qualify that statement - while initially composing my reply I got completely distracted with TTS and forgot this was about altering speech to go faster. I realized and went back and edited it a couple minutes later, but didn't adjust it sufficiently.

ianhowson · on April 4, 2017

I did a podcast/audiobook player app ('lectr') with gap removal and had experimental support for what you describe. Roughly, it removed samples if there was no significant difference in the frequency-domain spectra. This was a simple threshold test (the threshold could be adjusted by the user).

It works for some material. Some speakers (usually professional book readers) have even pacing and it's more effective. A simple 1.5-1.7x speedup is also useful there. Podcasters tend to be 'bursty' in that they speak rapidly and then pause, and gap removal was more useful there.

I stopped working on it as my lifestyle changed such that I wasn't using it. There was almost no commercial interest in the project.

I'd love to see a library, especially with MOOCs etc becoming mainstream. Most players provide 1.25/1.5/2.0x options, but between 1.5x and 2.0x is the sweet spot for me, and almost no apps provide that.

jessriedel · on April 4, 2017

Thanks for your insightful comment! Is Lectr still available anywhere? Nothing showed up on a Google search.

> between 1.5x and 2.0x is the sweet spot for me, and almost no apps provide that.

Agreed, this is super frustrating and is some evidence that the market isn't that interested. (Why would they pay for advanced compression if no one bothers with trivial stuff like finer gradation?)

Fwiw, on Android I use Audible, PocketCasts, and Audipo for audiobooks, podcasts, and general audio, respectively. They are all serviceable and have 0.1x or finer granularity.

ianhowson · on April 5, 2017

Not available any more. It was for iOS. I pulled it from the App Store a few iOS releases ago because it was crashing on startup.

I've taken the website down but you can still view it on archive.org: http://web.archive.org/web/20160329053206/http://www.lectrap...

I still use Swift for finer gradation. It also worked for video files, which was fantastic (mine never did that). Swift is 32-bit and will stop working soon. I haven't found a replacement yet.

(On iOS, at least, the 1.25/1.5/2.0 thing is because that's what the OS provides for very little effort. Finer controls require use of a supported-but-undocumented API or AudioUnits.)

MaulingMonkey · on April 4, 2017

HTML5 video speed can be controlled with a bookmarklet on Desktop at least:

    javascript:document.getElementsByTagName("video")[0].playbackRate=4

You probably don't want my 4x speed as chosen here ;)

srvlsct · on April 4, 2017

Interesting idea. I used to watch Berkeley webcast videos at 1.5x to shave off ~30 minutes from a 90 minute lecture. Any faster wouldn't be intelligible.

azaydak · on April 4, 2017

I do the same. I usually watch all lecture videos at 1.5x or 2x speed. Saves me tons of time which is good because usually to videos are of random interesting topics that are distracting me from work.

dnautics · on April 4, 2017

It is the case that you can teach yourself to understand faster and faster speech. The blind often have human interface devices which speak at absurdly high vocal acceleration.

JoshTriplett · on April 4, 2017

This also depends heavily on information density. I can listen to some types of content at 2x without issue; other content I can't accelerate more than 10-20%.

MaulingMonkey · on April 4, 2017

Some content I have to scrub and rewatch even at 1x speed.

Some content I can watch at 4x.

dnautics · on April 5, 2017

Blind people regularly teach themselves to listen at 4-5x speed.

loxias · on April 4, 2017

This is doable. If one were, uh, hypothetically, to write a library that did this, how would the HN community recommend monetizing it? I have no experience selling my wares to anyone but a large company, let alone an app store or something like that. Would the tool primarily be applied to nice clean audio, like books on tape, or also required to work with significant background noise?

i336_ · on April 6, 2017

IANAL but I can't see any harm in releasing some demo samples of what your system can do. To be fair, an example of how the system breaks down it gets pushed too hard would probably be useful too.

With that in hand, perhaps you might create a new post showing the samples and asking for advice. (Maybe you could also let people make sample requests via comments?)

It's possible that you may receive offers from people interested in being business partners (treating the situation like a startup); have fun with that ;)

sixo · on April 4, 2017

An audio version of those bots that summarize news articles could solve the problem in a different way.

jerrre · on April 5, 2017

At some point it might become useful to feed the audio in to speech recognition, then feed the result in to a Text-to-Speech engine. You will lose all of your prosody and speaker characteristics, but blind people have their screen readers at crazy speeds so it will stay intelligible.

minhajuddin · on April 5, 2017

I'd definitely use something that did this :) I usually speed up youtube videos/podcasts to 2x. Many times this is possible because the content that is presented is already known or is easy to understand. However, this breaks down when learning something new.

brickmort · on April 4, 2017

'Podcast Addict' (for Android) can speed podcasts up like this, up to 5x. Usually, I can comfortably bump the speed up to 1.25x-1.4x without any trouble. 2x for me ends up being too fast, but useful if you're skimming and not trying to absorb every second of the audio.

andai · on April 4, 2017

Another solution to getting through long audio segments quickly is learning to read spectrograms (time/frequency intensity graphs of speech).

Then you translate the problem into speed reading.

I'm (slowly) investigating this, expect a HN post in a few years :)

olleromam91 · on April 4, 2017

Is the goal for this to happen in real time?

jessriedel · on April 5, 2017

That'd be nice but not really necessary. Biggest use case would be podcasts and lectures, which are usually downloaded in advance so there's plenty of time for off-line processing.