I have a project idea that might be useful for someone to learn about audio processing (and maybe neural networks?). I like to listen to audio and watch video of podcasts (and lectures and other human speech) at faster speeds. Sometime, especially if I'm trying to "skim" to see if the media is worth listening to carefully, I'd like to listen at 3x or faster. Very often, the limiting factor is the intelligibility of the actual words rather than mentally parsing them.
Some software already removes complete silences, but this is a 10% effect and I think this could be taken much further. I would love audio software that could manipulate high-speed human speech to improve intelligibility by preferentially compressing parts with low information content (like vowels and "ughs") and uncompressing, or even "repairing", info-dense parts like sequential consonant sounds.
I've looked around and haven't been able to find anything like this. Could make a nice stand-alone app, or a library to sell to a podcast player.
Consider hand-producing a sample via manual audio editing, to demonstrate the limits of what ought to be possible. Find some audio you think could be listened to at that rate, see how fast you can listen to it via standard sound stretching techniques (e.g. libsoundtouch and similar), and then demonstrate how much better you can do with hand-editing. Worry about how to automate that after you successfully demonstrate that possibility and make it compelling.
Yea, it's a good point. We can logically distinguish between the basic audio problem (which might be really hard) from the automation problem.
On the other hand, suppose we somehow got good training data by getting a bunch of audio samples at the same number of words per minute that were graded by human listeners as easy or hard to understand. Then in principle something like a neural net might figure out what audio features were responsible for intelligibility and then adjust the non-intelligible audio in that direction (a la using convolutional neural nets to make pictures appear in the style of a famous painter without changing the content). This would be done automatically without any humans actually understanding the solution.
Sure, you could try any number of things to produce a solution. But even if you try an approach where you don't know at first what might work, you should likely still put effort into figuring out what features made it work, so that you can improve it further and maintain stability.
Nobody else has mentioned anything along these lines so I'll do so.
If you haven't talked to the blind community about this sort of thing already I would strongly recommend doing so as they'll be able to rapidly point you at the bleeding edge of what currently exists - they routinely use text to speech cranked up to illegible speeds.
(I also heard of one guy who would listen to TTS from his computer with one ear, while using his other ear to hold a phone conversation. I believe I read this in Thunder Dog: The True Story of a Blind Man, His Guide Dog, and the Triumph of Trust (978-1400204724 / 1400204720).)
The second reason I suggest this is that the blind community is a dense populus of users who use systems like this as part of their everyday lives, so if you aimed such a system at them the feedback quality would be extremely high and allow for ridiculously fast iteration times and a great product. High-speed speech (TTS in particular) is a genuine technology hole/need.
Great comment, thanks. Agreed that blind users would be at cutting edge of this and provide the most useful feedback.
> they routinely use text to speech cranked up to illegible speeds.
Note that this is not quite what I'm talking about. TTS is a slightly different problem since you're constructing the speech, and you can actually choose voice synthesizers that sound weird but remain intelligible at high speeds. On the other hand, trying to modify existing voice audio (with no text) for greater intelligibility at high speeds is a different problem.
>> they routinely use text to speech cranked up to illegible speeds.
> Note that this is not quite what I'm talking about.
Good point. I think I forgot to fully qualify that statement - while initially composing my reply I got completely distracted with TTS and forgot this was about altering speech to go faster. I realized and went back and edited it a couple minutes later, but didn't adjust it sufficiently.
I did a podcast/audiobook player app ('lectr') with gap removal and had experimental support for what you describe. Roughly, it removed samples if there was no significant difference in the frequency-domain spectra. This was a simple threshold test (the threshold could be adjusted by the user).
It works for some material. Some speakers (usually professional book readers) have even pacing and it's more effective. A simple 1.5-1.7x speedup is also useful there. Podcasters tend to be 'bursty' in that they speak rapidly and then pause, and gap removal was more useful there.
I stopped working on it as my lifestyle changed such that I wasn't using it. There was almost no commercial interest in the project.
I'd love to see a library, especially with MOOCs etc becoming mainstream. Most players provide 1.25/1.5/2.0x options, but between 1.5x and 2.0x is the sweet spot for me, and almost no apps provide that.
Thanks for your insightful comment! Is Lectr still available anywhere? Nothing showed up on a Google search.
> between 1.5x and 2.0x is the sweet spot for me, and almost no apps provide that.
Agreed, this is super frustrating and is some evidence that the market isn't that interested. (Why would they pay for advanced compression if no one bothers with trivial stuff like finer gradation?)
Fwiw, on Android I use Audible, PocketCasts, and Audipo for audiobooks, podcasts, and general audio, respectively. They are all serviceable and have 0.1x or finer granularity.
I still use Swift for finer gradation. It also worked for video files, which was fantastic (mine never did that). Swift is 32-bit and will stop working soon. I haven't found a replacement yet.
(On iOS, at least, the 1.25/1.5/2.0 thing is because that's what the OS provides for very little effort. Finer controls require use of a supported-but-undocumented API or AudioUnits.)
Interesting idea. I used to watch Berkeley webcast videos at 1.5x to shave off ~30 minutes from a 90 minute lecture. Any faster wouldn't be intelligible.
I do the same. I usually watch all lecture videos at 1.5x or 2x speed. Saves me tons of time which is good because usually to videos are of random interesting topics that are distracting me from work.
It is the case that you can teach yourself to understand faster and faster speech. The blind often have human interface devices which speak at absurdly high vocal acceleration.
This also depends heavily on information density. I can listen to some types of content at 2x without issue; other content I can't accelerate more than 10-20%.
This is doable. If one were, uh, hypothetically, to write a library that did this, how would the HN community recommend monetizing it? I have no experience selling my wares to anyone but a large company, let alone an app store or something like that. Would the tool primarily be applied to nice clean audio, like books on tape, or also required to work with significant background noise?
IANAL but I can't see any harm in releasing some demo samples of what your system can do. To be fair, an example of how the system breaks down it gets pushed too hard would probably be useful too.
With that in hand, perhaps you might create a new post showing the samples and asking for advice. (Maybe you could also let people make sample requests via comments?)
It's possible that you may receive offers from people interested in being business partners (treating the situation like a startup); have fun with that ;)
At some point it might become useful to feed the audio in to speech recognition, then feed the result in to a Text-to-Speech engine. You will lose all of your prosody and speaker characteristics, but blind people have their screen readers at crazy speeds so it will stay intelligible.
I'd definitely use something that did this :) I usually speed up youtube videos/podcasts to 2x. Many times this is possible because the content that is presented is already known or is easy to understand. However, this breaks down when learning something new.
'Podcast Addict' (for Android) can speed podcasts up like this, up to 5x. Usually, I can comfortably bump the speed up to 1.25x-1.4x without any trouble. 2x for me ends up being too fast, but useful if you're skimming and not trying to absorb every second of the audio.
That'd be nice but not really necessary. Biggest use case would be podcasts and lectures, which are usually downloaded in advance so there's plenty of time for off-line processing.
Some software already removes complete silences, but this is a 10% effect and I think this could be taken much further. I would love audio software that could manipulate high-speed human speech to improve intelligibility by preferentially compressing parts with low information content (like vowels and "ughs") and uncompressing, or even "repairing", info-dense parts like sequential consonant sounds.
I've looked around and haven't been able to find anything like this. Could make a nice stand-alone app, or a library to sell to a podcast player.
http://softwarerecs.stackexchange.com/questions/27175/video-...