FS1R Formant Shaping, Part 1: Human speech

- January 01, 2024

Introduction

I'll help you get a deeper understanding of Formant Shaping on the FS1R, split over multiple posts.

I'll start with some background, and then dive deeper into the technical aspects.

Communication via audio

Animals

Many animals can make sounds: "bark", "oink", "tweet", and "ssssss". It usually just means: "I'm here!"

Some can alter their calls. A dog can bark softly or loudly, to express the difference between: "I don't really appreciate this" and "I'll tear you up!".

Some animals can pack even more information. Birds produce a whistle-like sound that can be varied in both pitch and loudness quickly.

Human speech

We can do what birds to, to a certain extent: blow air from our longs and vary the pitch.

Birds outsing us on the technique of producing a single frequency that varies over time.

But we've developed something truly awesome on top of that technique.

We give the outgoing sound an extra treatment. We reshape it to damp and amplify custom frequencies, corresponding to the data that we want to encode. So not only do we have the sound that's produced in our throats. We also have a few extra tracks with frequencies to store information in.

And with that, we pull off something remarkable: we've found a way to encode multiple streams of information in parallel into a single audio stream. Eat that, birds!

The resonant frequencies that we produce after our throat are called formants.

Sending data encoded in frequencies has great advantages for us;

We can send more data in parallel.

In the same way a radio has multiple channels, we emit multiple signals on different frequencies.

It's fault-tolerant.

The same information is repeated many times before the next package is sent. So missing a packet is not that big of a deal.
Using the multiple signals together allows for some error corrections to be performed in our brains.

Our ears, brains, throats, and mouths are all optimized for this way of exchanging information.

The "protocol"

We don't process speech in the time domain, like a sample. That's not how our ears are built, and it's not how our brain can process audio.

Instead, we process information in the frequency domain. And we're not even processing all frequencies with the same attention. We just process the peak frequencies.

A specific combination of boosted frequencies can be interpreted as a vowel (think "ooh", "aah", or "eee").

The human speech "protocol" also allows for other sounds to be mixed in. Sounds that don't have a very specific frequency, and are more noise-like. Like the letter S. Or sudden noisy bursts, like the P.

To have intelligible speech, all you need is:

2 to 3 frequency tracks
some noise-like portions

This is universal for all human languages. Language and meaning are built on top of these building blocks.

The FS1R neatly condenses that model into an engine that can produce exactly what's needed (and more!).

Next time, we'll see how humans produce speech, and explain some of the FS1R terminology

Search This Blog

Wouter's Nerd Corner