Skip to content

Audio: MFCC: Add Voice Activity Detection based on Mel spectrum#10782

Draft
singalsu wants to merge 1 commit into
thesofproject:mainfrom
singalsu:mfcc_add_vad
Draft

Audio: MFCC: Add Voice Activity Detection based on Mel spectrum#10782
singalsu wants to merge 1 commit into
thesofproject:mainfrom
singalsu:mfcc_add_vad

Conversation

@singalsu
Copy link
Copy Markdown
Collaborator

This patch adds a new mfcc_vad module that implements VAD operating on the Mel log spectrum values produced by the MFCC component. The VAD is very simple and is not very selective for voice vs. other signals. But the continuously updated background noise estimate prevents stationary noises to trigger the VAD.

The algorithm tracks a per-bin noise floor (instant-down, slow-rise) and computes a A-weighted energy delta. The used weight emphasizes speech frequencies. Speech is declared when the delta exceeds a threshold (0.30 in Q9.23) with a 20-frame hangover to prevent rapid toggling.

The VAD flag is inserted into the output stream as the first value after the magic header word in all format paths (S16, S24, S32).

A new Kconfig option CONFIG_COMP_MFCC_VAD (depends on COMP_MFCC, default n) gates compilation of the VAD code and the stream format change.

@singalsu
Copy link
Copy Markdown
Collaborator Author

This is still WIP. I'd like to add a better audio feature header to the fake PCM stream. In successive PRs should start to use the compress PCM type for MFCC output data. The MFCC config blob could enable for VAD mode discontinuous data. E.g. once per second background noise Mel spectrum values, for speech detected at FFT hop rate, e.g. every 10 ms.

Comment thread src/audio/mfcc/mfcc_vad.c
/* Find j such that a_weight_hz[j] <= f_hz < a_weight_hz[j+1] */
for (j = 0; j < A_WEIGHT_TABLE_SIZE - 2; j++) {
if (f_hz < a_weight_hz[j + 1])
break;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

binary search?

Copy link
Copy Markdown
Collaborator Author

@singalsu singalsu May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be implemented with some binary search function? It's a very small table (36 values) and this is initialization time code, not hot.

Comment thread src/audio/mfcc/mfcc_vad.c
Comment thread src/audio/mfcc/mfcc_vad.c Outdated
Comment thread src/audio/mfcc/mfcc_vad.c Outdated
Comment thread src/audio/mfcc/mfcc_vad.c Outdated
This patch adds a new mfcc_vad module. It operates  on the Mel
log spectrum values produced by the MFCC component. The VAD is
very simple and not very selective for voice vs. other signals.
But the continuously updated background noise estimate prevents
stationary noises from triggering the VAD.

The algorithm tracks a per-bin noise floor (instant-down, slow-rise)
and computes a A-weighted energy delta. The used weight emphasizes
speech frequencies. Speech is declared when the delta exceeds a
threshold (0.35 in Q9.23) with a 20-frame hangover to prevent rapid
toggling.

The VAD flag is inserted into the output stream as the first value
after the magic header word in all format paths (S16, S24, S32).

A new Kconfig option CONFIG_COMP_MFCC_VAD (depends on COMP_MFCC,
default y) gates compilation of the VAD code and the stream format
change.

The README.txt file is updated to show help how to run the
example Python script sof_mel_to_text_live_dsp_vad.py. It uses
the MFCC Mel spectrum data and VAD flags stream as audio features
for Whisper speech to text model. The formatting is changed to md.

Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
@singalsu singalsu marked this pull request as ready for review May 19, 2026 11:11
Copilot AI review requested due to automatic review settings May 19, 2026 11:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an optional MFCC Voice Activity Detection (VAD) feature that runs on the MFCC component’s Mel log spectrum and embeds a VAD flag into the MFCC/Mel output stream, along with updated host-side tuning/decoding tooling and documentation.

Changes:

  • Add a new mfcc_vad module (state, initialization, per-frame update) and wire it into MFCC Mel-log-spectrum processing.
  • Insert a per-frame VAD flag into the MFCC output stream immediately after the magic header word (gated by a new Kconfig option).
  • Update tuning tools/documentation: add a live DSP-VAD-triggered Whisper transcription script, migrate README to Markdown, and extend decode_mel.m to extract VAD.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/include/sof/audio/mfcc/mfcc_vad.h New public header for VAD state + API and tuning constants
src/audio/mfcc/mfcc_vad.c New VAD implementation (noise floor tracking + weighted energy delta + hangover)
src/include/sof/audio/mfcc/mfcc_comp.h Extend MFCC component state to carry VAD state and output bookkeeping
src/audio/mfcc/mfcc_common.c Run VAD during Mel processing and emit VAD flag in stream output
src/audio/mfcc/mfcc_setup.c Initialize/free VAD resources during MFCC setup/teardown
src/audio/mfcc/Kconfig Add CONFIG_COMP_MFCC_VAD option controlling build + format change
src/audio/mfcc/CMakeLists.txt Conditionally compile mfcc_vad.c
src/arch/host/configs/library_defconfig Enable VAD in host library defconfig
src/audio/mfcc/tune/sof_mel_to_text_live_dsp_vad.py New live capture + Whisper transcription tool using DSP-embedded VAD
src/audio/mfcc/tune/README.md New Markdown documentation (replaces README.txt)
src/audio/mfcc/tune/decode_mel.m Extend Mel decoder to parse VAD flag and plot it
Comments suppressed due to low confidence (1)

src/audio/mfcc/mfcc_common.c:297

  • vad_pending is only set for state->mel_only. If VAD is meant to be emitted for all MFCC output frames (including cepstral output), this needs to be set for the non-mel_only path too; otherwise, please update docs to state the VAD flag is only present in Mel-log-spectrum output streams.
		if (state->mel_only) {
			state->out_data_ptr = state->mel_spectra->data;
#ifdef CONFIG_COMP_MFCC_VAD
			state->vad_pending = true;
#endif

#define MFCC_VAD_NOISE_INIT_FRAMES 100

/**
* \brief Slow noise floor rise coefficient in Q1.15 (0.0010 * 32768 = 3).
Comment thread src/audio/mfcc/Kconfig
Comment on lines +28 to +32
config COMP_MFCC_VAD
bool "MFCC Voice Activity Detection"
depends on COMP_MFCC
default y
help
Comment on lines +373 to +377
# --- Speech buffering logic ---
if speech:
speech_buffer.append(mel.copy())
silence_counter = 0
was_speaking = True
Comment on lines +151 to +154
#ifdef CONFIG_COMP_MFCC_VAD
/* Run VAD on the mel log spectrum before further processing */
state->vad_flag = mfcc_vad_update(&cd->vad, state->mel_log_32);
#endif
Comment on lines +361 to +365
#ifdef CONFIG_COMP_MFCC_VAD
ret = mfcc_vad_init(&cd->vad, config->num_mel_bins, sample_rate, mod);
if (ret < 0) {
comp_err(dev, "Failed VAD init");
goto free_lifter;
Comment on lines 72 to 76
% Last frame can be incomplete due to span over multiple periods
last = idx(end) + num_mel - 1;
if (last > length(data))
num_frames = num_frames - 1;
end
Comment on lines +78 to 80
% VAD flag is first int32 after magic, followed by num_mel coefficients
payload_len = 1 + num_mel;

print(f"Whisper model: {model_path} (encoder: {encoder_device}, decoder: {decoder_device})")
print()

proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
@singalsu
Copy link
Copy Markdown
Collaborator Author

I think I'll remove the CONFIG_COMP_MFCC_VAD and build it always. Then it's simpler to make it a permanent part of the magic header. The configuration blob for Mel mode can enable computing it,while in MFCC mode it will be zeros unless enabled there also with blob. Then the parsing scripts can always use the same data format.

@singalsu singalsu marked this pull request as draft May 19, 2026 15:02
@singalsu
Copy link
Copy Markdown
Collaborator Author

Adding more features --> draft

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants