Audio: MFCC: Add Voice Activity Detection based on Mel spectrum by singalsu · Pull Request #10782 · thesofproject/sof

singalsu · 2026-05-15T16:19:37Z

This patch adds a new mfcc_vad module that implements VAD operating on the Mel log spectrum values produced by the MFCC component. The VAD is very simple and is not very selective for voice vs. other signals. But the continuously updated background noise estimate prevents stationary noises to trigger the VAD.

The algorithm tracks a per-bin noise floor (instant-down, slow-rise) and computes a A-weighted energy delta. The used weight emphasizes speech frequencies. Speech is declared when the delta exceeds a threshold (0.30 in Q9.23) with a 20-frame hangover to prevent rapid toggling.

The VAD flag is inserted into the output stream as the first value after the magic header word in all format paths (S16, S24, S32).

A new Kconfig option CONFIG_COMP_MFCC_VAD (depends on COMP_MFCC, default n) gates compilation of the VAD code and the stream format change.

singalsu · 2026-05-15T16:25:47Z

This is still WIP. I'd like to add a better audio feature header to the fake PCM stream. In successive PRs should start to use the compress PCM type for MFCC output data. The MFCC config blob could enable for VAD mode discontinuous data. E.g. once per second background noise Mel spectrum values, for speech detected at FFT hop rate, e.g. every 10 ms.

lyakh · 2026-05-18T06:25:18Z

+			/* Find j such that a_weight_hz[j] <= f_hz < a_weight_hz[j+1] */
+			for (j = 0; j < A_WEIGHT_TABLE_SIZE - 2; j++) {
+				if (f_hz < a_weight_hz[j + 1])
+					break;


binary search?

Can this be implemented with some binary search function? It's a very small table (36 values) and this is initialization time code, not hot.

This patch adds a new mfcc_vad module. It operates on the Mel log spectrum values produced by the MFCC component. The VAD is very simple and not very selective for voice vs. other signals. But the continuously updated background noise estimate prevents stationary noises from triggering the VAD. The algorithm tracks a per-bin noise floor (instant-down, slow-rise) and computes a A-weighted energy delta. The used weight emphasizes speech frequencies. Speech is declared when the delta exceeds a threshold (0.35 in Q9.23) with a 20-frame hangover to prevent rapid toggling. The VAD flag is inserted into the output stream as the first value after the magic header word in all format paths (S16, S24, S32). A new Kconfig option CONFIG_COMP_MFCC_VAD (depends on COMP_MFCC, default y) gates compilation of the VAD code and the stream format change. The README.txt file is updated to show help how to run the example Python script sof_mel_to_text_live_dsp_vad.py. It uses the MFCC Mel spectrum data and VAD flags stream as audio features for Whisper speech to text model. The formatting is changed to md. Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>

Copilot

Pull request overview

This PR introduces an optional MFCC Voice Activity Detection (VAD) feature that runs on the MFCC component’s Mel log spectrum and embeds a VAD flag into the MFCC/Mel output stream, along with updated host-side tuning/decoding tooling and documentation.

Changes:

Add a new mfcc_vad module (state, initialization, per-frame update) and wire it into MFCC Mel-log-spectrum processing.
Insert a per-frame VAD flag into the MFCC output stream immediately after the magic header word (gated by a new Kconfig option).
Update tuning tools/documentation: add a live DSP-VAD-triggered Whisper transcription script, migrate README to Markdown, and extend decode_mel.m to extract VAD.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
src/include/sof/audio/mfcc/mfcc_vad.h	New public header for VAD state + API and tuning constants
src/audio/mfcc/mfcc_vad.c	New VAD implementation (noise floor tracking + weighted energy delta + hangover)
src/include/sof/audio/mfcc/mfcc_comp.h	Extend MFCC component state to carry VAD state and output bookkeeping
src/audio/mfcc/mfcc_common.c	Run VAD during Mel processing and emit VAD flag in stream output
src/audio/mfcc/mfcc_setup.c	Initialize/free VAD resources during MFCC setup/teardown
src/audio/mfcc/Kconfig	Add `CONFIG_COMP_MFCC_VAD` option controlling build + format change
src/audio/mfcc/CMakeLists.txt	Conditionally compile `mfcc_vad.c`
src/arch/host/configs/library_defconfig	Enable VAD in host library defconfig
src/audio/mfcc/tune/sof_mel_to_text_live_dsp_vad.py	New live capture + Whisper transcription tool using DSP-embedded VAD
src/audio/mfcc/tune/README.md	New Markdown documentation (replaces README.txt)
src/audio/mfcc/tune/decode_mel.m	Extend Mel decoder to parse VAD flag and plot it

Comments suppressed due to low confidence (1)

src/audio/mfcc/mfcc_common.c:297

vad_pending is only set for state->mel_only. If VAD is meant to be emitted for all MFCC output frames (including cepstral output), this needs to be set for the non-mel_only path too; otherwise, please update docs to state the VAD flag is only present in Mel-log-spectrum output streams.

		if (state->mel_only) {
			state->out_data_ptr = state->mel_spectra->data;
#ifdef CONFIG_COMP_MFCC_VAD
			state->vad_pending = true;
#endif

+#define MFCC_VAD_NOISE_INIT_FRAMES	100
+
+/**
+ * \brief Slow noise floor rise coefficient in Q1.15 (0.0010 * 32768 = 3).


+config COMP_MFCC_VAD
+	bool "MFCC Voice Activity Detection"
+	depends on COMP_MFCC
+	default y
+	help


+                # --- Speech buffering logic ---
+                if speech:
+                    speech_buffer.append(mel.copy())
+                    silence_counter = 0
+                    was_speaking = True


+#ifdef CONFIG_COMP_MFCC_VAD
+			/* Run VAD on the mel log spectrum before further processing */
+			state->vad_flag = mfcc_vad_update(&cd->vad, state->mel_log_32);
+#endif


+#ifdef CONFIG_COMP_MFCC_VAD
+	ret = mfcc_vad_init(&cd->vad, config->num_mel_bins, sample_rate, mod);
+	if (ret < 0) {
+		comp_err(dev, "Failed VAD init");
+		goto free_lifter;


 % Last frame can be incomplete due to span over multiple periods
 last = idx(end) + num_mel - 1;
 if (last > length(data))
    num_frames = num_frames - 1;
 end


+% VAD flag is first int32 after magic, followed by num_mel coefficients
+payload_len = 1 + num_mel;



+    print(f"Whisper model: {model_path} (encoder: {encoder_device}, decoder: {decoder_device})")
+    print()
+
+    proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)


singalsu · 2026-05-19T11:38:02Z

I think I'll remove the CONFIG_COMP_MFCC_VAD and build it always. Then it's simpler to make it a permanent part of the magic header. The configuration blob for Mel mode can enable computing it,while in MFCC mode it will be zeros unless enabled there also with blob. Then the parsing scripts can always use the same data format.

singalsu · 2026-05-19T15:03:23Z

Adding more features --> draft

lyakh reviewed May 18, 2026

View reviewed changes

singalsu force-pushed the mfcc_add_vad branch from e4d8190 to 03c5b4d Compare May 19, 2026 10:31

singalsu marked this pull request as ready for review May 19, 2026 11:11

Copilot AI review requested due to automatic review settings May 19, 2026 11:11

singalsu requested review from dbaluta, kv2019i, lbetlej, lgirdwood, mmaka1 and plbossart as code owners May 19, 2026 11:11

Copilot started reviewing on behalf of singalsu May 19, 2026 11:12 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

singalsu marked this pull request as draft May 19, 2026 15:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audio: MFCC: Add Voice Activity Detection based on Mel spectrum#10782

Audio: MFCC: Add Voice Activity Detection based on Mel spectrum#10782
singalsu wants to merge 1 commit into
thesofproject:mainfrom
singalsu:mfcc_add_vad

singalsu commented May 15, 2026

Uh oh!

singalsu commented May 15, 2026

Uh oh!

lyakh May 18, 2026

Uh oh!

singalsu May 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

singalsu commented May 19, 2026

Uh oh!

singalsu commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		% VAD flag is first int32 after magic, followed by num_mel coefficients
		payload_len = 1 + num_mel;

Conversation

singalsu commented May 15, 2026

Uh oh!

singalsu commented May 15, 2026

Uh oh!

lyakh May 18, 2026

Choose a reason for hiding this comment

Uh oh!

singalsu May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

singalsu commented May 19, 2026

Uh oh!

singalsu commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

singalsu May 18, 2026 •

edited

Loading