Audio: MFCC: Add Voice Activity Detection based on Mel spectrum#10782
Audio: MFCC: Add Voice Activity Detection based on Mel spectrum#10782singalsu wants to merge 1 commit into
Conversation
|
This is still WIP. I'd like to add a better audio feature header to the fake PCM stream. In successive PRs should start to use the compress PCM type for MFCC output data. The MFCC config blob could enable for VAD mode discontinuous data. E.g. once per second background noise Mel spectrum values, for speech detected at FFT hop rate, e.g. every 10 ms. |
| /* Find j such that a_weight_hz[j] <= f_hz < a_weight_hz[j+1] */ | ||
| for (j = 0; j < A_WEIGHT_TABLE_SIZE - 2; j++) { | ||
| if (f_hz < a_weight_hz[j + 1]) | ||
| break; |
There was a problem hiding this comment.
Can this be implemented with some binary search function? It's a very small table (36 values) and this is initialization time code, not hot.
This patch adds a new mfcc_vad module. It operates on the Mel log spectrum values produced by the MFCC component. The VAD is very simple and not very selective for voice vs. other signals. But the continuously updated background noise estimate prevents stationary noises from triggering the VAD. The algorithm tracks a per-bin noise floor (instant-down, slow-rise) and computes a A-weighted energy delta. The used weight emphasizes speech frequencies. Speech is declared when the delta exceeds a threshold (0.35 in Q9.23) with a 20-frame hangover to prevent rapid toggling. The VAD flag is inserted into the output stream as the first value after the magic header word in all format paths (S16, S24, S32). A new Kconfig option CONFIG_COMP_MFCC_VAD (depends on COMP_MFCC, default y) gates compilation of the VAD code and the stream format change. The README.txt file is updated to show help how to run the example Python script sof_mel_to_text_live_dsp_vad.py. It uses the MFCC Mel spectrum data and VAD flags stream as audio features for Whisper speech to text model. The formatting is changed to md. Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces an optional MFCC Voice Activity Detection (VAD) feature that runs on the MFCC component’s Mel log spectrum and embeds a VAD flag into the MFCC/Mel output stream, along with updated host-side tuning/decoding tooling and documentation.
Changes:
- Add a new
mfcc_vadmodule (state, initialization, per-frame update) and wire it into MFCC Mel-log-spectrum processing. - Insert a per-frame VAD flag into the MFCC output stream immediately after the magic header word (gated by a new Kconfig option).
- Update tuning tools/documentation: add a live DSP-VAD-triggered Whisper transcription script, migrate README to Markdown, and extend
decode_mel.mto extract VAD.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/include/sof/audio/mfcc/mfcc_vad.h | New public header for VAD state + API and tuning constants |
| src/audio/mfcc/mfcc_vad.c | New VAD implementation (noise floor tracking + weighted energy delta + hangover) |
| src/include/sof/audio/mfcc/mfcc_comp.h | Extend MFCC component state to carry VAD state and output bookkeeping |
| src/audio/mfcc/mfcc_common.c | Run VAD during Mel processing and emit VAD flag in stream output |
| src/audio/mfcc/mfcc_setup.c | Initialize/free VAD resources during MFCC setup/teardown |
| src/audio/mfcc/Kconfig | Add CONFIG_COMP_MFCC_VAD option controlling build + format change |
| src/audio/mfcc/CMakeLists.txt | Conditionally compile mfcc_vad.c |
| src/arch/host/configs/library_defconfig | Enable VAD in host library defconfig |
| src/audio/mfcc/tune/sof_mel_to_text_live_dsp_vad.py | New live capture + Whisper transcription tool using DSP-embedded VAD |
| src/audio/mfcc/tune/README.md | New Markdown documentation (replaces README.txt) |
| src/audio/mfcc/tune/decode_mel.m | Extend Mel decoder to parse VAD flag and plot it |
Comments suppressed due to low confidence (1)
src/audio/mfcc/mfcc_common.c:297
vad_pendingis only set forstate->mel_only. If VAD is meant to be emitted for all MFCC output frames (including cepstral output), this needs to be set for the non-mel_only path too; otherwise, please update docs to state the VAD flag is only present in Mel-log-spectrum output streams.
if (state->mel_only) {
state->out_data_ptr = state->mel_spectra->data;
#ifdef CONFIG_COMP_MFCC_VAD
state->vad_pending = true;
#endif
| #define MFCC_VAD_NOISE_INIT_FRAMES 100 | ||
|
|
||
| /** | ||
| * \brief Slow noise floor rise coefficient in Q1.15 (0.0010 * 32768 = 3). |
| config COMP_MFCC_VAD | ||
| bool "MFCC Voice Activity Detection" | ||
| depends on COMP_MFCC | ||
| default y | ||
| help |
| # --- Speech buffering logic --- | ||
| if speech: | ||
| speech_buffer.append(mel.copy()) | ||
| silence_counter = 0 | ||
| was_speaking = True |
| #ifdef CONFIG_COMP_MFCC_VAD | ||
| /* Run VAD on the mel log spectrum before further processing */ | ||
| state->vad_flag = mfcc_vad_update(&cd->vad, state->mel_log_32); | ||
| #endif |
| #ifdef CONFIG_COMP_MFCC_VAD | ||
| ret = mfcc_vad_init(&cd->vad, config->num_mel_bins, sample_rate, mod); | ||
| if (ret < 0) { | ||
| comp_err(dev, "Failed VAD init"); | ||
| goto free_lifter; |
| % Last frame can be incomplete due to span over multiple periods | ||
| last = idx(end) + num_mel - 1; | ||
| if (last > length(data)) | ||
| num_frames = num_frames - 1; | ||
| end |
| % VAD flag is first int32 after magic, followed by num_mel coefficients | ||
| payload_len = 1 + num_mel; | ||
|
|
| print(f"Whisper model: {model_path} (encoder: {encoder_device}, decoder: {decoder_device})") | ||
| print() | ||
|
|
||
| proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE) |
|
I think I'll remove the CONFIG_COMP_MFCC_VAD and build it always. Then it's simpler to make it a permanent part of the magic header. The configuration blob for Mel mode can enable computing it,while in MFCC mode it will be zeros unless enabled there also with blob. Then the parsing scripts can always use the same data format. |
|
Adding more features --> draft |
This patch adds a new mfcc_vad module that implements VAD operating on the Mel log spectrum values produced by the MFCC component. The VAD is very simple and is not very selective for voice vs. other signals. But the continuously updated background noise estimate prevents stationary noises to trigger the VAD.
The algorithm tracks a per-bin noise floor (instant-down, slow-rise) and computes a A-weighted energy delta. The used weight emphasizes speech frequencies. Speech is declared when the delta exceeds a threshold (0.30 in Q9.23) with a 20-frame hangover to prevent rapid toggling.
The VAD flag is inserted into the output stream as the first value after the magic header word in all format paths (S16, S24, S32).
A new Kconfig option CONFIG_COMP_MFCC_VAD (depends on COMP_MFCC, default n) gates compilation of the VAD code and the stream format change.