Skip to content

tetsuo/transcrypt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

transcrypt

Experimental tool for streaming audio transcription over encrypted channels. The server receives compressed audio from clients over a Noise-encrypted (XX), multiplexed transport and optionally transcribes in real time using antirez/voxtral.c - pure C inference of Mistral Voxtral Realtime 4B speech to text model.

Getting started

Clone and initialize the submodule:

git clone --recurse-submodules https://github.com/tetsuo/transcrypt
cd transcrypt

If you already cloned without --recurse-submodules:

git submodule update --init

Install system dependencies:

macOS (Homebrew):

brew install ffmpeg
xcode-select --install   # Clang + xxd, skip if already installed

Download voxtral.c model. Inside voxtral.c, run this script to download the model files (about 8.5 GB):

./download_model.sh

Build libvoxtral.a:

./build_voxtral_lib.sh

Build transcrypt binaries:

go build ./cmd/transcrypt-server
go build ./cmd/transcrypt-client

Usage

Server listens on :9000 by default:

./transcrypt-server [-model <model-dir>] [<addr>]

Client:

./transcrypt-client <addr> <url> [url ...]

Example:

# terminal 1: server with transcription
./transcrypt-server -model ./voxtral.c/voxtral-model

# terminal 2: client streams two audio streams in parallel (English news from BBC and Dutch radio)
./transcrypt-client localhost:9000 \
  "http://stream.live.vc.bbcmedia.co.uk/bbc_world_service" \
  "https://icecast.omroep.nl/radio1-bb-mp3"

The server prints the transcript of one audio stream to stderr as tokens arrive and writes both files to disk.

Without -model the server is pure transport: every channel is written to disk as raw PCM, no transcription is attempted.

Server should output something like this:

Loading voxtral model...
Model loaded.
listening on [::]:9000
[97d3e5a50d29067f] connected
[97d3e5a50d29067f] ch=2 → 97d3e5a50d29067f_c747deed144dd11a.mp3
[97d3e5a50d29067f] ch=2 transcribing...
[97d3e5a50d29067f] ch=1 → 97d3e5a50d29067f_d716d1989eb1f395.mp3
 Eerst om op een lijst te staan van UNESCO. Als wereld erfgoed. Zover dat ik weet. Dat brengt geen toeristen mee op Curaçao. Is cultuur alleen maar in geld te meten? Nee, cultuur is niet alleen in geld te meten. En wat levert het op voor die eiland? En als je cultuur alleen maar meet in stenen

And, client:

connected to 6a8aa9e4e19f99f7
[d716d198] ch=1 http://stream.live.vc.bbcmedia.co.uk/bbc_world_service
[c747deed] ch=2 https://icecast.omroep.nl/radio1-bb-mp3

Each streams are saved to disk as 97d3e5a50d29067f_c747deed144dd11a.mp3 and 97d3e5a50d29067f_d716d1989eb1f395.mp3 (filenames encode the remote peer and URL).

How it works

The client dials the server, completes a Noise handshake (ephemeral + static keys, both sides authenticated), and opens one multiplexed channel per audio URL.

Each channel streams compressed audio bytes as they are fetched over HTTP. The server saves every channel to disk.

Output filenames encode the remote peer and the URL:

{rpk}_{sha256(url)[:8]}.{ext}

rpk is the first 8 bytes (hex) of the remote static public key. ext comes from the Content-Type header (mp3, aac, ogg, opus, flac, wav); falls back to .raw when absent or unrecognised.

If a model is loaded (with -model), the first channel to arrive is picked up for live transcription. The server decodes the compressed audio to float32 16kHz mono on the C side using libavformat/libavcodec/libswresample, then feeds the PCM frames to voxtral for transcription. The transcript is printed to stderr as tokens arrive.

About

Real time speech-to-text over encrypted channels

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors