Liquid AI Unveils LFM2-Audio-1.5B for Low-Latency Real-Time Assistants
Liquid AI has unveiled LFM2-Audio-1.5B, a compact audio-language foundation model designed for low-latency, real-time assistants on resource-constrained devices. The model matches or improves on Whisper-large-v3-turbo's ASR WERs for some datasets and offers two generation modes: interleaved and sequential.
The released checkpoint uses LFM2 as the backbone, FastConformer for audio encoding, and RQ-Transformer for audio decoding. The model understands and generates speech and text through a single end-to-end stack. Liquid AI reports end-to-end latency below 100 ms from a 4-second audio query to the first audible response. The model disentangles audio representations, with inputs as continuous embeddings and outputs as discrete audio codes. LFM2-Audio extends the 1.2B-parameter LFM2 language backbone to treat audio and text as first-class sequence tokens. On VoiceBench, LFM2-Audio-1.5B scores an overall 56.78, with per-task numbers disclosed in the blog's chart.
LFM2-Audio-1.5B is a significant development in audio-language processing, offering improved performance and low-latency for real-time assistants. Its ability to treat audio and text as first-class sequence tokens opens up new possibilities in multimodal processing.