PhoenixCodec: Taming Neural Speech Coding for Extreme Low-Resource Scenarios



Zixiang Wan★†    Haoran Zhao‡†    Guochang Zhang    Runqiang Han
Jianqiang Wei‡*    Yuexian Zou★*

 Audio Innovation Technology Department, Anker Inc, Beijing, China
 Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology,
Peking University, Shenzhen, China

 Equal contribution.   * Corresponding authors.


Paper                   Code

Abstract

This paper presents PhoenixCodec, a comprehensive neural speech coding and decoding framework designed for extremely low-resource conditions. The proposed system integrates an optimized asymmetric frequency-time architecture, a Cyclical Calibration and Refinement (CCR) training strategy, and a noise-invariant fine-tuning procedure. Under stringent constraints—computation below 700 MFLOPs, latency less than 30 ms, and dual-rate support at 1 kbps and 6 kbps —existing methods face a trade-off between efficiency and quality. PhoenixCodec addresses these challenges by alleviating the resource-scattering of conventional decoders, employing CCR to escape local optima, and enhancing robustness through noisy-sample fine-tuning. In the LRAC 2025 Challenge Track 1, the proposed system ranked third overall and demonstrated the best performance at 1 kbps in both real-world noise and reverberation and intelligibility in clean tests, confirming its effectiveness.

Model Architecture

We propose a frequency–time domain fusion end-to-end audio codec that achieves high-quality speech transmission under strict resource constraints. The overall architecture, illustrated in Figure 1, consists of a frequency-domain encoder, a residual vector quantizer (RVQ), and a time-domain decoder. The input audio is first transformed into an amplitude spectrogram via short-time Fourier transform (STFT).

The frequency-domain encoder, built upon SpecTokenizer, employs a complex convolution layer followed by four cascaded FdownBlocks and RNNBlocks to extract and compress spectral features. Each FdownBlock combines a 2D convolution with Snake2D activation to enhance harmonic-structure modeling, while each RNNBlock integrates FLNorm, Tanh, GRU, 2D convolution, and Snake2D activation with residual connections to maintain stable gradient flow and preserve feature fidelity.

The latent representation is subsequently quantized by the RVQ module and passed to a BigCodec-based time-domain decoder. This decoder comprises a 1D convolution, a unidirectional LSTM with residual connections, four sequential DecoderBlocks, Snake1D activation, an output 1D convolution, and Tanh activation. Each DecoderBlock contains Snake1D activation, a 1D transposed convolution for upsampling, and several ResidualBlocks. Each ResidualBlock consists of two 1D convolutions with different kernel sizes and Snake1D activations, coupled with a residual connection at the end, thereby improving high-frequency detail restoration and spatial perceptual quality in waveform reconstruction.

Model training adopts a multi-objective loss function, including multi-scale mel-spectrogram loss, VQ quantization loss, and GAN-based adversarial loss. During adversarial training, a Multi-Period Discriminator (MPD) and Multi-Resolution Discriminator (MRD) are employed jointly to constrain both time-domain details and spectral textures, significantly enhancing mid-to-high frequency energy reproduction and naturalness. As a result, the proposed system delivers high-fidelity speech reconstruction that combines audio quality and intelligibility under low-latency and low-bitrate conditions.

Figure 1. Framework of the proposed model.

EXPERIMENTS

Datasets

All training data in this study are sourced from the official LRAC2025 dataset and underwent rigorous filtering and preprocessing prior to use. For noise data, labels were predicted using a pre-trained audio understanding model, and any non-pure noise samples containing speech were removed to ensure clean noise content. For reverberation data, room impulse responses (RIRs) were truncated before convolution, retaining only the 1 ms segment following the peak. This reduces long-tail decay that can impair speech clarity while preserving spatial characteristics.

Based on this, we applied a data augmentation strategy by mixing clean, noisy, and reverberant speech in a 1:1:1 ratio. In noise mixing, the signal-to-noise ratio (SNR) was uniformly sampled within the range of 10–30 dB to increase acoustic diversity.

Model evaluation was conducted on an open test set from the same source, with inference performed directly on the original official data without additional processing, and performance tested at both 1 kbps and 6 kbps bitrates.

Implementation Details

The proposed model has an overall computational complexity of 698 M FLOPs and 1.48 M parameters, with the encoder and RVQ module accounting for 399 M FLOPs and 1.17 M parameters, and the decoder for 299 M FLOPs and 0.32 M parameters. The system operates at a sampling rate of 24 kHz, with a frame length of 720 samples and a frame shift of 288 samples (approximately 83 Hz frame rate). In the STFT computation, only frequency bins 0–240 (0–8kHz) are used, effectively yielding a 24kHz to 16 kHz downsampling without introducing additional latency.

The encoder employs convolution kernels and strides of 1, introducing no additional latency. The decoder primarily uses causal convolutions and causal transposed convolutions, but non-causal convolutions are applied in specific positions to enhance reconstruction quality: the first convolution layer in the decoder (kernel = 1, stride = 1), the first convolution layer within repeated ResidualBlocks (kernel sizes = [7, 9, 9, 11], stride = 1), and the final convolution layer in the decoder (kernel = 7, stride = 1). These designs significantly improve mid-to-high frequency detail within the latency budget. The end-to-end latency is determined by both the STFT window length and the non-causal convolutions, and is kept within 30 ms overall.

To convert the 16 kHz audio output of the decoder to 24 kHz without noticeably increasing latency, we use a fractional-rate resampling strategy. First, the signal is upsampled by a factor of three using zero-insertion. Next, the spectral images introduced by zero-insertion are removed with an 11th-order IIR Butterworth low-pass filter with an 8 kHz cutoff frequency. Finally, the signal is downsampled by a factor of two to reach the target sampling rate. Compared to an FIR-based approach, this IIR design exhibits a maximum passband group delay of only 8 samples near 8 kHz, making it well-suited for real-time applications. The latency breakdown is shown in Table 2.

Table 2. Latency breakdown of the proposed system.
Source Samples Notes
STFT hopsize 192 @ 16kHz Frame shift
Decoder Residual Units 272 @ 16kHz 64 × 3 + 16 × 4 + 4 × 4 + 1 × 5
Final decoder convolution 3 @ 16kHz Kernel size = 7
Resampling delay 8 @ 24kHz Maximum group delay of the IIR filter
Total (24kHz) 716 @ 24kHz (29.83 ms) 472 × 3/2 + 8

The RVQ module consists of six codebooks, each containing 4096 entries (indexed with 12-bit codes) and a vector dimension of 8. During inference, either 1 codebook (for 1 kbps) or all 6 codebooks (for 6 kbps) can be selected, enabling operation at two different bitrates. The encoder channel configuration is [32, 32, 32, 128, 335], with time-axis kernel sizes and strides of [1, 1, 1, 1] and frequency-axis kernel sizes and strides of [5, 4, 4, 3]. The decoder channels are [117, 58, 29, 14, 7], with upsampling rates of [3, 4, 4, 4]. For the discriminators, the MPD uses periods [2, 3, 5, 7, 11], and the MRD operates with window sizes [128, 256, 512, 1024, 2048].

For optimization, both the generator and discriminators use an initial learning rate of 8 × 10−4 during the Mel stage and 1 × 10−4 during the GAN stage, gradually reduced to 1 × 10−5 . Adam is used throughout all training stages.

Checkpoint Selection Strategy: For system submission, we performed subjective listening evaluations on multiple models from different training stages using the open test set, selecting the checkpoint that yielded the best combination of audio quality and fine-detail reproduction as the final competition version.

Performance Comparison

Table 3. Final Results of the LRAC 2025 Challenge (Track 1).
Test Type Clean speech Real-world light noise
and reverb
Simultaneous
talkers
Intelligibility
in clean
Aggregate
Score
Scale MUSHRA [0, 100] DMOS [1, 5] DMOS [1, 5] DRT score [-100, 100] Weighted sum of normalized test
mean scores [0, 100]
Weight 20% 20% 5% 10% 100%
Bitrate Mode ULBR LBR ULBR LBR ULBR LBR ULBR Final Score Overall
Rank
teamwzqaq 62.65 81.75 3.02 4.44 2.82 4.35 85.43 71.91 1
nano-codec 59.23 81.17 3.13 4.44 2.60 4.22 78.12 70.86 2
PhoenixCodec 60.90 80.69 3.40 4.16 2.08 2.98 85.57 69.22 3
nju-aalab 65.20 89.19 2.74 4.12 1.70 2.82 82.98 67.48 4
boya-audio 35.22 77.24 2.21 4.30 2.03 4.26 80.29 59.42 5
pdura7 42.75 62.56 2.30 3.29 1.68 2.05 75.34 49.94 6
lrac-challenge 17.92 74.28 1.31 3.35 1.26 2.20 75.90 42.36 7

Audio Examples

Clean Speech

Ground Truth PhoenixCodec (6kbps) PhoenixCodec (1kbps)
Spectrogram of clean_subj_english_017_m Spectrogram of clean_subj_english_017_m Spectrogram of clean_subj_english_017_m
Spectrogram of clean_subj_english_020_m Spectrogram of clean_subj_english_020_m Spectrogram of clean_subj_english_020_m
Spectrogram of clean_subj_english_037_m Spectrogram of clean_subj_english_037_m Spectrogram of clean_subj_english_037_m
Spectrogram of clean_subj_english_039_m Spectrogram of clean_subj_english_039_m Spectrogram of clean_subj_english_039_m
Spectrogram of clean_subj_english_041_m Spectrogram of clean_subj_english_041_m Spectrogram of clean_subj_english_041_m

DRT English Clean

Ground Truth PhoenixCodec (1kbps)
Spectrogram of back_A33X52IN60MSOE_e2d4cdf13909446b84820f1f9c8290cb Spectrogram of back_A33X52IN60MSOE_e2d4cdf13909446b84820f1f9c8290cb
Spectrogram of back_A3TDPXTB6Y6H1I_5e3c18205cee4761a322322b7e5cd0da Spectrogram of back_A3TDPXTB6Y6H1I_5e3c18205cee4761a322322b7e5cd0da
Spectrogram of bad_A1SHLWKA0UH1IS_934363b93be24ee682e48e5118af2020 Spectrogram of bad_A1SHLWKA0UH1IS_934363b93be24ee682e48e5118af2020
Spectrogram of bad_A2ER0EVZ7E1Z8G_9f63688481ca40ff99bcd0465bdf7b13 Spectrogram of bad_A2ER0EVZ7E1Z8G_9f63688481ca40ff99bcd0465bdf7b13
Spectrogram of bag_A2ER0EVZ7E1Z8G_46fea23f510543c8808326780a9e2d9e Spectrogram of bag_A2ER0EVZ7E1Z8G_46fea23f510543c8808326780a9e2d9e

Real-world Noise

Ground Truth PhoenixCodec (6kbps) PhoenixCodec (1kbps)
Spectrogram of realworld_data_001 Spectrogram of realworld_data_001 Spectrogram of realworld_data_001
Spectrogram of realworld_data_002 Spectrogram of realworld_data_002 Spectrogram of realworld_data_002
Spectrogram of realworld_data_003 Spectrogram of realworld_data_003 Spectrogram of realworld_data_003
Spectrogram of realworld_data_004 Spectrogram of realworld_data_004 Spectrogram of realworld_data_004
Spectrogram of realworld_data_005 Spectrogram of realworld_data_005 Spectrogram of realworld_data_005

Simultaneous Talkers

Ground Truth PhoenixCodec (6kbps) PhoenixCodec (1kbps)
Spectrogram of simultaneous_talkers_001 Spectrogram of simultaneous_talkers_001 Spectrogram of simultaneous_talkers_001
Spectrogram of simultaneous_talkers_002 Spectrogram of simultaneous_talkers_002 Spectrogram of simultaneous_talkers_002
Spectrogram of simultaneous_talkers_003 Spectrogram of simultaneous_talkers_003 Spectrogram of simultaneous_talkers_003
Spectrogram of simultaneous_talkers_004 Spectrogram of simultaneous_talkers_004 Spectrogram of simultaneous_talkers_004
Spectrogram of simultaneous_talkers_005 Spectrogram of simultaneous_talkers_005 Spectrogram of simultaneous_talkers_005