Arnon Turetzky

Speech Synthesis From Continuous Features Using Per-Token Latent Diffusion

ASRU 2025

Arnon Turetzky, Avihu Dekel, Nimrod Shabtay, Slava Shechtman, Hagai Aronowitz, David Haws, Ron Hoory, Yossi Adi

SALAD is a per-token latent diffusion model for zero-shot text-to-speech,that operates on continuous representations. SALAD builds upon the recently proposed expressive diffusion head for image generation, and extends it to generate variable-length outputs. Our approach utilizes semantic tokens for providing contextual information and determining the stopping condition.

arXiv

LAST: Language Model Aware Speech Tokenization

Arxiv preprint

Arnon Turetzky and Yossi Adi

Most speech tokenizers used in Speech Language Models (SpeechLMs) are trained independently of the language model (LM) training process. These tokenizers rely on quantization methods applied over the acoustic model representations. In this study, we propose to integrate objectives from pre-trained textual LMs into the tokenizer training process. The goal is to guide the tokenizer towards creating clusters that are better suited for the language model.

Project Page arXiv Code

A Language Modeling Approach to Diacritic-Free Hebrew TTS

INTERSPEECH 2024

Amit Roth, Arnon Turetzky, Yossi Adi

In this work, we propose to adopt a language modeling Diacritics-Free approach, for the task of Hebrew TTS. The model operates on discrete speech representations and is conditioned on a word-piece to- kenizer. We optimize the proposed method using in-the-wild weakly supervised data and compare it to several diacritic-based TTS systems. Results suggest the proposed method is superior to the evaluated baselines considering both content preserva- tion and naturalness of the generated speech.

Project Page arXiv Code

HEBDB: a Weakly Supervised Dataset for Hebrew Speech Processing

INTERSPEECH 2024

Arnon Turetzky, Or Tal, Yael Segal-Feldman, Yehoshua Dissen, Ella Zeldes, Amit Roth, Eyal Cohen, Yosi Shrem, Bronya R. Chernyak, Olga Seleznova, Joseph Keshet, Yossi Adi

We present HEBDB, a weakly supervised dataset for spoken language processing in the Hebrew language. HEBDB offers roughly 2500 hours of natural and spontaneous speech record- ings in the Hebrew language, consisting of a large variety of speakers and topics. We provide raw recordings together with a pre-processed, weakly supervised, and filtered version. The goal of HEBDB is to further enhance research and development of spoken language processing tools for the Hebrew language.

Project Page arXiv Code

Deep Audio Waveform Prior

INTERSPEECH 2022

Arnon Turetzky, Tzvi Michelson, Yossi Adi, Shmuel Peleg

"Deep prior" describes the bias of neural networks (NN) structure to generate "natural" results regardless of training data. In this work we show that existing SOTA architectures for audio source separation contain deep priors even when working with the raw waveform. Deep priors can be discovered by training a neural network to generate a single corrupted signal when given white noise as input. A network with relevant deep priors is likely to generate a cleaner version of the signal before converging on the corrupted signal. We demonstrate this restoration effect with several corruptions: background noise, reverberations, and a gap in the signal (audio inpainting).

Project Page arXiv Code

Arnon Turetzky

Publications

Speech Synthesis From Continuous Features Using Per-Token Latent Diffusion

LAST: Language Model Aware Speech Tokenization

A Language Modeling Approach to Diacritic-Free Hebrew TTS

HEBDB: a Weakly Supervised Dataset for Hebrew Speech Processing

Deep Audio Waveform Prior