Deep Speaker: an End-to-End Neural Speaker Embedding System - 5 May 2017

1、 We experiment with ResCNN and GRU architectures to extract the acoustic features, then mean pool to produce utterance-level speaker embeddings, and train using triplet loss based on cosine similarity.

2、pooling and length normalization layers generate utterance-level speaker embeddings.

3、We also investigate stacked gated recurrent unit (GRU) layers as an alternative for frame-level feature extraction, since they have proven to be effective for speech processing applications [deepspeech 2]

4、We also select hard negative examples at each iteration by checking candidate utterances globally, not just in the same minibatch. This approach provides faster training convergence.

5、We experiment with two different core architectures: a ResNet- style deep CNN and the Deep Speech 2 style architecture consisting of convolutional layers followed by GRU layers.

VoxCeleb2: Deep Speaker Recognition -27 Jun 2018

1、The objective of this paper is speaker recognition under noisy and unconstrained conditions.

2、we develop and compare Convolutional Neural Network (CNN) models and training strategies that can effectively recognise identities from voice under various conditions.

3、In this paper, we present a deep CNN based neural speaker embedding system, named VGGVox, trained to map voice spectrograms to a compact Eu- clidean space where distances directly correspond to a measure of speaker similarity.

4、Unfortunately, speaker recognition still faces a dearth of large-scale freely available datasets in the wild.To address this issue we curate VoxCeleb2, a large scale speaker recognition dataset obtained automatically from open-source media. VoxCeleb2 consists of over a million utterances from over 6k speakers.Since the dataset is collected ‘in the wild’, the speech segments are corrupted with real world noise including laughter, cross-talk, channel effects, music and other sounds. The dataset is also multilingual, with speech from speakers of 145 different nationalities, covering a wide range of accents, ages, ethnicities and languages.

5、We train VGGVox on this dataset in order to learn speaker discriminative embeddings.

6、Our system consists of three main variable parts: an underlying deep CNN trunk architecture, which is used to extract the features, a pooling method which is used to aggregate features to provide a single embedding for a given utterance, and a pairwise loss trained on the features to directly optimise the mapping itself.

7、We experiment with both VGG-M and ResNet based trunk CNN architectures.

8、we propose deep ResNet-based architectures for speaker em- bedding suitable for spectrogram inputs (section 4)

9、we beat the current state of the art for speaker verification on the VoxCeleb1 test set using our embeddings (section 5)

10、The system is trained on short-term magnitude spec-trograms extracted directly from raw audio segments, with no other pre-processing.

11、A deep neural network trunk architecture is used to extract frame level features, which are pooled to ob-tain utterance-level speaker embeddings.

12、The entire model is then trained using contrastive loss.

13、Pre-training using a soft- max layer and cross-entropy over a fixed list of speakers improves model performance; hence we pre-train the trunk architecture model for the task of identification first.

Deep Speaker Embeddings for Short-Duration Speaker Verification - August 2017

1、we apply deep neural networks directly to time- frequency speech representations.

2、Our best model is based on a deep convolutional architecture wherein recordings are treated as images.

3、From our experimental findings we advocate treating utterances as images or speaker snapshots, much like in face recognition.

4、The trials are audio recordings of arbitrary duration, and their phonetic content is unconstrained.

5、we advocate the view of treating a time- frequency representation of speech like an image.

6、Where each image is 5 seconds long and 40 filter-banks wide.

7、In the next section we analyze the problem of modeling speakers with neural networks.

8、We also provide details of the deep network architectures used in this work. This is followed by a section describing our experiments and results.

9、In this context we argue that recognizing speakers has more in common with recognizing faces than recognizing speech. In- deed many ideas from face recognition have been successfully ported to speaker recognition