Speaker Identification Using a Convolutional Neural Network
Dublin Core
Title
Speaker Identification Using a Convolutional Neural Network
Subject
speaker identification, CNN, spectrogram, feature extraction
Description
Speech, a mode of communication between humans and machines, has various applications, including biometric systems for
identifying people have access to secure systems. Feature extraction is an important factor in speech recognition with high
accuracy. Therefore, we implemented a spectrogram, which is a pictorial representation of speech in terms of raw features, to
identify speakers. These features were inputted into a convolutional neural network (CNN), and a CNN-visual geometry group
(CNN-VGG) architecture was used to recognize the speakers. We used 780 primary data from 78 speakers, and each speaker
uttered a number in Bahasa Indonesia. The proposed architecture, CNN-VGG-f, has a learning rate of 0.001, batch size of
256, and epoch of 100. The results indicate that this architecture can generate a suitable model for speaker identification. A
spectrogram was used to determine the best features for identifying the speakers. The proposed method exhibited an accuracy
of 98.78%, which is significantly higher than the accuracies of the method involving Mel-frequency cepstral coefficients
(MFCCs; 34.62%) and the combination of MFCCs and deltas (26.92%). Overall, CNN-VGG-f with the spectrogram can
identify 77 speakers from the samples, validating the usefulness of the combination of spectrograms and CNN in speech
recognition applications.
identifying people have access to secure systems. Feature extraction is an important factor in speech recognition with high
accuracy. Therefore, we implemented a spectrogram, which is a pictorial representation of speech in terms of raw features, to
identify speakers. These features were inputted into a convolutional neural network (CNN), and a CNN-visual geometry group
(CNN-VGG) architecture was used to recognize the speakers. We used 780 primary data from 78 speakers, and each speaker
uttered a number in Bahasa Indonesia. The proposed architecture, CNN-VGG-f, has a learning rate of 0.001, batch size of
256, and epoch of 100. The results indicate that this architecture can generate a suitable model for speaker identification. A
spectrogram was used to determine the best features for identifying the speakers. The proposed method exhibited an accuracy
of 98.78%, which is significantly higher than the accuracies of the method involving Mel-frequency cepstral coefficients
(MFCCs; 34.62%) and the combination of MFCCs and deltas (26.92%). Overall, CNN-VGG-f with the spectrogram can
identify 77 speakers from the samples, validating the usefulness of the combination of spectrograms and CNN in speech
recognition applications.
Creator
Suci Dwijayanti1
, Alvio Yunita Putri2
, Bhakti Yudho Suprapto3
, Alvio Yunita Putri2
, Bhakti Yudho Suprapto3
Publisher
Universitas Sriwijaya
Date
: 27-02-2022
Contributor
Fajar bagus W
Format
PDF
Language
Indonesia
Type
Text
Files
Collection
Citation
Suci Dwijayanti1
, Alvio Yunita Putri2
, Bhakti Yudho Suprapto3, “Speaker Identification Using a Convolutional Neural Network,” Repository Horizon University Indonesia, accessed May 30, 2025, https://repository.horizon.ac.id/items/show/9116.