Deepfake Speech Detection: Identifying AI-Generated and Real Human Voices Using Hybrid Convolutional Neural Network and Long Short-Term Memory Model

Marc Laureta; John Maynardk Atienza; John Lemuel Tapen

doi:10.65141/ject.v2i1.n3

Authors

Marc Laureta College of Informatics and Computing Studies, New Era University, Quezon City, 1107, Philippines
John Maynardk Atienza College of Informatics and Computing Studies, New Era University, Quezon City, 1107, Philippines
John Lemuel Tapel College of Informatics and Computing Studies, New Era University, Quezon City, 1107, Philippines

DOI:

https://doi.org/10.65141/ject.v2i1.n3

Keywords:

CNN- LSTM model, deepfake audio detection, Mel Spectrogram, synthetic speech detection, multilingual speech classification

Abstract

This study explored deepfake audio detection using English and Tagalog datasets to enhance multilingual speech classification. The rise of synthetic media, particularly deepfake audio, raises concerns about misinformation, security, and authenticity. To address this, the researchers developed a web-based detection system using a hybrid Convolutional Neural Network and Long Short-Term Memory Model (CNN-LSTM) model, which captured spatial and temporal features for accurate classification. The approach leveraged Mel spectrograms, convolutional layers for spatial patterns, and LSTM networks for temporal dependencies. Trained on an augmented dataset of over 176,000 samples and fine-tuned using TensorFlow, the model achieved 98.65% accuracy, with a precision of 98.60% and a recall of 98.76%. The system employed class weighting to address imbalance and used mixed-precision training for efficiency. Its architecture included Conv2D layers with Batch Normalization and MaxPooling, followed by TimeDistributed Dense layers and an LSTM for sequential modeling. Regularization and callbacks optimized performance, which was evaluated using accuracy, precision, recall, F1-score, and a confusion matrix. Results confirmed its efficacy in distinguishing real and AI-generated voices, mitigating risks from synthetic speech. Future work may refine dataset diversity and optimize system responsiveness for broader real-world implementation.

Deepfake Speech Detection: Identifying AI-Generated and Real Human Voices Using Hybrid Convolutional Neural Network and Long Short-Term Memory Model

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Make a Submission

Current Issue