Deepfake Speech Detection: Identifying AI-Generated and Real Human Voices Using Hybrid Convolutional Neural Network and Long Short-Term Memory Model

Authors

  • Marc Laureta College of Informatics and Computing Studies, New Era University, Quezon City, 1107, Philippines
  • John Maynardk Atienza College of Informatics and Computing Studies, New Era University, Quezon City, 1107, Philippines
  • John Lemuel Tapel College of Informatics and Computing Studies, New Era University, Quezon City, 1107, Philippines

Keywords:

CNN- LSTM model, deepfake audio detection, Mel Spectrogram, synthetic speech detection, multilingual speech classification

Abstract


This study explored deepfake audio detection using English and Tagalog datasets to enhance multilingual speech classification. The rise of synthetic media, particularly deepfake audio, raises concerns about misinformation, security, and authenticity. To address this, the researchers developed a web-based detection system using a hybrid Convolutional Neural Network and Long Short-Term Memory Model (CNN-LSTM) model, which captured spatial and temporal features for accurate classification. The approach leveraged Mel spectrograms, convolutional layers for spatial patterns, and LSTM networks for temporal dependencies. Trained on an augmented dataset of over 176,000 samples and fine-tuned using TensorFlow, the model achieved 98.65% accuracy, with a precision of 98.60% and a recall of 98.76%. The system employed class weighting to address imbalance and used mixed-precision training for efficiency. Its architecture included Conv2D layers with Batch Normalization and MaxPooling, followed by TimeDistributed Dense layers and an LSTM for sequential modeling. Regularization and callbacks optimized performance, which was evaluated using accuracy, precision, recall, F1-score, and a confusion matrix. Results confirmed its efficacy in distinguishing real and AI-generated voices, mitigating risks from synthetic speech. Future work may refine dataset diversity and optimize system responsiveness for broader real-world implementation.

Downloads

Published

2025-06-30

How to Cite

Laureta, M., Atienza, J. M., & Tapen, J. L. (2025). Deepfake Speech Detection: Identifying AI-Generated and Real Human Voices Using Hybrid Convolutional Neural Network and Long Short-Term Memory Model. Isabela State University Linker: Journal of Engineering, Computing and Technology, 2(1), 32–49. Retrieved from https://www.isujournals.ph/index.php/ject/article/view/208