Published

2025-09-15

Urbanphony-3-CNN: A Convolutional Neural Network for Identifying the Urban Soundscape Taxonomy in Spectrograms Generated from Audios of Historic Cities

Urbanofonía-3-RNC: una red neuronal convolucional para identificar la taxonomía del paisaje sonoro urbano en espectrogramas generados a partir de audios de ciudades históricas

DOI:

https://doi.org/10.15446/ing.investig.114942

Keywords:

soundscape studies, urban soundscape, convolutional neural networks, Mel spectrograms, supervised learning, cross-validation (en)
estudios en paisaje sonoro, paisaje sonoro urbano, redes neuronales convolucionales, espectrogramas Mel, aprendizaje supervisado, validación cruzada (es)

Downloads

Authors

Urban soundscapes are characterized by the overlapping of multiple sounds, posing challenges for automatic classification via deep learning. This study applies convolutional neural networks (CNNs) with transfer learning to identify diverse sounds in urban environments using visual representations called Mel spectrograms —time-frequency images of audio signals. We created the Urbanphony-3 (UP3) dataset from recordings in historic cities with significant sound overlaps and compared it against two established datasets with minimal overlap: UrbanSound8K (US8K) and Environmental Sound Classification 50 (ESC50). CNNs were trained with each dataset to develop the UP3-CNN, US8K-CNN, and ESC50-CNN models, enabling the automatic recognition of various urban sounds. Model performance was assessed through five-fold cross-validation, using accuracy and loss metrics, as well as confusion matrix analysis and ROC curves. The UP3-CNN model, which classified sounds from environments with frequent overlap, reached an accuracy of 75.2%. In contrast, ESC50-CNN and US8K-CNN, trained with less overlapped sounds, yielded better results (85.6 and 86.3%, respectively). These findings confirm that CNNs have great potential for classifying urban soundscapes, even under natural overlap. However, the performance gap between UP3-CNN and the other models indicates that CNNs are less effective when sounds overlap significantly. Thus, additional strategies are required to improve the results, including data augmentation, transformers, or optimization techniques. Future research should also extend automatic soundscape classification by considering other variables, such as emotional reactions, cultural preferences or contextual influence.

Los paisajes sonoros urbanos se caracterizan por la superposición de múltiples sonidos, lo que plantea desafíos para la clasificación automática mediante aprendizaje profundo. Este estudio aplica redes neuronales convolucionales (CNN) con aprendizaje por transferencia para identificar diversos sonidos en entornos urbanos por medio de representaciones visuales llamadas espectrogramas de Mel —imágenes tiempo-frecuencia de señales de audio. Se creó el conjunto de datos Urbanphony-3 (UP3) a partir de grabaciones en ciudades históricas con una notable superposición sonora y se comparó con dos conjuntos de datos consolidados con mínima superposición: UrbanSound8K (US8K) y Environmental Sound Classification 50 (ESC50). Las CNN se entrenaron con cada conjunto de datos para desarrollar los modelos UP3-CNN, US8K-CNN y ESC50-CNN, lo que permitió el reconocimiento automático de diversos sonidos urbanos. El desempeño de los modelos se evaluó mediante validación cruzada de cinco pliegues, usando métricas de exactitud y pérdida, así como análisis de matrices de confusión y curvas ROC. El modelo UP3-CNN, que clasificó sonidos de entornos con superposición frecuente, alcanzó una exactitud del 75.2 %. En contraste, ESC50-CNN y US8K-CNN, entrenados con sonidos menos superpuestos, obtuvieron mejores resultados (85.6 y 86.3 % respectivamente). Estos hallazgos confirman que las CNN tienen un gran potencial para la clasificación de paisajes sonoros urbanos, incluso bajo condiciones de superposición natural. Sin embargo, la brecha de desempeño entre UP3-CNN y los otros modelos indica que las CNN son menos efectivas cuando los sonidos se superponen de manera significativa. Por tanto, se requieren estrategias adicionales para mejorar los resultados, incluyendo aumento de datos, transformadores o técnicas de optimización. Asimismo, las investigaciones futuras deberían ampliar la clasificación automática de paisajes sonoros considerando otras variables, como las reacciones emocionales, las preferencias culturales o la influencia del contexto.

References

[1] M. Basner et al., “Auditory and nonauditory effects of noise on health,” The Lancet, vol. 383, no. 9925, pp. 1325–1332, 2014. https://doi.org/10.1016/S0140-6736(13)61613-X

[2] G. Licitra, L. Fredianelli, D. Petri, and M. A. Vigotti, “Annoyance evaluation due to overall railway noise and vibration in Pisa urban areas,” Sci. Total Environ., vol. 568, pp. 1315–1325, 2016. https://doi.org/10.1016/j.scitotenv.2015.11.071

[3] L. C. Erickson and R. S. Newman, “Influences of background noise on infants and children,” Curr. Dir. Psychol. Sci., vol. 26, no. 5, pp. 451–457, 2017. https://doi.org/10.1177/0963721417709087

[4] C. M. Tiesler et al., “Exposure to road traffic noise and children’s behavioural problems and sleep disturbance: Results from the GINIplus and LISAplus studies,” Environ. Res., vol. 123, pp. 1–8, 2013. https://doi.org/10.1007/s40572-015-0044-1

[5] N. Bilenko et al., “Traffic-related air pollution and noise and children’s blood pressure: Results from the PIAMA birth cohort study,” Eur. J. Prevent. Card., vol. 22, no. 1, pp. 4–12, 2015. https://doi.org/10.1177/2047487313505821

[6] N. ElBardisy, “An analytical investigation of environmental awareness about noise and visual pollution inside the Egyptian context,” Discover Cities, vol. 2, art. 16, 2025. https://doi.org/10.1007/s44327-025-00060-8

[7] H. Wang et al., “Urban network noise control based on road grade optimization considering comprehensive traffic environment benefit,” J. Environ. Manage., vol. 364, art. 121451, 2024. https://doi.org/10.1016/j.jenvman.2023.121451

[8] L. Fredianelli et al., “Traffic flow detection using camera images and machine learning methods in ITS for noise map and action plan optimization,” Sensors, vol. 22, no. 5, art. 1929, 2022. https://doi.org/10.3390/s22051929

[9] A. Can, A. L'Hostis, P. Aumond, D. Botteldooren, M. C. Coelho, C. Guarnaccia, and J. Kang, “The future of urban sound environments: Impacting mobility trends and insights for noise assessment and mitigation,” App. Acous., vol. 170, art. 107518, 2020. https://doi.org/10.1016/j.apacoust.2020.107518

[10] R. M. Schafer, The soundscape: Our sonic environment and the tuning of the world. Rochester, VT, USA: Destiny Books, 1993.

[11] J. Wang, C. Li, Y. Lin, C. Weng, and Y. Jiao, “Smart soundscape sensing: A low-cost and integrated sensing system for urban soundscape ecology research,” Environ. Tech. Innov., vol. 29, art. 102965, 2023. https://doi.org/10.1016/j.eti.2022.102965

[12] J. B. López Giler and C. S. Casquete Baidal, “¿Cómo plantear un proyecto de urbanismo que disminuya el impacto ambiental y ofrezca calidad?,” E-IDEA J. Eng. Sci., vol. 3, no. 6, pp. 17–32, 2021. https://doi.org/10.53734/esci.vol3.id177

[13] Y. Xiang, Q. Meng, X. Zhang, M. Li, D. Yang, and Y. Wu, “Soundscape diversity: Evaluation indices of the sound environment in urban green spaces—effectiveness, role, and interpretation,” Ecol. Ind., vol. 154, art. 110725, 2023. https://doi.org/10.1016/j.ecolind.2023.110725

[14] J. Y. Hong and J. Y. Jeon, “Relationship between spatiotemporal variability of soundscape and urban morphology in a multifunctional urban area: A case study in Seoul, Korea,” Bluild. Environ., vol. 126, pp. 382–395, 2017. https://doi.org/10.1016/j.buildenv.2017.10.021

[15] H. I. Jo and J. Y. Jeon, “Urban soundscape categorization based on individual recognition, perception, and assessment of sound environments,” Landsc. Urban Plan., vol. 216, p. 104241, 2021. https://doi.org/10.1016/j.landurbplan.2021.104241

[16] X. Fang, Y. Qi, M. Hedblom, T. Gao, and L. Qiu, “Do soundscape perceptions vary over length of stay within urban parks?,” J. Outdoor Rec. Tourism, vol. 45, p. 100728, 2024. https://doi.org/10.1016/j.jort.2023.100728

[17] Y. Zhang, C. Wang, and Z. Sun, “Soundscape diversity: Evaluation indices of the sound environment in urban green spaces – Effectiveness, role, and interpretation,” Ecol. Ind., vol. 147, art. 109063, 2023. https://doi.org/10.1016/j.ecolind.2023.110725

[18] B. C. Pijanowski et al., “Soundscape ecology: The science of sound in the landscape,” BioScience, vol. 61, no. 3, pp. 203–216, 2011. https://doi.org/10.1525/bio.2011.61.3.6

[19] Y. Jia, H. Ma, J. Kang, and C. Wang, “The preservation value of urban soundscape and its determinant factors,” App. Acous., vol. 168, art. 107430, 2020. https://doi.org/10.1016/j.apacoust.2020.107430

[20] H. I. Jo and J. Y. Jeon, “Effect of the appropriateness of sound environment on urban soundscape assessment,” Build. Environ., vol. 179, art. 106975, 2020. https://doi.org/10.1016/j.buildenv.2020.106975

[21] S. Korpilo et al., “Landscape and soundscape quality promote stress recovery in nearby urban nature: A multisensory field experiment,” Urban For. Urban Green., art. 128286, 2024. https://doi.org/10.1016/j.ufug.2024.128286

[22] T. Ozseven, “Investigation of the effectiveness of time-frequency domain images and acoustic features in urban sound classification,” App. Acous., vol. 211, art. 109564, 2023. https://doi.org/10.1016/j.apacoust.2023.109564

[23] A. Arnault, B. Hanssens, and N. Riche, "Urban sound classification: Striving towards a fair comparison," 2020. [Online]. Available: https://doi.org/10.48550/arXiv.2010.11805

[24] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Net., vol. 61, pp. 85–117, 2015. https://doi.org/10.1016/j.neunet.2014.09.003

[25] Ö. İnik, “CNN hyperparameter optimization for environmental sound classification,” App. Acous., vol. 202, art. 109168, Jan. 2023. https://doi.org/10.1016/j.apacoust.2022.109168

[26] K. J. Piczak, “Environmental sound classification with convolutional neural networks,” in Proc. 2015 IEEE 25th Int. Work. Machine Learn. Signal Process. (MLSP), Boston, MA, USA, Sep. 2015, pp. 1–6. https://doi.org/10.1109/MLSP.2015.7324337

[27] J. Salamon and J. P. Bello, “Deep convolutional neural networks and data augmentation for environmental sound classification,” IEEE Signal Proc. Lett., vol. 24, no. 3, pp. 279–283, Mar. 2017. https://doi.org/10.1109/LSP.2017.2657381

[28] W. Mu, B. Yin, X. Huang, J. Xu, and Z. Du, “Environmental sound classification using temporal-frequency attention based convolutional neural network,” Sci. Rep., vol. 11, art. 21552, Nov. 2021. https://doi.org/10.1038/s41598-021-01045-4

[29] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, 2015. https://doi.org/10.1038/nature14539

[30] C. A. Quinn et al., “Soundscape classification with convolutional neural networks reveals temporal and geographic patterns in ecoacoustic data,” Ecol. Ind., vol. 138, art. 108831, 2022. https://doi.org/10.1016/j.ecolind.2022.108831

[31] D. V. Devalraju and P. Rajan, "Multiview embeddings for soundscape classification," IEEE/ACM Trans. Audio Speech Lang Proc., vol. 30, pp. 1197–1206, 2022. https://doi.org/10.1109/TASLP.2022.3153272

[32] B. Mishachandar and S. Vairamuthu, "Diverse ocean noise classification using deep learning," Applied Acoustics, vol. 181, art. 108141, 2021. https://doi.org/10.1016/j.apacoust.2021.108141

[33] R. M. Rehan, “The phonic identity of the city urban soundscape for sustainable spaces,” HBRC J., vol. 12, no. 3, pp. 337–349, 2016. https://doi.org/10.1016/j.hbrcj.2014.12.005

[34] A. Karapostoli and N. E. Votsi, “Urban soundscapes in the historic centre of Thessaloniki: Sonic architecture and sonic identity,” Sound Stud., vol. 4, no. 2, pp. 162–177, 2018. https://doi.org/10.1080/20551940.2019.1582744

[35] J. Schmidhuber, "Deep learning in neural networks: An overview," Neural Net., vol. 61, pp. 85-117, 2015. https://doi.org/10.1016/j.neunet.2014.09.003

[36] A. Chinea Manrique de Lara, “On the theory of deep learning: A theoretical physics perspective (Part I),” Phys. A Stat. Mech. App., vol. 632, art. 129308, 2023. https://doi.org/10.1016/j.physa.2023.129308

[37] J. A. Grijalba-Obando and V. Paül-Carril, “La influencia del paisaje sonoro en la calidad del entorno urbano. Un estudio en la ciudad de Popayán (Colombia),” Urbano, vol. 21, no. 38, pp. 70–83, 2018. https://doi.org/10.22320/07183607.2018.21.38.06

[38] M. Tan and Q. V. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,” in Proc. 36th Int. Conf. on Machine Learning (ICML), PMLR, vol. 97, pp. 6105–6114, 2019. https://proceedings.mlr.press/v97/tan19a.html

[39] L. Nanni, G. Maguolo, and M. Paci, “Data augmentation approaches for improving animal audio classification,” Ecol. Infor., vol. 57, art. 101084, May 2020. https://doi.org/10.1016/j.ecoinf.2020.101084

[40] R. Jahangir, M. A. Nauman, R. Alroobaea, J. Almotiri, M. M. Malik, and S. M. Alzahrani, “Deep learning-based environmental sound classification using feature fusion and data enhancement,” Comput. Mater. Contin., vol. 74, no. 1, pp. 1069–1091, 2023. https://doi.org/10.32604/cmc.2023.032719

[41] Z. Mushtaq and S.-F. Su, “Efficient classification of environmental sounds through multiple features aggregation and data enhancement techniques for spectrogram images,” Symmetry, vol. 12, no. 11, art. 1822, 2020. https://doi.org/10.3390/sym12111822

[42] W. Mu, B. Yin, X. Huang, J. Xu, and Z. Du, “Environmental sound classification using temporal-frequency attention based convolutional neural network,” Scientific Reports, vol. 11, art. 21552, 2021. https://doi.org/10.1038/s41598-021-01045-4.

[43] M. Towsey, J. Wimmer, I. Williamson, and P. Roe, “The use of acoustic indices to determine avian species richness in audio-recordings of the environment,” Ecological Informatics, vol. 21, pp. 110–119, 2014. https://doi.org/10.1016/j.ecoinf.2013.11.007

[44] J. Salamon, C. Jacoby, and J. P. Bello, "A dataset and taxonomy for urban sound research," in Proc. 22nd ACM Int. Conf. Multimedia, 2014, pp. 1041–1044. https://doi.org/10.1145/2647868.2655045

[45] B. Bahmei, E. Birmingham, and S. Arzanpour, "CNN-RNN and data augmentation using deep convolutional generative adversarial network for environmental sound classification," IEEE Signal Proc. Lett., vol. 29, pp. 682-686, 2022. https://doi.org/10.1109/LSP.2022.3150258

[46] M. Bubashait and N. Hewahi, "Urban sound classification using DNN, CNN & LSTM a Comparative Approach," in 2021 Int. Conf. Innov. Intel. Infor., Comp, Tech. (3ICT), Zallaq, Bahrain, 2021, pp. 46-50. https://doi.org/10.1109/3ICT53449.2021.9581339

[47] Z. Mushtaq, S. F. Su, and Q. V. Tran, “Spectral images based environmental sound classification using CNN with meaningful data augmentation,” Applied Acoustics, vol. 172, p. 107581, 2021. https://doi.org/10.1016/j.apacoust.2020.107581

[48] A. A. Rahman and J. Angel Arul Jothi, “Classification of UrbanSound8k: A Study Using Convolutional Neural Network and Multiple Data Augmentation Techniques,” in Soft Computing and its Engineering Applications, K. K. Patel, D. Garg, A. Patel, and P. Lingras, Eds., Communications in Computer and Information Science, vol. 1374. Singapore: Springer, 2021. https://doi.org/10.1007/978-981-16-0708-0_5

[49] B. Z. J. L. S. Thornton, “Audio recognition using Mel spectrograms and convolution neural networks,” unpublished manuscript, 2019.

[50] C. Mydlarz, M. Sharma, Y. Lockerman, B. Steers, C. Silva, and J. P. Bello, “The life of a New York City noise sensor network,” Sensors, vol. 19, art. 1415, 2019. https://doi.org/10.3390/s19061415

[51] S. Park, D. K. Han, and M. Elhilali, “Cross-referencing self-training network for sound event detection in audio mixtures,” IEEE Trans. Multimedia, vol. 25, no. 10, pp. 4573–4585, Oct. 2023. https://doi.org/10.1109/TMM.2023.3197123

[52] S. Zhang, Y. Zhang, Y. Liao, K. Pang, Z. Wan, and S. Zhou,

“Polyphonic sound event localization and detection based on multiple attention fusion ResNet,”

Mathematical Biosci. Eng., vol. 21, no. 2, pp. 2004–2023, 2024. https://doi.org/10.3934/mbe.2024089

[53] A. Lie et al., “Occupational noise exposure and hearing: A systematic review,” Int. Arch. Occup. Environ. Health, vol. 89, pp. 351–372, 2016. https://doi.org/10.1007/s00420-015-1083-5

[54] Acoustics — Soundscape — Part 1: Definition and conceptual framework, ISO 12913-1:2014, ISO, Geneva, Switzerland, 2014.

How to Cite

APA

Durán Paredes, C. A., Grijalba Obando, J. A., Cajas Ordoñez, S. & Sánchez-Ferreira, C. (2025). Urbanphony-3-CNN: A Convolutional Neural Network for Identifying the Urban Soundscape Taxonomy in Spectrograms Generated from Audios of Historic Cities. Ingeniería e Investigación, 45(2). https://doi.org/10.15446/ing.investig.114942

ACM

[1]
Durán Paredes, C.A., Grijalba Obando, J.A., Cajas Ordoñez, S. and Sánchez-Ferreira, C. 2025. Urbanphony-3-CNN: A Convolutional Neural Network for Identifying the Urban Soundscape Taxonomy in Spectrograms Generated from Audios of Historic Cities. Ingeniería e Investigación. 45, 2 (Aug. 2025). DOI:https://doi.org/10.15446/ing.investig.114942.

ACS

(1)
Durán Paredes, C. A.; Grijalba Obando, J. A.; Cajas Ordoñez, S.; Sánchez-Ferreira, C. Urbanphony-3-CNN: A Convolutional Neural Network for Identifying the Urban Soundscape Taxonomy in Spectrograms Generated from Audios of Historic Cities. Ing. Inv. 2025, 45.

ABNT

DURÁN PAREDES, C. A.; GRIJALBA OBANDO, J. A.; CAJAS ORDOÑEZ, S.; SÁNCHEZ-FERREIRA, C. Urbanphony-3-CNN: A Convolutional Neural Network for Identifying the Urban Soundscape Taxonomy in Spectrograms Generated from Audios of Historic Cities. Ingeniería e Investigación, [S. l.], v. 45, n. 2, 2025. DOI: 10.15446/ing.investig.114942. Disponível em: https://revistas.unal.edu.co/index.php/ingeinv/article/view/114942. Acesso em: 11 nov. 2025.

Chicago

Durán Paredes, Carlos A., Julian A. Grijalba Obando, Sebastián Cajas Ordoñez, and Camilo Sánchez-Ferreira. 2025. “Urbanphony-3-CNN: A Convolutional Neural Network for Identifying the Urban Soundscape Taxonomy in Spectrograms Generated from Audios of Historic Cities”. Ingeniería E Investigación 45 (2). https://doi.org/10.15446/ing.investig.114942.

Harvard

Durán Paredes, C. A., Grijalba Obando, J. A., Cajas Ordoñez, S. and Sánchez-Ferreira, C. (2025) “Urbanphony-3-CNN: A Convolutional Neural Network for Identifying the Urban Soundscape Taxonomy in Spectrograms Generated from Audios of Historic Cities”, Ingeniería e Investigación, 45(2). doi: 10.15446/ing.investig.114942.

IEEE

[1]
C. A. Durán Paredes, J. A. Grijalba Obando, S. Cajas Ordoñez, and C. Sánchez-Ferreira, “Urbanphony-3-CNN: A Convolutional Neural Network for Identifying the Urban Soundscape Taxonomy in Spectrograms Generated from Audios of Historic Cities”, Ing. Inv., vol. 45, no. 2, Aug. 2025.

MLA

Durán Paredes, C. A., J. A. Grijalba Obando, S. Cajas Ordoñez, and C. Sánchez-Ferreira. “Urbanphony-3-CNN: A Convolutional Neural Network for Identifying the Urban Soundscape Taxonomy in Spectrograms Generated from Audios of Historic Cities”. Ingeniería e Investigación, vol. 45, no. 2, Aug. 2025, doi:10.15446/ing.investig.114942.

Turabian

Durán Paredes, Carlos A., Julian A. Grijalba Obando, Sebastián Cajas Ordoñez, and Camilo Sánchez-Ferreira. “Urbanphony-3-CNN: A Convolutional Neural Network for Identifying the Urban Soundscape Taxonomy in Spectrograms Generated from Audios of Historic Cities”. Ingeniería e Investigación 45, no. 2 (August 1, 2025). Accessed November 11, 2025. https://revistas.unal.edu.co/index.php/ingeinv/article/view/114942.

Vancouver

1.
Durán Paredes CA, Grijalba Obando JA, Cajas Ordoñez S, Sánchez-Ferreira C. Urbanphony-3-CNN: A Convolutional Neural Network for Identifying the Urban Soundscape Taxonomy in Spectrograms Generated from Audios of Historic Cities. Ing. Inv. [Internet]. 2025 Aug. 1 [cited 2025 Nov. 11];45(2). Available from: https://revistas.unal.edu.co/index.php/ingeinv/article/view/114942

Download Citation

CrossRef Cited-by

CrossRef citations0

Dimensions

PlumX

Article abstract page views

207

Downloads

Download data is not yet available.