A Single Predominant Instrument Recognition of Polyphonic Music Using CNN-based Timbre Analysis

Daeyeol Kim; Tegg Taekyong Sung; Soo Young  Cho; Gyunghak Lee; Chae Bong Sohn

doi:10.14419/ijet.v7i3.34.19388

Authors

Daeyeol Kim
Tegg Taekyong Sung
Soo Young Cho
Gyunghak Lee
Chae Bong Sohn

Received date: September 9, 2018

Accepted date: September 9, 2018

Published date: September 1, 2018

DOI:

https://doi.org/10.14419/ijet.v7i3.34.19388

Keywords:

Instrument recognition, Convolution neural network, Timbre analysis, Hilbert spectrum analysis, Intrinsic mode functions

Abstract

Classifying musical instrument from polyphonic music is a challenging but important task in music information retrieval. This work enables to automatically tag music information, such as genre classification. In previous, almost every work of spectrogram analysis has been used Short Time Fourier Transform (STFT) and Mel Frequency Cepstral Coefficient (MFCC). Recently, sparkgram is researched and used in audio source analysis. Moreover, for deep learning approach, modified convolutional neural networks (CNN) widely have been researched, but many results have not been improved drastically. Instead of improving backbone networks, we have researched on preprocessing process.
In this paper, we use CNN and Hilbert Spectral Analysis (HSA) to solve the polyphonic music problem. The HSA is performed at the fixed length of polyphonic music, and a predominant instrument is labeled at its result. As result, we have achieved the state-of-the-art result in IRMAS dataset and 3% performance improvement in individual instruments
Â
Â

References

[1] Downie, J. S. (2003). Music information retrieval. Annual review of information science and technology, 37(1), 295-340
[2] Ng, P. C., & Henikoff, S. (2003). SIFT: Predicting amino acid changes that affect protein function. Nucleic acids research, 31(13), 3812-3814.
[3] Rakotomamonjy, A., & Gasso, G. (2015). Histogram of gradients of time-frequency representations for audio scene classification. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 23(1), 142-153.
[4] Lecun, Y., Bottou, L., Bengio, Y. and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), pp.2278-2324. doi: 10.1109/5.726791
[5] CNNs Architevtures: LeNet, AlexNet, VGG, GoogLeNet, ResNet and more â€¦ (2018). Retrived from httpsL//medium.com
[6] Xu, M., Duan, L. Y., Cai, J., Chia, L. T., Xu, C., & Tian, Q. (2004, November). HMM-based audio keyword generation. In Pacific-Rim Conference on Multimedia (pp. 566-574). Springer, Berlin,
[7] Joder, C., Essid, S., & Richard, G. (2011). A conditional random field framework for robust and scalable audio-to-score matching. IEEE Transactions on Audio, Speech, and Language Processing, 19(8), 2385-2397.
[8] Boreczky, J. S., & Wilcox, L. D. (1998, May). A hidden Markov model framework for video segmentation using audio and image features. In Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on (Vol. 6, pp. 3741-3744). IEEE.
[9] Allen, J. (1977). Short term spectral analysis, synthesis, and modification by discrete Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 25(3), 235-238.
[10] Eronen, A., & Klapuri, A. (2000). Musical instrument recognition using cepstral coefficients and temporal features. In Acoustics, Speech, and Signal Processing, 2000. ICASSP'00. Proceedings. 2000 IEEE International Conference on (Vol. 2, pp. II753-II756). IEEE.
[11] Mel-frequency cepstral coefficient analysis in speech recognition. (2006). 2006 International Conference on Computing & Informatics. doi: 10.1109/ICOCI.2006.5276486
[12] Huang, N. E., Shen, Z., Long, S. R., Wu, M. C., Shih, H. H., Zheng, Q., ... & Liu, H. H. (1998, March). The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis. In Proceedings of the Royal Society of London A: mathematical, physical and engineering sciences (Vol. 454, No. 1971, pp. 903-995). The Royal Society.
[13] Sandoval, S., De Leon, P. and Liss, J. (2015). Hilbert spectral analysis of vowels using intrinsic mode functions. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). doi: 10.1109/ASRU.2015.7404846
[14] Rato, R. T., Ortigueira, M. D., & Batista, A. G. (2008). On the HHT, its problems, and some solutions. Mechanical Systems and Signal Processing, 22(6), 1374-1394.New child vaccine gets funding boost. (2001)
[15] Huang, P. S., Kim, M., Hasegawa-Johnson, M., & Smaragdis, P. (2014, May). Deep learning for monaural speech separation. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on (pp. 1562-1566). IEEE.
[16] Han, Y., Kim, J. and Lee, K. (2017). Deep Convolutional Neural Networks for Predominant Instrument Recognition in Polyphonic Music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(1), pp.208-221. doi: 10.1109/TASLP.2016.2632307
[17] Lee, H., Pham, P., Largman, Y., & Ng, A. Y. (2009). Unsupervised feature learning for audio classification using convolutional deep belief networks. In Advances in neural information processing systems (pp. 1096-1104)
[18] Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

A Single Predominant Instrument Recognition of Polyphonic Music Using CNN-based Timbre Analysis

Authors

Daeyeol Kim

Tegg Taekyong Sung

Soo Young Cho

Gyunghak Lee

Chae Bong Sohn

How to Cite

DOI:

Keywords:

Abstract

References

Downloads

How to Cite