Activate Activate Activate
contact  
Hello. Sign in to personalize your visit. New user? Register now.  





In
By author

Neural Computation

Full Text | PDF (814.565 KB) | PDF Plus (559.793 KB)

Sound Retrieval and Ranking Using Sparse Auditory Representations

Richard F. Lyon

Google, Mountain View, CA 94043, U.S.A.

Martin Rehn

Google, Mountain View, CA 94043, U.S.A.

Samy Bengio

Google, Mountain View, CA 94043, U.S.A.

Thomas C. Walters

Google, Mountain View, CA 94043, U.S.A.

Gal Chechik

Google, Mountain View, CA 94043, U.S.A.

To create systems that understand the sounds that humans are exposed to in everyday life, we need to represent sounds with features that can discriminate among many different sound classes. Here, we use a sound-ranking framework to quantitatively evaluate such representations in a large-scale task. We have adapted a machine-vision method, the passive-aggressive model for image retrieval (PAMIR), which efficiently learns a linear mapping from a very large sparse feature space to a large query-term space. Using this approach, we compare different auditory front ends and different ways of extracting sparse features from high-dimensional auditory images. We tested auditory models that use an adaptive pole–zero filter cascade (PZFC) auditory filter bank and sparse-code feature extraction from stabilized auditory images with multiple vector quantizers. In addition to auditory image models, we compare a family of more conventional mel-frequency cepstral coefficient (MFCC) front ends. The experimental results show a significant advantage for the auditory models over vector-quantized MFCCs. When thousands of sound files with a query vocabulary of thousands of words were ranked, the best precision at top-1 was 73% and the average precision was 35%, reflecting a 18% improvement over the best competing MFCC front end.

Cited by

Ha Manh Do, Weihua Sheng, Meiqin Liu. (2016) Human-assisted sound event recognition for home service robots. Robotics and Biomimetics 3:1.
Online publication date: 2-Jun-2016.
CrossRef
Huy Phan, Lars Hertel, Marco Maass, Radoslaw Mazur, Alfred Mertins. (2016) Learning Representations for Nonspeech Audio Events Through Their Similarities to Speech Patterns. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24:4807-822.
Online publication date: 1-Apr-2016.
CrossRef
Ian McLoughlin, Haomin Zhang, Zhipeng Xie, Yan Song, Wei Xiao. (2015) Robust Sound Event Classification Using Deep Neural Networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23:3540-552.
Online publication date: 1-Mar-2015.
CrossRef
Axel Tidemann, Øyvind Brandtsegg. (2015) [self.]. Proceedings of the 2015 ACM SIGCHI Conference on Creativity and Cognition - C&C '15153-154.
CrossRef
Axel Tidemann, Øyvind Brandtsegg. (2015) [self.]. Proceedings of the 2015 ACM SIGCHI Conference on Creativity and Cognition - C&C '15181-184.
CrossRef
Yonatan Vaizman, Brian McFee, Gert Lanckriet. (2014) Codebook-Based Audio Feature Representation for Music Information Retrieval. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22:101483-1493.
Online publication date: 1-Oct-2014.
CrossRef
Li Su, Chin-Chia Michael Yeh, Jen-Yu Liu, Ju-Chiang Wang, Yi-Hsuan Yang. (2014) A Systematic Evaluation of the Bag-of-Frames Representation for Music Information Retrieval. IEEE Transactions on Multimedia 16:51188-1200.
Online publication date: 1-Aug-2014.
CrossRef
Janvier Maxime, Xavier Alameda-Pineda, Laurent Girin, Radu Horaud. (2014) Sound representation and classification benchmark for domestic robots. 2014 IEEE International Conference on Robotics and Automation (ICRA)6285-6292.
CrossRef
Eric J. Humphrey, Juan P. Bello, Yann LeCun. (2013) Feature learning and deep architectures: new directions for music informatics. Journal of Intelligent Information Systems 41:3461-481.
Online publication date: 12-Jul-2013.
CrossRef
M. G. Baydogan, G. Runger, E. Tuv. (2013) A Bag-of-Features Framework to Classify Time Series. IEEE Transactions on Pattern Analysis and Machine Intelligence 35:112796-2802.
Online publication date: 1-Nov-2013.
CrossRef
Kyogu Lee, Ziwon Hyung, Juhan Nam. (2013) Acoustic scene classification using sparse feature learning and event-based pooling. 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics1-4.
CrossRef
Omid Madani, Manfred Georg, David Ross. (2013) On using nearly-independent feature families for high precision and confidence. Machine Learning 92:2-3457-477.
Online publication date: 30-May-2013.
CrossRef
Gael Richard, Shiva Sundaram, Shrikanth Narayanan. (2013) An Overview on Perceptually Motivated Audio Indexing and Classification. Proceedings of the IEEE 101:91939-1954.
Online publication date: 1-Sep-2013.
CrossRef
Yoko Sasaki, Kazuyoshi Yoshii, Satoshi Kagami. (2013) A nested infinite Gaussian mixture model for identifying known and unknown audio events. 2013 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS)1-4.
CrossRef
Yu-Gang Jiang, Subhabrata Bhattacharya, Shih-Fu Chang, Mubarak Shah. (2013) High-level event recognition in unconstrained videos. International Journal of Multimedia Information Retrieval 2:273-101.
Online publication date: 13-Nov-2012.
CrossRef
Courtenay V. Cotton, Daniel P. W. Ellis. (2013) Subband autocorrelation features for video soundtrack classification. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing8663-8666.
CrossRef
Eric Nichols, Charles DuHadway, Hrishikesh Aradhye, Richard F. Lyon. (2012) Automatically Discovering Talented Musicians with Acoustic Analysis of YouTube Videos. 2012 IEEE 12th International Conference on Data Mining559-565.
CrossRef
Maxime Janvier, Xavier Alameda-Pineda, Laurent Girinz, Radu Horaud. (2012) Sound-event recognition with a companion humanoid. 2012 12th IEEE-RAS International Conference on Humanoid Robots (Humanoids 2012)104-111.
CrossRef
Florian Muller, Alfred Mertins. (2012) On using the auditory image model and invariant-integration for noise robust automatic speech recognition. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)4905-4908.
CrossRef
Bob L. Sturm, Laurent Daudet. (2011) Recursive nearest neighbor search in a sparse and multiscale domain for comparing audio signals. Signal Processing 91:122836-2851.
Online publication date: 1-Dec-2011.
CrossRef
Jason Weston, Samy Bengio, Philippe Hamel. (2011) Multi-Tasking with Joint Semantic Spaces for Large-Scale Music Annotation and Retrieval. Journal of New Music Research 40:4337-348.
Online publication date: 1-Dec-2011.
CrossRef
Thomas Leung, Yang Song, John Zhang. (2011) Handling label noise in video classification via multiple instance learning. 2011 International Conference on Computer Vision2056-2063.
CrossRef
Weilong Yang, George Toderici. (2011) Discriminative tag learning on YouTube videos with latent sub-tags. CVPR 20113217-3224.
CrossRef
Vijay Chandrasekhar, Mehmet Emre Sargin, David A. Ross. (2011) Automatic Language Identification in music videos with low level audio and visual features. 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)5724-5727.
CrossRef
Richard F. Lyon, Jay Ponte, Gal Chechik. (2011) Sparse coding of auditory features for machine hearing in interference. 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)5876-5879.
CrossRef
Richard F. Lyon. (2011) Cascades of two-pole–two-zero asymmetric resonators are good models of peripheral auditory function. The Journal of the Acoustical Society of America 130:63893.
Online publication date: 1-Jan-2011.
CrossRef

Technology Partner - Atypon Systems, Inc.
  CrossRef member COUNTER member