Challenge Description
Aims and Motivation
The challenge of sound source localization in acoustically complex environments has attracted widespread attention in the AASP community in recent years. This was highlighted by the acceptance of a Special Session at ICASSP 2017 on “Speaker localization in dynamic real-life environments”, organized jointly by the IEEE AASP and Sensor Array and Multichannel (SAM) technical committees. Source localization approaches in the literature range from single-sensor to multi-sensor and distributed arrays, based on features including, for example, Time Delays of Arrival, Direction of Arrival, or even audio spectrograms. Nevertheless, despite the significant impact of sound source localization approaches, a comprehensive, objective benchmarking campaign of state-of-the-art algorithms is to date unavailable. The IEEE AASP challenge on acoustic source LOCalization And TrAcking (LOCATA) therefore aims at providing researchers in source localization with a framework to objectively benchmark results against competing algorithms using a common, publically released data corpus that encompasses a wide range of realistic scenarios in enclosed acoustic environments.
Academic and Commercial Impact
A large number of AASP sub-areas benefit from accurate sound source localization, including, for example, speaker diarization and Blind Source Separation (BSS) for distinction between concurrently active talkers; beamforming for improved focussing on desired sources; and speech enhancement and dereverberation for suppressing ambient noise, late reverberation, and early reflections. Therefore, robust localization algorithms have wide academic and commercial impact towards the following applications:
• Hearing aids for improved focusing on desired sound sources
• Smart homes and home assistants for interaction with distant speakers
• Robots for awareness of and reaction to visually occluded events
• Smart cars for detection and reaction to approaching emergency vehicles
• Virtual reality devices for synthesis of immersive sound fields.
Context within previous IEEE-AASP Challenges
Recent IEEE AASP challenges focussed on detection and recognition of sound events (DCASE I/II and BAD), as well as the characterization of acoustic environments (ACE). The scope of the IEEE-AASP LOCATA challenge on acoustic source localization and tracking is to gain information about the positions of fixed or moving sound sources by various fixed or moving microphone arrays in a realistic acoustic environment.
Tasks
The challenge consist of the following 5 tasks:
1. Localization of a single, static loudspeaker using static microphones arrays
2. Multi-source localization of static loudspeakers using static microphone arrays
3. Localization of a single, moving talker using static microphone arrays
4. Localization of a single, moving talker using moving microphone arrays
5. Multi-source localization of moving talkers using moving microphone arrays.
For scenarios 4 and 5, involving moving sensors, the position and orientation of the moving sensors is made available to the participants. For scenarios 3 to 5, involving moving talkers, participants were encouraged, but not limited to employ target tracking solutions
in addition or in place of sound source localization approaches.
Dataset
As part of the Challenge, an extensive data corpus has been released, targeted at sound source localization in general and at the above 5 tasks in particular. The corpus is open access, distributed under the Open Data Commons license and can be downloaded via this link.
The corpus aims at providing a wide range of scenarios encountered in acoustic signal processing, with an emphasis on dynamic scenarios. All recordings contained in the corpus were made in a realistic, reverberant acoustic environment in the presence of ambient noise from a road in front of the building. Ground truth positions, trajectories, and orientations of sources and sensors were obtained by means of an OptiTrack system that uses 10 infrared cameras to localize and track moving objects. Ground truth positional data were made available to the participants. Ground truth positions of the sources are used for evaluation of the Challenge results, and released as part of the data corpus after completion of the Challenge. Due to the installation of the OptiTrack system, recordings are limited to a single room. To ensure different acoustic conditions between recordings, source-sensor distances and angles were changed, thereby enforcing varying Direct-to-Reverberant Ratios (DRRs)
between the recordings. The baseline reverberation time of the room is provided by means of a Room Impulse Response (RIR) measurement. A sound level meter was used throughout the measurements to gauge both, the level of ambient noise as well as the speech signal level at a predetermined position in the room. In addition, readings from a room
temperature sensor were provided to the participants.
Speech Material
Tasks 1 and 2, involving static loudspeakers, are based on the CSTR VCTK1 database. The VCTK database provides over 400 newspaper sentences spoken by 109 native English talkers, recorded in a semi-anechoic environment at 96 kHz and down-sampled to 48 kHz.
The database is distributed under the Open Data Commons license, therefore permitting open access for participants. As a result, the Challenge corpus is also distributed under the Open Data Commons license to facilitate open access.
Tasks 3 to 5 use speech recordings of live talkers reading randomly selected VCTK sentences. The talkers are equipped with DPA microphones near their mouths to record the close-talking speech signals. Participants were provided with the close-talking speech signals only for the development. The corresponding signals for the evaluation dataset will
be released as part of the corpus after the Challenge is completed.
These recordings are representative of the practical challenges, including natural speech inactivity during sentences, sporadic utterances as well as dialogues between talkers.
Acoustic Sensor Configurations
The following microphone arrays were used for the recordings:
• Tasks 1-5: Distant-talking Interfaces for Control of Interactive TV (DICIT) array, including a linear array (ULA) with the uniform nested arrays
• Tasks 1-5: 32-channel spherical Eigenmike by mh-acoustics
• Tasks 1-5: 12-channel pseudo-spherical microphone array integrated in a NAO robot head.
• Tasks 3-5: Binaural recordings from a pair of hearing aid dummies (Siemens Signia) installed on a dummy head (HeadAcoustic)
1 http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html
These recordings are representative of the practical challenges, including variation in orientation, position, and speed of the microphone arrays as well as the talkers.
Evaluation Measures and System
An external positioning system (optical tracker) was used to record the positions and orientations of talker, loudspeakers and microphone arrays. The ground truth values were compared to the estimated locations submitted by the participants using several criteria to evaluate the variance with different talkers and the accuracy of the estimated locations.
Dissemination of Results
The results of the challenge were published in the format of conference contribution in a satellite workshop at IWAENC 2018 in Tokyo.
Also, an article will be submitted to IEEE Transactions in Audio, Language and Speech Processing. This paper will review, categorize, and benchmark all participating algorithms. The results will also be published on this website.