Type
Conference PaperAuthors
Alcázar, Juan LeónCaba, Fabian
Mai, Long
Perazzi, Federico
Lee, Joon-Young
Arbeláez, Pablo
Ghanem, Bernard

KAUST Department
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) DivisionElectrical Engineering Program
VCC Analytics Research Group
Date
2020-08-05Preprint Posting Date
2020-05-20Online Publication Date
2020-08-05Print Publication Date
2020-06Permanent link to this record
http://hdl.handle.net/10754/663635
Metadata
Show full item recordAbstract
Current methods for active speaker detection focus on modeling audiovisual information from a single speaker. This strategy can be adequate for addressing single-speaker scenarios, but it prevents accurate detection when the task is to identify who of many candidate speakers are talking. This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons. Our new model learns pairwise and temporal relations from a structured ensemble of audiovisual observations. Our experiments show that a structured feature ensemble already beneï¬ ts active speaker detection performance. We also ï¬ nd that the proposed Active Speaker Context improves the state-of-the-art on the AVA-ActiveSpeaker dataset achieving an mAP of 87.1%. Moreover, ablation studies verify that this result is a direct consequence of our long-term multi-speaker analysis.Citation
Alcazar, J. L., Caba, F., Mai, L., Perazzi, F., Lee, J.-Y., Arbelaez, P., & Ghanem, B. (2020). Active Speakers in Context. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/cvpr42600.2020.01248Conference/Event name
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)ISBN
978-1-7281-7169-2arXiv
2005.09812Additional Links
https://ieeexplore.ieee.org/document/9157027/https://ieeexplore.ieee.org/document/9157027/
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9157027
ae974a485f413a2113503eed53cd6c53
10.1109/CVPR42600.2020.01248