Technology

Audio-based context awareness measures human behaviours such as cooking, eating, working, various kinds of talking, and travelling as they occur around microphone-enabled devices. This is a novel problem with many application areas for mobile devices, AR/VR headsets, and smart homes. However, audio-based context is a significant challenge to annotate and scale, often requiring expert annotations that are difficult to obtain. The closest related problem that has been tackled by the research community is audio event detection. The largest public dataset of this problem, AudioSet, contains approximately 5,800 hours of manually labelled annotations and the state-of-the-art still only reaches approximately a 59% classification accuracy while detecting domestic scene events [16]. We instead develop a novel formulation to the problem of determining context, a high-level description of the behaviors in a scene, and develop numerous innovations to leverage our unique taxonomic understanding of context awareness. HyperSentience’s proprietary solutions require only approximately 373 hours of audio to be manually annotated in order to model five context scenarios: cooking, eating, talking (in person), talking (amplified), and working. In this document, we present our approach to audio-based context awareness and compare it to a baseline approach to context modeling using a SOTA event detection model released by Google (YAMNet), which performs at 0.26 mAP(mean average precision). In comparison, our innovations result in a mAP of 0.693 on context detection and an optimized precision of 0.87 following model selection, while still maintaining notifi- cation detection rates of 0.81 across the same 5 context scenarios. Further, our approach is well tested in real-life conditions across 14 countries and on various phones, highly optimised, cpu and energy efficient, and capable of running on the edge. Finally, this document further describes how we see our technology scaling to new contexts and application areas.