[Internship] Extraction of speaker-specific information for the implementation of explainable / interpretable systems.
According to the salary grids in application
This internship aims at building an information extraction system of speaker characteristics, based on neural artificial intelligence technologies. The goal is to determine a set of vocal characteristics, each of them being specific to a given subgroup of the population.
Person authentication based on voice will be performed in a simple manner inspired by DNA identification: for each speaker-specific feature observed in the samples, the reliability of that observation and its typicality (i.e., the percentage of people in the population with that feature) will be combined. All observations will be accumulated to produce the final decision.
The proposed method uses deep learning to better handle the different variabilities of speech: type of speech, noises, transmission channels, etc. Moreover, the method will jointly estimate the presence of features in order to benefit from the sum of knowledge extracted by the neural networks used.
This internship will be linked to a thesis on a related topic concerning the development of the BA-LR approach, intrinsically interpretable/explicable for speaker recognition. This thesis is currently in progress and will allow the candidate to work in collaboration with the PhD student involved.
Context and challenge:
Voice recognition of individuals is a rapidly developing field with important societal implications. Many applications are available, in the context of banking, voice assistants or IoT for example. Voice recognition is also used for security applications, private or national. Finally, it also affects the judicial system, with voice comparison forensics in courts.
Although speaker recognition systems show a very high level of performance in scientific evaluations, they still suffer from several flaws. First of all, like any machine learning system, they can admit learning biases that lead to inappropriate decisions (e.g., the absence of a regional accent in the training base can lead to a confusion between individual and regional features). Moreover, the systems work as a black box: they return a numerical score in response to a stimulus under all circumstances, even if the sound recording contains little characteristic information about the speaker. Finally, the scores proposed by the speaker recognition systems have no meaning as such. To make a decision, it is still necessary to normalize, or “calibrate” the score, to take into account the context of the application and the local conditions of use. A defect in this calibration, due for example to conditions not yet met, can give the system an erratic behavior.
This internship will focus on the speech feature extraction module within the BA-LR approach. The “explicability” aspects will be privileged.
It will take place within the framework of the LIAvignon partnership chair, within the Laboratoire Informatique d’Avignon (LIA) which will provide all the necessary knowledge and resources (software, databases and calculators) to carry out the proposed work.