FINDING YOUR VOICE IN THE REGULATORY...

transcript

FINDING YOUR VOICE IN THE REGULATORY AGE

NIGEL CANNINGSCTO

nigel.cannings@intelligentvoice.com@intelligentvox

THE YEAR OF VOICE

LIBOR FX ScandalBanks face

Multi-Billion $ finesAmazon Alexa SIRI(?)

As almost 50% of all corporate data will have a voice component within 5 years, either as audio or video, all companies, but particularly banks and insurance companies, need to get a handle not just on where this data is being held, but what is being said in it, and also who is saying it.

20152016?2017!

AUDIENCE PARTICIPATION

HOW OFTEN DO YOU USE A VOICE ASSISTANT?

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

Weekly

Monthly

Results taken from a survey on 5th October 2017 of 1500 people across Europe

Of the people with a smart phone how many use their integrated voice assistant (e.g. Siri, Cortana):

HOW OFTEN DO YOU USE A VOICE ASSISTANT?

0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50%

Weekly

Monthly

Results taken from a survey on 5th October 2017 of 1500 people across Europe

Of the people with an Alexa home assistant how often do they use it:

IT’S A DOUBLE WHAMMY

Where?

GDPRMiFID II

What?Who?

CLOUD SECURITY

Where is your voice stored?

Your voice could be used for any number of the following:

Use (edit) your voice recordings to impersonate you

Learn about you

→ Your identity, gender, nationality (accent), emotional state..

Track you from uploads / communications of voice recordings

ENCRYPTED SPEECH PROCESSING

Privacy preserving encrypted phonetic search of speech dataC Glackin, G Chollet, N Dugan, N Cannings, J Wall, S Tahir, IG RayIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.

A New Secure and Lightweight Searchable Encryption Scheme over Encrypted Cloud DataS Tahir, S Ruj, Y Rahulamathavan, M Rajarajan, C GlackinIEEE Transactions on Emerging Topics in Computing, 2017.

AES Encryption (Public key)Powered by machine learning

Powered by GPU

DEEP SPIKING NEURAL NETWORKS FOR SPEECH ENHANCEMENT

Recurrent lateral inhibitory spiking networks for speech enhancementJ Wall, C Glackin, N Cannings, G Chollet, N DuganInternational Joint Conference on Neural Networks (IJCNN), pp. 1023-1028, 2016.

TECHNICAL

CONVOLUTIONAL NEURAL NETWORKS FOR ACOUSTIC MODELLING

TIMIT Speech Corpus

1.4M spectrograms for the training set

Sliding window used for timing

4 to 5 phones in each 0.256 second window

61 Phoneme Classes ?

- Beaten the current NTIMIT. State of the art! - Beaten the current NTIMIT. State of the art! - Beaten the current NTIMIT. State of the

TECHNICAL

HOW FAST?

Times Real Time

UNDERSTANDING

100x Realtime using P5000

TELEFONICA/O2

But this is just the beginning: Voice data is generated not only in the organisation, but externally, maybe as YouTube content.

One area commonly forgotten is mobile telephony. MiFID II now places a strong requirement not just on recording calls made from a regulated organisation premises, but their mobile calls as well.

Intelligent Voice are working with Telefonica/O2 to capture, index and analyse mobile phone calls, and introduce them as part of a compliance and monitoring workflow for MiFID II .

CREDIBILITYWHAT IS WRONG WITH THESE STATEMENTS?

“Woke up at 7:30. Had a shower. Made breakfast and read the newspaper. At 8:30, drove to work.”

“We should have done a better job.”

“That’s their way of doing things.”

“You’d better ask them.”

Alleged robbery victim: “The man asked for my money.”

“He told me not to look at him. He said he would shoot me if I screamed.”

CREDIBILITY INDICATORSPronouns: Omission, Improper use, Higher rates of third person plural pronounced person plural

pronounsComplexity: Parameters such as number of letters/syllables per word, higher word count, higher

rate of pausesSpeaking verbs: Strong tone (told, demanded, telling), soft tone (said, asked, stated, saying) – tone

changesTempo: Slow tempo (indicator of cognitive load), fast tempo (indicator of arousal and

negative effects)Pitch: Higher pitch/lower voice quality at specific times are indications of fraudulent related

utterancesSpecific Words: Explainers (so, since therefore, because…)

These are just a few of the indicators of suspicious language

CREDIBILITY NETWORKVoice Activity

Detection

i-vector diarization

What happened next?

He told me not to look at him. He said he would shoot me if…

INTERVIEWER

CALLER

… He told me not to look at him . He said … EmbeddingLSTM

Strong tone

Weak tone

followed by

Inspired by recurrent networks for named entity recognition and part of speech taggingWe can use bi-directional recurrent networks to attach credibility tags to the speech transcriptionBi-directionality is important for contextNetwork can tag explainers, changes in tone, pronouns etc.

GPU-accelerated RNN-based

Speech to Text

SPEAKER IDENTIFICATION

RASTA SOX MATLAB PYTHON RASTA 12

Dialect identification via images and DIGITSNIST evaluation of 500 hours and 20 dialects

NIST EVALUATION

Preliminary Results

0 50 100

English-Portuguese-Brazilian

Spanish-Spanish-European

Chinese-Min_DongArabic

Chinese-CantoneseArabic-Egyptian

English-BritishSpanish-Caribbean

Slavic-RussianArabic-Maghrebi

Chinese-MandarinArabic-Iraqi

English-American

Chinese-WuSlavic-Polish

French-Haitian

Arabic-Leventine

French-West_African

CELEBRITY SOUND A LIKE

https://celebsoundalike.com/

Tweet your results to @intelligentvox

CONCLUSION

THANK YOU

nigel.cannings@intelligentvoice.com@intelligentvox

FINDING YOUR VOICE IN THE REGULATORY...

Documents