The twenty-four-member BUT Speech@FIT team at the Faculty of Information Technology of BUT consists of experts from eleven countries. In their offices, you will encounter English as often as Czech. And the very thing that may at first sight divide researchers unites them all - a common passion for speech and languages in all forms.
Call centres, psychologists and intelligence services - these are all "customers" of the so-called FIT speakers. "We are in the business of mining data from speech. Some would say we are in the speech recognition business, but that narrows our scope quite a bit. We are simply trying to get the maximum possible data from it," says Jan Černocký, the head of the research group. In the office of world leaders in their field, there is a clarinet and a plethora of documents on the table and a scooter leaning against the door. Moments before, Černocký had been walking down the corridor of the department to invite one of his colleagues to be interviewed.
"Speech processing has recently become very close to natural language processing. This is what Santosh here, who has one foot in speech and the other in text, is dealing with," says Černocký as he smoothly hands over to another member of the research group. Santosh Kesiraju came to FIT eight years ago. We all speak English together, but the speakers, as time goes on, convince me more and more that it doesn't matter what the specific language is.
It doesn't matter how many languages you speak
"Let me give you an example. Somewhere in the world a disaster happens, and it is, for example, in an area where people speak Somali or Bengali, dialects for which new language technology is not available. You need to find out what's going on there and if they need help," Kesiraju explains one of his projects. The source of the data is, for example, local TV news, which needs to be automatically translated into English. And ideally very quickly. "Now I'm working on translating speech into text. That is, a person speaks in one language, but the transcript is then written in another language. It can be used for example as automatic subtitling for movies and other applications," continues Santosh Kesiraju.
He focuses primarily on translations of languages that have little or no written record. Kesiraju excitedly explains: "One of them is Tamasheq, which is spoken by about one million people in North Africa. Linguists have managed to translate some of the recordings of the local news into French. So we have spoken words in Tamasheq and written translation in French, and at the same time we don't know what the written version looks like in the original language." This will not result in a perfect translation, but the general information and the topic of conversation can be obtained without much difficulty.
How we played at being drug dealers
Generally speaking, the Brno researchers can find out whatever they want from the available recordings. "We can identify the language, the particular speaker and partially the stress and intonation. In one of our projects we are trying to develop technologies together with psychotherapists that will improve the quality of psychotherapy sessions," Jan Černocký names a few examples and goes on in more detail: "A good therapist wants to improve. Sometimes the recording of the session is analysed by a mentor who finds out who is talking more, if the session is flowing, if there are any problems. But most of the time, these tasks fall directly on the therapist and it is difficult to adequately perform the role of analyst as well." The DeePsy project was created in collaboration with psychotherapists from Masaryk University.
It certainly cannot be said that the work of Brno's speakers will just end up in a drawer somewhere. Thanks to cooperation with universities, intelligence services or air traffic controllers, the algorithms from VUT are actually used and are a great help in many ways. When the work is also fun, you understand what the international success of BUT Speech@FIT is based on. Jan Černocký confirms: "We are in the Roxanne project, which is a large European security project that is trying to link speech processing and criminal network analysis. Within this we try to uncover the behavioural patterns that are the basis of communication between these people. We also have real police officers working with us, but because we don't have access to "hot" cases and data, we had to create the data ourselves. We played drug dealers and called each other in different languages."
The researchers are also currently working on simplifying the answering of calls to the 112 emergency line, which would help emergency responders in situations such as major disasters and when being overwhelmed with calls. Another project in progress aims to simplify communication between air traffic controllers and pilots. Computer scientists from the Brno Technical University have also completed a project of mining information from the voices of people calling call centres. And I could go on forever.
"Hello, who's calling? And are you human?"
"I'm not afraid that artificial intelligence will enslave us or robots will start shooting at us, but deepfakes are already very real and it will get worse and worse," Jan Černocký says when asked about artificial intelligence and synthetic voices. Today, anyone can not only create a robot that speaks with his or her own voice, but thanks to the vast amount of recordings and data, it can mimic almost any public figure very easily. So the speakers, in collaboration with IT security experts from a neighbouring department, submitted a proposal for a project to help verify who is actually speaking and whether it is a human or an artificial voice.
"The quality of deepfakes is already very good and will get better. The tools will be freely available to everyone, so it is to be expected that crime committed in this way will increase. Older people in particular will be very vulnerable. Today, we know what spam looks like in an email or in a physical mailbox, but if someone calls you from a number you know - and it can be done now - and speaks in your loved one's voice, they can do a lot of bad things."
What other areas are still challenging for scientists in the field of speech processing? According to Santosh Kesiraju, it is the determination of emotions: "It's very hard to recognize emotions just by voice. For example, when a person laughs, we cannot say for sure that he or she is happy or excited. We can say that it is generally a more positive emotion, but sometimes it can be laughter from stress." And Jan Cernocký nods: "How do you expect a computer to know how a person is feeling just from their voice when even we humans can't agree?"
Source: Události na VUT 04/2022-2023 (BUT quarterly magazine in Czech)