Markus Hauenstein

Objective Speech-Quality Assessment

Imagine you have some speech codecs and you want to find out which of them sounds the best. What do you do ? Well, the natural approach would be to ask some people to listen to the speech codecs and to tell which codec they prefer. Since people have different listening habits and expectations, the results you get will be ambigous: some people prefer codec A, some others codec B, a few codec C and so on. A way of handling this problem would be to demand the listeners to express their opinions with scores, just like in school where pupils are rated with school marks. In the end, you would have to collect all the scores concerning the different codecs and to calculate a mean opinion score (MOS) for each codec. The speech codec with the highest MOS would then be the one you would choose for your new mobile telephone network or what else you intend to do with the codec.

Quality MOS

excellent 5

good 4

fair 3

poor 2

bad 1

Quality	MOS
excellent	5
good	4
fair	3
poor	2
bad	1

The way I described above is often called the subjective assessment of speech quality. It should be clear that this method if full of drawbacks: First of all, you need a representative group of listeners. Then you need balanced speech material to be used for the assessment. Furthermore, you must keep the listeners well motivated during the test, and the tests must not be too long since they are very tiring.

A lot of strategies have been developed to cope with these problems, so there exist subjective test methods that lead to more or less reliable results. Nevertheless, the lack of reproducability and high costs of such tests remain difficult problems. If you are developing a new codec, it nearly is impossible to have a full subjective test whenever you change your codec a bit.

Wouldn't it be much nicer to have the codecs judged by an instrument ? Such an apparatus could be a just a computer program and thus be not too expensive and time consuming. We could feed the program just with the processed and (if needed) the original speech material. The program would calculate for a short while giving us then a score indicating the quality of the codec. You would call that way the instrumental or objective assessment of speech quality.

The instrumental way would be cheap, fast and reproducable. This is why a lot of people - for example speech codec developers - want to have such a program. Certainly, the results shouldn't differ very much from the results you would obtain from a well performed subjective test. And that exactly is the problem ! The instrumental measures we know up to now - for example the simple but famous SNR (signal to noise ratio) - do not tell us what subjective tests would do. This problem increases with the decrease of bitrate: speech codecs that operate on a low bitrate and thus are very interesting especially for mobile applications introduce a lot of disturbances of different types into the speech signal. Some of them are audible and some not.

My goal as a researcher was to find better instrumental measures for the assessment of speech codecs. These measures should have a high correlation with subjective results. Still, the most promising approaches seem to be the measures that model the human auditory processing.

Here you find a diagram showing a rather simple instrumental measure (BSD, Barks Spectral Distortion) based on psychoacoustic results.