# I. Introduction NDERSTANDING speech in adverse listening conditions with auditory prostheses like hearing aids or CIs is a very which try to remove as much noise as possible from the mixture of the target speech and the interfering sound, with the objective of increasing SI and/or improving the speech quality of the processed signal. A typical constraint of such strategies is that distortions of the target signal should be avoided. Usually, noise reduction algorithms operate upon a time frequency representation of the input (noisy) signal, applying a gain to each time frequency point to suppress the noise. The pattern of the gain function over all time frequency points is often called mask. Most of these time frequency domain approaches derive their gains as a function of the short-term signalto-noise ratio (SNR) in the respective time frequency point. There is an ongoing discussion about the choice of the perfect gain function that improves SI and speech quality in NH listeners very popular choice is the socalled binary mask (BM). The mask is motivated by the auditory masking phenomenon and preserves with its binary values time frequency points, where the target is dominant (i.e., the short-term SNR is above a threshold). The BM exploits the spar sity and disjointness of the target and interferer spectra. When a priori knowledge of the signal and noise spectra is used in the derivation of the mask, the mask is often called IBM. Under certain listening conditions, approaches based on BMs with and without a priori knowledge for the mask computation can increase SI in NH, and hearing impaired listeners. In contrast to the hard-decision approach of the BM, state-of-the-art noise reduction algorithms derive mask. Such algorithms demonstrate improved speech quality as compared to BM processed output. A popular representative of this class of algorithms is the Wiener filter (WF), which was shown to be very promising in terms of quality improvement. When a priori knowledge was used to calculate the gain function of the WF, the approach is referred to as IWF. It was shown in that the IWF restored perfect intelligibility with a Bark-scale frequency resolution even at very low SNRs in both multi talker babble noise and interfering talker scenarios. This was in stark contrast to the performance of the IBM, which yielded intelligibility scores of around 60% at the low SNRs. In this study, the potential of the IWF and the IBM approaches in terms of SI and speech quality is investigated with regard to its application in CIs. The tests are carried out on two groups of participants: a group of NH subjects listening to noise vocoder simulations a same model of CI processing of the processed signals, and a group of CI users. Because of the relative ease with which NH volunteers may cruited, tests on NHlisteners, presented with noise vocoded versions of the processed signals comparable to CI processing, are often used in a first step to evaluate speech enhancement strategies for application in CIs. In our case, the inclusion of such tests with noise vocoded sounds (also referred to as CI simulations) allows us to investigate if the intelligibility scores obtained on NH listeners translate to the scores of CI users as well. The aims of this study are the following: we wish to vestigate in terms of SI, which mask pattern is more beneficial for NH subjects listening to noise vocoder CI simulations and for CI users. It is interesting in particular for CI users, because noise reduction approaches based on time frequency masks can be added to the signal processing chain of existing clinical coding strategies without significant effect on other stages. Furthermore, we study the influence of estimation errors on the SI for both groups of listeners, as it was shown in that CI users are less sensitive to speech distortions. The design of the study allows us to investigate if the SI results ob tained with NH listeners using CI simulations can be translated to that of CI users. Additionally, we want to study the potential for speech quality improvement of both mask patterns in CI users. # II. Signal Processing In Cochlear, Ltd., up to N=22 envelopes are extracted in the frequency range up to 8 kHz. Therefore, such CIs usually operate with a frequency resolution that is close to the Bark-scale spectrum used in. The signal model and processing used in this study are very similar to that in purposes of completeness, we present these briefly below. a) Signal Model Denote the time discrete signal recorded by the microphone as y(t), where t is the sample index. The signal y(t) consists of the target signal s(t) and the additive interference v(t). This additive signal model for the recorded signal can be written as y (t)=s(t)+v (t). ( Due to the fact that the IBM and IWF speech enhancement approaches operate in the frequency domain, the short time frequency representation of the signal in (1) can be written, with the frame index n and the frequency index k, as Y (n,k)=S (n,k)+V (n,k). ( Y (n,k) is the microphone signal in the time frequency domain, S (n,k) and V (n,k) represent the target signal and the interferer, respectively. The estimate ? S(n,k) of the target signal is obtained by applying the time frequency mask G(n,k) ? [0,1] yielded by the IBM and/or IWF approach, to Y (n,k). Thus, the output of the speech enhancement step can be written as ? S (n,k)=G(n,k)Y (n,k). (3) Both the IBM and the IWF approaches derive their respective masks as a function of the short-term SNR ? (n,k), which is defined as the ratio between the power spectral density (PSD) of the target signal ?SS (n,k) and the PSD of the interferer ?VV (n,k) ? (n,k)=?SS (n,k) ?VV (n,k).(4) Usually, the PSD of the target signal and the interfering sound are computed by using the Welch method, one implementation which is a first order recursive smoothing of the respective periodograms. Since we deal with ideal estimates of the parameters of the IBM and the IWF approach, we can approximate the PSD with ?SS (n,k)=|S (n,k)|2 (5) ?VV (n,k)=|V (n,k)|2.(6) i. Ideal Binary Mask The IBM GIBM consists of binary weights. GIBM is equal to 1 when the SNR is above a threshold value, and 0 when the SNR is lower than this threshold. In this study, the threshold used was the global input SNR ?in. The BM GIBM can be written as GIBM (n,k) = 1, if ? (n,k) ? ? in 0, else.(7) Note that for a given combination of s(t) and v(t), the mask pattern is constant and independent of the SNR. This is termed as the local threshold. The binary gain function GIBM is applied to the input signal Y to obtain the enhanced output ? SIBM ? SIBM (n,k)=GIBM (n,k)Y (n,k). (8) ii. Ideal Wiener Filter The gain function GIWF of the WF approach is a continuous value between 0 and 1. It is obtained as the minimum mean-squared error estimate of the complex spectral amplitude minE{|S (n,k)? ? S (n,k)|2}(9) and can be written as GIWF (n,k) = ? (n,k) 1+? (n,k)(10) The corresponding estimate ? SIWF may then be written as ? SIWF (n,k) GIWF (n,k)Y (n,k)(11) iii # . Simulation of Estimation Errors To investigate the influence of estimation errors in the mask patterns of WF and the BM on SI, such errors in the mask pattern were simulated. Due to the fact that over and under estimation errors influence SI differently we use the approach first described is to generate a balanced pattern of estimation errors. For the mask derivation, the spectra of the target and the noise signal were corrupted with an additional noise term which can be written as ? S (n,k)=S (n,k)+S (k) (12) V (n,k) = V (n,k)+V (k) (13) where S (k) and V (k) are complex randomly distributed variables with zero mean and power equal to the respective clean signal in the frequency band k. The corrupted spectra influence the short-term SNR estimation in and, thereby, the mask computation. This results in corrupted mask patterns GBM and GWF. When referring to results and patterns obtained in the condition with perturbed estimates, the masks are called BM and WF. The output signals for the BM and the WF mask are ? Hybrid Masking Algorithm for Universal Hearing Aid System SBM (n,k)=GBM (n,k)Y (n,k) (14) SWF (n,k)=GWF (n,k)Y (n,k)(15) The corrupted parameter estimates are only used for the mask pattern estimates. The corrupted masks are applied to the original, unperturbed mixture in ( 14) and (15). This manner of simulating estimation errors allows for both under and over estimation of the instantaneous PSD estimate. Additionally, such a perturbation of the underlying spectrum has the advantage that it does not preserve the silence periods of the speech and/or interference. Thus, the musicalnoise phenomenon will be present in such speech/interference pauses [26]. This lends realism to the simulation. # b) General Processing Steps The processing steps to generate the stimuli that are presented to the listener are shown in Fig. 1. The first six processing steps are the same for both groups of listeners. The processing steps that are different between NH listeners and CI users are represented by the black trace for the noise band vocoder CI simulation with NH listeners and by the dashed gray trace for the electrical stimulation with CI users. In the first step, the target signals(t) and the interfering signal v(t) (sampled at 16 kHz) are filtered with a pre emphasis filter that consists of the frequency response of the SP12 microphone of the Freedom speech processor of Cochlear, Ltd. The result of this pre emphasis is a boost of the higher frequencies. A square-root Hann window was used as the analysis window. The envelope extraction is done by grouping the magnitude-squared DFT coefficients into N frequency bands. This process is applied to the target and the interfering signal to calculate the power spectral density (PSD) estimates used for the gain computation in ( 7) and (10). # c) Noise Band Vocoder as CI Simulation The number of channels used for the noise band vocoder CI simulation was set to N =8, because asymptotic SI performance for most CI users is reached with the current clinical speech processing strategies with eight effective channels [27]. The cut off frequencies to obtain the band pass filtered envelopes are 187.5, 437.5 7937.5 Hz. These cutoff values for the band pass filters correspond to bandwidths of 250, 250, 375, 500, 750, 1125, 1750, and 2750 Hz for the eight channels. The signal components under 187.5 Hzare not considered in the signal processing. Finally, all noise vocoded channels are added to obtain the final audio stimulus that can be presented acoustically to the NH listener. # d) Cochlear Implants The current clinical CI device of Cochlear, Ltd., can stimulate 22 channels. Therefore for most patients, N = 22frequency bands are processed in the envelope extraction stage. In this study, all six patients used a frequency resolution of 22 channels. The advanced combination encoder (ACE) strategy that is the default speech processing strategy in CIs of Cochlear, Ltd., does not stimulate all available channels in each time frame. The ACE strategy consists of a maxima selection stage in each frame, where the M