Infant-directed speech (IDS) has the important functions of capturing the infants’ attention and maintaining communication between the mother and the infant. It is known that three acoustic components (F0, F0 range, and tempo) in IDS and adult-directed speech (ADS)are different. However, it is not easy to discriminate between IDS and ADS using procedural approaches due to the wide range of individual differences. In this paper, we propose a novel approach to discriminate between IDS and ADS that usesMel-Frequency Cepstrum Coefficient and a hidden Markov model-based speech discrimination algorithm; this approach is not based on the prosodic features of F0, F0 range, and tempo. The averagediscrimination accuracy of the proposed algorithm is 84.34%. The objective accuracy of the discrimination models have been confirmed using the head-turn preference procedure, which measures infants’ listening duration to auditory stimuli of IDS and ADS. These results suggest that the proposed algorithm may enable a robust and reliable classification of mothers’ speech and infant attention to the mothers’ speech may depend on IDS clarity.
Effect of Stimulation Rate on Cochlear Implant Users’ Phoneme, Word and Sentence Recognition in Quiet and in Noise
High stimulation rates in cochlear implants (CI) offer better temporal sampling, can induce stochastic-like firing of auditory neurons and can increase the electric dynamic range, all of which could improve CI speech performance. While commercial CI have employed increasingly high stimulation rates, no clear or consistent advantage has been shown for high rates. In this study, speech recognition was acutely measured with experimental processors in 7 CI subjects (Clarion CII users). The stimulation rate varied between (approx.) 600 and 4800 pulses per second per electrode (ppse) and the number of active electrodes varied between 4 and 16. Vowel, consonant, consonant-nucleus-consonant word and IEEE sentence recognition was acutely measured in quiet and in steady noise (+10 dB signal-to-noise ratio). Subjective quality ratings were obtained for each of the experimental processors in quiet and in noise. Except for a small difference for vowel recognition in quiet, there were no significant differences in performance among the experimental stimulation rates for any of the speech measures. There was also a small but significant increase in subjective quality rating as stimulation rates increased from 1200 to 2400 ppse in noise. Consistent with previous studies, performance significantly improved as the number of electrodes was increased from 4 to 8, but no significant difference showed between 8, 12 and 16 electrodes. Altogether, there was little-to-no advantage of high stimulation rates in quiet or in noise, at least for the present speech tests and conditions.
In order to acquire their native language, infants must learn to identify and segment word forms in continuous speech. This word segmentation ability is thus crucial for language acquisition. Previous behavioral studies have shown that it emerges during the first year of life, and that early segmentation differs according to the language in acquisition. In particular, linguistic rhythm, which differs across classes of languages, has been found to have an early impact on segmentation abilities. For French, behavioral evidence showed that infants could use the rhythmic unit appropriate to their native language (the syllable) to segment fluent speech by 12 months of age, but failed to show whole word segmentation at that age, a surprising delay compared to the emergence of segmentation abilities in other languages. Given the implications of such findings, the present study reevaluates the issue of whole word and syllabic segmentation, using an electrophysiological method, high-density ERPs (event-related potentials), rather than a behavioral technique, and by testing French-learning 12-month-olds on bisyllabic word segmentation. The ERP data show evidence of whole word segmentation while also confirming that French-learning infants rely on syllables to segment fluent speech. They establish that segmentation and recognition of words/syllables happen within 500 milliseconds of their onset, and raise questions regarding the interaction between syllabic segmentation and multisyllabic word recognition.
from Brain Research
Background: There is evidence that, unlike in typical populations, initial lexical activation upon hearing spoken words in aphasic patients is not a direct reflection of the goodness of fit between the presented stimulus and the intended target. Earlier studies have mainly used short monosyllabic target words. Short words are relatively difficult to recognise because they are not highly redundant: changing one phoneme will often result in a (similar-sounding) different word.
Aims: The present study aimed to investigate sensitivity of the lexical recognition system in aphasia. The focus was on longer words that contain more redundancy, to investigate whether aphasic adults might be impaired in deactivation of strongly activated lexical candidates. This was done by studying lexical activation upon presentation of spoken polysyllabic pseudowords (such as procodile) to see to what extent mismatching phonemic information leads to deactivation in the face of overwhelming support for one specific lexical candidate.
Methods & Procedures: Speeded auditory lexical decision was used to investigate response time and accuracy to pseudowords with a word-initial or word-final phonemic mismatch in 21 aphasic patients and in an age-matched control group.
Outcomes & Results: Results of an auditory lexical decision task showed that aphasic participants were less sensitive to phonemic mismatch if there was strong evidence for one particular lexical candidate, compared to the control group. Classifications of patients as Broca’s vs Wernicke’s or as fluent vs non-fluent did not reveal differences in sensitivity to mismatch between aphasia types. There was no reliable relationship between measures of auditory verbal short-term memory and lexical decision performance.
Conclusions: It is argued that the aphasic results can best be viewed as lexical “overactivation” and that a verbal short-term memory account is less appropriate.
* Esther Janse is now also at the Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands. This study was part of a larger collaborative research programme Auditory processing in speakers with acquired or developmental language disorders. The Dutch Organisation for Scientific Research (NWO) is gratefully acknowledged for funding this research project.
A dual-task interference paradigm was used to investigate the effect of perceptual effort on recall of spoken word lists by young and older adults with good hearing and with mild-to-moderate hearing loss. In addition to poorer recall accuracy, listeners with hearing loss, especially older adults, showed larger secondary task costs while recalling the word lists even though the stimuli were presented at a sound intensity that allowed correct word identification. Findings support the hypothesis that extra effort at the sensory–perceptual level attendant to hearing loss has negative consequences to downstream recall, an effect that may be further magnified with increased age.
from Psychology and Aging
This paper presents a quantitative and comprehensive study of the lip movements of a given speaker in different speech/nonspeech contexts, with a particular focus on silences (i.e., when no sound is produced by the speaker). The aim is to characterize the relationship between “lip activity” and “speech activity” and then to use visual speech information as a voice activity detector (VAD). To this aim, an original audiovisual corpus was recorded with two speakers involved in a face-to-face spontaneous dialog, although being in separate rooms. Each speaker communicated with the other using a microphone, a camera, a screen, and headphones. This system was used to capture separate audio stimuli for each speaker and to synchronously monitor the speaker’s lip movements. A comprehensive analysis was carried out on the lip shapes and lip movements in either silence or nonsilence (i.e., speech+nonspeech audible events). A single visual parameter, defined to characterize the lip movements, was shown to be efficient for the detection of silence sections. This results in a visual VAD that can be used in any kind of environment noise, including intricate and highly nonstationary noises, e.g., multiple and/or moving noise sources or competing speech signals.
©2009 Acoustical Society of America
Identifying isolated, multispeaker Mandarin tones from brief acoustic input: A perceptual and acoustic study
Lexical tone identification relies primarily on the processing of F0. Since F0 range differs across individuals, the interpretation of F0 usually requires reference to specific speakers. This study examined whether multispeaker Mandarin tone stimuli could be identified without cues commonly considered necessary for speaker normalization. The sa syllables, produced by 16 speakers of each gender, were digitally processed such that only the fricative and the first six glottal periods remained in the stimuli, neutralizing the dynamic F0 contrasts among the tones. Each stimulus was presented once, in isolation, to 40 native listeners who had no prior exposure to the speakers’ voices. Chi-square analyses showed that tone identification accuracy exceeded chance as did tone classification based on F0 height. Acoustic analyses showed contrasts between the high- and low-onset tones in F0, duration, and two voice quality measures (F1 bandwidth and spectral tilt). Correlation analyses showed that F0 covaried with the voice quality measures and that tone classification based on F0 height also correlated with these acoustic measures. Since the same acoustic measures consistently distinguished the female from the male stimuli, gender detection may be implicated in F0 height estimation when no context, dynamic F0, or familiarity with speaker voices is available.
©2009 Acoustical Society of America
Characteristics of phonation onset were investigated in a two-layer body-cover continuum model of the vocal folds as a function of the biomechanical and geometric properties of the vocal folds. The analysis showed that an increase in either the body or cover stiffness generally increased the phonation threshold pressure and phonation onset frequency, although the effectiveness of varying body or cover stiffness as a pitch control mechanism varied depending on the body-cover stiffness ratio. Increasing body-cover stiffness ratio reduced the vibration amplitude of the body layer, and the vocal fold motion was gradually restricted to the medial surface, resulting in more effective flow modulation and higher sound production efficiency. The fluid-structure interaction induced synchronization of more than one group of eigenmodes so that two or more eigenmodes may be simultaneously destabilized toward phonation onset. At certain conditions, a slight change in vocal fold stiffness or geometry may cause phonation onset to occur as eigenmode synchronization due to a different pair of eigenmodes, leading to sudden changes in phonation onset frequency, vocal fold vibration pattern, and sound production efficiency. Although observed in a linear stability analysis, a similar mechanism may also play a role in register changes at finite-amplitude oscillations.
©2009 Acoustical Society of America
Masking release for low- and high-pass-filtered speech in the presence of noise and single-talker interference
Speech intelligibility was measured for sentences presented in spectrally matched steady noise, single-talker interference, or speech-modulated noise. The stimuli were unfiltered or were low-pass (LP) (1200 Hz cutoff) or high-pass (HP) (1500 Hz cutoff) filtered. The cutoff frequencies were selected to produce equal performance in both LP and HP conditions in steady noise and to limit access to the temporal fine structure of resolved harmonics in the HP conditions. Masking release, or the improvement in performance between the steady noise and single-talker interference, was substantial with no filtering. Under LP and HP filtering, masking release was roughly equal but was much less than in unfiltered conditions. When the average F0 of the interferer was shifted lower than that of the target, similar increases in masking release were observed under LP and HP filtering. Similar LP and HP results were also obtained for the speech-modulated-noise masker. The findings are not consistent with the idea that pitch conveyed by the temporal fine structure of low-order harmonics plays a crucial role in masking release. Instead, any reduction in speech redundancy, or manipulation that increases the target-to-masker ratio necessary for intelligibility to beyond around 0 dB, may result in reduced masking release.
This paper reports the development of a quantitative target approximation (qTA) model for generating F0 contours of speech. The qTA model simulates the production of tone and intonation as a process of syllable-synchronized sequential target approximation [Xu, Y. (2005). “Speech melody as articulatorily implemented communicative functions,” Speech Commun. 46, 220–251]. It adopts a set of biomechanical and linguistic assumptions about the mechanisms of speech production. The communicative functions directly modeled are lexical tone in Mandarin and lexical stress in English and focus in both languages. The qTA model is evaluated by extracting function-specific model parameters from natural speech via supervised learning (automatic analysis by synthesis) and comparing the F0 contours generated with the extracted parameters to those of natural utterances through numerical evaluation and perceptual testing. The F0 contours generated by the qTA model with the learned parameters were very close to the natural contours in terms of root mean square error, rate of human identification of tone, and focus and judgment of naturalness by human listeners. The results demonstrate that the qTA model is both an effective tool for research on tone and intonation and a potentially effective system for automatic synthesis of tone and intonation.
Talkers alter vowel production in response to real-time formant perturbation even when instructed not to compensate
Talkers show sensitivity to a range of perturbations of auditory feedback (e.g., manipulation of vocal amplitude, fundamental frequency and formant frequency). Here, 50 subjects spoke a monosyllable (“head”), and the formants in their speech were shifted in real time using a custom signal processing system that provided feedback over headphones. First and second formants were altered so that the auditory feedback matched subjects’ production of “had.” Three different instructions were tested: (1) control, in which subjects were naïve about the feedback manipulation, (2) ignore headphones, in which subjects were told that their voice might sound different and to ignore what they heard in the headphones, and (3) avoid compensation, in which subjects were informed in detail about the manipulation and were told not to compensate. Despite explicit instruction to ignore the feedback changes, subjects produced a robust compensation in all conditions. There were no differences in the magnitudes of the first or second formant changes between groups. In general, subjects altered their vowel formant values in a direction opposite to the perturbation, as if to cancel its effects. These results suggest that compensation in the face of formant perturbation is relatively automatic, and the response is not easily modified by conscious strategy.
This investigation compares vocal tract dimensions and the classification of singer voices by examining an x-ray material assembled between 1959 and 1991 of students admitted to the solo singing education at the University of Music, Dresden, Germany. A total of 132 images were available to analysis. Different classifications’ values of the lengths of the total vocal tract, the pharynx, and mouth cavities as well as of the relative position of the larynx, the height of the palatal arch, and the estimated vocal fold length were analyzed statistically, and some significant differences were found. The length of the pharynx cavity seemed particularly influential on the total vocal tract length, which varied systematically with classification. Also studied were the relationships between voice classification and the body height and weight and the body mass index. The data support the hypothesis that there are consistent morphological vocal tract differences between singers of different voice classifications.
The interference of different background noises on speech processing in elderly hearing impaired subjects
The objective of the investigation is to study the interference of different background noises on speech processing. For this purpose speech recognition with the Hagerman test and a test battery with speech comprehension tasks (SVIPS) were performed in speech-weighted background noises varying in temporal structure, signal-to-noise ratio (SNR), and meaningfulness. With different test criteria and a score of perceived effort, the aim was to get a more complete picture of speech comprehension under adverse listening situations. Twenty-four subjects, aged 56-83 years, with a bilateral sensorineural hearing impairment, participated in the study. Differences in performance between the different background noises varied depending on the speech processing task, SNR, and on quantitative versus qualitative outcome measures. Age effects were seen in the Hagerman test and especially in background conditions of modulated noises (speech and reversed speech). Findings are discussed in relation to a hypothesis suggesting that masking and distraction interference from background noises on speech processing at peripheral, central auditory, and cognitive levels depends on the SNR used and the noise type and the listening task.
from the International Journal of Audiology
Scientists at the University of Rochester have shown for the first time that our brains automatically consider many possible words and their meanings before we’ve even heard the final sound of the word.
from Ear and Hearing
Speech-evoked auditory event-related potentials (ERPs) provide insight into the neural mechanisms underlying speech processing. For this reason, ERPs are of great value to hearing scientists and audiologists. This article will provide an overview of ERPs frequently used to examine the processing of speech and other sound stimuli. These ERPs include the P1-N1-P2 complex, acoustic change complex, mismatch negativity, and P3 responses. In addition, we focus on the application of these speech-evoked potentials for the assessment of (1) the effects of hearing loss on the neural encoding of speech allowing for behavioral detection and discrimination; (2) improvements in the neural processing of speech with amplification (hearing aids, cochlear implants); and (3) the impact of auditory training on the neural processing of speech. Studies in these three areas are reviewed and implications for audiologists are discussed.