"The 22 Language corpus consists of telephone speech from 22 languages: Eastern Arabic, Cantonese, Czech, Farsi, French, German, Hindi, Hungarian, Japanese, Korean, Malay, Mandarin, Italian, Polish, Portuguese, Russian, Spanish, Swedish, Swahili, Tamil, Vietnamese, and English. Unfortunately French is not available. The corpus contains fixed vocabulary utterances (e.g. days of the week) as well as fluent continuous speech. We were expecting at least 300 callers in each language. Each utterance is verified by a native speaker to determine if the caller followed instructions when answering the prompts. Some of the calls in each language are transcribed orthographically."
"The Alphadigit Corpus is a collection of 78,044 examples from 3,025 speakers saying six digit strings of letters and digits over the telephone. A total of about 82 hours of speech are included in Release 1.3. Each file has an orthographic transcription and time align transcription as well."
This software is a database of spoken letters and digits collected by phone and recorded.
ISOLET is a database of letters of the English alphabet spoken in isolation. The database consists of 7800 spoken letters, two productions of each letter by 150 speakers. It contains approximately 1.25 hours of speech.h The recordings were done under quiet, laboratory conditions with a noise-canceling microphone.
"The Center for Spoken Language Understanding is available for help with purchasing, product planning or investment decisions. We also work collaboratively with corporations in joint language software development projects."
"The Center for Spoken Language Understanding offers custom evaluations of speech synthesizers and speech enhancement systems that focus on the specific needs of the corporate customer."
"The Apple Words and Phrases corpus was developed with support from Apple Computer, Inc., who also supplied the list of words and phrases to be collected. This telephone speech corpus contains about 69.5 hours of speech. 998 calls were collected on an analog system and 2010 calls were collected on a digital system. Each caller repeated a list of phrases as they were prompted. The phrases were command and control type phrases, e.g. "help."
"This software is a continuation of the development, validation, and commercialization efforts of Audiology Incorporated to develop hearing testing products that will increase access, efficiency, and accuracy of hearing tests. The methods are based on behavioral science and statistical validation to provide automated tests of hearing that can be performed without the requirement for highly trained professional operators. The software combines several existing products with a new hardware platform that will provide a complete automated testing solution that can be marketed directly to end users (such as audiologists, physicians, hearing aid specialists, nurses, educational institutions, and retail centers.)"
This software is meant to provide quantitative data on subject's hearing ability.
"This software combines the power of machine-based sensing and computation to improve the study of speech patterns in individuals with autism. Current manual methods for measuring narrative coherence are not only difficult to obtain and extremely time consuming but it is unclear whether the human coder can even detect the statistical degree of semantic similarity as the machine can. This software analyzes recordings being collected from two narrative recall tests that have the potential to uncover a wider range of speech differences between ASD and others. The hope is that this will clinically define children with ASD relative to typically developing children and differentiate ASD from other groups who also have communication impairments, i.e., children with developmental language delay (DLD), as well as differentiate speech characteristics or markers that might better discriminate subtypes within the ASD umbrella (e.g., HFA vs. Asperger's)."
This software analyzes human voice inputs.
"The Cellular Words and Phrases Corpus consists of utterances gathered from callers who were using cellular telephones. Each caller listened and responded to a series of pre-recorded prompts from a fixed protocol. There are 346 callers represented in this corpus."
"The clearspeechjph corpus contains microphone speech from a single speaker (JPH), who spoke 140 sentences (70 sentences each from Material A and Material B) in both "clear" and "conversational" speaking styles.
This sentence material was comprised of 70 IEEE Harvard Psychoacoustic Sentences (Rothauser et al., 1969), which are syntactically and semantically normal sentences, with each sentence containing five keywords (e. g. His shirt was clean but one button was gone). The sentences are phonetically balanced, i. e. they are crafted so that the average frequency of occurrence of each phoneme is representative of the language as a whole.
This material was comprised of 70 syntactically correct, but semantically anomalous sentences (e. g. They slide far across the tiny clock), created by randomizing and exchanging words and grammar structures from Material A. Using identical words and sentence lengths in both materials allowed for direct comparisons between experimental results.
"This software consists of computer-guided lessons for a wide variety of phoneme listening targets, allowing hearing-impaired individuals to attain listening skills without the extensive guidance of a specialist."
"This software is for computer-assisted remediation of expressive and receptive prosody in children with autism spectrum disorders (ASD). The computer program consists of an interactive drama book containing a collection of videotaped social scenarios. These scenarios represent different social situations requiring prosody, both receptive and communicative, highlighting prosody's role in affecting others and driving the selection of events. Interpersonal dramas in the game can unfold in different ways, controlled by the ASD subject's responses."
This software is an educational game for children with autism meant to improve their prosody.
"We propose a new algorithm that uses a unit-dependent trainable parameterized cross-fading weight function to generate more natural-looking formant trajectories and, it is hoped, better-sounding output speech. The proposed algorithm:
- uses a perceptually-based objective function to capture
differences between cross-faded and natural trajectories
across the whole region of the phoneme, and
- uses phoneme identity, prosodic contexts, and acoustic
features of the units to predict optimal cross-fading parameters to generate more natural formant trajectories."
The algorithm "renders the lexical content of an utterance unintelligible, while preserving important acoustic prosodic cues, as well as naturalness and speaker identity."
An algorithm to recover the original airﬂow signal in a frequency modulated signal.
We use this algorithm as implemented by the function demod in Matlab’s signal processing toolbox.
"This software is an automated, objective system for detecting early warning signs of autism in infants. The approach is non-invasive and uses an in-home system comprising low-cost, off-the shelf equipment in the form of microphones, video cameras, and accelerometers. Data generated by the system is transmitted via internet protocol to a central processing facility where innovative algorithms -- which are the core contribution of the proposed study -- extract diagnostic profiles. Unlike other diagnosis and detection methods for autism, which rely on behavioral assessment and subjective clinical judgment along with parent questionnaires, these diagnostic profiles are objective and based on sophisticated computer analysis of voice and movement patterns and hence are expected to be more reliable, accurate, and information-rich."
"This software is meant as an improvement on existing parser-derived and tagger-derived features within discriminative approaches to language modeling for automatic speech recognition. Discriminative language modeling approaches provide a tremendous amount of flexibility in defining features, but the size of the potential parser-derived feature space requires efficient feature annotation and selection algorithms. The project had four specific aims. The first aim was to develop a set of efficient, general, and scalable syntactic feature selection algorithms for use with various kinds of annotation and several parameter estimation techniques. The second aim was to develop general tree and grammar transformation algorithms designed to preserve selected feature annotations yet lead to faster parsing or even tagging approximations to parsing. The third aim was to evaluate a broad range of feature selection and grammar transformation approaches on a large vocabulary continuous speech recognition (LVCSR) task, namely Switchboard. The final aim was to design and package the algorithms to straightforwardly support future research into other applications, such as machine translation (MT); and into other languages, such as Chinese and Arabic. The algorithms developed as a part of this project are expected to contribute to improvements in LVCSR accuracy and applications that rely upon this technology. The algorithms are being packaged into a publicly available software library, enabling researchers working in many application areas -- including LVCSR and MT -- and various languages to investigate best practices in syntactic language modeling for their specific task, without having to hand-select and evaluate feature sets."
This software models language for automatic speech recognition.
"This software is meant as a development in finite-state syntactic processing models for natural language that use features encoding global structural constraints derived through multiple sequence alignment (MSA) techniques, to significantly improve accuracy without expensive context-free inference. MSAs are widely used in computational biology for building finite-state models that capture long-distance dependencies in sequences (e.g., in RNA secondary structure). Given a large set of functionally aligned sequences in MSA format, finite-state models can be constructed that allow for the efficient alignment of new sequences with the given MSA. In natural language processing (NLP), only very rarely have MSA techniques been used, and then to characterize phonetic or semantic similarity. This software explores the definition of a purely syntactic functional alignment between semantically unrelated strings from the same language, to define a structural MSA for constructing finite-state syntactic models. The software had two specific aims. The first aim was to develop natural language sequence processing algorithms and models that could: a) define sequence alignments with respect to syntactic function; b) build structural MSAs based on defined functional alignments; c) derive finite-state models to efficiently align new sequences with the built MSA; and d) extract features from an alignment with the MSA for improved sequence modeling. The second aim was to empirically validate this approach within a number of large-scale text processing applications in multiple domains and languages. These algorithms are expected to provide improved finite-state natural language models that will contribute to the state-of-the-art in critical text processing applications."
This software processes natural language.
Children with Autism Spectrum Disorders (ASD) exhibit varying levels of communication abilities. This software aims to develop a communication system for such children. Resulting technology could also benefit other children and adults with adequate cognition but limited communication options. The software is meant as an assistive communication facilitation device referred to as the RSVP Keyboard. It unites three technologies: 1) Rapid serial visual presentation (RSVP, with individually adjustable presentation rates) of letters/words/phrases; 2) a yes/no intent detection mechanism based on detecting evoked-response potentials (ERP) in the brain to determine which target letter or letters the child wants to convey; 3) a statistical language model based dynamic sequencing optimization procedure that computes which letter needs to be presented next to take advantage of regularities in language. The system operates by showing the sequence of candidate letters on the screen as well as previously typed text, such that words and phrases are formed naturally by adding selected letters.
"The organizational algorithm is examined as a computational approach to representing interpersonal learning."
Software generated as a result of this project, led by Jan van Santen and Lois Black, and joint with Rhea Paul and Fred Volkmar at Yale's Child Study Center and Larry Shriberg at the University of Wisconsin's Waisman Center, focuses on automated technologies for assessment of prosodic ability in autism. Autistic Spectrum Disorders (ASD) form a group of neuropsychiatric conditions whose core behavioral features include impairments in reciprocal social interaction, in communication, and repetitive, stereotyped, or restricted interests and behaviors. The importance of prosodic deficits in the adaptive communicative competence of speakers with ASD, as well as for a fuller understanding of the social disabilities central to these disorders is generally recognized; yet current studies are few in number and have significant methodological limitations. The objective of the project was to detail prosodic deficits in young speakers with ASD through a series of experiments that address these disabilities and related areas of function. Key features of the project include: 1) the application of innovative technology. The study applied computer-based speech and language technologies for quantifying expressive prosody, for computing dialogue structure, and for generating acoustically controlled speech stimuli for measuring receptive prosody; moreover, all experiments were delivered via computer to insure consistency of stimuli and accuracy of recording responses; 2) broad coverage of the dimensions of prosody. All three functions of prosody, grammatical, pragmatic, and affective, are addressed; expressive and receptive tasks are included; and both contextualized tasks (dialogue, story comprehension and memory) and decontextualized tasks (e.g., vocal affect recognition) were used; 3) inclusion of neuropsychological assessment and classification methodologies to address within-group heterogeneity and obtain a detailed characterization of the groups; 4) inclusion of two comparison groups: children with typical development and those with Developmental Language Disorder; 5) inclusion of an experimental treatment program to enhance the prosodic abilities of speakers with ASD.
Children with autism spectrum disorder (ASD) have often been observed to express affect either weakly, only in one modality at a time (e.g., choice of words) or in multiple modalities but not in a coordinated fashion. These difficulties in crossmodal integration of affect expression may have roots in certain global characteristics of brain structure in autism, specifically atypical interconnectivity between brain areas. Poor crossmodal integration of affect expression may also play a critical role in communications difficulties that are well documented in ASD. Not understanding how e.g., facial expression can be used to modify the interpretation of words undermines social reciprocity. Impairment in crossmodal integration of affect is thus a potentially powerful explanatory concept in ASD. Software related to this project addresses the need for data on expressive crossmodal integration impairment in ASD and its association with receptive croosmodal integration impairment by using innovative technologies to create stimuli for a judgmental procedure that makes possible independent assessment of the individual modalities; these technologies are critical because human observers are not able to selectively filter out modalities. In addition, the vocal measures and the audiovisual database lay the essential groundwork for the next step: Creation of audiovisual analysis methods for automated assessment of expressive crossmodal integration. These methods are applied to audio-visual recordings of a structured play situation; the child will participate in this play situation twice, once with a caregiver and once with an examiner. This procedure for measuring expressive crossmodal integration is complemented by a procedure for measuring crossmodal integration of affect processing using dynamic talking-face stimuli in which the audio and video stream are recombined (preserving perfect synchrony of the facial and vocal channels) to create stimuli with congruent vs. incongruent affect expression.
"The Foreign Accented English (FAE) corpus consists of American English utterances by non-native speakers. The corpus contains 4925 telephone quality utterances from native speakers of 23 languages. Three independent judgments of accent were made on each utterance by native American English speakers."
This software implements new theoretical models and technology to automatically convert descriptive text into 3D scenes representing the text's meaning. They do this via the Scenario-Based Lexical Knowledge Resource (SBLR), a resource they are creating from existing sources (PropBank, WordNet, FrameNet) and from automated mining of Wikipedia and other un-annotated text. In addition to predicate-argument structure and semantic roles, the SBLR includes necessary roles, typical role fillers, contextual elements, and activity poses which enables analysis of input sentences at a deep level and assembly of appropriate elements from libraries of 3D objects to depict the fuller scene implied by a sentence. For example, "Terry ate breakfast" does not tell us where (kitchen, dining room, restaurant) or what he ate (cereal, doughnut, or rice, umeboshi, and natto). These elements must be supplied from knowledge about typical role fillers appropriate for the information that is specified in the input. Note that the SBLR has a component that varies by cultural context.
A real-coded genetic algorithm that efﬁciently estimates parameters of a formant trajectory model. The genetic algorithm uses roulette-wheel selection and elitism to minimize the root mean square error between the observed formant trajectory and the model trajectory. Parameters, including vowel and consonant target values and coarticulation parameters, are
estimated for a corpus of English clear and conversational CVC words. Results show consistent consonant formant targets, even when those consonants do not themselves have formant structure. We also present ﬁndings of a relationship between a coarticulation parameter and the consonant identity
"The vast variability of the human speech signal remains a central challenge for Text-to-Speech (TTS) systems. The objective of this research is to develop TTS technologies that focus on elimination of concatenation errors, and accurate speech modifications in the areas of coarticulation, degree of articulation, prosodic effects, and speaker characteristics. The investigators are exploring an asynchronous interpolation model (AIM), which promises to provide for high-quality and flexible TTS. The core idea of AIM is to represent a short region of speech as a composition of several types of features called streams. Each stream is computed by asynchronous interpolation of basis vectors.
Each basis vector is associated with a particular phoneme, allophone, or more specialized unit. Thus, the speech region is described by the varying degrees of influence of several types of preceding and following acoustic features. Using AIM, the investigators are also developing methods to optimally compress the acoustic inventories of TTS systems, given a size or a quality constraint, and to adapt the system to a new voice, given a few training samples. The system being researched forms a hybrid between traditional concatenative and formant-based synthesis, having advantages of both, resulting in a high-quality, optimized TTS system with voice adaptation capabilities. TTS has generally recognized societal benefits for universal access, education, and information access by voice. Our research will make it possible, for example, to build personalized TTS systems for individuals with speech disorders who can only intermittently produce normal speech sounds."
Uses following algorithms:
- ESPS get_formant algorithm
- Dynamic time warping (DTW) algorithm
- Pitch-synchronous sinusoidal synthesis algorithm
The “hybridization algorithm (1) extracts conversational (CNV) and clear (CLR) features from the same sentences spoken in both CNV and CLR styles, then (2) constitutes a “hybrid” (HYB) feature set from a particular subset of CLR features and from the complementary subset of CNV features, and ﬁnally (3) synthesizes HYB sentences from the HYB features."
Many children with autism who have limited verbal abilities use Augmentative and Alternative Communication (AAC) devices to help them communicate with others. Often, these devices produce speech output. Necessarily, the voice of such a system does not resemble in any way the voice of the child who uses the system. The software from this project is for children who have at least some speech capability, such as saying a few isolated words. The technology performs a voice transplant of the child's natural voice onto the AAC device, so that the device's voice will sound like the child.
The Kids' Speech Corpus was developed to facilitate research about the characteristics of kids' speech at different ages and to train and evaluate recognizers for use in language training and other interactive tasks involving children. For instance, this corpus was used to train recognizers used in language development with deaf children. In cooperation with the Forest Grove School District speech was gathered from children in grades K through 10. Approximately 100 children at each grade level have been recorded.
"The software from this research project enables next generation dialogue systems to be able to collaborate with a user without the limitations of system-initiative interaction in order to solve complex tasks in an optimal manner. The research develops reinforcement learning (RL) strategies to learn dialogue policies that are mixed-initiative. The specific aims of this are to (a) extend RL to mixed-initiative dialogue interaction; (b) allow the system policy to adapt to different user types, such as people with poor memory, or poor problem-solving skills; and (c) simultaneously learn the policy for the simulated user.
This approach will allow more advanced dialogue systems to be deployed, such as assisting the elderly so they can live independently longer, and helping provide health care information to rural areas. The proposed research project will result in a toolkit that will allow a wide range of users to easily develop dialogue policies. The toolkit will (a) allow students to be effectively trained in this area, (b) lower the barrier for other researchers to contribute to the field, and (c) help transfer this new technology to industry."
Software resulting from this project includes new algorithms that enable dysarthric individuals to be more easily understood. Currently available devices are essentially spectral filters and amplifiers that enhance certain parts of the spectrum. While these can help certain types of dysarthria, many dysarthric persons suffer from speech problems that require forms of speech modification that are much more profound and complex such as: irregular sub-glottal pressure, resulting in loudness bursts that can be difficult to adjust to; absence, or poor control, of voicing; systematic mispronunciation of certain phoneme groups, resulting in certain sounds becoming indistinguishable or unrecognizable; variable mispronunciation; and poor prosody (pitch control, timing, and loudness). For these difficult problems, new approaches are needed that do not merely filter the speech signal but analyze it at acoustic, articulatory, phonetic, and linguistic levels.
The objective of software from this project is to develop techniques to objectively (automatically) measure spoken language variability and change in aging. Many of the most effective methods for cognitive assessment are mediated by observed behavior, particularly spoken language production. These include clinical instruments, e.g., the Mini Mental Status Examination (MMSE), but also less formal assessments involving interviews or dialogs with physicians or even friends and family. Behavioral changes noted through these spoken language interactions could indicate pathological changes associated with a disorder; or the changes may be transient, due to missing medication or depression at the time of assessment. Alternatively, the observed behavior may be simply due to normal change in spoken language due to aging, or even within the range of natural behavioral variation. Understanding normal versus pathological language change with age requires the collection and annotation of repeated samples from both healthy and impaired individuals. This project has three specific aims: 1) to collect and transcribe longitudinal spoken language sample data elicited in multiple ways from diverse elderly adults; 2) to develop algorithms for automatically extracting features from these spoken language samples; and 3) to characterize the variability of feature values across samples of the same individual; and the utility of feature values and even feature variances for discriminating between subject groups. A particular challenge being addressed by this research is to achieve high-quality, efficient automatic annotation of discourse structure for the spoken language samples. The resulting methods are expected to directly contribute to important behavioral assessment applications.
This project focuses on applying a model used in text-to-speech synthesis (TTS) to the task of automatic speech recognition (ASR). The standard method in ASR for addressing variability due to phonemic context, or coarticulation, requires a large amount of training data and is sensitive to differences between training and testing conditions. Despite the effective use of stochastic models, current ASR systems are often unable to sufficiently account for the large degree of variability observed in speech. In many cases, this variability is not due to random factors, but is due to predictable changes in the speech signal. These factors are currently modeled in order to generate speech via TTS, but they are not yet modeled in order to recognize speech, largely because of non-local dependencies. This software applies the Asynchronous Interpolation Model (AIM) used in TTS to the task of speech recognition, by decomposing the speech signal into target vectors and weight trajectories, and then searching weight-trajectory and stochastic target-vector models for the highest-probability match to the input signal. The goal of this research is improve the robustness of ASR to variability that is due to phonemic and lexical context. This improvement will increase the use of ASR technology in automated information access by telephone, educational software, and universal access for individuals with visual, auditory, or speech-production challenges. More effective models of coarticulation may increase our understanding of both human speech perception and speech production.
A portion of the Numbers corpus played through loudspeakers, re-recorded on a 12-channel table-top microphone array in a meeting room.
Note: Currently not available for commercial use.
- slow to 300%
- speed-up to 50%
- lower pitch to 50%
- raise pitch to 200%
- scale formants to 80%
- scale formants to 120%
- mimic child
- mimic man
Software resulting from this NSF-funded project is meant to create a speech interface that supports a user in interacting with multiple real-time devices at the same time, where the interaction with each device is a separate dialogue thread. The first aim is to show, using a human-computer study, that the simple way to implement a speech interface for managing multiple threads is not effective. The second aim is to run a human-human study to show that people can inherently manage multiple dialogue threads, and to determine what conventions they use. The third aim is to build a speech interface that implements the conventions that were found.
The OGI Multi-language Telephone Speech Corpus consists of telephone speech from 11 languages: English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil, Vietnamese. The corpus contains fixed vocabulary utterances (e.g. days of the week) as well as fluent continuous speech. The current release includes recorded utterances from about 2052 speakers, for a total of about 38.5 hours of speech.
The Names Corpus is a collection of 24,245 first and last name utterances from 20184 speakers. The utterances were taken from many other telephone speech data collections that have been completed at the CSLU, during which callers were asked to say their first and last names, or asked to leave their name and address to receive an award coupon (addresses are not include in corpus). Each file in the Names corpus has an orthographic transcription following the CSLU Labeling Guide. Also, to take advantage of the phonemic variability, 24245 of the utterance have been phonetically transcribed. The selection of files to phonetically transcribe was constrained by a process that selected files that were suspected to contain phonetic contexts that had not yet been transcribed.
The Cellular Corpus consists of cellular telephone speech from 2336 callers from locations throughout the United States. The data collection protocol contains requests for fixed vocabulary and continuous speech utterances. A total of about one minute of speech from each caller is collected.
This software is a computerized assessment system. It is meant to have a clear understanding of which neuropsychological functions are measured, interactivity (the computer adapts its behavior instantly to the subject's responses, thereby being able to operate at a level of optimal sensitivity), instantaneous and timed measurement of a range of behavioral responses including the force dynamics of button pushing and eye movements, mathematical modeling of the underlying cognitive processes in order to derive "purer" measures of the neuropsychological functions, and a more motivating and shorter assessment process.
The Numbers Corpus is a collection of naturally produced numbers. The utterances were taken from other CSLU telephone speech data collections, and include isolated digit strings, continuous digit strings, and ordinal/cardinal numbers. A total of 23902 files are included in this corpus.
Software from this project addresses on how humans perceive acoustic discontinuities in speech. Current text-to-speech synthesis ("TTS") technology operates by retrieving intervals of stored digitized speech("units") from a database and splicing ("concatenating") them to form the output utterance. Unavoidably, there are acoustic discontinuities at the time points where the successive speech intervals meet. An unsolved problem is how to predict from the quantitative, acoustic properties of two to-be-concatenated units whether humans will hear a discontinuity. This is of immediate relevance for TTS systems that select units at run time from a large speech corpus. During selection, the systems search through the space of all possible sequences of units that can be used for the utterance and selects the sequence that has the lowest overall objective cost measure, such as the Euclidean distance between the final frame and initial frame of two units. However, research has already shown that this method and related methods do not predict well whether humans will hear a discontinuity. The current research, by being explicitly focused on perceptually optimized objective cost measures, will directly contribute to the perceptual accuracy of cost measures and hence to synthesis quality.
"Software resulting from this study aids existing work by (i) performing a series of perception tests that cover a wide range of prosodic contrasts; (ii) using state-of-the-art speech modification techniques to "surgically" manipulate prosodic dimensions; and (iii) relating these results to results from parallel speech production tasks."
The Portland Cellular Corpus consists of utterances gathered from callers who were using cellular telephones. Each participant called a toll-free cellular number, then listened and responded to the protocol. A total of 515 different callers participated.
"Currently, segmentation of QCT scans is done using manual drawing software. With this software, the radiation oncologist outlines the prostate region on a stack of cross-sectional x-ray slices that constitute the prostate. This procedure is tedious and time-consuming. (An experienced radiation oncologist would need to work full-time for about 8 months to segment the 4000 CT images in the MrOS cohort, assuming 15 minutes per scan, unvalidated.) This is a major drawback to using CT images in efficiently diagnosing and researching prostate diseases. The software resulting from this study is a computer-based segmentation algorithm of the prostate that is greatly needed."
"This project [joint with Alan Black at Carnegie Mellon University and Richard Sproat at AT&T Research] focuses on innovative algorithms for generating highly expressive synthetic speech. Generating expressive speech involves three hard research problems. (i) Computation of abstract tags that specify, e.g., which words need emphasis, and phrasing (e.g., where to pause). (ii) Based on these tags, the system has to compute a fundamental frequency contour. (iii) Severe modification of the stored speech fragments ("acoustic units") to obtain these contours. The central goal of the project is to address these research problems, and create a TTS system that will make the next generation of TTS based language remediation systems viable."
"The software associated with this project aims to improve upon the recent advances in state estimation techniques for nonlinear dynamical systems. These include the unscented Kalman filter, the particle filter, and the sigma-point filters. Previous methods relied heavily on the accurate model knowledge for the specific state estimation domain, whereas many problems involve uncertainty about the dynamic equations and the noise distributions. This software is meant as a robust state estimation algorithms based on information theoretic estimation principles that can handle such uncertainties as well as outliers and sensor failures. These techniques will be used for unobtrusive monitoring of elderly in their homes using motion sensors and RFID transmitters."
"This software includes a new algorithm for text-to-speech synthesis (TTS) that will lead to (i) dramatic decreases in disk and memory requirements at a given speech quality level and (ii) minimization of the amount of voice recordings needed to create a new synthetic voice."
"This software provides an automated method for assessment of childhood apraxia of speech. This disorder is a highly controversial disorder due to a lack of consensus on the features that define it and the etiologic conditions that explain its origin. The term Suspected Apraxia of Speech (sAOS) has been proposed as an interim term for this putative clinical entity. Future iterations of this software will attempt to develop a valid, reliable, and efficient means to classify children as positive for sAOS. This initial iteration of the software includes automated diagnostic markers for sAOS with clinically adequate sensitivity and specificity (> 90% positive and negative likelihood ratios). The four specific aims for future iterations are: (a) to automate and improve the sensitivity and specificity of two existing (manually derived) prosodic markers, (b) to develop four additional automatic, prosody-based diagnostic markers, (c) to derive a single diagnostic index based on a statistical derivative from the six individual markers, and (d) to validate the composite diagnostic marker using classification data obtained from expert clinical researchers."
"This project is conducting fundamental research in statistical language modeling to improve human language technologies, including automatic speech recognition (ASR) and machine translation (MT).
A language model (LM) is conventionally optimized, using text in the target language, to assign high probability to well-formed sentences. This method has a fundamental shortcoming: the optimization does not explicitly target the kinds of distinctions necessary to accomplish the task at hand, such as discriminating (for ASR) between different words that are acoustically confusable or (for MT) between different target-language words that express the multiple meanings of a polysemous source-language word.
Discriminative optimization of the LM, which would overcome this shortcoming, requires large quantities of paired input-output sequences: speech and its reference transcription for ASR or source-language (e.g. Chinese) sentences and their translations into the target language (say, English) for MT. Such resources are expensive, and limit the efficacy of discriminative training methods.
In a radical departure from convention, this project is investigating discriminative training using easily available, *unpaired* input and output sequences: un-transcribed speech or monolingual source-language text and unpaired target-language text. Two key ideas are being pursued: (i) unlabeled input sequences (e.g. speech or Chinese text) are processed to learn likely confusions encountered by the ASR or MT system; (ii) unpaired output sequences (English text) are leveraged to discriminate between these well-formed sentences from the (supposed) ill-formed sentences the system could potentially confuse them with.
This self-supervised discriminative training, if successful, will advance machine intelligence in fundamental ways that impact many other applications."
The Speaker Recognition corpus (formerly known as Speaker Verification), consists of telephone speech from 91 participants. Each participant has recorded speech in twelve sessions over a two-year period answering questions like "what is your eye color" or respond to prompts like "describe a typical day in your life." Most of the utterances in the release of the corpus have corresponding non-time-aligned word level transcriptions.
An "algorithm for adjusting the magnitude spectrum when the fundamental frequency (F0) of a speech signal is altered. The algorithm exploits the correlation between F0 and the magnitude spectrum of speech as represented by line
spectral frequencies (LSFs). This correlation is class-dependent, and thus a broad classification of the input is achieved by a Gaussian mixture model (GMM). The within-class dependencies of LSFs on F0 values are captured by constructing their joint probability densities using a series of GMMs, one for each speech class. The proposed system is used for post-processing the pitch modified signal. Perceptual tests showed that the addition of this post-processing system improves the naturalness of the pitch modified signal for large pitch modification factors."
The Spelled and Spoken Words corpus consists of spelled and spoken words. 3647 callers were prompted to say and spell their first and last names, to say what city they grew up in and what city they were calling from, and to answer two yes/no questions. In order to collect sufficient instances of each letter, about 1000 callers also recited the English alphabet with pauses between the letters. Each call was transcribed by two people, and all differences were resolved. In addition, a subset of the calls has been phonetically labeled.
"The Spoltech Brazilian Portuguese corpus contains microphone speech from a variety of regions in Brazil with phonetic and orthographic transcriptions. The utterances consist of both read speech (for phonetic coverage) and responses to questions (for spontaneous speech). The corpus contains 480 speakers and 8119 separate utterances. A total of 2579 utterances have been transcribed at the word level (without time alignments), and 5505 utterances have been transcribed at the phoneme level (with time alignments). Protocol design, recording and transcription were performed by the Universidade Federal do Rio Grande do Sul and the Universidade de Caxias do Sul."
The Stories Corpus is made up of extemporaneous speech collected from English speakers in the CSLU "Multi-language Telephone Speech data collection. Each speaker was asked to speak on a topic of their choice for one minute. These utterances make up the Stories Corpus."
"This software demonstrates methods for synthesis of speaker identity from a relatively small, realistic amount of training recordings. This constraint serves not only to force the work to create solutions relevant for real-world applications, but also to develop and apply tools for exploring a key perceptual question: What speech features are critical for speaker identification by humans?"
"This package of software programs is meant for the automatic analysis of spontaneous language samples from children with neurodevelopmental disorders. The program will be usable directly by clinicians in their assessment of patients."
"Transformation algorithms that operate either in the formant frequency domain (for didactic purposes) or the line spectral frequency domain (for full-featured transformation), and evaluated their performance using objective measurements and perceptual tests."
"The populations of patients with locked-in syndrome are increasing as medical technologies advance and successfully support life. These individuals with limited to no movement could potentially contribute to their medical decision making, informed consent, and daily care giving if they had faster, more reliable means to interface with communication systems. These software-based language models are innovative technological discoveries that are being applied to clinical augmentative communication tools so that patients and their families can participate in daily activities and advocate for improvements in standard clinical care."
"This algorithm is meant to develop a synthetic voice for an AAC system that sounds like the individual using the system (before they lost the ability to speak), without requiring very much recorded data on the part of the original talker. The system works by first creating a synthetic "base" voice (or set of base voices) using professional actors who must provide a fairly large inventory of speech data. Using the base voice and a small sample from the target talker (i.e., containing at least one instance of each phoneme), a new synthetic voice is created by essentially modulating parameters in the base voice so that it takes on characteristics of the target talker. The ability to create a voice that sounds like the original talker without much data from the original talker would be a significant advantage."
"The purpose of a voice conversion (VC) system is to change the perceived speaker identity of a speech signal. We propose a new algorithm based on converting the LPC spectrum and predicting the residual as a function of the target envelope parameters. We conduct listening tests based on speaker discrimination of same/difference pairs to measure the accuracy by which the converted voices match the desired target voices. To establish the level of human performance as a baseline, we ﬁrstmeasure the ability of listeners to discriminate between original speech utterances under three conditions: normal, fundamental frequency and duration normalized, and LPC coded. Additionally, the spectral parameter conversion function is tested in isolation by listening to source, target, and converted speakers as LPC coded speech. The results show that the speaker identity of speech whose LPC spectrum has been converted can be recognized as the target speaker with the same level of performance as discriminating between LPC coded speech. However, the level of discrimination of converted utterances produced by the full VC system is signiﬁcantlybelow that of speaker discrimination of natural speech."
These algorithms are for state-of-the-art voice transformation to create new synthetic voices.
VTmax uses an LSF vocoder speech model to train the
transformation and generate new speech ﬁles.
VTmin is useful for cases in which only low-quality existing voice recordings are available.
The VOICES Corpus was created by Alexander Kain for his Ph.D. dissertation work on high resolution voice transformation. The corpus contains 12 speakers reading 50 phonetically rich sentences. The recording procedure involved a "mimicking" approach which resulted in a high degree of natural time-alignment between different speakers. The acoustic wave and the concurrent laryngograph signal were recorded for 1 "free" and 2 "mimicked" renditions of each sentence. Pitch marks, calculated from the laryngograph signal, and time marks, the output of a forced-alignment algorithm, have been added to the corpus.
The Yes/No Corpus is a collection of answers to yes/no questions from other CSLU corpora.
The data in this corpus were collected over telephone lines. They were collected from both analog and digital phone lines.
The analog data were recorded using a Gradient Technologies analog-to- digital conversion box. These files were recorded as 16-bit, 8 khz and stored in a linear format.
The digital data were recorded with the CSLU T1 digital data collection system. These files were sampled at 8 khz 8-bit and stored as ulaw files.
All of the data use the RIFF standard file format. This file format is 16-bit linearly encoded.