8577 lines
350 KiB
Plaintext
8577 lines
350 KiB
Plaintext
\input texinfo @c -*-texinfo-*-
|
|
@c %**start of header
|
|
@setfilename festival.info
|
|
@settitle Festival Speech Synthesis System
|
|
@finalout
|
|
@setchapternewpage odd
|
|
@c %**end of header
|
|
|
|
@c This document was modelled on the numerous examples of texinfo
|
|
@c documentation available with GNU software, primarily the hello
|
|
@c world example, but many others too. I happily acknowledge their
|
|
@c aid in producing this document -- awb
|
|
|
|
@set EDITION 1.4
|
|
@set VERSION 1.4.3
|
|
@set UPDATED 27th December 2002
|
|
|
|
@ifinfo
|
|
This file documents the @code{Festival} Speech Synthesis System a general
|
|
text to speech system for making your computer talk and developing
|
|
new synthesis techniques.
|
|
|
|
Copyright (C) 1996-2004 University of Edinburgh
|
|
|
|
Permission is granted to make and distribute verbatim copies of
|
|
this manual provided the copyright notice and this permission notice
|
|
are preserved on all copies.
|
|
|
|
@ignore
|
|
Permission is granted to process this file through TeX, or otherwise and
|
|
print the results, provided the printed document carries copying
|
|
permission notice identical to this one except for the removal of this
|
|
paragraph (this paragraph not being relevant to the printed manual).
|
|
|
|
@end ignore
|
|
Permission is granted to copy and distribute modified versions of this
|
|
manual under the conditions for verbatim copying, provided that the entire
|
|
resulting derived work is distributed under the terms of a permission
|
|
notice identical to this one.
|
|
|
|
Permission is granted to copy and distribute translations of this manual
|
|
into another language, under the above conditions for modified versions,
|
|
except that this permission notice may be stated in a translation approved
|
|
by the authors.
|
|
@end ifinfo
|
|
|
|
@titlepage
|
|
@title The Festival Speech Synthesis System
|
|
@subtitle System documentation
|
|
@subtitle Edition @value{EDITION}, for Festival Version @value{VERSION}
|
|
@subtitle @value{UPDATED}
|
|
@author by Alan W Black, Paul Taylor and Richard Caley.
|
|
|
|
@page
|
|
@vskip 0pt plus 1filll
|
|
Copyright @copyright{} 1996-2004 University of Edinburgh, all rights
|
|
reserved.
|
|
|
|
Permission is granted to make and distribute verbatim copies of
|
|
this manual provided the copyright notice and this permission notice
|
|
are preserved on all copies.
|
|
|
|
Permission is granted to copy and distribute modified versions of this
|
|
manual under the conditions for verbatim copying, provided that the entire
|
|
resulting derived work is distributed under the terms of a permission
|
|
notice identical to this one.
|
|
|
|
Permission is granted to copy and distribute translations of this manual
|
|
into another language, under the above conditions for modified versions,
|
|
except that this permission notice may be stated in a translation approved
|
|
by the University of Edinburgh
|
|
@end titlepage
|
|
|
|
@node Top, , , (dir)
|
|
|
|
@ifinfo
|
|
This file documents the @emph{Festival Speech Synthesis System}
|
|
@value{VERSION}. This document contains many gaps and is still in the
|
|
process of being written.
|
|
@end ifinfo
|
|
|
|
@menu
|
|
* Abstract:: initial comments
|
|
* Copying:: How you can copy and share the code
|
|
* Acknowledgements:: List of contributors
|
|
* What is new:: Enhancements since last public release
|
|
|
|
* Overview:: Generalities and Philosophy
|
|
* Installation:: Compilation and Installation
|
|
* Quick start:: Just tell me what to type
|
|
* Scheme:: A quick introduction to Festival's scripting language
|
|
|
|
Text methods for interfacing to Festival
|
|
* TTS:: Text to speech modes
|
|
* XML/SGML mark-up:: XML/SGML mark-up Language
|
|
* Emacs interface:: Using Festival within Emacs
|
|
|
|
Internal functions
|
|
* Phonesets:: Defining and using phonesets
|
|
* Lexicons:: Building and compiling Lexicons
|
|
* Utterances:: Existing and defining new utterance types
|
|
|
|
Modules
|
|
* Text analysis:: Tokenizing text
|
|
* POS tagging:: Part of speech tagging
|
|
* Phrase breaks:: Finding phrase breaks
|
|
* Intonation:: Intonations modules
|
|
* Duration:: Duration modules
|
|
* UniSyn synthesizer:: The UniSyn waveform synthesizer
|
|
* Diphone synthesizer:: Building and using diphone synthesizers
|
|
* Other synthesis methods:: other waveform synthesis methods
|
|
* Audio output:: Getting sound from Festival
|
|
|
|
* Voices:: Adding new voices (and languages)
|
|
|
|
* Tools:: CART, Ngrams etc
|
|
|
|
* Building models from databases::
|
|
|
|
Adding new modules and writing C++ code
|
|
* Programming:: Programming in Festival (Lisp/C/C++)
|
|
* API:: Using Festival in other programs
|
|
|
|
* Examples:: Some simple (and not so simple) examples
|
|
|
|
* Problems:: Reporting bugs.
|
|
* References:: Other sources of information
|
|
* Feature functions:: List of builtin feature functions.
|
|
* Variable list:: Short descriptions of all variables
|
|
* Function list:: Short descriptions of all functions
|
|
* Index:: Index of concepts.
|
|
@end menu
|
|
|
|
@node Abstract, Copying, , Top
|
|
@chapter Abstract
|
|
|
|
This document provides a user manual for the Festival
|
|
Speech Synthesis System, version @value{VERSION}.
|
|
|
|
Festival offers a general framework for building speech synthesis
|
|
systems as well as including examples of various modules. As a whole it
|
|
offers full text to speech through a number APIs: from shell level,
|
|
though a Scheme command interpreter, as a C++ library, and an Emacs
|
|
interface. Festival is multi-lingual, we have develeoped voices in many
|
|
languages including English (UK and US), Spanish and Welsh, though
|
|
English is the most advanced.
|
|
|
|
The system is written in C++ and uses the Edinburgh Speech Tools
|
|
for low level architecture and has a Scheme (SIOD) based command
|
|
interpreter for control. Documentation is given in the FSF texinfo
|
|
format which can generate a printed manual, info files and HTML.
|
|
|
|
The latest details and a full software distribution of the Festival Speech
|
|
Synthesis System are available through its home page which may be found
|
|
at
|
|
@example
|
|
@url{http://www.cstr.ed.ac.uk/projects/festival.html}
|
|
@end example
|
|
|
|
@node Copying, Acknowledgements, Abstract, Top
|
|
@chapter Copying
|
|
|
|
@cindex restrictions
|
|
@cindex redistribution
|
|
As we feeel the core system has reached an acceptable level of maturity
|
|
from 1.4.0 the basic system is released under a free lience, without the
|
|
commercial restrictions we imposed on early versions. The basic system
|
|
has been placed under an X11 type licence which as free licences go is
|
|
pretty free. No GPL code is included in festival or the speech tools
|
|
themselves (though some auxiliary files are GPL'd e.g. the Emacs mode
|
|
for Festival). We have deliberately choosen a licence that should be
|
|
compatible with our commercial partners and our free software users.
|
|
|
|
However although the code is free, we still offer no warranties and no
|
|
maintenance. We will continue to endeavor to fix bugs and answer
|
|
queries when can, but are not in a position to guarantee it. We will
|
|
consider maintenance contracts and consultancy if desired, please
|
|
contacts us for details.
|
|
|
|
Also note that not all the voices and lexicons we distribute with
|
|
festival are free. Particularly the British English lexicon derived
|
|
from Oxford Advanced Learners' Dictionary is free only for
|
|
non-commercial use (we will release an alternative soon). Also the
|
|
Spanish diphone voice we relase is only free for non-commercial use.
|
|
|
|
If you are using Festival or the speech tools in commercial environment,
|
|
even though no licence is required, we would be grateful if you let us
|
|
know as it helps justify ourselves to our various sponsors.
|
|
|
|
The current copyright on the core system is
|
|
@example
|
|
The Festival Speech Synthesis System: version 1.4.3
|
|
Centre for Speech Technology Research
|
|
University of Edinburgh, UK
|
|
Copyright (c) 1996-2004
|
|
All Rights Reserved.
|
|
|
|
Permission is hereby granted, free of charge, to use and distribute
|
|
this software and its documentation without restriction, including
|
|
without limitation the rights to use, copy, modify, merge, publish,
|
|
distribute, sublicense, and/or sell copies of this work, and to
|
|
permit persons to whom this work is furnished to do so, subject to
|
|
the following conditions:
|
|
1. The code must retain the above copyright notice, this list of
|
|
conditions and the following disclaimer.
|
|
2. Any modifications must be clearly marked as such.
|
|
3. Original authors' names are not deleted.
|
|
4. The authors' names are not used to endorse or promote products
|
|
derived from this software without specific prior written
|
|
permission.
|
|
|
|
THE UNIVERSITY OF EDINBURGH AND THE CONTRIBUTORS TO THIS WORK
|
|
DISCLAIM ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING
|
|
ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT
|
|
SHALL THE UNIVERSITY OF EDINBURGH NOR THE CONTRIBUTORS BE LIABLE
|
|
FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
|
|
WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN
|
|
AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION,
|
|
ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF
|
|
THIS SOFTWARE.
|
|
@end example
|
|
|
|
@node Acknowledgements, What is new, Copying, Top
|
|
@chapter Acknowledgements
|
|
@cindex acknowledgements
|
|
@cindex thanks
|
|
|
|
The code in this system was primarily written by Alan W Black, Paul
|
|
Taylor and Richard Caley. Festival sits on top of the Edinburgh Speech
|
|
Tools Library, and uses much of its functionality.
|
|
|
|
Amy Isard wrote a synthesizer for her MSc project in 1995, which first
|
|
used the Edinburgh Speech Tools Library. Although Festival doesn't
|
|
contain any code from that system, her system was used as a basic model.
|
|
|
|
Much of the design and philosophy of Festival has been built on the
|
|
experience both Paul and Alan gained from the development of various
|
|
previous synthesizers and software systems, especially CSTR's Osprey and
|
|
Polyglot systems @cite{taylor91} and ATR's CHATR system @cite{black94}.
|
|
|
|
However, it should be stated that Festival is fully developed at CSTR
|
|
and contains neither proprietary code or ideas.
|
|
|
|
Festival contains a number of subsystems integrated from other sources
|
|
and we acknowledge those systems here.
|
|
|
|
@section SIOD
|
|
@cindex SIOD
|
|
@cindex Scheme
|
|
@cindex Paradigm Associates
|
|
|
|
The Scheme interpreter (SIOD -- Scheme In One Defun 3.0) was
|
|
written by George Carrett (gjc@@mitech.com, gjc@@paradigm.com)
|
|
and offers a basic small Scheme (Lisp) interpreter suitable
|
|
for embedding in applications such as Festival as a scripting
|
|
language. A number of changes and improvements have been added
|
|
in our development but it still remains that basic system.
|
|
We are grateful to George and Paradigm Associates Incorporated
|
|
for providing such a useful and well-written sub-system.
|
|
@example
|
|
Scheme In One Defun (SIOD)
|
|
COPYRIGHT (c) 1988-1994 BY
|
|
PARADIGM ASSOCIATES INCORPORATED, CAMBRIDGE, MASSACHUSETTS.
|
|
ALL RIGHTS RESERVED
|
|
|
|
Permission to use, copy, modify, distribute and sell this software
|
|
and its documentation for any purpose and without fee is hereby
|
|
granted, provided that the above copyright notice appear in all copies
|
|
and that both that copyright notice and this permission notice appear
|
|
in supporting documentation, and that the name of Paradigm Associates
|
|
Inc not be used in advertising or publicity pertaining to distribution
|
|
of the software without specific, written prior permission.
|
|
|
|
PARADIGM DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING
|
|
ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL
|
|
PARADIGM BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR
|
|
ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
|
|
WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION,
|
|
ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS
|
|
SOFTWARE.
|
|
@end example
|
|
|
|
@section editline
|
|
|
|
Because of conflicts between the copyright for GNU readline, for which
|
|
an optional interface was included in earlier versions, we have replace
|
|
the interface with a complete command line editing system based on
|
|
@file{editline}. @file{Editline} was posted to the USENET newsgroup
|
|
@file{comp.sources.misc} in 1992. A number of modifications have been
|
|
made to make it more useful to us but the original code (contained
|
|
within the standard speech tools distribution) and our modifications
|
|
fall under the following licence.
|
|
@example
|
|
Copyright 1992 Simmule Turner and Rich Salz. All rights reserved.
|
|
|
|
This software is not subject to any license of the American Telephone
|
|
and Telegraph Company or of the Regents of the University of California.
|
|
|
|
Permission is granted to anyone to use this software for any purpose on
|
|
any computer system, and to alter it and redistribute it freely, subject
|
|
to the following restrictions:
|
|
1. The authors are not responsible for the consequences of use of this
|
|
software, no matter how awful, even if they arise from flaws in it.
|
|
2. The origin of this software must not be misrepresented, either by
|
|
explicit claim or by omission. Since few users ever read sources,
|
|
credits must appear in the documentation.
|
|
3. Altered versions must be plainly marked as such, and must not be
|
|
misrepresented as being the original software. Since few users
|
|
ever read sources, credits must appear in the documentation.
|
|
4. This notice may not be removed or altered.
|
|
@end example
|
|
|
|
@section Edinburgh Speech Tools Library
|
|
|
|
@cindex Edinburgh Speech Tools Library
|
|
The Edinburgh Speech Tools lies at the core of Festival. Although
|
|
developed separately, much of the development of certain parts of the
|
|
Edinburgh Speech Tools has been directed by Festival's needs. In turn
|
|
those who have contributed to the Speech Tools make Festival
|
|
a more usable system.
|
|
|
|
@xref{Acknowledgements, , Acknowledgements, speechtools,
|
|
Edinburgh Speech Tools Library Manual}.
|
|
|
|
Online information about the Edinburgh Speech Tools library
|
|
is available through
|
|
@example
|
|
@url{http://www.cstr.ed.ac.uk/projects/speech_tools.html}
|
|
@end example
|
|
|
|
@section Others
|
|
|
|
Many others have provided actual code and support for Festival,
|
|
for which we are grateful. Specifically,
|
|
|
|
@itemize @bullet
|
|
@item Alistair Conkie:
|
|
various low level code points and some design work,
|
|
Spanish synthesis, the old diphone synthesis code.
|
|
@item Steve Isard:
|
|
directorship and LPC diphone code, design of diphone schema.
|
|
@item EPSRC:
|
|
who fund Alan Black and Paul Taylor.
|
|
@item Sun Microsystems Laboratories:
|
|
for supporting the project and funding Richard.
|
|
@item AT&T Labs - Research:
|
|
for supporting the project.
|
|
@item Paradigm Associates and George Carrett:
|
|
for Scheme in one defun.
|
|
@item Mike Macon:
|
|
Improving the quality of the diphone synthesizer and LPC analysis.
|
|
@item Kurt Dusterhoff:
|
|
Tilt intonation training and modelling.
|
|
@item Amy Isard:
|
|
for her SSML project and related synthesizer.
|
|
@item Richard Tobin:
|
|
for answering all those difficult questions, the socket code,
|
|
and the XML parser.
|
|
@item Simmule Turner and Rich Salz:
|
|
command line editor (editline)
|
|
@item Borja Etxebarria:
|
|
Help with the Spanish synthesis
|
|
@item Briony Williams:
|
|
Welsh synthesis
|
|
@item Jacques H. de Villiers: @file{jacques@@cse.ogi.edu} from CSLU
|
|
at OGI, for the TCL interface, and other usability issues
|
|
@item Kevin Lenzo: @file{lenzo@@cs.cmu.edu} from CMU for the PERL
|
|
interface.
|
|
@item Rob Clarke:
|
|
for support under Linux.
|
|
@item Samuel Audet @file{guardia@@cam.org}:
|
|
OS/2 support
|
|
@item Mari Ostendorf:
|
|
For providing access to the BU FM Radio corpus from which some
|
|
modules were trained.
|
|
@item Melvin Hunt:
|
|
from whose work we based our residual LPC synthesis model on
|
|
@item Oxford Text Archive:
|
|
For the computer users version of Oxford Advanced
|
|
Learners' Dictionary (redistributed with permission).
|
|
@item Reading University:
|
|
for access to MARSEC from which the phrase break
|
|
model was trained.
|
|
@item LDC & Penn Tree Bank:
|
|
from which the POS tagger was trained, redistribution
|
|
of the models is with permission from the LDC.
|
|
@item Roger Burroughes and Kurt Dusterhoff:
|
|
For letting us capture their voices.
|
|
@item ATR and Nick Campbell:
|
|
for first getting Paul and Alan to work together and for the
|
|
experience we gained.
|
|
@item FSF:
|
|
for G++, make, ....
|
|
@item Center for Spoken Language Understanding:
|
|
CSLU at OGI, particularly Ron Cole and Mike Macon, have acted as
|
|
significant users for the system giving significant feedback and
|
|
allowing us to teach courses on Festival offering valuable real-use
|
|
feedback.
|
|
@item Our beta testers:
|
|
Thanks to all the people who put up with previous versions of the system
|
|
and reported bugs, both big and small. These comments are very important
|
|
to the constant improvements in the system. And thanks for your quick
|
|
responses when I had specific requests.
|
|
@item And our users ...
|
|
Many people have downloaded earlier versions of the system. Many have
|
|
found problems with installation and use and have reported it to us.
|
|
Many of you have put up with multiple compilations trying to fix bugs
|
|
remotely. We thank you for putting up with us and are pleased you've
|
|
taken the time to help us improve our system. Many of you have come up
|
|
with uses we hadn't thought of, which is always rewarding.
|
|
|
|
Even if you haven't actively responded, the fact that you use the system
|
|
at all makes it worthwhile.
|
|
@end itemize
|
|
|
|
@node What is new, Overview, Acknowledgements , Top
|
|
@chapter What is new
|
|
|
|
Compared to the the previous major release (1.3.0 release Aug 1998)
|
|
1.4.0 is not functionally so different from its previous versions.
|
|
This release is primarily a consolidation release fixing and tidying
|
|
up some of the lower level aspects of the system to allow better
|
|
modularity for some of our future planned modules.
|
|
|
|
@itemize @bullet
|
|
@item Copyright change:
|
|
The system is now free and has no commercial restriction. Note that
|
|
currently on the US voices (ked and kal) are also now unrestricted. The
|
|
UK English voices depend on the Oxford Advanced Learners' Dictionary of
|
|
Current English which cannot be used for commercial use without
|
|
permission from Oxford University Press.
|
|
|
|
@item Architecture tidy up:
|
|
the interfaces to lower level part parts of the system have been tidied
|
|
up deleting some of the older code that was supported for
|
|
compatibility reasons. This is a much higher dependence of features
|
|
and easier (and safer) ways to register new objects as feature values
|
|
and Scheme objects. Scheme has been tidied up. It is no longer
|
|
"in one defun" but "in one directory".
|
|
|
|
@item New documentation system for speech tools:
|
|
A new docbook based documentation system has been added to the
|
|
speech tools. Festival's documentation will will move
|
|
over to this sometime soon too.
|
|
|
|
@item initial JSAPI support: both JSAPI and JSML (somewhat
|
|
similar to Sable) now have initial impelementations. They of course
|
|
depend on Java support which so far we have only (successfully)
|
|
investgated under Solaris and Linux.
|
|
|
|
@item Generalization of statistical models: CART, ngrams,
|
|
and WFSTs are now fully supported from Lisp and can be used with a
|
|
generalized viterbi function. This makes adding quite complex statistical
|
|
models easy without adding new C++.
|
|
|
|
@item Tilt Intonation modelling:
|
|
Full support is now included for the Tilt intomation models,
|
|
both training and use.
|
|
|
|
@item Documentation on Bulding New Voices in Festival:
|
|
documentation, scripts etc. for building new voices and languages in
|
|
the system, see
|
|
@example
|
|
@url{http://www.cstr.ed.ac.uk/projects/festival/docs/festvox/}
|
|
@end example
|
|
|
|
@end itemize
|
|
|
|
@node Overview, Installation , What is new, Top
|
|
@chapter Overview
|
|
|
|
Festival is designed as a speech synthesis system for at least three
|
|
levels of user. First, those who simply want high quality speech from
|
|
arbitrary text with the minimum of effort. Second, those who are
|
|
developing language systems and wish to include synthesis output. In
|
|
this case, a certain amount of customization is desired, such as
|
|
different voices, specific phrasing, dialog types etc. The third level
|
|
is in developing and testing new synthesis methods.
|
|
|
|
This manual is not designed as a tutorial on converting text to speech
|
|
but for documenting the processes and use of our system. We do not
|
|
discuss the detailed algorithms involved in converting text to speech or
|
|
the relative merits of multiple methods, though we will often give
|
|
references to relevant papers when describing the use of each module.
|
|
|
|
For more general information about text to speech we recommend Dutoit's
|
|
@file{An introduction to Text-to-Speech Synthesis} @cite{dutoit97}. For
|
|
more detailed research issues in TTS see @cite{sproat98} or
|
|
@cite{vansanten96}.
|
|
|
|
@menu
|
|
* Philosophy:: Why we did it like it is
|
|
* Future:: How much better its going to get
|
|
@end menu
|
|
|
|
@node Philosophy, Future, , Overview
|
|
@section Philosophy
|
|
|
|
One of the biggest problems in the development of speech synthesis, and
|
|
other areas of speech and language processing systems, is that there are
|
|
a lot of simple well-known techniques lying around which can help you
|
|
realise your goal. But in order to improve some part of the whole
|
|
system it is necessary to have a whole system in which you can test and
|
|
improve your part. Festival is intended as that whole system in which
|
|
you may simply work on your small part to improve the whole. Without a
|
|
system like Festival, before you could even start to test your new
|
|
module you would need to spend significant effort to build a whole
|
|
system, or adapt an existing one before you could start working on your
|
|
improvements.
|
|
|
|
Festival is specifically designed to allow the addition of new
|
|
modules, easily and efficiently, so that development need not
|
|
get bogged down in re-implementing the wheel.
|
|
|
|
But there is another aspect of Festival which makes it more useful than
|
|
simply an environment for researching into new synthesis techniques.
|
|
It is a fully usable text-to-speech system suitable for embedding in
|
|
other projects that require speech output. The provision of a fully
|
|
working easy-to-use speech synthesizer in addition to just a testing
|
|
environment is good for two specific reasons. First, it offers a conduit
|
|
for our research, in that our experiments can quickly and directly
|
|
benefit users of our synthesis system. And secondly, in ensuring we have
|
|
a fully working usable system we can immediately see what problems exist
|
|
and where our research should be directed rather where our whims take
|
|
us.
|
|
|
|
These concepts are not unique to Festival. ATR's CHATR system
|
|
(@cite{black94}) follows very much the same philosophy and Festival
|
|
benefits from the experiences gained in the development of that system.
|
|
Festival benefits from various pieces of previous work. As well as
|
|
CHATR, CSTR's previous synthesizers, Osprey and the Polyglot projects
|
|
influenced many design decisions. Also we are influenced by more
|
|
general programs in considering software engineering issues, especially
|
|
GNU Octave and Emacs on which the basic script model was based.
|
|
|
|
Unlike in some other speech and language systems, software engineering is
|
|
considered very important to the development of Festival. Too often
|
|
research systems consist of random collections of hacky little scripts
|
|
and code. No one person can confidently describe the algorithms it
|
|
performs, as parameters are scattered throughout the system, with tricks
|
|
and hacks making it impossible to really evaluate why the system is good
|
|
(or bad). Such systems do not help the advancement of speech
|
|
technology, except perhaps in pointing at ideas that should be further
|
|
investigated. If the algorithms and techniques cannot be described
|
|
externally from the program @emph{such that} they can reimplemented by
|
|
others, what is the point of doing the work?
|
|
|
|
Festival offers a common framework where multiple techniques may be
|
|
implemented (by the same or different researchers) so that they may
|
|
be tested more fairly in the same environment.
|
|
|
|
As a final word, we'd like to make two short statements which both
|
|
achieve the same end but unfortunately perhaps not for the same reasons:
|
|
@quotation
|
|
Good software engineering makes good research easier
|
|
@end quotation
|
|
But the following seems to be true also
|
|
@quotation
|
|
If you spend enough effort on something it can be shown to be better
|
|
than its competitors.
|
|
@end quotation
|
|
|
|
@node Future, , Philosophy , Overview
|
|
@section Future
|
|
|
|
Festival is still very much in development. Hopefully this state will
|
|
continue for a long time. It is never possible to complete software,
|
|
there are always new things that can make it better. However as time
|
|
goes on Festival's core architecture will stabilise and little or
|
|
no changes will be made. Other aspects of the system will gain
|
|
greater attention such as waveform synthesis modules, intonation
|
|
techniques, text type dependent analysers etc.
|
|
|
|
Festival will improve, so don't expected it to be the same six months
|
|
from now.
|
|
|
|
A number of new modules and enhancements are already under consideration
|
|
at various stages of implementation. The following is a non-exhaustive
|
|
list of what we may (or may not) add to Festival over the
|
|
next six months or so.
|
|
@itemize @bullet
|
|
@item Selection-based synthesis:
|
|
Moving away from diphone technology to more generalized selection
|
|
of units for speech database.
|
|
@item New structure for linguistic content of utterances:
|
|
Using techniques for Metrical Phonology we are building more structure
|
|
representations of utterances reflecting there linguistic significance
|
|
better. This will allow improvements in prosody and unit selection.
|
|
@item Non-prosodic prosodic control:
|
|
For language generation systems and custom tasks where the speech
|
|
to be synthesized is being generated by some program, more information
|
|
about text structure will probably exist, such as phrasing, contrast,
|
|
key items etc. We are investigating the relationship of high-level
|
|
tags to prosodic information through the Sole project
|
|
@url{http://www.cstr.ed.ac.uk/projects/sole.html}
|
|
@item Dialect independent lexicons:
|
|
Currently for each new dialect we need a new lexicon, we are currently
|
|
investigating a form of lexical specification that is dialect independent
|
|
that allows the core form to be mapped to different dialects. This
|
|
will make the generation of voices in different dialects much easier.
|
|
@end itemize
|
|
|
|
@node Installation, Quick start, Overview, Top
|
|
@chapter Installation
|
|
|
|
This section describes how to install Festival from source in a new
|
|
location and customize that installation.
|
|
|
|
@menu
|
|
* Requirements:: Software/Hardware requirements for Festival
|
|
* Configuration:: Setting up compilation
|
|
* Site initialization:: Settings for your particular site
|
|
* Checking an installation:: But does it work ...
|
|
@end menu
|
|
|
|
@node Requirements, Configuration, , Installation
|
|
@section Requirements
|
|
|
|
@cindex requirements
|
|
In order to compile Festival you first need the following
|
|
source packages
|
|
|
|
@table @code
|
|
@item festival-1.4.3-release.tar.gz
|
|
Festival Speech Synthesis System source
|
|
@item speech_tools-1.2.3-release.tar.gz
|
|
The Edinburgh Speech Tools Library
|
|
@item festlex_NAME.tar.gz
|
|
@cindex lexicon
|
|
The lexicon distribution, where possible, includes the lexicon input
|
|
file as well as the compiled form, for your convenience. The lexicons
|
|
have varying distribution policies, but are all free except OALD, which
|
|
is only free for non-commercial use (we are working on a free
|
|
replacement). In some cases only a pointer to an ftp'able file plus a
|
|
program to convert that file to the Festival format is included.
|
|
@item festvox_NAME.tar.gz
|
|
You'll need a speech database. A number are available (with varying
|
|
distribution policies). Each voice may have other dependencies such as
|
|
requiring particular lexicons
|
|
@item festdoc_1.4.3.tar.gz
|
|
Full postscript, info and html documentation for Festival and the
|
|
Speech Tools. The source of the documentation is available
|
|
in the standard distributions but for your conveniences it has
|
|
been pre-generated.
|
|
@end table
|
|
|
|
In addition to Festival specific sources you will also need
|
|
|
|
@table @emph
|
|
@item A UNIX machine
|
|
Currently we have compiled and tested the system under Solaris (2.5(.1),
|
|
2.6, 2.7 and 2.8), SunOS (4.1.3), FreeBSD (3.x, 4.x), Linux (Redhat 4.1,
|
|
5.0, 5.1, 5.2, 6.[012], 7.[01], 8.0 and other Linux distributions), and it
|
|
should work under OSF (Dec Alphas), SGI (Irix), HPs (HPUX). But any
|
|
standard UNIX machine should be acceptable. We have now successfully
|
|
ported this version to Windows NT and Windows 95 (using the Cygnus GNU
|
|
win32 environment). This is still a young port but seems to work.
|
|
@item A C++ compiler
|
|
@cindex GNU g++
|
|
@cindex g++
|
|
@cindex C++
|
|
Note that C++ is not very portable even between different versions
|
|
of the compiler from the same vendor. Although we've tried very
|
|
hard to make the system portable, we know it is very unlikely to
|
|
compile without change except with compilers that have already been tested.
|
|
The currently tested systems are
|
|
@itemize @bullet
|
|
@item Sun Sparc Solaris 2.5, 2.5.1, 2.6, 2.7, 2.9:
|
|
GCC 2.95.1, GCC 3.2
|
|
@item FreeBSD for Intel 3.x and 4.x:
|
|
GCC 2.95.1, GCC 3.0
|
|
@item Linux for Intel (RedHat 4.1/5.0/5.1/5.2/6.0/7.x/8.0):
|
|
GCC 2.7.2, GCC 2.7.2/egcs-1.0.2, egcs 1.1.1, egcs-1.1.2, GCC 2.95.[123],
|
|
GCC "2.96", GCC 3.0, GCC 3.0.1 GCC 3.2 GCC 3.2.1
|
|
@item Windows NT 4.0:
|
|
GCC 2.7.2 plus egcs (from Cygnus GNU win32 b19), Visual C++ PRO v5.0,
|
|
Visual C++ v6.0
|
|
@end itemize
|
|
Note if GCC works on one version of Unix it usually works on
|
|
others.
|
|
|
|
@cindex Windows NT/95
|
|
We have compiled both the speech tools and Festival under Windows NT 4.0
|
|
and Windows 95 using the GNU tools available from Cygnus.
|
|
@example
|
|
@url{ftp://ftp.cygnus.com/pub/gnu-win32/}.
|
|
@end example
|
|
|
|
@item GNU make
|
|
Due to there being too many different @code{make} programs out there
|
|
we have tested the system using GNU make on all systems we use.
|
|
Others may work but we know GNU make does.
|
|
@item Audio hardware
|
|
@cindex audio hardware
|
|
You can use Festival without audio output hardware but it doesn't sound
|
|
very good (though admittedly you can hear less problems with it). A
|
|
number of audio systems are supported (directly inherited from the
|
|
audio support in the Edinburgh Speech Tools Library): NCD's NAS
|
|
(formerly called netaudio) a network transparent audio system (which can
|
|
be found at @url{ftp://ftp.x.org/contrib/audio/nas/});
|
|
@file{/dev/audio} (at 8k ulaw and 8/16bit linear), found on Suns, Linux
|
|
machines and FreeBSD; and a method allowing arbitrary UNIX
|
|
commands. @xref{Audio output}.
|
|
@end table
|
|
|
|
@cindex readline
|
|
@cindex editline
|
|
@cindex GNU readline
|
|
Earlier versions of Festival mistakenly offered a command line editor
|
|
interface to the GNU package readline, but due to conflicts with the GNU
|
|
Public Licence and Festival's licence this interface was removed in
|
|
version 1.3.1. Even Festival's new free licence would cause problems as
|
|
readline support would restrict Festival linking with non-free code. A
|
|
new command line interface based on editline was provided that offers
|
|
similar functionality. Editline remains a compilation option as it is
|
|
probably not yet as portable as we would like it to be.
|
|
|
|
@cindex @file{texi2html}
|
|
In addition to the above, in order to process the documentation you will
|
|
need @file{TeX}, @file{dvips} (or similar), GNU's @file{makeinfo} (part
|
|
of the texinfo package) and @file{texi2html} which is available from
|
|
@url{http://wwwcn.cern.ch/dci/texi2html/}.
|
|
|
|
@cindex documentation
|
|
However the document files are also available pre-processed into,
|
|
postscript, DVI, info and html as part of the distribution in
|
|
@file{festdoc-1.4.X.tar.gz}.
|
|
|
|
Ensure you have a fully installed and working version of your C++
|
|
compiler. Most of the problems people have had in installing Festival
|
|
have been due to incomplete or bad compiler installation. It
|
|
might be worth checking if the following program works if you don't
|
|
know if anyone has used your C++ installation before.
|
|
@example
|
|
#include <iostream.h>
|
|
int main (int argc, char **argv)
|
|
@{
|
|
cout << "Hello world\n";
|
|
@}
|
|
@end example
|
|
|
|
Unpack all the source files in a new directory. The directory
|
|
will then contain two subdirectories
|
|
@example
|
|
speech_tools/
|
|
festival/
|
|
@end example
|
|
|
|
@node Configuration, Site initialization, Requirements , Installation
|
|
@section Configuration
|
|
|
|
First ensure you have a compiled version of the Edinburgh
|
|
Speech Tools Library. See @file{speech_tools/INSTALL} for
|
|
instructions.
|
|
|
|
@cindex configuration
|
|
The system now supports the standard GNU @file{configure} method
|
|
for set up. In most cases this will automatically configure festival
|
|
for your particular system. In most cases you need only
|
|
type
|
|
@example
|
|
gmake
|
|
@end example
|
|
and the system will configure itself and compile, (note you
|
|
need to have compiled the Edinburgh Speech Tools
|
|
@file{speech_tools-1.2.2} first.
|
|
|
|
@cindex @file{config/config}
|
|
In some case hand configuration is required. All of the configuration
|
|
choices are kept in the file @file{config/config}.
|
|
|
|
@cindex OTHER_DIRS
|
|
For the most part Festival configuration inherits the configuration from
|
|
your speech tools config file (@file{../speech_tools/config/config}).
|
|
Additional optional modules may be added by adding them to the end of
|
|
your config file e.g.
|
|
@example
|
|
ALSO_INCLUDE += clunits
|
|
@end example
|
|
Adding and new module here will treat is as a new directory in
|
|
the @file{src/modules/} and compile it into the system in the
|
|
same way the @code{OTHER_DIRS} feature was used in
|
|
previous versions.
|
|
|
|
@cindex NFS
|
|
@cindex automounter
|
|
If the compilation directory being accessed by NFS or if you use an
|
|
automounter (e.g. amd) it is recommend to explicitly set the variable
|
|
@code{FESTIVAL_HOME} in @file{config/config}. The command @code{pwd} is
|
|
not reliable when a directory may have multiple names.
|
|
|
|
There is a simple test suite with Festival but it requires the three
|
|
basic voices and their respective lexicons installed before it will work.
|
|
Thus you need to install
|
|
@example
|
|
festlex_CMU.tar.gz
|
|
festlex_OALD.tar.gz
|
|
festlex_POSLEX.tar.gz
|
|
festvox_don.tar.gz
|
|
festvox_kedlpc16k.tar.gz
|
|
festvox_rablpc16k.tar.gz
|
|
@end example
|
|
If these are installed you can test the installation with
|
|
@example
|
|
gmake test
|
|
@end example
|
|
|
|
To simply make it run with a male US English voice it is
|
|
sufficient to install just
|
|
@example
|
|
festlex_CMU.tar.gz
|
|
festlex_POSLEX.tar.gz
|
|
festvox_kallpc16k.tar.gz
|
|
@end example
|
|
|
|
Note that the single most common reason for problems in compilation and
|
|
linking found amongst the beta testers was a bad installation of GNU
|
|
C++. If you get many strange errors in G++ library header files or link
|
|
errors it is worth checking that your system has the compiler, header
|
|
files and runtime libraries properly installed. This may be checked by
|
|
compiling a simple program under C++ and also finding out if anyone at
|
|
your site has ever used the installation. Most of these installation
|
|
problems are caused by upgrading to a newer version of libg++ without
|
|
removing the older version so a mixed version of the @file{.h} files
|
|
exist.
|
|
|
|
Although we have tried very hard to ensure that Festival compiles with
|
|
no warnings this is not possible under some systems.
|
|
|
|
@cindex SunOS
|
|
Under SunOS the system include files do not declare a number of
|
|
system provided functions. This a bug in Sun's include files. This
|
|
will causes warnings like "implicit definition of fprintf". These
|
|
are harmless.
|
|
|
|
@cindex Linux
|
|
Under Linux a warning at link time about reducing the size of some
|
|
symbols often is produced. This is harmless. There is often
|
|
occasional warnings about some socket system function having an
|
|
incorrect argument type, this is also harmless.
|
|
|
|
@cindex Visual C++
|
|
The speech tools and festival compile under Windows95 or Windows NT
|
|
with Visual C++ v5.0 using the Microsoft @file{nmake} make program. We've
|
|
only done this with the Professonal edition, but have no reason to
|
|
believe that it relies on anything not in the standard edition.
|
|
|
|
In accordance to VC++ conventions, object files are created with extension
|
|
.obj, executables with extension .exe and libraries with extension
|
|
.lib. This may mean that both unix and Win32 versions can be built in
|
|
the same directory tree, but I wouldn't rely on it.
|
|
|
|
To do this you require nmake Makefiles for the system. These can be
|
|
generated from the gnumake Makefiles, using the command
|
|
@example
|
|
gnumake VCMakefile
|
|
@end example
|
|
in the speech_tools and festival directories. I have only done this
|
|
under unix, it's possible it would work under the cygnus gnuwin32
|
|
system.
|
|
|
|
If @file{make.depend} files exist (i.e. if you have done @file{gnumake
|
|
depend} in unix) equivalent @file{vc_make.depend} files will be created, if not
|
|
the VCMakefiles will not contain dependency information for the @file{.cc}
|
|
files. The result will be that you can compile the system once, but
|
|
changes will not cause the correct things to be rebuilt.
|
|
|
|
In order to compile from the DOS command line using Visual C++ you
|
|
need to have a collection of environment variables set. In Windows NT
|
|
there is an instalation option for Visual C++ which sets these
|
|
globally. Under Windows95 or if you don't ask for them to be set
|
|
globally under NT you need to run
|
|
@example
|
|
vcvars32.bat
|
|
@end example
|
|
See the VC++ documentation for more details.
|
|
|
|
Once you have the source trees with VCMakefiles somewhere visible from
|
|
Windows, you need to copy
|
|
@file{peech_tools\config\vc_config-dist} to
|
|
@file{speech_tools\config\vc_config} and edit it to suit your
|
|
local situation. Then do the same with
|
|
@file{festival\config\vc_config-dist}.
|
|
|
|
The thing most likely to need changing is the definition of
|
|
@code{FESTIVAL_HOME} in @file{festival\config\vc_config_make_rules}
|
|
which needs to point to where you have put festival.
|
|
|
|
Now you can compile. cd to the speech_tools directory and do
|
|
@example
|
|
nmake /nologo /fVCMakefile
|
|
@end example
|
|
@exdent and the library, the programs in main and the test programs should be
|
|
compiled.
|
|
|
|
The tests can't be run automatically under Windows. A simple test to
|
|
check that things are probably OK is:
|
|
@example
|
|
main\na_play testsuite\data\ch_wave.wav
|
|
@end example
|
|
@exdent which reads and plays a waveform.
|
|
|
|
Next go into the festival directory and do
|
|
@example
|
|
nmake /nologo /fVCMakefile
|
|
@end example
|
|
@exdent to build festival. When it's finished, and assuming you have the
|
|
voices and lexicons unpacked in the right place, festival should run
|
|
just as under unix.
|
|
|
|
We should remind you that the NT/95 ports are still young and there may
|
|
yet be problems that we've not found yet. We only recommend the use the
|
|
speech tools and Festival under Windows if you have significant
|
|
experience in C++ under those platforms.
|
|
|
|
@cindex smaller system
|
|
@cindex minimal system
|
|
Most of the modules @file{src/modules} are actually optional and the
|
|
system could be compiled without them. The basic set could be reduced
|
|
further if certain facilities are not desired. Particularly:
|
|
@file{donovan} which is only required if the donovan voice is used;
|
|
@file{rxp} if no XML parsing is required (e.g. Sable); and @file{parser}
|
|
if no stochastic paring is required (this parser isn't used for any of
|
|
our currently released voices). Actually even @file{UniSyn} and
|
|
@file{UniSyn_diphone} could be removed if some external waveform
|
|
synthesizer is being used (e.g. MBROLA) or some alternative one like
|
|
@file{OGIresLPC}. Removing unused modules will make the festival binary
|
|
smaller and (potentially) start up faster but don't expect too much.
|
|
You can delete these by changing the @code{BASE_DIRS} variable in
|
|
@file{src/modules/Makefile}.
|
|
|
|
@node Site initialization, Checking an installation, Configuration, Installation
|
|
@section Site initialization
|
|
|
|
@cindex run-time configuration
|
|
@cindex initialization
|
|
@cindex installation initialization
|
|
@cindex @file{init.scm}
|
|
@cindex @file{siteinit.scm}
|
|
Once compiled Festival may be further customized for particular sites.
|
|
At start up time Festival loads the file @file{init.scm} from its
|
|
library directory. This file further loads other necessary files such
|
|
as phoneset descriptions, duration parameters, intonation parameters,
|
|
definitions of voices etc. It will also load the files
|
|
@file{sitevars.scm} and @file{siteinit.scm} if they exist.
|
|
@file{sitevars.scm} is loaded after the basic Scheme library functions
|
|
are loaded but before any of the festival related functions are
|
|
loaded. This file is intended to set various path names before
|
|
various subsystems are loaded. Typically variables such
|
|
as @code{lexdir} (the directory where the lexicons are held), and
|
|
@code{voices_dir} (pointing to voice directories) should
|
|
be reset here if necessary.
|
|
|
|
@cindex change libdir at run-time
|
|
@cindex run-time configuration
|
|
@cindex @code{load-path}
|
|
The default installation will try to find its lexicons and voices
|
|
automatically based on the value of @code{load-path} (this is derived
|
|
from @code{FESTIVAL_HOME} at compilation time or by using the @code{--libdir}
|
|
at run-time). If the voices and lexicons have been unpacked into
|
|
subdirectories of the library directory (the default) then no site
|
|
specific initialization of the above pathnames will be necessary.
|
|
|
|
The second site specific file is @file{siteinit.scm}. Typical examples
|
|
of local initialization are as follows. The default audio output method
|
|
is NCD's NAS system if that is supported as that's what we use normally
|
|
in CSTR. If it is not supported, any hardware specific mode is the
|
|
default (e.g. sun16audio, freebas16audio, linux16audio or mplayeraudio).
|
|
But that default is just a setting in @file{init.scm}. If for example
|
|
in your environment you may wish the default audio output method to be
|
|
8k mulaw through @file{/dev/audio} you should add the following line to
|
|
your @file{siteinit.scm} file
|
|
@lisp
|
|
(Parameter.set 'Audio_Method 'sunaudio)
|
|
@end lisp
|
|
Note the use of @code{Parameter.set} rather than @code{Parameter.def}
|
|
the second function will not reset the value if it is already set.
|
|
Remember that you may use the audio methods @code{sun16audio}.
|
|
@code{linux16audio} or @code{freebsd16audio} only if @code{NATIVE_AUDIO}
|
|
was selected in @file{speech_tools/config/config} and your are
|
|
on such machines. The Festival variable @code{*modules*} contains
|
|
a list of all supported functions/modules in a particular installation
|
|
including audio support. Check the value of that variable if things
|
|
aren't what you expect.
|
|
|
|
If you are installing on a machine whose audio is not directly supported
|
|
by the speech tools library, an external command may be executed to play
|
|
a waveform. The following example is for an imaginary machine that can
|
|
play audio files through a program called @file{adplay} with arguments
|
|
for sample rate and file type. When playing waveforms, Festival, by
|
|
default, outputs as unheadered waveform in native byte order. In this
|
|
example you would set up the default audio playing mechanism in
|
|
@file{siteinit.scm} as follows
|
|
@lisp
|
|
(Parameter.set 'Audio_Method 'Audio_Command)
|
|
(Parameter.set 'Audio_Command "adplay -raw -r $SR $FILE")
|
|
@end lisp
|
|
@cindex output sample rate
|
|
@cindex output file type
|
|
@cindex audio command output
|
|
@cindex audio output rate
|
|
@cindex audio output filetype
|
|
For @code{Audio_Command} method of playing waveforms Festival supports
|
|
two additional audio parameters. @code{Audio_Required_Rate} allows you
|
|
to use Festivals internal sample rate conversion function to any desired
|
|
rate. Note this may not be as good as playing the waveform at the
|
|
sample rate it is originally created in, but as some hardware devices
|
|
are restrictive in what sample rates they support, or have naive
|
|
resample functions this could be optimal. The second addition
|
|
audio parameter is @code{Audio_Required_Format} which can be
|
|
used to specify the desired output forms of the file. The default
|
|
is unheadered raw, but this may be any of the values supported by
|
|
the speech tools (including nist, esps, snd, riff, aiff, audlab, raw
|
|
and, if you really want it, ascii).
|
|
|
|
For example suppose you run Festival on a remote machine and are not
|
|
running any network audio system and want Festival to copy files back to
|
|
your local machine and simply cat them to @file{/dev/audio}. The
|
|
following would do that (assuming permissions for rsh are allowed).
|
|
@lisp
|
|
(Parameter.set 'Audio_Method 'Audio_Command)
|
|
;; Make output file ulaw 8k (format ulaw implies 8k)
|
|
(Parameter.set 'Audio_Required_Format 'ulaw)
|
|
(Parameter.set 'Audio_Command
|
|
"userhost=`echo $DISPLAY | sed 's/:.*$//'`; rcp $FILE $userhost:$FILE; \
|
|
rsh $userhost \"cat $FILE >/dev/audio\" ; rsh $userhost \"rm $FILE\"")
|
|
@end lisp
|
|
Note there are limits on how complex a command you want to put in the
|
|
@code{Audio_Command} string directly. It can get very confusing with respect
|
|
to quoting. It is therefore recommended that once you get past a certain
|
|
complexity consider writing a simple shell script and calling it from
|
|
the @code{Audio_Command} string.
|
|
|
|
@cindex default voice
|
|
A second typical customization is setting the default speaker. Speakers
|
|
depend on many things but due to various licence (and resource)
|
|
restrictions you may only have some diphone/nphone databases available
|
|
in your installation. The function name that is the value of
|
|
@code{voice_default} is called immediately after @file{siteinit.scm} is
|
|
loaded offering the opportunity for you to change it. In
|
|
the standard distribution no change should be required. If you
|
|
download all the distributed voices @code{voice_rab_diphone} is
|
|
the default voice. You may change this for a site by adding
|
|
the following to @file{siteinit.scm} or per person by changing
|
|
your @file{.festivalrc}. For example if you wish to
|
|
change the default voice to the American one @code{voice_ked_diphone}
|
|
@lisp
|
|
(set! voice_default 'voice_ked_diphone)
|
|
@end lisp
|
|
Note the single quote, and note that unlike in early versions
|
|
@code{voice_default} is not a function you can call directly.
|
|
|
|
@cindex @file{.festivalrc}
|
|
@cindex user initialization
|
|
A second level of customization is on a per user basis. After loading
|
|
@file{init.scm}, which includes @file{sitevars.scm} and
|
|
@file{siteinit.scm} for local installation, Festival loads the file
|
|
@file{.festivalrc} from the user's home directory (if it exists). This
|
|
file may contain arbitrary Festival commands.
|
|
|
|
@node Checking an installation, , Site initialization, Installation
|
|
@section Checking an installation
|
|
|
|
Once compiled and site initialization is set up you should test
|
|
to see if Festival can speak or not.
|
|
|
|
Start the system
|
|
@example
|
|
$ bin/festival
|
|
Festival Speech Synthesis System 1.4.3:release Jan 2003
|
|
Copyright (C) University of Edinburgh, 1996-2003. All rights reserved.
|
|
For details type `(festival_warranty)'
|
|
festival> ^D
|
|
@end example
|
|
If errors occur at this stage they are most likely to do
|
|
with pathname problems. If any error messages are printed
|
|
about non-existent files check that those pathnames
|
|
point to where you intended them to be. Most of the (default)
|
|
pathnames are dependent on the basic library path. Ensure that
|
|
is correct. To find out what it has been set to, start the
|
|
system without loading the init files.
|
|
@example
|
|
$ bin/festival -q
|
|
Festival Speech Synthesis System 1.4.3:release Jan 2003
|
|
Copyright (C) University of Edinburgh, 1996-2003. All rights reserved.
|
|
For details type `(festival_warranty)'
|
|
festival> libdir
|
|
"/projects/festival/lib/"
|
|
festival> ^D
|
|
@end example
|
|
This should show the pathname you set in your @file{config/config}.
|
|
|
|
If the system starts with no errors try to synthesize something
|
|
@example
|
|
festival> (SayText "hello world")
|
|
@end example
|
|
Some files are only accessed at synthesis time so this may
|
|
show up other problem pathnames. If it talks, you're in business,
|
|
if it doesn't, here are some possible problems.
|
|
|
|
@cindex audio problems
|
|
If you get the error message
|
|
@example
|
|
Can't access NAS server
|
|
@end example
|
|
You have selected NAS as the audio output but have no server running on
|
|
that machine or your @code{DISPLAY} or @code{AUDIOSERVER} environment
|
|
variable is not set properly for your output device. Either set these
|
|
properly or change the audio output device in @file{lib/siteinit.scm} as
|
|
described above.
|
|
|
|
Ensure your audio device actually works the way you think it does. On
|
|
Suns, the audio output device can be switched into a number of different
|
|
output modes, speaker, jack, headphones. If this is set to the wrong
|
|
one you may not hear the output. Use one of Sun's tools to change this
|
|
(try @file{/usr/demo/SOUND/bin/soundtool}). Try to find an audio
|
|
file independent of Festival and get it to play on your audio.
|
|
Once you have done that ensure that the audio output method set in
|
|
Festival matches that.
|
|
|
|
Once you have got it talking, test the audio spooling device.
|
|
@example
|
|
festival> (intro)
|
|
@end example
|
|
This plays a short introduction of two sentences, spooling the audio
|
|
output.
|
|
|
|
Finally exit from Festival (by end of file or @code{(quit)}) and test
|
|
the script mode with.
|
|
@example
|
|
$ examples/saytime
|
|
@end example
|
|
|
|
A test suite is included with Festival but it makes certain assumptions
|
|
about which voices are installed. It assumes that
|
|
@code{voice_rab_diphone} (@file{festvox_rabxxxx.tar.gz}) is the default
|
|
voice and that @code{voice_ked_diphone} and @code{voice_don_diphone}
|
|
(@file{festvox_kedxxxx.tar.gz} and @file{festvox_don.tar.gz}) are
|
|
installed. Also local settings in your @file{festival/lib/siteinit.scm}
|
|
may affect these tests. However, after installation it may
|
|
be worth trying
|
|
@example
|
|
gnumake test
|
|
@end example
|
|
from the @file{festival/} directory. This will do various tests
|
|
including basic utterance tests and tokenization tests. It also checks
|
|
that voices are installed and that they don't interfere with each other.
|
|
These tests are primarily regression tests for the developers of
|
|
Festival, to ensure new enhancements don't mess up existing supported
|
|
features. They are not designed to test an installation is successful,
|
|
though if they run correctly it is most probable the installation has
|
|
worked.
|
|
|
|
@node Quick start, Scheme, Installation, Top
|
|
@chapter Quick start
|
|
|
|
This section is for those who just want to know the absolute basics
|
|
to run the system.
|
|
|
|
@cindex command mode
|
|
@cindex text-to-speech mode
|
|
@cindex tts mode
|
|
Festival works in two fundamental modes, @emph{command mode} and
|
|
@emph{text-to-speech mode} (tts-mode). In command mode, information (in
|
|
files or through standard input) is treated as commands and is
|
|
interpreted by a Scheme interpreter. In tts-mode, information (in files
|
|
or through standard input) is treated as text to be rendered as speech.
|
|
The default mode is command mode, though this may change in later
|
|
versions.
|
|
|
|
@menu
|
|
* Basic command line options::
|
|
* Simple command driven session::
|
|
* Getting some help::
|
|
@end menu
|
|
|
|
@node Basic command line options, Simple command driven session, , Quick start
|
|
@section Basic command line options
|
|
|
|
@cindex command line options
|
|
Festival's basic calling method is as
|
|
|
|
@lisp
|
|
festival [options] file1 file2 ...
|
|
@end lisp
|
|
|
|
Options may be any of the following
|
|
|
|
@table @code
|
|
@item -q
|
|
start Festival without loading @file{init.scm} or user's
|
|
@file{.festivalrc}
|
|
@item -b
|
|
@itemx --batch
|
|
@cindex batch mode
|
|
After processing any file arguments do not become interactive
|
|
@item -i
|
|
@itemx --interactive
|
|
@cindex interactive mode
|
|
After processing file arguments become interactive. This option overrides
|
|
any batch argument.
|
|
@item --tts
|
|
@cindex tts mode
|
|
Treat file arguments in text-to-speech mode, causing them to be
|
|
rendered as speech rather than interpreted as commands. When selected
|
|
in interactive mode the command line edit functions are not available
|
|
@item --command
|
|
@cindex command mode
|
|
Treat file arguments in command mode. This is the default.
|
|
@item --language LANG
|
|
@cindex language specification
|
|
Set the default language to @var{LANG}. Currently @var{LANG} may be
|
|
one of @code{english}, @code{spanish} or @code{welsh} (depending on
|
|
what voices are actually available in your installation).
|
|
@item --server
|
|
After loading any specified files go into server mode. This is
|
|
a mode where Festival waits for clients on a known port (the
|
|
value of @code{server_port}, default is 1314). Connected
|
|
clients may send commands (or text) to the server and expect
|
|
waveforms back. @xref{Server/client API}. Note server mode
|
|
may be unsafe and allow unauthorised access to your
|
|
machine, be sure to read the security recommendations in
|
|
@ref{Server/client API}
|
|
@item --script scriptfile
|
|
@cindex script files
|
|
@cindex Festival script files
|
|
Run scriptfile as a Festival script file. This is similar to
|
|
to @code{--batch} but it encapsulates the command line arguments into
|
|
the Scheme variables @code{argv} and @code{argc}, so that Festival
|
|
scripts may process their command line arguments just like
|
|
any other program. It also does not load the the basic initialisation
|
|
files as sometimes you may not want to do this. If you wish them,
|
|
you should copy the loading sequence from an example Festival
|
|
script like @file{festival/examples/saytext}.
|
|
@item --heap NUMBER
|
|
@cindex heap size
|
|
@cindex Scheme heap size
|
|
The Scheme heap (basic number of Lisp cells) is of a fixed size and
|
|
cannot be dynamically increased at run time (this would complicate
|
|
garbage collection). The default size is 210000 which seems to be more
|
|
than adequate for most work. In some of our training experiments where
|
|
very large list structures are required it is necessary to increase
|
|
this. Note there is a trade off between size of the heap and time it
|
|
takes to garbage collect so making this unnecessarily big is not a good
|
|
idea. If you don't understand the above explanation you almost
|
|
certainly don't need to use the option.
|
|
@end table
|
|
In command mode, if the file name starts with a left parenthesis, the
|
|
name itself is read and evaluated as a Lisp command. This is often
|
|
convenient when running in batch mode and a simple command is necessary
|
|
to start the whole thing off after loading in some other specific files.
|
|
|
|
@node Simple command driven session, Getting some help, Basic command line options, Quick start
|
|
@section Sample command driven session
|
|
|
|
Here is a short session using Festival's command interpreter.
|
|
|
|
Start Festival with no arguments
|
|
@lisp
|
|
$ festival
|
|
Festival Speech Synthesis System 1.4.3:release Dec 2002
|
|
Copyright (C) University of Edinburgh, 1996-2002. All rights reserved.
|
|
For details type `(festival_warranty)'
|
|
festival>
|
|
@end lisp
|
|
|
|
Festival uses the a command line editor based on editline for terminal
|
|
input so command line editing may be done with Emacs commands. Festival
|
|
also supports history as well as function, variable name, and file name
|
|
completion via the @key{TAB} key.
|
|
|
|
Typing @code{help} will give you more information, that is @code{help}
|
|
without any parenthesis. (It is actually a variable name whose value is a
|
|
string containing help.)
|
|
|
|
@cindex Scheme
|
|
@cindex read-eval-print loop
|
|
Festival offers what is called a read-eval-print loop, because
|
|
it reads an s-expression (atom or list), evaluates it and prints
|
|
the result. As Festival includes the SIOD Scheme interpreter most
|
|
standard Scheme commands work
|
|
@lisp
|
|
festival> (car '(a d))
|
|
a
|
|
festival> (+ 34 52)
|
|
86
|
|
@end lisp
|
|
In addition to standard Scheme commands a number of commands specific to
|
|
speech synthesis are included. Although, as we will see, there are
|
|
simpler methods for getting Festival to speak, here are the basic
|
|
underlying explicit functions used in synthesizing an utterance.
|
|
|
|
@cindex utterance
|
|
@cindex hello world
|
|
Utterances can consist of various types @xref{Utterance types},
|
|
but the simplest form is plain text. We can create an utterance
|
|
and save it in a variable
|
|
@lisp
|
|
festival> (set! utt1 (Utterance Text "Hello world"))
|
|
#<Utterance 1d08a0>
|
|
festival>
|
|
@end lisp
|
|
The (hex) number in the return value may be different for your
|
|
installation. That is the print form for utterances. Their internal
|
|
structure can be very large so only a token form is printed.
|
|
|
|
@cindex synthesizing an utterance
|
|
Although this creates an utterance it doesn't do anything else.
|
|
To get a waveform you must synthesize it.
|
|
@lisp
|
|
festival> (utt.synth utt1)
|
|
#<Utterance 1d08a0>
|
|
festival>
|
|
@end lisp
|
|
@cindex playing an utterance
|
|
This calls various modules, including tokenizing, duration,. intonation
|
|
etc. Which modules are called are defined with respect to the type
|
|
of the utterance, in this case @code{Text}. It is possible to
|
|
individually call the modules by hand but you just wanted it to talk didn't
|
|
you. So
|
|
@lisp
|
|
festival> (utt.play utt1)
|
|
#<Utterance 1d08a0>
|
|
festival>
|
|
@end lisp
|
|
@exdent will send the synthesized waveform to your audio device. You should
|
|
hear "Hello world" from your machine.
|
|
|
|
@cindex @code{SayText}
|
|
To make this all easier a small function doing these three steps exists.
|
|
@code{SayText} simply takes a string of text, synthesizes it and sends it
|
|
to the audio device.
|
|
@lisp
|
|
festival> (SayText "Good morning, welcome to Festival")
|
|
#<Utterance 1d8fd0>
|
|
festival>
|
|
@end lisp
|
|
Of course as history and command line editing are supported @key{c-p}
|
|
or up-arrow will allow you to edit the above to whatever you wish.
|
|
|
|
Festival may also synthesize from files rather than simply text.
|
|
@lisp
|
|
festival> (tts "myfile" nil)
|
|
nil
|
|
festival>
|
|
@end lisp
|
|
@cindex exiting Festival
|
|
@cindex @code{quit}
|
|
The end of file character @key{c-d} will exit from Festival and
|
|
return you to the shell, alternatively the command @code{quit} may
|
|
be called (don't forget the parentheses).
|
|
|
|
@cindex TTS
|
|
@cindex text to speech
|
|
Rather than starting the command interpreter, Festival may synthesize
|
|
files specified on the command line
|
|
@lisp
|
|
unix$ festival --tts myfile
|
|
unix$
|
|
@end lisp
|
|
|
|
@cindex text to wave
|
|
@cindex offline TTS
|
|
Sometimes a simple waveform is required from text that is to be kept and
|
|
played at some later time. The simplest way to do this with festival is
|
|
by using the @file{text2wave} program. This is a festival script that
|
|
will take a file (or text from standard input) and produce a single
|
|
waveform.
|
|
|
|
@cindex text2wave
|
|
An example use is
|
|
@example
|
|
text2wave myfile.txt -o myfile.wav
|
|
@end example
|
|
Options exist to specify the waveform file type, for example if
|
|
Sun audio format is required
|
|
@example
|
|
text2wave myfile.txt -otype snd -o myfile.wav
|
|
@end example
|
|
Use @file{-h} on @file{text2wave} to see all options.
|
|
|
|
@node Getting some help, , Simple command driven session, Quick start
|
|
@section Getting some help
|
|
|
|
@cindex help
|
|
If no audio is generated then you must check to see if audio is
|
|
properly initialized on your machine. @xref{Audio output}.
|
|
|
|
In the command interpreter @key{m-h} (meta-h) will give you help
|
|
on the current symbol before the cursor. This will be a short
|
|
description of the function or variable, how to use it and what
|
|
its arguments are. A listing of all such help strings appears
|
|
at the end of this document. @key{m-s} will synthesize and say
|
|
the same information, but this extra function is really just for show.
|
|
|
|
@cindex @code{manual}
|
|
The lisp function @code{manual} will send the appropriate command to an
|
|
already running Netscape browser process. If @code{nil} is given as an
|
|
argument the browser will be directed to the tables of contents of the
|
|
manual. If a non-nil value is given it is assumed to be a section title
|
|
and that section is searched and if found displayed. For example
|
|
@example
|
|
festival> (manual "Accessing an utterance")
|
|
@end example
|
|
Another related function is @code{manual-sym} which given a symbol will
|
|
check its documentation string for a cross reference to a manual
|
|
section and request Netscape to display it. This function is
|
|
bound to @key{m-m} and will display the appropriate section for
|
|
the given symbol.
|
|
|
|
Note also that the @key{TAB} key can be used to find out the name
|
|
of commands available as can the function @code{Help} (remember the
|
|
parentheses).
|
|
|
|
For more up to date information on Festival regularly check
|
|
the Festival Home Page at
|
|
@example
|
|
@url{http://www.cstr.ed.ac.uk/projects/festival.html}
|
|
@end example
|
|
|
|
Further help is available by mailing questions to
|
|
@example
|
|
festival-help@@cstr.ed.ac.uk
|
|
@end example
|
|
Although we cannot guarantee the time required to answer you, we
|
|
will do our best to offer help.
|
|
|
|
@cindex bug reports
|
|
Bug reports should be submitted to
|
|
@example
|
|
festival-bug@@cstr.ed.ac.uk
|
|
@end example
|
|
|
|
If there is enough user traffic a general mailing list will be
|
|
created so all users may share comments and receive announcements.
|
|
In the mean time watch the Festival Home Page for news.
|
|
|
|
@node Scheme, TTS, Quick start, Top
|
|
@chapter Scheme
|
|
|
|
@cindex Scheme introduction
|
|
Many people seem daunted by the fact that Festival uses Scheme as its
|
|
scripting language and feel they can't use Festival because they don't
|
|
know Scheme. However most of those same people use Emacs everyday which
|
|
also has (a much more complex) Lisp system underneath. The number of
|
|
Scheme commands you actually need to know in Festival is really very
|
|
small and you can easily just find out as you go along. Also people use
|
|
the Unix shell often but only know a small fraction of actual commands
|
|
available in the shell (or in fact that there even is a distinction
|
|
between shell builtin commands and user definable ones). So take it
|
|
easy, you'll learn the commands you need fairly quickly.
|
|
|
|
@menu
|
|
* Scheme references:: Places to learn more about Scheme
|
|
* Scheme fundamentals:: Syntax and semantics
|
|
* Scheme Festival specifics::
|
|
* Scheme I/O::
|
|
@end menu
|
|
|
|
@node Scheme references, Scheme fundamentals, , Scheme
|
|
@section Scheme references
|
|
|
|
If you wish to learn about Scheme in more detail I recommend
|
|
the book @cite{abelson85}.
|
|
|
|
The Emacs Lisp documentation is reasonable as it is comprehensive and
|
|
many of the underlying uses of Scheme in Festival were influenced
|
|
by Emacs. Emacs Lisp however is not Scheme so there are some
|
|
differences.
|
|
|
|
@cindex Scheme references
|
|
Other Scheme tutorials and resources available on the Web are
|
|
@itemize @bullet
|
|
@item
|
|
The Revised Revised Revised Revised Scheme Report, the document
|
|
defining the language is available from
|
|
@example
|
|
@url{http://tinuviel.cs.wcu.edu/res/ldp/r4rs-html/r4rs_toc.html}
|
|
@end example
|
|
@item
|
|
a Scheme tutorials from the net:
|
|
@itemize @bullet
|
|
@item @url{http://www.cs.uoregon.edu/classes/cis425/schemeTutorial.html}
|
|
@end itemize
|
|
@item the Scheme FAQ
|
|
@itemize @bullet
|
|
@item @url{http://www.landfield.com/faqs/scheme-faq/part1/}
|
|
@end itemize
|
|
@end itemize
|
|
|
|
@node Scheme fundamentals, Scheme Festival specifics, Scheme references, Scheme
|
|
@section Scheme fundamentals
|
|
|
|
But you want more now, don't you, not just be referred to some
|
|
other book. OK here goes.
|
|
|
|
@emph{Syntax}: an expression is an @emph{atom} or a @emph{list}. A
|
|
list consists of a left paren, a number of expressions and right
|
|
paren. Atoms can be symbols, numbers, strings or other special
|
|
types like functions, hash tables, arrays, etc.
|
|
|
|
@emph{Semantics}: All expressions can be evaluated. Lists are
|
|
evaluated as function calls. When evaluating a list all the
|
|
members of the list are evaluated first then the first item (a
|
|
function) is called with the remaining items in the list as arguments.
|
|
Atoms are evaluated depending on their type: symbols are
|
|
evaluated as variables returning their values. Numbers, strings,
|
|
functions, etc. evaluate to themselves.
|
|
|
|
Comments are started by a semicolon and run until end of line.
|
|
|
|
And that's it. There is nothing more to the language that. But just
|
|
in case you can't follow the consequences of that, here are
|
|
some key examples.
|
|
|
|
@lisp
|
|
festival> (+ 2 3)
|
|
5
|
|
festival> (set! a 4)
|
|
4
|
|
festival> (* 3 a)
|
|
12
|
|
festival> (define (add a b) (+ a b))
|
|
#<CLOSURE (a b) (+ a b)>
|
|
festival> (add 3 4)
|
|
7
|
|
festival> (set! alist '(apples pears bananas))
|
|
(apples pears bananas)
|
|
festival> (car alist)
|
|
apples
|
|
festival> (cdr alist)
|
|
(pears bananas)
|
|
festival> (set! blist (cons 'oranges alist))
|
|
(oranges apples pears bananas)
|
|
festival> (append alist blist)
|
|
(apples pears bananas oranges apples pears bananas)
|
|
festival> (cons alist blist)
|
|
((apples pears bananas) oranges apples pears bananas)
|
|
festival> (length alist)
|
|
3
|
|
festival> (length (append alist blist))
|
|
7
|
|
@end lisp
|
|
|
|
@node Scheme Festival specifics, Scheme I/O, Scheme fundamentals, Scheme
|
|
@section Scheme Festival specifics
|
|
|
|
There a number of additions to SIOD that are Festival specific though
|
|
still part of the Lisp system rather than the synthesis functions per se.
|
|
|
|
By convention if the first statement of a function is a string,
|
|
it is treated as a documentation string. The string will be
|
|
printed when help is requested for that function symbol.
|
|
|
|
@cindex debugging Scheme errors
|
|
@cindex debugging scripts
|
|
@cindex backtrace
|
|
In interactive mode if the function @code{:backtrace} is called (within
|
|
parenthesis) the previous stack trace is displayed. Calling
|
|
@code{:backtrace} with a numeric argument will display that particular
|
|
stack frame in full. Note that any command other than @code{:backtrace}
|
|
will reset the trace. You may optionally call
|
|
@lisp
|
|
(set_backtrace t)
|
|
@end lisp
|
|
Which will cause a backtrace to be displayed whenever a Scheme error
|
|
occurs. This can be put in your @file{.festivalrc} if you wish. This
|
|
is especially useful when running Festival in non-interactive mode
|
|
(batch or script mode) so that more information is printed when an error
|
|
occurs.
|
|
|
|
@cindex hooks
|
|
A @emph{hook} in Lisp terms is a position within some piece of code
|
|
where a user may specify their own customization. The notion is used
|
|
heavily in Emacs. In Festival there a number of places where hooks are
|
|
used. A hook variable contains either a function or list of functions
|
|
that are to be applied at some point in the processing. For example the
|
|
@code{after_synth_hooks} are applied after synthesis has been applied to
|
|
allow specific customization such as resampling or modification of the
|
|
gain of the synthesized waveform. The Scheme function
|
|
@code{apply_hooks} takes a hook variable as argument and an object and
|
|
applies the function/list of functions in turn to the object.
|
|
|
|
@cindex catching errors in Scheme
|
|
@cindex @code{unwind-protect}
|
|
@cindex errors in Scheme
|
|
When an error occurs in either Scheme or within the C++ part of Festival
|
|
by default the system jumps to the top level, resets itself and
|
|
continues. Note that errors are usually serious things, pointing to
|
|
bugs in parameters or code. Every effort has been made to ensure
|
|
that the processing of text never causes errors in Festival.
|
|
However when using Festival as a development system it is often
|
|
that errors occur in code.
|
|
|
|
Sometimes in writing Scheme code you know there is a potential for
|
|
an error but you wish to ignore that and continue on to the next
|
|
thing without exiting or stopping and returning to the top level. For
|
|
example you are processing a number of utterances from a database and
|
|
some files containing the descriptions have errors in them but you
|
|
want your processing to continue through every utterance that can
|
|
be processed rather than stopping 5 minutes after you gone home after
|
|
setting a big batch job for overnight.
|
|
|
|
@cindex @code{unwind-protect}
|
|
@cindex catching errors
|
|
Festival's Scheme provides the function @code{unwind-protect} which
|
|
allows the catching of errors and then continuing normally. For example
|
|
suppose you have the function @code{process_utt} which takes a filename
|
|
and does things which you know might cause an error. You can write the
|
|
following to ensure you continue processing even in an error
|
|
occurs.
|
|
@lisp
|
|
(unwind-protect
|
|
(process_utt filename)
|
|
(begin
|
|
(format t "Error found in processing %s\n" filename)
|
|
(format t "continuing\n")))
|
|
@end lisp
|
|
The @code{unwind-protect} function takes two arguments. The first is
|
|
evaluated and if no error occurs the value returned from that expression
|
|
is returned. If an error does occur while evaluating the first
|
|
expression, the second expression is evaluated. @code{unwind-protect}
|
|
may be used recursively. Note that all files opened while evaluating
|
|
the first expression are closed if an error occurs. All global
|
|
variables outside the scope of the @code{unwind-protect} will be left as
|
|
they were set up until the error. Care should be taken in using this
|
|
function but its power is necessary to be able to write robust Scheme
|
|
code.
|
|
|
|
@node Scheme I/O, , Scheme Festival specifics, Scheme
|
|
@section Scheme I/O
|
|
|
|
@cindex file i/o in Scheme
|
|
@cindex i/o in Scheme
|
|
Different Scheme's may have quite different implementations of
|
|
file i/o functions so in this section we will describe the
|
|
basic functions in Festival SIOD regarding i/o.
|
|
|
|
Simple printing to the screen may be achieved with the function
|
|
@code{print} which prints the given s-expression to the screen.
|
|
The printed form is preceded by a new line. This is often useful
|
|
for debugging but isn't really powerful enough for much else.
|
|
|
|
@cindex @code{fopen}
|
|
@cindex @code{fclose}
|
|
Files may be opened and closed and referred to file descriptors
|
|
in a direct analogy to C's stdio library. The SIOD functions
|
|
@code{fopen} and @code{fclose} work in the exactly the same
|
|
way as their equivalently named partners in C.
|
|
|
|
@cindex @code{format}
|
|
@cindex formatted output
|
|
The @code{format} command follows the command of the same name in Emacs
|
|
and a number of other Lisps. C programmers can think of it as
|
|
@code{fprintf}. @code{format} takes a file descriptor, format string
|
|
and arguments to print. The file description may be a file descriptor
|
|
as returned by the Scheme function @code{fopen}, it may also be @code{t}
|
|
which means the output will be directed as standard out
|
|
(cf. @code{printf}). A third possibility is @code{nil} which will cause
|
|
the output to printed to a string which is returned (cf. @code{sprintf}).
|
|
|
|
The format string closely follows the format strings
|
|
in ANSI C, but it is not the same. Specifically the directives
|
|
currently supported are, @code{%%}, @code{%d}, @code{%x},
|
|
@code{%s}, @code{%f}, @code{%g} and @code{%c}. All modifiers
|
|
for these are also supported. In addition @code{%l} is provided
|
|
for printing of Scheme objects as objects.
|
|
|
|
For example
|
|
@lisp
|
|
(format t "%03d %3.4f %s %l %l %l\n" 23 23 "abc" "abc" '(a b d) utt1)
|
|
@end lisp
|
|
will produce
|
|
@lisp
|
|
023 23.0000 abc "abc" (a b d) #<Utterance 32f228>
|
|
@end lisp
|
|
on standard output.
|
|
|
|
@cindex pretty printing
|
|
When large lisp expressions are printed they are difficult to read
|
|
because of the parentheses. The function @code{pprintf} prints an
|
|
expression to a file description (or @code{t} for standard out). It
|
|
prints so the s-expression is nicely lined up and indented. This
|
|
is often called pretty printing in Lisps.
|
|
|
|
@cindex reading from files
|
|
@cindex loading data from files
|
|
For reading input from terminal or file, there is currently no
|
|
equivalent to @code{scanf}. Items may only be read as Scheme
|
|
expressions. The command
|
|
@lisp
|
|
(load FILENAME t)
|
|
@end lisp
|
|
@exdent
|
|
will load all s-expressions in @code{FILENAME} and return them,
|
|
unevaluated as a list. Without the third argument the @code{load}
|
|
function will load and evaluate each s-expression in the file.
|
|
|
|
To read individual s-expressions use @code{readfp}. For
|
|
example
|
|
@lisp
|
|
(let ((fd (fopen trainfile "r"))
|
|
(entry)
|
|
(count 0))
|
|
(while (not (equal? (set! entry (readfp fd)) (eof-val)))
|
|
(if (string-equal (car entry) "home")
|
|
(set! count (+ 1 count))))
|
|
(fclose fd))
|
|
@end lisp
|
|
|
|
@cindex @code{parse-number}
|
|
@cindex @code{atof}
|
|
@cindex string to number
|
|
@cindex convert string to number
|
|
To convert a symbol whose print name is a number to a number
|
|
use @code{parse-number}. This is the equivalent to @code{atof}
|
|
in C.
|
|
|
|
Note that, all i/o from Scheme input files is assumed to be
|
|
basically some form of Scheme data (though can be just numbers,
|
|
tokens). For more elaborate analysis of incoming data it is
|
|
possible to use the text tokenization functions which offer
|
|
a fully programmable method of reading data.
|
|
|
|
@node TTS, XML/SGML mark-up, Scheme, Top
|
|
@chapter TTS
|
|
|
|
Festival supports text to speech for raw text files. If you
|
|
are not interested in using Festival in any other way except as
|
|
black box for rendering text as speech, the following method
|
|
is probably what you want.
|
|
@example
|
|
festival --tts myfile
|
|
@end example
|
|
This will say the contents of @file{myfile}. Alternatively text
|
|
may be submitted on standard input
|
|
@example
|
|
echo hello world | festival --tts
|
|
cat myfile | festival --tts
|
|
@end example
|
|
|
|
@cindex text modes
|
|
Festival supports the notion of @emph{text modes} where the text file
|
|
type may be identified, allowing Festival to process the file in an
|
|
appropriate way. Currently only two types are considered stable:
|
|
@code{STML} and @code{raw}, but other types such as @code{email},
|
|
@code{HTML}, @code{Latex}, etc. are being developed and discussed below.
|
|
This follows the idea of buffer modes in Emacs where a file's type can
|
|
be utilized to best display the text. Text mode may also be selected
|
|
based on a filename's extension.
|
|
|
|
Within the command interpreter the function @code{tts} is used
|
|
to render files as text; it takes a filename and the text mode
|
|
as arguments.
|
|
|
|
@menu
|
|
* Utterance chunking:: From text to utterances
|
|
* Text modes:: Mode specific text analysis
|
|
* Example text mode:: An example mode for reading email
|
|
@end menu
|
|
|
|
@node Utterance chunking, Text modes, , TTS
|
|
@section Utterance chunking
|
|
|
|
@cindex utterance chunking
|
|
@cindex @code{eou_tree}
|
|
Text to speech works by first tokenizing the file and chunking the
|
|
tokens into utterances. The definition of utterance breaks is
|
|
determined by the utterance tree in variable @code{eou_tree}. A default
|
|
version is given in @file{lib/tts.scm}. This uses a decision tree to
|
|
determine what signifies an utterance break. Obviously blank lines are
|
|
probably the most reliable, followed by certain punctuation. The
|
|
confusion of the use of periods for both sentence breaks and
|
|
abbreviations requires some more heuristics to best guess their
|
|
different use. The following tree is currently used which
|
|
works better than simply using punctuation.
|
|
@lisp
|
|
(defvar eou_tree
|
|
'((n.whitespace matches ".*\n.*\n\\(.\\|\n\\)*") ;; 2 or more newlines
|
|
((1))
|
|
((punc in ("?" ":" "!"))
|
|
((1))
|
|
((punc is ".")
|
|
;; This is to distinguish abbreviations vs periods
|
|
;; These are heuristics
|
|
((name matches "\\(.*\\..*\\|[A-Z][A-Za-z]?[A-Za-z]?\\|etc\\)")
|
|
((n.whitespace is " ")
|
|
((0)) ;; if abbrev single space isn't enough for break
|
|
((n.name matches "[A-Z].*")
|
|
((1))
|
|
((0))))
|
|
((n.whitespace is " ") ;; if it doesn't look like an abbreviation
|
|
((n.name matches "[A-Z].*") ;; single space and non-cap is no break
|
|
((1))
|
|
((0)))
|
|
((1))))
|
|
((0)))))
|
|
@end lisp
|
|
The token items this is applied to will always (except in the
|
|
end of file case) include one following token, so look ahead is
|
|
possible. The "n." and "p." and "p.p." prefixes allow access to the
|
|
surrounding token context. The features @code{name}, @code{whitespace}
|
|
and @code{punc} allow access to the contents of the token itself. At
|
|
present there is no way to access the lexicon form this tree which
|
|
unfortunately might be useful if certain abbreviations were identified
|
|
as such there.
|
|
|
|
Note these are heuristics and written by hand not trained from data,
|
|
though problems have been fixed as they have been observed in data. The
|
|
above rules may make mistakes where abbreviations appear at end of
|
|
lines, and when improper spacing and capitalization is used. This is
|
|
probably worth changing, for modes where more casual text appears, such
|
|
as email messages and USENET news messages. A possible improvement
|
|
could be made by analysing a text to find out its basic threshold of
|
|
utterance break (i.e. if no full stop, two spaces, followed by a
|
|
capitalized word sequences appear and the text is of a reasonable length
|
|
then look for other criteria for utterance breaks).
|
|
|
|
Ultimately what we are trying to do is to chunk the text into utterances
|
|
that can be synthesized quickly and start to play them quickly to
|
|
minimise the time someone has to wait for the first sound when starting
|
|
synthesis. Thus it would be better if this chunking were done on
|
|
@emph{prosodic phrases} rather than chunks more similar to linguistic
|
|
sentences. Prosodic phrases are bounded in size, while sentences are
|
|
not.
|
|
|
|
@node Text modes, Example text mode, Utterance chunking, TTS
|
|
@section Text modes
|
|
|
|
@cindex text modes
|
|
We do not believe that all texts are of the same type. Often information
|
|
about the general contents of file will aid synthesis greatly. For
|
|
example in Latex files we do not want to here "left brace, backslash e
|
|
m" before each emphasized word, nor do we want to necessarily hear
|
|
formating commands. Festival offers a basic method for specifying
|
|
customization rules depending on the @emph{mode} of the text. By type
|
|
we are following the notion of modes in Emacs and eventually will allow
|
|
customization at a similar level.
|
|
|
|
Modes are specified as the third argument to the function @code{tts}.
|
|
When using the Emacs interface to Festival the buffer mode is
|
|
automatically passed as the text mode. If the mode is not supported a
|
|
warning message is printed and the raw text mode is used.
|
|
|
|
Our initial text mode implementation allows configuration both in C++
|
|
and in Scheme. Obviously in C++ almost anything can be done but it is
|
|
not as easy to reconfigure without recompilation. Here
|
|
we will discuss those modes which can be fully configured at
|
|
run time.
|
|
|
|
A text mode may contain the following
|
|
@table @emph
|
|
@item filter
|
|
A Unix shell program filter that processes the text file in some
|
|
appropriate way. For example for email it might remove uninteresting
|
|
headers and just output the subject, from line and the message body.
|
|
If not specified, an identity filter is used.
|
|
@item init_function
|
|
This (Scheme) function will be called before any processing
|
|
will be done. It allows further set up of tokenization rules
|
|
and voices etc.
|
|
@item exit_function
|
|
This (Scheme) function will be called at the end of any processing
|
|
allowing reseting of tokenization rules etc.
|
|
@item analysis_mode
|
|
If analysis mode is @code{xml} the file is read through the built in XML
|
|
parser @code{rxp}. Alternatively if analysis mode is @code{xxml} the
|
|
filter should an SGML normalising parser and the output is processed in
|
|
a way suitable for it. Any other value is ignored.
|
|
@end table
|
|
These mode specific parameters are specified in the a-list
|
|
held in @code{tts_text_modes}.
|
|
|
|
When using Festival in Emacs the emacs buffer mode is passed to
|
|
Festival as the text mode.
|
|
|
|
Note that above mechanism is not really designed to be re-entrant,
|
|
this should be addressed in later versions.
|
|
|
|
@cindex @code{auto-text-mode-alist}
|
|
@cindex automatic selection of text mode
|
|
Following the use of auto-selection of mode in Emacs, Festival can
|
|
auto-select the text mode based on the filename given when no explicit
|
|
mode is given. The Lisp variable @code{auto-text-mode-alist} is a list
|
|
of dotted pairs of regular expression and mode name. For example
|
|
to specify that the @code{email} mode is to be used for files ending
|
|
in @file{.email} we would add to the current @code{auto-text-mode-alist}
|
|
as follows
|
|
@lisp
|
|
(set! auto-text-mode-alist
|
|
(cons (cons "\\.email$" 'email)
|
|
auto-text-mode-alist))
|
|
@end lisp
|
|
If the function @code{tts} is called with a mode other than @code{nil}
|
|
that mode overrides any specified by the @code{auto-text-mode-alist}.
|
|
The mode @code{fundamental} is the explicit "null" mode, it is used
|
|
when no mode is specified in the function @code{tts}, and match
|
|
is found in @code{auto-text-mode-alist} or the specified mode
|
|
is not found.
|
|
|
|
By convention if a requested text model is not found in
|
|
@code{tts_text_modes} the file @file{MODENAME-mode} will be
|
|
@code{required}. Therefore if you have the file
|
|
@file{MODENAME-mode.scm} in your library then it will be automatically
|
|
loaded on reference. Modes may be quite large and it is not necessary
|
|
have Festival load them all at start up time.
|
|
|
|
Because of the @code{auto-text-mode-alist} and the auto loading
|
|
of currently undefined text modes you can use Festival like
|
|
@example
|
|
festival --tts example.email
|
|
@end example
|
|
Festival with automatically synthesize @file{example.email} in text
|
|
mode @code{email}.
|
|
|
|
@cindex personal text modes
|
|
If you add your own personal text modes you should do the following.
|
|
Suppose you've written an HTML mode. You have named it
|
|
@file{html-mode.scm} and put it in @file{/home/awb/lib/festival/}. In
|
|
your @file{.festivalrc} first identify you're personal Festival library
|
|
directory by adding it to @code{lib-path}.
|
|
@example
|
|
(set! lib-path (cons "/home/awb/lib/festival/" lib-path))
|
|
@end example
|
|
Then add the definition to the @code{auto-text-mode-alist}
|
|
that file names ending @file{.html} or @file{.htm} should
|
|
be read in HTML mode.
|
|
@example
|
|
(set! auto-text-mode-alist
|
|
(cons (cons "\\.html?$" 'html)
|
|
auto-text-mode-alist))
|
|
@end example
|
|
Then you may synthesize an HTML file either from Scheme
|
|
@example
|
|
(tts "example.html" nil)
|
|
@end example
|
|
@exdent Or from the shell command line
|
|
@example
|
|
festival --tts example.html
|
|
@end example
|
|
Anyone familiar with modes in Emacs should recognise that the process of
|
|
adding a new text mode to Festival is very similar to adding a new
|
|
buffer mode to Emacs.
|
|
|
|
@node Example text mode, , Text modes, TTS
|
|
@section Example text mode
|
|
|
|
@cindex email mode
|
|
Here is a short example of a tts mode for reading email messages. It
|
|
is by no means complete but is a start at showing how you can customize
|
|
tts modes without writing new C++ code.
|
|
|
|
The first task is to define a filter that will take a saved mail
|
|
message and remove extraneous headers and just leave the from
|
|
line, subject and body of the message. The filter program
|
|
is given a file name as its first argument and should output the
|
|
result on standard out. For our purposes we will do this as
|
|
a shell script.
|
|
@example
|
|
#!/bin/sh
|
|
# Email filter for Festival tts mode
|
|
# usage: email_filter mail_message >tidied_mail_message
|
|
grep "^From: " $1
|
|
echo
|
|
grep "^Subject: " $1
|
|
echo
|
|
# delete up to first blank line (i.e. the header)
|
|
sed '1,/^$/ d' $1
|
|
@end example
|
|
Next we define the email init function, which will be called
|
|
when we start this mode. What we will do is save the current
|
|
token to words function and slot in our own new one. We can
|
|
then restore the previous one when we exit.
|
|
@lisp
|
|
(define (email_init_func)
|
|
"Called on starting email text mode."
|
|
(set! email_previous_t2w_func token_to_words)
|
|
(set! english_token_to_words email_token_to_words)
|
|
(set! token_to_words email_token_to_words))
|
|
@end lisp
|
|
Note that @emph{both} @code{english_token_to_words} and
|
|
@code{token_to_words} should be set to ensure that our new
|
|
token to word function is still used when we change voices.
|
|
|
|
The corresponding end function puts the token to words function
|
|
back.
|
|
@lisp
|
|
(define (email_exit_func)
|
|
"Called on exit email text mode."
|
|
(set! english_token_to_words email_previous_t2w_func)
|
|
(set! token_to_words email_previous_t2w_func))
|
|
@end lisp
|
|
Now we can define the email specific token to words function. In this
|
|
example we deal with two specific cases. First we deal with the common
|
|
form of email addresses so that the angle brackets are not pronounced.
|
|
The second points are to recognise quoted text and immediately change the
|
|
the speaker to the alternative speaker.
|
|
@lisp
|
|
(define (email_token_to_words token name)
|
|
"Email specific token to word rules."
|
|
(cond
|
|
@end lisp
|
|
This first condition identifies the token as a bracketed email address
|
|
and removes the brackets and splits the token into name
|
|
and IP address. Note that we recursively call the function
|
|
@code{email_previous_t2w_func} on the email name and IP address
|
|
so that they will be pronounced properly. Note that because that
|
|
function returns a @emph{list} of words we need to append them together.
|
|
@lisp
|
|
((string-matches name "<.*@.*>")
|
|
(append
|
|
(email_previous_t2w_func token
|
|
(string-after (string-before name "@@") "<"))
|
|
(cons
|
|
"at"
|
|
(email_previous_t2w_func token
|
|
(string-before (string-after name "@@") ">")))))
|
|
@end lisp
|
|
Our next condition deals with identifying a greater than sign being used
|
|
as a quote marker. When we detect this we select the alternative
|
|
speaker, even though it may already be selected. We then return no
|
|
words so the quote marker is not spoken. The following condition finds
|
|
greater than signs which are the first token on a line.
|
|
@lisp
|
|
((and (string-matches name ">")
|
|
(string-matches (item.feat token "whitespace")
|
|
"[ \t\n]*\n *"))
|
|
(voice_don_diphone)
|
|
nil ;; return nothing to say
|
|
)
|
|
@end lisp
|
|
If it doesn't match any of these we can go ahead and use the builtin
|
|
token to words function Actually, we call the function that was set
|
|
before we entered this mode to ensure any other specific rules
|
|
still remain. But before that we need to check if we've had a newline
|
|
with doesn't start with a greater than sign. In that case we
|
|
switch back to the primary speaker.
|
|
@lisp
|
|
(t ;; for all other cases
|
|
(if (string-matches (item.feat token "whitespace")
|
|
".*\n[ \t\n]*")
|
|
(voice_rab_diphone))
|
|
(email_previous_t2w_func token name))))
|
|
@end lisp
|
|
@cindex declaring text modes
|
|
In addition to these we have to actually declare the text mode.
|
|
This we do by adding to any existing modes as follows.
|
|
@lisp
|
|
(set! tts_text_modes
|
|
(cons
|
|
(list
|
|
'email ;; mode name
|
|
(list ;; email mode params
|
|
(list 'init_func email_init_func)
|
|
(list 'exit_func email_exit_func)
|
|
'(filter "email_filter")))
|
|
tts_text_modes))
|
|
@end lisp
|
|
This will now allow simple email messages to be dealt with in a mode
|
|
specific way.
|
|
|
|
An example mail message is included in @file{examples/ex1.email}. To
|
|
hear the result of the above text mode start Festival, load
|
|
in the email mode descriptions, and call TTS on the example file.
|
|
@example
|
|
(tts ".../examples/ex1.email" 'email)
|
|
@end example
|
|
|
|
The above is very short of a real email mode but does illustrate
|
|
how one might go about building one. It should be reiterated
|
|
that text modes are new in Festival and their most effective form
|
|
has not been discovered yet. This will improve with time
|
|
and experience.
|
|
|
|
@node XML/SGML mark-up, Emacs interface, TTS, Top
|
|
@chapter XML/SGML mark-up
|
|
|
|
@cindex STML
|
|
@cindex SGML
|
|
@cindex SSML
|
|
@cindex Sable
|
|
@cindex XML
|
|
@cindex Spoken Text Mark-up Language
|
|
The ideas of a general, synthesizer system nonspecific, mark-up language
|
|
for labelling text has been under discussion for some time. Festival
|
|
has supported an SGML based markup language through multiple versions
|
|
most recently STML (@cite{sproat97}). This is based on the earlier SSML
|
|
(Speech Synthesis Markup Language) which was supported by previous
|
|
versions of Festival (@cite{taylor96}). With this version of Festival
|
|
we support @emph{Sable} a similar mark-up language devised by a
|
|
consortium from Bell Labls, Sub Microsystems, AT&T and Edinburgh,
|
|
@cite{sable98}. Unlike the previous versions which were SGML based, the
|
|
implementation of Sable in Festival is now XML based. To the user they
|
|
different is negligable but using XML makes processing of files easier
|
|
and more standardized. Also Festival now includes an XML parser thus
|
|
reducing the dependencies in processing Sable text.
|
|
|
|
Raw text has the problem that it cannot always easily be rendered as
|
|
speech in the way the author wishes. Sable offers a well-defined way of
|
|
marking up text so that the synthesizer may render it appropriately.
|
|
|
|
@cindex CSS
|
|
@cindex Cascading style sheets
|
|
@cindex DSSSL
|
|
The definition of Sable is by no means settled and is still in
|
|
development. In this release Festival offers people working on Sable
|
|
and other XML (and SGML) based markup languages a chance to quickly
|
|
experiment with prototypes by providing a DTD (document type
|
|
descriptions) and the mapping of the elements in the DTD to Festival
|
|
functions. Although we have not yet (personally) investigated facilities
|
|
like cascading style sheets and generalized SGML specification languages
|
|
like DSSSL we believe the facilities offer by Festival allow rapid
|
|
prototyping of speech output markup languages.
|
|
|
|
Primarily we see Sable markup text as a language that will be generated by
|
|
other programs, e.g. text generation systems, dialog managers etc.
|
|
therefore a standard, easy to parse, format is required, even if
|
|
it seems overly verbose for human writers.
|
|
|
|
For more information of Sable and access to the mailing list see
|
|
@example
|
|
@url{http://www.cstr.ed.ac.uk/projects/sable.html}
|
|
@end example
|
|
|
|
@menu
|
|
* Sable example:: an example of Sable with descriptions
|
|
* Supported Sable tags:: Currently supported Sable tags
|
|
* Adding Sable tags:: Adding new Sable tags
|
|
* XML/SGML requirements:: Software environment requirements for use
|
|
* Using Sable:: Rendering Sable files as speech
|
|
@end menu
|
|
|
|
@node Sable example, Supported Sable tags, , XML/SGML mark-up
|
|
@section Sable example
|
|
|
|
Here is a simple example of Sable marked up text
|
|
|
|
@example
|
|
<?xml version="1.0"?>
|
|
<!DOCTYPE SABLE PUBLIC "-//SABLE//DTD SABLE speech mark up//EN"
|
|
"Sable.v0_2.dtd"
|
|
[]>
|
|
<SABLE>
|
|
<SPEAKER NAME="male1">
|
|
|
|
The boy saw the girl in the park <BREAK/> with the telescope.
|
|
The boy saw the girl <BREAK/> in the park with the telescope.
|
|
|
|
Good morning <BREAK /> My name is Stuart, which is spelled
|
|
<RATE SPEED="-40%">
|
|
<SAYAS MODE="literal">stuart</SAYAS> </RATE>
|
|
though some people pronounce it
|
|
<PRON SUB="stoo art">stuart</PRON>. My telephone number
|
|
is <SAYAS MODE="literal">2787</SAYAS>.
|
|
|
|
I used to work in <PRON SUB="Buckloo">Buccleuch</PRON> Place,
|
|
but no one can pronounce that.
|
|
|
|
By the way, my telephone number is actually
|
|
<AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.2.au"/>
|
|
<AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.7.au"/>
|
|
<AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.8.au"/>
|
|
<AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.7.au"/>.
|
|
</SPEAKER>
|
|
</SABLE>
|
|
@end example
|
|
@cindex SABLE DTD
|
|
@cindex @file{Sable.v0_2.dtd}
|
|
After the initial definition of the SABLE tags, through the file
|
|
@file{Sable.v0_2.dtd}, which is distributed as part of Festival, the
|
|
body is given. There are tags for identifying the language and the
|
|
voice. Explicit boundary markers may be given in text. Also duration
|
|
and intonation control can be explicit specified as can new
|
|
pronunciations of words. The last sentence specifies some external
|
|
filenames to play at that point.
|
|
|
|
@node Supported Sable tags, Adding Sable tags, Sable example, XML/SGML mark-up
|
|
@section Supported Sable tags
|
|
|
|
@cindex Sable tags
|
|
There is not yet a definitive set of tags but hopefully such a list
|
|
will form over the next few months. As adding support for new tags is
|
|
often trivial the problem lies much more in defining what tags there
|
|
should be than in actually implementing them. The following
|
|
are based on version 0.2 of Sable as described in
|
|
@url{http://www.cstr.ed.ac.uk/projects/sable_spec2.html}, though
|
|
some aspects are not currently supported in this implementation.
|
|
Further updates will be announces through the Sable mailing list.
|
|
|
|
@table @code
|
|
@item LANGUAGE
|
|
Allows the specification of the language through the @code{ID}
|
|
attribute. Valid values in Festival are, @code{english},
|
|
@code{en1}, @code{spanish}, @code{en}, and others depending
|
|
on your particular installation.
|
|
For example
|
|
@example
|
|
<LANGUAGE id="english"> ... </LANGUAGE>
|
|
@end example
|
|
If the language isn't supported by the particualr installation of
|
|
Festival "Some text in .." is said instead and the section is
|
|
ommitted.
|
|
@item SPEAKER
|
|
Select a voice. Accepts a parameter @code{NAME} which takes values
|
|
@code{male1}, @code{male2}, @code{female1}, etc. There
|
|
is currently no definition about what happens when a voice is selected
|
|
which the synthesizer doesn't support. An example is
|
|
@example
|
|
<SPEAKER name="male1"> ... </SPEAKER>
|
|
@end example
|
|
@item AUDIO
|
|
This allows the specification of an external waveform that is to
|
|
be included. There are attributes for specifying volume and whether
|
|
the waveform is to be played in the background of the following
|
|
text or not. Festival as yet only supports insertion.
|
|
@example
|
|
My telephone number is
|
|
<AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.2.au"/>
|
|
<AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.7.au"/>
|
|
<AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.8.au"/>
|
|
<AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.7.au"/>.
|
|
@end example
|
|
@item MARKER
|
|
This allows Festival to mark when a particalur part of the text has
|
|
been reached. At present the simply the value of the @code{MARK}
|
|
attribute is printed. This is done some when that piece
|
|
of text is analyzed. not when it is played. To use
|
|
this in any real application would require changes to this tags
|
|
implementation.
|
|
@example
|
|
Move the <MARKER MARK="mouse" /> mouse to the top.
|
|
@end example
|
|
@item BREAK
|
|
Specifies a boundary at some @code{LEVEL}. Strength may be values
|
|
@code{Large}, @code{Medium}, @code{Small} or a number. Note that
|
|
this this tag is an empty tag and must include the closing part
|
|
within itsefl specification.
|
|
@example
|
|
<BREAK LEVEL="LARGE"/>
|
|
@end example
|
|
@item DIV
|
|
This signals an division. In Festival this causes an utterance
|
|
break. A @code{TYPE} attribute may be specified but it is ignored
|
|
by Festival.
|
|
@item PRON
|
|
Allows pronunciation of enclosed text to be explcitily given. It
|
|
supports the attributes @code{IPA} for an IPA specification (not
|
|
currently supported by Festival); @code{SUB} text to be substituted
|
|
which can be in some form of phonetic spelling, and @code{ORIGIN} where
|
|
the linguistic origin of the enclosed text may be identified to assist
|
|
in etymologically sensitive letter to sound rules.
|
|
@example
|
|
<PRON SUB="toe maa toe">tomato</PRON>
|
|
@end example
|
|
@item SAYAS
|
|
Allows indeitnfication of the enclose tokens/text. The attribute
|
|
@code{MODE} cand take any of the following a values: @code{literal},
|
|
@code{date}, @code{time}, @code{phone}, @code{net}, @code{postal},
|
|
@code{currency}, @code{math}, @code{fraction}, @code{measure},
|
|
@code{ordinal}, @code{cardinal}, or @code{name}. Further specification
|
|
of type for dates (MDY, DMY etc) may be speficied through the
|
|
@code{MODETYPE} attribute.
|
|
@example
|
|
As a test of marked-up numbers. Here we have
|
|
a year <SAYAS MODE="date">1998</SAYAS>,
|
|
an ordinal <SAYAS MODE="ordinal">1998</SAYAS>,
|
|
a cardinal <SAYAS MODE="cardinal">1998</SAYAS>,
|
|
a literal <SAYAS MODE="literal">1998</SAYAS>,
|
|
and phone number <SAYAS MODE="phone">1998</SAYAS>.
|
|
@end example
|
|
@item EMPH
|
|
To specify enclose text should be emphasized, a @code{LEVEL}
|
|
attribute may be specified but its value is currently
|
|
ignored by Festival (besides the emphasis Festival generates
|
|
isn't very good anyway).
|
|
@example
|
|
The leaders of <EMPH>Denmark</EMPH> and <EMPH>India</EMPH> meet on
|
|
Friday.
|
|
@end example
|
|
@item PITCH
|
|
Allows the specification of pitch range, mid and base points.
|
|
@example
|
|
Without his penguin, <PITCH BASE="-20%"> which he left at home, </PITCH>
|
|
he could not enter the restaurant.
|
|
@end example
|
|
@item RATE
|
|
Allows the specification of speaking rate
|
|
@example
|
|
The address is <RATE SPEED="-40%"> 10 Main Street </RATE>.
|
|
@end example
|
|
@item VOLUME
|
|
Allows the specification of volume. Note in festival this
|
|
causes an utterance break before and after this tag.
|
|
@example
|
|
Please speak more <VOLUME LEVEL="loud">loudly</VOLUME>, except
|
|
when I ask you to speak <VOLUME LEVEL="quiet">in a quiet voice</VOLUME>.
|
|
@end example
|
|
@item ENGINE
|
|
This allows specification of engine specific commands
|
|
@example
|
|
An example is <ENGINE ID="festival" DATA="our own festival speech
|
|
synthesizer"> the festival speech synthesizer</ENGINE> or
|
|
the Bell Labs speech synthesizer.
|
|
@end example
|
|
@end table
|
|
|
|
These tags may change in name but they cover the aspects of speech
|
|
mark up that we wish to express. Later additions and changes to these
|
|
are expected.
|
|
|
|
See the files @file{festival/examples/example.sable} and
|
|
@file{festival/examples/example2.sable} for working examples.
|
|
|
|
Note the definition of Sable is on going and there are likely to be
|
|
later more complete implementations of sable for Festival as independent
|
|
releases consult @file{url://www.cstr.ed.ac.uk/projects/sable.html} for
|
|
the most recent updates.
|
|
|
|
@node Adding Sable tags, XML/SGML requirements, Supported Sable tags, XML/SGML mark-up
|
|
@section Adding Sable tags
|
|
|
|
We do not yet claim that there is a fixed standard for Sable tags but
|
|
we wish to move towards such a standard. In the mean time we have
|
|
made it easy in Festival to add support for new tags without, in
|
|
general, having to change any of the core functions.
|
|
|
|
Two changes are necessary to add a new tags. First, change the
|
|
definition in @file{lib/Sable.v0_2.dtd}, so that Sable files may use it.
|
|
The second stage is to make Festival sensitive to that new tag. The
|
|
example in @code{festival/lib/sable-mode.scm} shows how a new text mode
|
|
may be implemented for an XML/SGML-based markup language. The basic
|
|
point is that an identified function will be called on finding a start
|
|
tag or end tags in the document. It is the tag-function's job to
|
|
synthesize the given utterance if the tag signals an utterance boundary.
|
|
The return value from the tag-function is the new status of the current
|
|
utterance, which may remain unchanged or if the current utterance has
|
|
been synthesized @code{nil} should be returned signalling a new
|
|
utterance.
|
|
|
|
Note the hierarchical structure of the document is not available in this
|
|
method of tag-functions. Any hierarchical state that must be preserved
|
|
has to be done using explicit stacks in Scheme. This is an artifact
|
|
due to the cross relationship to utterances and tags (utterances may end
|
|
within start and end tags), and the desire to have all specification in
|
|
Scheme rather than C++.
|
|
|
|
The tag-functions are defined in an elements list. They are identified
|
|
with names such as "(SABLE" and ")SABLE" denoting start and end tags
|
|
respectively. Two arguments are passed to these tag functions,
|
|
an assoc list of attributes and values as specified in the document
|
|
and the current utterances. If the tag denotes an utterance
|
|
break, call @code{xxml_synth} on @code{UTT} and return @code{nil}.
|
|
If a tag (start or end) is found in the document and there is no
|
|
corresponding tag-function it is ignored.
|
|
|
|
New features may be added to words with a start and end tag by
|
|
adding features to the global @code{xxml_word_features}. Any
|
|
features in that variable will be added to each word.
|
|
|
|
Note that this method may be used for both XML based lamnguages and SGML
|
|
based markup languages (though and external normalizing SGML parser is
|
|
required in the SGML case). The type (XML vs SGML) is identified
|
|
by the @code{analysis_type} parameter in the tts text mode specification.
|
|
|
|
@node XML/SGML requirements, Using Sable, Adding Sable tags, XML/SGML mark-up
|
|
@section XML/SGML requirements
|
|
|
|
@cindex XML
|
|
@cindex rxp
|
|
Festival is distributed with @code{rxp} an XML parser developed
|
|
by Richard Tobin of the Language Technology Group, University of
|
|
Edinburgh. Sable is set up as an XML text mode so no
|
|
further requirements or external programs are required to synthesize
|
|
from Sable marked up text (unlike previous releases). Note that @code{rxp}
|
|
is not a full validation parser and hence doesn't check some aspects
|
|
of the file (tags within tags).
|
|
|
|
@cindex @file{nsgmls}
|
|
@cindex SGML parser
|
|
Festival still supports SGML based markup but in such cases requires an
|
|
external SGML normalizing parser. We have tested @file{nsgmls-1.0}
|
|
which is available as part of the SGML tools set @file{sp-1.1.tar.gz}
|
|
which is available from @url{http://www.jclark.com/sp/index.html}.
|
|
This seems portable between many platforms.
|
|
|
|
@node Using Sable, ,XML/SGML requirements, XML/SGML mark-up
|
|
@section Using Sable
|
|
|
|
@cindex Sable using
|
|
@cindex using Sable
|
|
Support in Festival for Sable is as a text mode. In the command
|
|
mode use the following to process an Sable file
|
|
@example
|
|
(tts "file.sable" 'sable)
|
|
@end example
|
|
|
|
Also the automatic selection of mode based on file type has been set up
|
|
such that files ending @file{.sable} will be automatically synthesized in
|
|
this mode. Thus
|
|
@example
|
|
festival --tts fred.sable
|
|
@end example
|
|
Will render @file{fred.sable} as speech in Sable mode.
|
|
|
|
Another way of using Sable is through the Emacs interface. The
|
|
say-buffer command will send the Emacs buffer mode to Festival as
|
|
its tts-mode. If the Emacs mode is stml or sgml the file is treated
|
|
as an sable file. @xref{Emacs interface}.
|
|
|
|
@cindex saving Sable waveforms
|
|
@cindex saving TTS waveforms
|
|
Many people experimenting with Sable (and TTS in general) often want all
|
|
the waveform output to be saved to be played at a later date. The
|
|
simplest way to do this is using the @file{text2wave} script, It
|
|
respects the audo mode selection so
|
|
@example
|
|
text2wave fred.sable -o fred.wav
|
|
@end example
|
|
Note this renders the file a single waveform (done by concatenating
|
|
the waveforms for each utterance in the Sable file).
|
|
|
|
If you wish the waveform for each utterance in a file saved you can
|
|
cause the tts process to save the waveforms during synthesis. A
|
|
call to
|
|
@example
|
|
festival> (save_waves_during_tts)
|
|
@end example
|
|
Any future call to @code{tts} will cause the waveforms to be saved in a
|
|
file @file{tts_file_xxx.wav} where @file{xxx} is a number. A call to
|
|
@code{(save_waves_during_tts_STOP)} will stop saving the waves. A
|
|
message is printed when the waveform is saved otherwise people forget
|
|
about this and wonder why their disk has filled up.
|
|
|
|
This is done by inserting a function in @code{tts_hooks}
|
|
which saves the wave. To do other things to each utterances during
|
|
TTS (such as saving the utterance structure), try redefining
|
|
the function @code{save_tts_output} (see @code{festival/lib/tts.scm}).
|
|
|
|
@node Emacs interface, Phonesets, XML/SGML mark-up, Top
|
|
@chapter Emacs interface
|
|
|
|
@cindex Emacs interface
|
|
One easy method of using Festival is via an Emacs interface
|
|
that allows selection of text regions to be sent to Festival for
|
|
rendering as speech.
|
|
|
|
@cindex @file{festival.el}
|
|
@cindex @code{say-minor-mode}
|
|
@file{festival.el} offers a new minor mode which offers
|
|
an extra menu (in emacs-19 and 20) with options for saying a selected
|
|
region, or a whole buffer, as well as various general control
|
|
functions. To use this you must install @file{festival.el} in
|
|
a directory where Emacs can find it, then add to your
|
|
@file{.emacs} in your home directory the following lines.
|
|
@lisp
|
|
(autoload 'say-minor-mode "festival" "Menu for using Festival." t)
|
|
(say-minor-mode t)
|
|
@end lisp
|
|
Successive calls to @code{say-minor-mode} will toggle the minor
|
|
mode, switching the @samp{say} menu on and off.
|
|
|
|
Note that the optional voice selection offered by the language sub-menu
|
|
is not sensitive to actual voices supported by the your Festival
|
|
installation. Hand customization is require in the @file{festival.el}
|
|
file. Thus some voices may appear in your menu that your Festival
|
|
doesn't support and some voices may be supported by your Festival
|
|
that do not appear in the menu.
|
|
|
|
When the Emacs Lisp function @code{festival-say-buffer} or the
|
|
menu equivalent is used the Emacs major mode is passed to Festival
|
|
as the text mode.
|
|
|
|
@node Phonesets, Lexicons, Emacs interface, Top
|
|
@chapter Phonesets
|
|
|
|
@cindex phonesets
|
|
@cindex phoneme definitions
|
|
The notion of phonesets is important to a number of different
|
|
subsystems within Festival. Festival supports multiple phonesets
|
|
simultaneously and allows mapping between sets when necessary. The
|
|
lexicons, letter to sound rules, waveform synthesizers, etc. all require
|
|
the definition of a phoneset before they will operate.
|
|
|
|
A phoneset is a set of symbols which may be further defined in terms
|
|
of features, such as vowel/consonant, place of articulation
|
|
for consonants, type of vowel etc. The set of features and
|
|
their values must be defined with the phoneset. The definition
|
|
is used to ensure compatibility between sub-systems as well as
|
|
allowing groups of phones in various prediction systems (e.g.
|
|
duration)
|
|
|
|
A phoneset definition has the form
|
|
@lisp
|
|
(defPhoneSet
|
|
NAME
|
|
FEATUREDEFS
|
|
PHONEDEFS )
|
|
@end lisp
|
|
The @var{NAME} is any unique symbol used e.g. @code{mrpa}, @code{darpa},
|
|
etc. @var{FEATUREDEFS} is a list of definitions each consisting of
|
|
a feature name and its possible values. For example
|
|
@lisp
|
|
(
|
|
(vc + -) ;; vowel consonant
|
|
(vlength short long diphthong schwa 0) ;; vowel length
|
|
...
|
|
)
|
|
@end lisp
|
|
The third section is a list of phone definitions themselves. Each phone
|
|
definition consists of a phone name and the values for each feature in
|
|
the order the features were defined in the above section.
|
|
|
|
A typical example of a phoneset definition can be found in
|
|
@file{lib/mrpa_phones.scm}.
|
|
|
|
@cindex silences
|
|
Note the phoneset should also include a definition for any silence
|
|
phones. In addition to the definition of the set the silence phone(s)
|
|
themselves must also be identified to the system. This is done through
|
|
the command @code{PhoneSet.silences}. In the mrpa set this is done by
|
|
the command
|
|
@lisp
|
|
(PhoneSet.silences '(#))
|
|
@end lisp
|
|
There may be more than one silence phone (e.g. breath, start silence etc.)
|
|
in any phoneset definition. However the first phone in this set is
|
|
treated special and should be canonical silence. Among other things,
|
|
it is this phone that is inserted by the pause prediction module.
|
|
|
|
@cindex selecting a phoneset
|
|
In addition to declaring phonesets, alternate sets may be selected
|
|
by the command @code{PhoneSet.select}.
|
|
|
|
@cindex mapping between phones
|
|
@cindex phone maps
|
|
Phones in different sets may be automatically mapped between using
|
|
their features. This mapping is not yet as general as it could be,
|
|
but is useful when mapping between various phonesets of the same
|
|
language. When a phone needs to be mapped from one set to another
|
|
the phone with matching features is selected. This allows, at least
|
|
to some extent, lexicons, waveform synthesizers, duration modules etc.
|
|
to use different phonesets (though in general this is not advised).
|
|
|
|
A list of currently defined phonesets is returned by the
|
|
function
|
|
@example
|
|
(PhoneSet.list)
|
|
@end example
|
|
Note phonesets are often not defined until a voice is actually
|
|
loaded so this list is not the list of of sets that are distributed but
|
|
the list of sets that are used by currently loaded voices.
|
|
|
|
The name, phones, features and silences of the current phoneset
|
|
may be accessedwith the function
|
|
@example
|
|
(PhoneSet.description nil)
|
|
@end example
|
|
If the argument to this function is a list, only those parts of
|
|
the phoneset description named are returned. For example
|
|
@example
|
|
(PhoneSet.description '(silences))
|
|
(PhoneSet.description '(silences phones))
|
|
@end example
|
|
|
|
@node Lexicons, Utterances , Phonesets, Top
|
|
@chapter Lexicons
|
|
|
|
@cindex lexicon
|
|
@cindex dictionary
|
|
A @emph{Lexicon} in Festival is a subsystem that provides
|
|
pronunciations for words. It can consist of three distinct parts:
|
|
an addenda, typically short consisting of hand added words; a
|
|
compiled lexicon, typically large (10,000s of words) which sits on
|
|
disk somewhere; and a method for dealing with words not in either list.
|
|
|
|
@menu
|
|
* Lexical entries:: Format of lexical entries
|
|
* Defining lexicons:: Building new lexicons
|
|
* Lookup process:: Order of significance
|
|
* Letter to sound rules:: Dealing with unknown words
|
|
* Building letter to sound rules:: Building rules from data
|
|
* Lexicon requirements:: What should be in the lexicon
|
|
* Available lexicons:: Current available lexicons
|
|
* Post-lexical rules:: Modification of words in context
|
|
@end menu
|
|
|
|
@node Lexical entries, Defining lexicons, , Lexicons
|
|
@section Lexical entries
|
|
|
|
@cindex lexical entries
|
|
Lexical entries consist of three basic parts, a head word, a part of
|
|
speech and a pronunciation. The headword is what you might normally
|
|
think of as a word e.g. @samp{walk}, @samp{chairs} etc. but it might be
|
|
any token.
|
|
|
|
@cindex part of speech tag
|
|
@cindex POS
|
|
@cindex part of speech map
|
|
The part-of-speech field currently consist of a simple atom (or nil if
|
|
none is specified). Of course there are many part of speech tag sets
|
|
and whatever you mark in your lexicon must be compatible with the
|
|
subsystems that use that information. You can optionally set a part of
|
|
speech tag mapping for each lexicon. The value should be a reverse
|
|
assoc-list of the following form
|
|
@lisp
|
|
(lex.set.pos.map
|
|
'((( punc fpunc) punc)
|
|
(( nn nnp nns nnps ) n)))
|
|
@end lisp
|
|
All part of speech tags not appearing in the left hand side of a pos map
|
|
are left unchanged.
|
|
|
|
@cindex pronunciation
|
|
The third field contains the actual pronunciation of the word. This
|
|
is an arbitrary Lisp S-expression. In many of the lexicons distributed
|
|
with Festival this entry has internal format, identifying syllable
|
|
structure, stress markigns and of course the phones themselves. In
|
|
some of our other lexicons we simply list the phones with stress marking
|
|
on each vowel.
|
|
|
|
Some typical example entries are
|
|
|
|
@lisp
|
|
( "walkers" n ((( w oo ) 1) (( k @@ z ) 0)) )
|
|
( "present" v ((( p r e ) 0) (( z @@ n t ) 1)) )
|
|
( "monument" n ((( m o ) 1) (( n y u ) 0) (( m @@ n t ) 0)) )
|
|
@end lisp
|
|
|
|
@cindex homographs
|
|
Note you may have two entries with the same headword, but different
|
|
part of speech fields allow differentiation. For example
|
|
|
|
@lisp
|
|
( "lives" n ((( l ai v z ) 1)) )
|
|
( "lives" v ((( l i v z ) 1)) )
|
|
@end lisp
|
|
|
|
@xref{Lookup process}, for a description of how multiple entries with the
|
|
same headword are used during lookup.
|
|
|
|
@cindex lexical stress
|
|
By current conventions, single syllable function words should have no
|
|
stress marking, while single syllable content words should be stressed.
|
|
|
|
@emph{NOTE:} the POS field may change in future to contain more complex
|
|
formats. The same lexicon mechanism (but different lexicon) is
|
|
used for holding part of speech tag distributions for the POS prediction
|
|
module.
|
|
|
|
@node Defining lexicons, Lookup process, Lexical entries, Lexicons
|
|
@section Defining lexicons
|
|
|
|
@cindex creating a lexicon
|
|
As stated above, lexicons consist of three basic parts (compiled
|
|
form, addenda and unknown word method) plus some other declarations.
|
|
|
|
Each lexicon in the system has a name which allows different lexicons to
|
|
be selected from efficiently when switching between voices during
|
|
synthesis. The basic steps involved in a lexicon definition are as
|
|
follows.
|
|
|
|
First a new lexicon must be created with a new name
|
|
@lisp
|
|
(lex.create "cstrlex")
|
|
@end lisp
|
|
A phone set must be declared for the lexicon, to allow both
|
|
checks on the entries themselves and to allow phone mapping between
|
|
different phone sets used in the system
|
|
@lisp
|
|
(lex.set.phoneset "mrpa")
|
|
@end lisp
|
|
The phone set must be already declared in the system.
|
|
|
|
@cindex compiling a lexicon
|
|
A compiled lexicon, the construction of which is described below,
|
|
may be optionally specified
|
|
@lisp
|
|
(lex.set.compile.file "/projects/festival/lib/dicts/cstrlex.out")
|
|
@end lisp
|
|
@cindex letter to sound rules
|
|
The method for dealing with unknown words, @xref{Letter to sound rules}, may
|
|
be set
|
|
@lisp
|
|
(lex.set.lts.method 'lts_rules)
|
|
(lex.set.lts.ruleset 'nrl)
|
|
@end lisp
|
|
In this case we are specifying the use of a set of letter to sound rules
|
|
originally developed by the U.S. Naval Research Laboratories. The
|
|
default method is to give an error if a word is not found in the addenda
|
|
or compiled lexicon. (This and other options are discussed more fully
|
|
below.)
|
|
|
|
@cindex addenda
|
|
@cindex lexicon addenda
|
|
Finally addenda items may be added for words that are known to
|
|
be common, but not in the lexicon and cannot reasonably be analysed by
|
|
the letter to sound rules.
|
|
@lisp
|
|
(lex.add.entry
|
|
'( "awb" n ((( ei ) 1) ((d uh) 1) ((b @@ l) 0) ((y uu) 0) ((b ii) 1))))
|
|
(lex.add.entry
|
|
'( "cstr" n ((( s ii ) 1) (( e s ) 1) (( t ii ) 1) (( aa ) 1)) ))
|
|
(lex.add.entry
|
|
'( "Edinburgh" n ((( e m ) 1) (( b r @@ ) 0))) ))
|
|
@end lisp
|
|
Using @code{lex.add.entry} again for the same word and part of speech
|
|
will redefine the current pronunciation. Note these add entries to the
|
|
@emph{current} lexicon so its a good idea to explicitly select the
|
|
lexicon before you add addenda entries, particularly if you are doing
|
|
this in your own @file{.festivalrc} file.
|
|
|
|
For large lists, compiled lexicons are best. The function
|
|
@code{lex.compile} takes two filename arguments, a file name containing
|
|
a list of lexical entries and an output file where the compiled lexicon
|
|
will be saved.
|
|
|
|
Compilation can take some time and may require lots of memory, as all
|
|
entries are loaded in, checked and then sorted before being written out
|
|
again. During compilation if some entry is malformed the reading
|
|
process halts with a not so useful message. Note that if any of your entries
|
|
include quote or double quotes the entries will probably be misparsed
|
|
and cause such a weird error. In such cases try setting
|
|
@lisp
|
|
(debug_output t)
|
|
@end lisp
|
|
@exdent before compilation. This will print out each entry as it is read in
|
|
which should help to narrow down where the error is.
|
|
|
|
@node Lookup process, Letter to sound rules, Defining lexicons, Lexicons
|
|
@section Lookup process
|
|
|
|
When looking up a word, either through the C++ interface, or
|
|
Lisp interface, a word is identified by its headword and part of
|
|
speech. If no part of speech is specified, @code{nil} is assumed
|
|
which matches any part of speech tag.
|
|
|
|
The lexicon look up process first checks the addenda, if there is
|
|
a full match (head word plus part of speech) it is returned. If
|
|
there is an addenda entry whose head word matches and whose part
|
|
of speech is @code{nil} that entry is returned.
|
|
|
|
If no match is found in the addenda, the compiled lexicon, if present,
|
|
is checked. Again a match is when both head word and part of speech tag
|
|
match, or either the word being searched for has a part of speech
|
|
@code{nil} or an entry has its tag as @code{nil}. Unlike the addenda,
|
|
if no full head word and part of speech tag match is found, the first
|
|
word in the lexicon whose head word matches is returned. The rationale
|
|
is that the letter to sound rules (the next defence) are unlikely to be
|
|
better than an given alternate pronunciation for a the word but
|
|
different part of speech. Even more so given that as there is an entry
|
|
with the head word but a different part of speech this word may have an
|
|
unusual pronunciation that the letter to sound rules will have no chance
|
|
in producing.
|
|
|
|
Finally if the word is not found in the compiled lexicon it is
|
|
passed to whatever method is defined for unknown words. This
|
|
is most likely a letter to sound module. @xref{Letter to sound rules}.
|
|
|
|
@cindex lexicon hooks
|
|
@cindex lookup hooks
|
|
Optional pre- and post-lookup hooks can be specified for a lexicon.
|
|
As a single (or list of) Lisp functions. The pre-hooks will
|
|
be called with two arguments (word and features) and should return
|
|
a pair (word and features). The post-hooks will be given a
|
|
lexical entry and should return a lexical entry. The pre- and
|
|
post-hooks do nothing by default.
|
|
|
|
@cindex compiled lexicons
|
|
Compiled lexicons may be created from lists of lexical entries.
|
|
A compiled lexicon is @emph{much} more efficient for look up than the
|
|
addenda. Compiled lexicons use a binary search method while the
|
|
addenda is searched linearly. Also it would take a prohibitively
|
|
long time to load in a typical full lexicon as an addenda. If you
|
|
have more than a few hundred entries in your addenda you should
|
|
seriously consider adding them to your compiled lexicon.
|
|
|
|
@cindex BEEP lexicon
|
|
@cindex CMU lexicon
|
|
Because many publicly available lexicons do not have syllable markings
|
|
for entries the compilation method supports automatic syllabification.
|
|
Thus for lexicon entries for compilation, two forms for the
|
|
pronunciation field are supported: the standard full syllabified and
|
|
stressed form and a simpler linear form found in at least the BEEP and
|
|
CMU lexicons. If the pronunciation field is a flat atomic list it is
|
|
assumed syllabification is required.
|
|
|
|
@cindex syllabification
|
|
Syllabification is done by finding the minimum sonorant position between
|
|
vowels. It is not guaranteed to be accurate but does give a solution
|
|
that is sufficient for many purposes. A little work would probably
|
|
improve this significantly. Of course syllabification requires the
|
|
entry's phones to be in the current phone set. The sonorant values are
|
|
calculated from the @emph{vc}, @emph{ctype}, and @emph{cvox} features
|
|
for the current phoneset. See
|
|
@file{src/arch/festival/Phone.cc:ph_sonority()} for actual definition.
|
|
|
|
Additionally in this flat structure vowels (atoms starting with a, e, i,
|
|
o or u) may have 1 2 or 0 appended marking stress. This is again
|
|
following the form found in the BEEP and CMU lexicons.
|
|
|
|
Some example entries in the flat form (taken from BEEP) are
|
|
@lisp
|
|
("table" nil (t ei1 b l))
|
|
("suspicious" nil (s @@ s p i1 sh @@ s))
|
|
@end lisp
|
|
|
|
Also if syllabification is required there is an opportunity to run a set
|
|
of "letter-to-sound"-rules on the input (actually an arbitrary re-write
|
|
rule system). If the variable @code{lex_lts_set} is set, the lts
|
|
ruleset of that name is applied to the flat input before
|
|
syllabification. This allows simple predictable changes such as
|
|
conversion of final r into longer vowel for English RP from
|
|
American labelled lexicons.
|
|
|
|
@cindex multiple lexical entries
|
|
A list of all matching entries in the addenda and the compiled lexicon
|
|
may be found by the function @code{lex.lookup_all}. This function takes
|
|
a word and returns all matching entries irrespective of part of speech.
|
|
|
|
@cindex pre_hooks
|
|
@cindex post_hooks
|
|
You can optionally intercept the words as they are looked up, and after
|
|
they have been found through @code{pre_hooks} and @code{post_hooks} for
|
|
each lexicon. This allows a function or list of functions to be applied
|
|
to a word and feature before lookup or to the resulting entry after
|
|
lookup. The following example shows how to add voice specific entries
|
|
to a general lexicon without affecting other voices that use that
|
|
lexicon.
|
|
|
|
For example suppose we were trying to use a Scottish English voice with
|
|
the US English (cmu) lexicon. A number of entries will be
|
|
inappropriate but we can redefine some entries thus
|
|
@lisp
|
|
(set! cmu_us_awb::lexicon_addenda
|
|
'(
|
|
("edinburgh" n (((eh d) 1) ((ax n) 0) ((b r ax) 0)))
|
|
("poem" n (((p ow) 1) ((y ax m) 0)))
|
|
("usual" n (((y uw) 1) ((zh ax l) 0)))
|
|
("air" n (((ey r) 1)))
|
|
("hair" n (((hh ey r) 1)))
|
|
("fair" n (((f ey r) 1)))
|
|
("chair" n (((ch ey r) 1)))))
|
|
@end lisp
|
|
We can then define a function that checks to see if the word looked
|
|
up is in the speaker specific exception list and use that entry
|
|
instead.
|
|
@lisp
|
|
(define (cmu_us_awb::cmu_lookup_post entry)
|
|
"(cmu_us_awb::cmu_lookup_post entry)
|
|
Speaker specific lexicon addeda."
|
|
(let ((ne
|
|
(assoc_string (car entry) cmu_us_awb::lexicon_addenda)))
|
|
(if ne
|
|
ne
|
|
entry)))
|
|
@end lisp
|
|
And then for the particular voice set up we need to
|
|
add both a selection part @emph{and} a reset part. Thus following
|
|
the FestVox conventions for voice set up.
|
|
@lisp
|
|
(define (cmu_us_awb::select_lexicon)
|
|
|
|
...
|
|
(lex.select "cmu")
|
|
;; Get old var for reset and to append our function to is
|
|
(set! cmu_us_awb::old_cmu_post_hooks
|
|
(lex.set.post_hooks nil))
|
|
(lex.set.post_hooks
|
|
(append cmu_us_awb::old_cmu_post_hooks
|
|
(list cmu_us_awb::cmu_lookup_post)))
|
|
...
|
|
)
|
|
|
|
...
|
|
|
|
(define (cmu_us_awb::reset_lexicon)
|
|
|
|
...
|
|
;; reset CMU's post_hooks back to original
|
|
(lex.set.post_hooks cmu_us_awb::old_cmu_post_hooks)
|
|
...
|
|
|
|
)
|
|
@end lisp
|
|
The above isn't the most efficient way as the word is looked up first
|
|
then it is checked with the speaker specific list.
|
|
|
|
The @code{pre_hooks} functions are called with two arguments, the
|
|
word and features, they should return a pair of word and features.
|
|
|
|
@node Letter to sound rules, Building letter to sound rules, Lookup process, Lexicons
|
|
@section Letter to sound rules
|
|
|
|
@cindex letter to sound rules
|
|
Each lexicon may define what action to take when a word cannot
|
|
be found in the addenda or the compiled lexicon. There are
|
|
a number of options which will hopefully be added to as more
|
|
general letter to sound rule systems are added.
|
|
|
|
@cindex unknown words
|
|
The method is set by the command
|
|
@lisp
|
|
(lex.set.lts.method METHOD)
|
|
@end lisp
|
|
Where @var{METHOD} can be any of the following
|
|
@table @samp
|
|
@item Error
|
|
Throw an error when an unknown word is found (default).
|
|
@item lts_rules
|
|
Use externally specified set of letter to sound rules (described
|
|
below). The name of the rule set to use is defined with the
|
|
@code{lex.lts.ruleset} function. This method runs one
|
|
set of rules on an exploded form of the word and assumes the rules
|
|
return a list of phonemes (in the appropriate set). If multiple
|
|
instances of rules are required use the @code{function} method
|
|
described next.
|
|
@item none
|
|
This returns an entry with a @code{nil} pronunciation field. This will
|
|
only be valid in very special circumstances.
|
|
@item FUNCTIONNAME
|
|
Call this as a LISP function function name. This function
|
|
is given two arguments: the word and the part of speech. It should
|
|
return a valid lexical entry.
|
|
@end table
|
|
|
|
The basic letter to sound rule system is very simple but is
|
|
powerful enough to build reasonably complex letter to sound rules.
|
|
Although we've found trained LTS rules better than hand written
|
|
ones (for complex languages) where no data is available and rules
|
|
must be hand written the following rule formalism is much easier to
|
|
use than that generated by the LTS training system (described
|
|
in the next section).
|
|
|
|
@cindex letter to sound rules
|
|
@cindex LTS
|
|
The basic form of a rule is as follows
|
|
@lisp
|
|
( LEFTCONTEXT [ ITEMS ] RIGHTCONTEXT = NEWITEMS )
|
|
@end lisp
|
|
This interpretation is that if @var{ITEMS} appear in the specified right
|
|
and left context then the output string is to contain @var{NEWITEMS}.
|
|
Any of @var{LEFTCONTEXT}, @var{RIGHTCONTEXT} or @var{NEWITEMS} may be
|
|
empty. Note that @var{NEWITEMS} is written to a different "tape" and hence
|
|
cannot feed further rules (within this ruleset). An example is
|
|
@lisp
|
|
( # [ c h ] C = k )
|
|
@end lisp
|
|
The special character @code{#} denotes a word boundary, and the symbol
|
|
@code{C} denotes the set of all consonants, sets are declared before
|
|
rules. This rule states that a @code{ch} at the start of a word
|
|
followed by a consonant is to be rendered as the @code{k} phoneme.
|
|
Symbols in contexts may be followed by the symbol @code{*} for zero or
|
|
more occurrences, or @code{+} for one or more occurrences.
|
|
|
|
The symbols in the rules are treated as set names if they are declared
|
|
as such or as symbols in the input/output alphabets. The symbols
|
|
may be more than one character long and the names are case sensitive.
|
|
|
|
The rules are tried in order until one matches the first (or more)
|
|
symbol of the tape. The rule is applied adding the right hand side to
|
|
the output tape. The rules are again applied from the start of the list
|
|
of rules.
|
|
|
|
The function used to apply a set of rules if given an atom will explode
|
|
it into a list of single characters, while if given a list will use it
|
|
as is. This reflects the common usage of wishing to re-write the
|
|
individual letters in a word to phonemes but without excluding the
|
|
possibility of using the system for more complex manipulations,
|
|
such as multi-pass LTS systems and phoneme conversion.
|
|
|
|
From lisp there are three basic access functions, there
|
|
are corresponding functions in the C/C++ domain.
|
|
|
|
@table @code
|
|
@item (lts.ruleset NAME SETS RULES)
|
|
Define a new set of lts rules. Where @code{NAME} is the name for this
|
|
rule, SETS is a list of set definitions of the form @code{(SETNAME e0 e1
|
|
...)} and @code{RULES} are a list of rules as described above.
|
|
@item (lts.apply WORD RULESETNAME)
|
|
Apply the set of rules named @code{RULESETNAME} to @code{WORD}. If
|
|
@code{WORD} is a symbol it is exploded into a list of the individual
|
|
characters in its print name. If @code{WORD} is a list it is used as
|
|
is. If the rules cannot be successfully applied an error is given. The
|
|
result of (successful) application is returned in a list.
|
|
@item (lts.check_alpha WORD RULESETNAME)
|
|
The symbols in @code{WORD} are checked against the input alphabet of the
|
|
rules named @code{RULESETNAME}. If they are all contained in that
|
|
alphabet @code{t} is returned, else @code{nil}. Note this does not
|
|
necessarily mean the rules will successfully apply (contexts may restrict
|
|
the application of the rules), but it allows general checking like
|
|
numerals, punctuation etc, allowing application of appropriate rule
|
|
sets.
|
|
@end table
|
|
|
|
The letter to sound rule system may be used directly from Lisp
|
|
and can easily be used to do relatively complex operations for
|
|
analyzing words without requiring modification of the C/C++
|
|
system. For example the Welsh letter to sound rule system consists
|
|
or three rule sets, first to explicitly identify epenthesis, then
|
|
identify stressed vowels, and finally rewrite this augmented
|
|
letter string to phonemes. This is achieved by
|
|
the following function
|
|
@lisp
|
|
(define (welsh_lts word features)
|
|
(let (epen str wel)
|
|
(set! epen (lts.apply (downcase word) 'newepen))
|
|
(set! str (lts.apply epen 'newwelstr))
|
|
(set! wel (lts.apply str 'newwel))
|
|
(list word
|
|
nil
|
|
(lex.syllabify.phstress wel))))
|
|
@end lisp
|
|
The LTS method for the Welsh lexicon is set to @code{welsh_lts}, so this
|
|
function is called when a word is not found in the lexicon. The
|
|
above function first downcases the word and then applies the rulesets in
|
|
turn, finally calling the syllabification process and returns a
|
|
constructed lexically entry.
|
|
|
|
@node Building letter to sound rules, Lexicon requirements, Letter to sound rules, Lexicons
|
|
@section Building letter to sound rules
|
|
|
|
As writing letter to sound rules by hand is hard and very time
|
|
consuming, an alternative method is also available where a latter to
|
|
sound system may be built from a lexicon of the language. This
|
|
technique has successfully been used from English (British and American),
|
|
French and German. The difficulty and appropriateness of using
|
|
letter to sound rules is very language dependent,
|
|
|
|
The following outlines the processes involved in building a letter to
|
|
sound model for a language given a large lexicon of pronunciations.
|
|
This technique is likely to work for most European languages (including
|
|
Russian) but doesn't seem particularly suitable for very language
|
|
alphabet languages like Japanese and Chinese. The process described
|
|
here is not (yet) fully automatic but the hand intervention required is
|
|
small and may easily be done even by people with only a very little
|
|
knowledge of the language being dealt with.
|
|
|
|
The process involves the following steps
|
|
@itemize @bullet
|
|
@item
|
|
Pre-processing lexicon into suitable training set
|
|
@item
|
|
Defining the set of allowable pairing of letters to phones. (We intend
|
|
to do this fully automatically in future versions).
|
|
@item
|
|
Constructing the probabilities of each letter/phone pair.
|
|
@item
|
|
Aligning letters to an equal set of phones/_epsilons_.
|
|
@item
|
|
Extracting the data by letter suitable for training.
|
|
@item
|
|
Building CART models for predicting phone from letters (and context).
|
|
@item
|
|
Building additional lexical stress assignment model (if necessary).
|
|
@end itemize
|
|
All except the first two stages of this are fully automatic.
|
|
|
|
Before building a model its wise to think a little about what you want
|
|
it to do. Ideally the model is an auxiluary to the lexicon so only
|
|
words not found in the lexicon will require use of the letter to sound
|
|
rules. Thus only unusual forms are likely to require the rules. More
|
|
precisely the most common words, often having the most non-standard
|
|
pronunciations, should probably be explicitly listed always. It is
|
|
possible to reduce the size of the lexicon (sometimes drastically) by
|
|
removing all entries that the training LTS model correctly predicts.
|
|
|
|
Before starting it is wise to consider removing some entries from the
|
|
lexicon before training, I typically will remove words under 4 letters
|
|
and if part of speech information is available I remove all function
|
|
words, ideally only training from nouns verbs and adjectives as these
|
|
are the most likely forms to be unknown in text. It is useful to have
|
|
morphologically inflected and derived forms in the training set as it is
|
|
often such variant forms that not found in the lexicon even though their
|
|
root morpheme is. Note that in many forms of text, proper names are the
|
|
most common form of unknown word and even the technique presented here
|
|
may not adequately cater for that form of unknown words (especially if
|
|
they unknown words are non-native names). This is all stating that this
|
|
may or may not be appropriate for your task but the rules generated by
|
|
this learning process have in the examples we've done been much better
|
|
than what we could produce by hand writing rules of the form described
|
|
in the previous section.
|
|
|
|
First preprocess the lexicon into a file of lexical entries to be used
|
|
for training, removing functions words and changing the head words to
|
|
all lower case (may be language dependent). The entries should be of
|
|
the form used for input for Festival's lexicon compilation. Specifical
|
|
the pronunciations should be simple lists of phones (no
|
|
syllabification). Depending on the language, you may wish to remve the
|
|
stressing---for examples here we have though later tests suggest that we
|
|
should keep it in even for English. Thus the training set should look
|
|
something like
|
|
@lisp
|
|
("table" nil (t ei b l))
|
|
("suspicious" nil (s @@ s p i sh @@ s))
|
|
@end lisp
|
|
It is best to split the data into a training set and a test set
|
|
if you wish to know how well your training has worked. In our
|
|
tests we remove every tenth entry and put it in a test set. Note this
|
|
will mean our test results are probably better than if we removed
|
|
say the last ten in every hundred.
|
|
|
|
The second stage is to define the set of allowable letter to phone
|
|
mappings irrespective of context. This can sometimes be initially done
|
|
by hand then checked against the training set. Initially constract a
|
|
file of the form
|
|
@lisp
|
|
(require 'lts_build)
|
|
(set! allowables
|
|
'((a _epsilon_)
|
|
(b _epsilon_)
|
|
(c _epsilon_)
|
|
...
|
|
(y _epsilon_)
|
|
(z _epsilon_)
|
|
(# #)))
|
|
@end lisp
|
|
All letters that appear in the alphabet should (at least) map to
|
|
@code{_epsilon_}, including any accented characters that appear in that
|
|
language. Note the last two hashes. These are used by to denote
|
|
beginning and end of word and are automatically added during training,
|
|
they must appear in the list and should only map to themselves.
|
|
|
|
To incrementally add to this allowable list run festival as
|
|
@lisp
|
|
festival allowables.scm
|
|
@end lisp
|
|
and at the prompt type
|
|
@lisp
|
|
festival> (cummulate-pairs "oald.train")
|
|
@end lisp
|
|
with your train file. This will print out each lexical entry
|
|
that couldn't be aligned with the current set of allowables. At the
|
|
start this will be every entry. Looking at these entries add
|
|
to the allowables to make alignment work. For example if the
|
|
following word fails
|
|
@lisp
|
|
("abate" nil (ah b ey t))
|
|
@end lisp
|
|
Add @code{ah} to the allowables for letter @code{a}, @code{b} to
|
|
@code{b}, @code{ey} to @code{a} and @code{t} to letter @code{t}. After
|
|
doing that restart festival and call @code{cummulate-pairs} again.
|
|
Incrementally add to the allowable pairs until the number of failures
|
|
becomes accceptable. Often there are entries for which there is no real
|
|
relationship between the letters and the pronunciation such as in
|
|
abbreviations and foreign words (e.g. "aaa" as "t r ih p ax l ey"). For
|
|
the lexicons I've used the technique on less than 10 per thousand fail
|
|
in this way.
|
|
|
|
It is worth while being consistent on defining your set of allowables.
|
|
(At least) two mappings are possible for the letter sequence
|
|
@code{ch}---having letter @code{c} go to phone @code{ch} and letter
|
|
@code{h} go to @code{_epsilon_} and also letter @code{c} go to phone
|
|
@code{_epsilon_} and letter @code{h} goes to @code{ch}. However only
|
|
one should be allowed, we preferred @code{c} to @code{ch}.
|
|
|
|
It may also be the case that some letters give rise to more than one
|
|
phone. For example the letter @code{x} in English is often pronunced as
|
|
the phone combination @code{k} and @code{s}. To allow this, use the
|
|
multiphone @code{k-s}. Thus the multiphone @code{k-s} will be predicted
|
|
for @code{x} in some context and the model will separate it into two
|
|
phones while it also ignoring any predicted @code{_epsilons_}. Note that
|
|
multiphone units are relatively rare but do occur. In English, letter
|
|
@code{x} give rise to a few, @code{k-s} in @code{taxi}, @code{g-s} in
|
|
@code{example}, and sometimes @code{g-zh} and @code{k-sh} in
|
|
@code{luxury}. Others are @code{w-ah} in @code{one}, @code{t-s} in
|
|
@code{pizza}, @code{y-uw} in @code{new} (British), @code{ah-m} in
|
|
@code{-ism} etc. Three phone multiphone are much rarer but may exist, they
|
|
are not supported by this code as is, but such entries should probably
|
|
be ignored. Note the @code{-} sign in the multiphone examples is
|
|
significant and is used to indentify multiphones.
|
|
|
|
The allowables for OALD end up being
|
|
@lisp
|
|
(set! allowables
|
|
'
|
|
((a _epsilon_ ei aa a e@@ @@ oo au o i ou ai uh e)
|
|
(b _epsilon_ b )
|
|
(c _epsilon_ k s ch sh @@-k s t-s)
|
|
(d _epsilon_ d dh t jh)
|
|
(e _epsilon_ @@ ii e e@@ i @@@@ i@@ uu y-uu ou ei aa oi y y-u@@ o)
|
|
(f _epsilon_ f v )
|
|
(g _epsilon_ g jh zh th f ng k t)
|
|
(h _epsilon_ h @@ )
|
|
(i _epsilon_ i@@ i @@ ii ai @@@@ y ai-@@ aa a)
|
|
(j _epsilon_ h zh jh i y )
|
|
(k _epsilon_ k ch )
|
|
(l _epsilon_ l @@-l l-l)
|
|
(m _epsilon_ m @@-m n)
|
|
(n _epsilon_ n ng n-y )
|
|
(o _epsilon_ @@ ou o oo uu u au oi i @@@@ e uh w u@@ w-uh y-@@)
|
|
(p _epsilon_ f p v )
|
|
(q _epsilon_ k )
|
|
(r _epsilon_ r @@@@ @@-r)
|
|
(s _epsilon_ z s sh zh )
|
|
(t _epsilon_ t th sh dh ch d )
|
|
(u _epsilon_ uu @@ w @@@@ u uh y-uu u@@ y-u@@ y-u i y-uh y-@@ e)
|
|
(v _epsilon_ v f )
|
|
(w _epsilon_ w uu v f u)
|
|
(x _epsilon_ k-s g-z sh z k-sh z g-zh )
|
|
(y _epsilon_ i ii i@@ ai uh y @@ ai-@@)
|
|
(z _epsilon_ z t-s s zh )
|
|
(# #)
|
|
))
|
|
@end lisp
|
|
Note this is an exhaustive list and (deliberately) says nothing
|
|
about the contexts or frequency that these letter to phone pairs appear.
|
|
That information will be generated automatically from the training
|
|
set.
|
|
|
|
Once the number of failed matches is signficantly low enough
|
|
let @code{cummulate-pairs} run to completion. This counts the number
|
|
of times each letter/phone pair occurs in allowable alignments.
|
|
|
|
Next call
|
|
@lisp
|
|
festival> (save-table "oald-")
|
|
@end lisp
|
|
with the name of your lexicon. This changes the cummulation
|
|
table into probabilities and saves it.
|
|
|
|
Restart festival loading this new table
|
|
@lisp
|
|
festival allowables.scm oald-pl-table.scm
|
|
@end lisp
|
|
Now each word can be aligned to an equally-lengthed string of phones,
|
|
epsilon and multiphones.
|
|
@lisp
|
|
festival> (aligndata "oald.train" "oald.train.align")
|
|
@end lisp
|
|
Do this also for you test set.
|
|
|
|
This will produce entries like
|
|
@lisp
|
|
aaronson _epsilon_ aa r ah n s ah n
|
|
abandon ah b ae n d ah n
|
|
abate ah b ey t _epsilon_
|
|
abbe ae b _epsilon_ iy
|
|
@end lisp
|
|
|
|
The next stage is to build features suitable for @file{wagon} to
|
|
build models. This is done by
|
|
@lisp
|
|
festival> (build-feat-file "oald.train.align" "oald.train.feats")
|
|
@end lisp
|
|
Again the same for the test set.
|
|
|
|
Now you
|
|
need to constructrure a description file for @file{wagon} for
|
|
the given data. The can be done using the script @file{make_wgn_desc}
|
|
provided with the speech tools
|
|
|
|
Here is an example script for building the models, you will need
|
|
to modify it for your particualr database but it shows the basic
|
|
processes
|
|
@example
|
|
for i in a b c d e f g h i j k l m n o p q r s t u v w x y z
|
|
do
|
|
# Stop value for wagon
|
|
STOP=2
|
|
echo letter $i STOP $STOP
|
|
# Find training set for letter $i
|
|
cat oald.train.feats |
|
|
awk '@{if ($6 == "'$i'") print $0@}' >ltsdataTRAIN.$i.feats
|
|
# split training set to get heldout data for stepwise testing
|
|
traintest ltsdataTRAIN.$i.feats
|
|
# Extract test data for letter $i
|
|
cat oald.test.feats |
|
|
awk '@{if ($6 == "'$i'") print $0@}' >ltsdataTEST.$i.feats
|
|
# run wagon to predict model
|
|
wagon -data ltsdataTRAIN.$i.feats.train -test ltsdataTRAIN.$i.feats.test \
|
|
-stepwise -desc ltsOALD.desc -stop $STOP -output lts.$i.tree
|
|
# Test the resulting tree against
|
|
wagon_test -heap 2000000 -data ltsdataTEST.$i.feats -desc ltsOALD.desc \
|
|
-tree lts.$i.tree
|
|
done
|
|
@end example
|
|
The script @file{traintest} splits the given file @file{X} into @file{X.train}
|
|
and @file{X.test} with every tenth line in @file{X.test} and the rest
|
|
in @file{X.train}.
|
|
|
|
This script can take a significnat amount of time to run, about 6 hours
|
|
on a Sun Ultra 140.
|
|
|
|
Once the models are created the must be collected together into
|
|
a single list structure. The trees generated by @file{wagon}
|
|
contain fully probability distributions at each leaf, at this time
|
|
this information can be removed as only the most probable will
|
|
actually be predicted. This substantially reduces the size of the
|
|
tress.
|
|
@lisp
|
|
(merge_models 'oald_lts_rules "oald_lts_rules.scm")
|
|
@end lisp
|
|
(@code{merge_models} is defined within @file{lts_build.scm})
|
|
The given file will contain a @code{set!} for the given variable
|
|
name to an assoc list of letter to trained tree. Note the above
|
|
function naively assumes that the letters in the alphabet are
|
|
the 26 lower case letters of the English alphabet, you will need
|
|
to edit this adding accented letters if required. Note that
|
|
adding "'" (single quote) as a letter is a little tricky in scheme
|
|
but can be done---the command @code{(intern "'")} will give you
|
|
the symbol for single quote.
|
|
|
|
To test a set of lts models load the saved model and call
|
|
the following function with the test align file
|
|
@lisp
|
|
festival oald-table.scm oald_lts_rules.scm
|
|
festival> (lts_testset "oald.test.align" oald_lts_rules)
|
|
@end lisp
|
|
The result (after showing all the failed ones), will be a table showing
|
|
the results for each letter, for all letters and for complete words.
|
|
The failed entries may give some notion of how good or bad the result
|
|
is, sometimes it will be simple vowel diferences, long versus short,
|
|
schwa versus full vowel, other times it may be who consonants missing.
|
|
Remember the ultimate quality of the letter sound rules is how adequate
|
|
they are at providing @emph{acceptable} pronunciations rather than
|
|
how good the numeric score is.
|
|
|
|
@cindex stress assignment
|
|
@cindex predicting stress
|
|
For some languages (e.g. English) it is necessary to also find a
|
|
stree pattern for unknown words. Ultimately for this to work well
|
|
you need to know the morphological decomposition of the word.
|
|
At present we provide a CART trained system to predict stress
|
|
patterns for English. If does get 94.6% correct for an unseen test
|
|
set but that isn't really very good. Later tests suggest that
|
|
predicting stressed and unstressed phones directly is actually
|
|
better for getting whole words correct even though the models
|
|
do slightly worse on a per phone basis @cite{black98}.
|
|
|
|
@cindex compressing the lexicon
|
|
@cindex reducing the lexicon
|
|
@cindex lexicon compression
|
|
As the lexicon may be a large part of the system we have also
|
|
experimented with removing entries from the lexicon if the letter to
|
|
sound rules system (and stree assignment system) can correct predict
|
|
them. For OALD this allows us to half the size of the lexicon, it could
|
|
possibly allow more if a certain amount of fuzzy acceptance was allowed
|
|
(e.g. with schwa). For other languages the gain here can be very
|
|
signifcant, for German and French we can reduce the lexicon by over 90%.
|
|
The function @code{reduce_lexicon} in @file{festival/lib/lts_build.scm}
|
|
was used to do this. A diccussion of using the above technique as a
|
|
dictionary compression method is discussed in @cite{pagel98}. A
|
|
morphological decomposition algorithm, like that described in
|
|
@cite{black91}, may even help more.
|
|
|
|
The technique described in this section and its relative merits with
|
|
respect to a number of languages/lexicons and tasks is dicussed more
|
|
fully in @cite{black98}.
|
|
|
|
@node Lexicon requirements, Available lexicons, Building letter to sound rules, Lexicons
|
|
@section Lexicon requirements
|
|
|
|
@cindex lexicon requirements
|
|
For English there are a number of assumptions made about the lexicon
|
|
which are worthy of explicit mention. If you are basically going to use
|
|
the existing token rules you should try to include at least the
|
|
following in any lexicon that is to work with them.
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The letters of the alphabet, when a token is identified as an acronym it
|
|
is spelled out. The tokenization assumes that the individual letters of
|
|
the alphabet are in the lexicon with their pronunciations. They
|
|
should be identified as nouns. (This is to distinguish @code{a} as
|
|
a determiner which can be schwa'd from @code{a} as a letter which
|
|
cannot.) The part of speech should be @code{nn} by default, but the
|
|
value of the variable @code{token.letter_pos} is used and may be
|
|
changed if this is not what is required.
|
|
@item
|
|
One character symbols such as dollar, at-sign, percent etc. Its
|
|
difficult to get a complete list and to know what the pronunciation of
|
|
some of these are (e.g hash or pound sign). But the letter to sound
|
|
rules cannot deal with them so they need to be explicitly listed. See
|
|
the list in the function @code{mrpa_addend} in
|
|
@file{festival/lib/dicts/oald/oaldlex.scm}. This list should
|
|
also contain the control characters and eight bit characters.
|
|
@item
|
|
@cindex possessives
|
|
The possessive @code{'s} should be in your lexicon as schwa and voiced
|
|
fricative (@code{z}). It should be in twice, once as part speech type
|
|
@code{pos} and once as @code{n} (used in plurals of numbers acronyms
|
|
etc. e.g 1950's). @code{'s} is treated as a word and is separated from
|
|
the tokens it appears with. The post-lexical rule (the function
|
|
@code{postlex_apos_s_check}) will delete the schwa and devoice the @code{z}
|
|
in appropriate contexts. Note this post-lexical rule brazenly assumes
|
|
that the unvoiced fricative in the phoneset is @code{s}. If it
|
|
is not in your phoneset copy the function (it is in
|
|
@file{festival/lib/postlex.scm}) and change it for your phoneset
|
|
and use your version as a post-lexical rule.
|
|
@item
|
|
@cindex NO digits
|
|
Numbers as digits (e.g. "1", "2", "34", etc.) should normally
|
|
@emph{not} be in the lexicon. The number conversion routines
|
|
convert numbers to words (i.e. "one", "two", "thirty four", etc.).
|
|
@item
|
|
@cindex unknown words
|
|
The word "unknown" or whatever is in the variable
|
|
@code{token.unknown_word_name}. This is used in a few obscure cases
|
|
when there just isn't anything that can be said (e.g. single characters
|
|
which aren't in the lexicon). Some people have suggested it should be
|
|
possible to make this a sound rather than a word. I agree, but Festival
|
|
doesn't support that yet.
|
|
@end itemize
|
|
|
|
@node Available lexicons, Post-lexical rules , Lexicon requirements, Lexicons
|
|
@section Available lexicons
|
|
|
|
@cindex lexicon
|
|
Currently Festival supports a number of different lexicons. They are
|
|
all defined in the file @file{lib/lexicons.scm} each with a number of
|
|
common extra words added to their addendas. They are
|
|
@table @samp
|
|
@item CUVOALD
|
|
@cindex CUVOALD lexicon
|
|
@cindex Oxford Advanced Learners' Dictionary
|
|
The Computer Users Version of Oxford Advanced Learner's Dictionary is
|
|
available from the Oxford Text Archive
|
|
@url{ftp://ota.ox.ac.uk/pub/ota/public/dicts/710}. It contains about
|
|
70,000 entries and is a part of the BEEP lexicon. It is more consistent
|
|
in its marking of stress though its syllable marking is not what works
|
|
best for our synthesis methods. Many syllabic @samp{l}'s, @samp{n}'s,
|
|
and @samp{m}'s, mess up the syllabification algorithm, making results
|
|
sometimes appear over reduced. It is however our current default
|
|
lexicon. It is also the only lexicon with part of speech tags that
|
|
can be distributed (for non-commercial use).
|
|
@item CMU
|
|
@cindex CMU lexicon
|
|
This is automatically constructed from @file{cmu_dict-0.4} available
|
|
from many places on the net (see @code{comp.speech} archives). It is
|
|
not in the mrpa phone set because it is American English pronunciation.
|
|
Although mappings exist between its phoneset (@samp{darpa}) and
|
|
@samp{mrpa} the results for British English speakers are not very good.
|
|
However this is probably the biggest, most carefully specified lexicon
|
|
available. It contains just under 100,000 entries. Our distribution
|
|
has been modified to include part of speech tags on words we know to be
|
|
homographs.
|
|
@item mrpa
|
|
@cindex mrpa lexicon
|
|
A version of the CSTR lexicon which has been floating about for years.
|
|
It contains about 25,000 entries. A new updated free version of
|
|
this is due to be released soon.
|
|
@item BEEP
|
|
@cindex BEEP lexicon
|
|
A British English rival for the @file{cmu_lex}. BEEP has been made
|
|
available by Tony Robinson at Cambridge and is available in many
|
|
archives. It contains 163,000 entries and has been converted to the
|
|
@samp{mrpa} phoneset (which was a trivial mapping). Although large, it
|
|
suffers from a certain randomness in its stress markings, making use of
|
|
it for synthesis dubious.
|
|
@end table
|
|
|
|
All of the above lexicons have some distribution restrictions (though
|
|
mostly pretty light), but as they are mostly freely available we provide
|
|
programs that can convert the originals into Festival's format.
|
|
|
|
@cindex MOBY lexicon
|
|
The MOBY lexicon has recently been released into the public domain and
|
|
will be converted into our format soon.
|
|
|
|
@node Post-lexical rules, , Available lexicons, Lexicons
|
|
@section Post-lexical rules
|
|
|
|
@cindex post-lexical rules
|
|
It is the lexicon's job to produce a pronunciation of a given word.
|
|
However in most languages the most natural pronunciation of a word
|
|
cannot be found in isolation from the context in which it is to be
|
|
spoken. This includes such phenomena as reduction, phrase final
|
|
devoicing and r-insertion. In Festival this is done by post-lexical
|
|
rules.
|
|
|
|
@code{PostLex} is a module which is run after accent assignment
|
|
but before duration and F0 generation. This is because knowledge
|
|
of accent position is necessary for vowel reduction and other
|
|
post lexical phenomena and changing the segmental items will
|
|
affect durations.
|
|
|
|
The @code{PostLex} first applies a set of built in rules (which could be
|
|
done in Scheme but for historical reasons are still in C++). It then
|
|
applies the functions set in the hook @code{postlex_rules_hook}. These
|
|
should be a set of functions that take an utterance and apply
|
|
appropriate rules. This should be set up on a per voice basis.
|
|
|
|
Although a rule system could be devised for post-lexical sound rules it
|
|
is unclear what the scope of them should be, so we have left it
|
|
completely open. Our vowel reduction model uses a CART decision tree to
|
|
predict which syllables should be reduced, while the "'s" rule is very
|
|
simple (shown in @file{festival/lib/postlex.scm}).
|
|
|
|
@cindex apostrophe s
|
|
@cindex possessives
|
|
The @code{'s} in English may be pronounced in a number of different
|
|
ways depending on the preceding context. If the preceding consonant
|
|
is a fricative or affricative and not a palatal labio-dental or
|
|
dental a schwa is required (e.g. @code{bench's}) otherwise
|
|
no schwa is required (e.g. @code{John's}). Also if the previous
|
|
phoneme is unvoiced the "s" is rendered as an "s" while in all
|
|
other cases it is rendered as a "z".
|
|
|
|
For our English voices we have a lexical entry for "'s" as a
|
|
schwa followed by a "z". We use a post lexical rule function called
|
|
@code{postlex_apos_s_check} to modify the basic given form when
|
|
required. After lexical lookup the segment relation contains the
|
|
concatenation of segments directly from lookup in the lexicon.
|
|
Post lexical rules are applied after that.
|
|
|
|
In the following rule we check each segment to see if it is part of a
|
|
word labelled "'s", if so we check to see if are we currently looking at the
|
|
schwa or the z part, and test if modification is required
|
|
@example
|
|
(define (postlex_apos_s_check utt)
|
|
"(postlex_apos_s_check UTT)
|
|
Deal with possesive s for English (American and British). Delete
|
|
schwa of 's if previous is not a fricative or affricative, and
|
|
change voiced to unvoiced s if previous is not voiced."
|
|
(mapcar
|
|
(lambda (seg)
|
|
(if (string-equal "'s" (item.feat
|
|
seg "R:SylStructure.parent.parent.name"))
|
|
(if (string-equal "a" (item.feat seg 'ph_vlng))
|
|
(if (and (member_string (item.feat seg 'p.ph_ctype)
|
|
'(f a))
|
|
(not (member_string
|
|
(item.feat seg "p.ph_cplace")
|
|
'(d b g))))
|
|
t;; don't delete schwa
|
|
(item.delete seg))
|
|
(if (string-equal "-" (item.feat seg "p.ph_cvox"))
|
|
(item.set_name seg "s")))));; from "z"
|
|
(utt.relation.items utt 'Segment))
|
|
utt)
|
|
@end example
|
|
|
|
@node Utterances, Text analysis, Lexicons, Top
|
|
@chapter Utterances
|
|
|
|
@cindex utterance
|
|
The utterance structure lies at the heart of Festival. This chapter
|
|
describes its basic form and the functions available
|
|
to manipulate it.
|
|
|
|
@menu
|
|
* Utterance structure:: internal structure of utterances
|
|
* Utterance types:: Type defined synthesis actions
|
|
* Example utterance types:: Some example utterances
|
|
* Utterance modules::
|
|
* Accessing an utterance:: getting the data from the structure
|
|
* Features:: Features and features names
|
|
* Utterance I/O:: Saving and loading utterances
|
|
@end menu
|
|
|
|
@node Utterance structure, Utterance types, , Utterances
|
|
@section Utterance structure
|
|
|
|
@cindex utterance
|
|
@cindex TTS processes
|
|
Festival's basic object for synthesis is the @emph{utterance}. An
|
|
represents some chunk of text that is to be rendered as speech. In
|
|
general you may think of it as a sentence but in many cases it wont
|
|
actually conform to the standard linguistic syntactic form of a
|
|
sentence. In general the process of text to speech is to take an
|
|
utterance which contains a simple string of characters and convert it
|
|
step by step, filling out the utterance structure with more information
|
|
until a waveform is built that says what the text contains.
|
|
|
|
The processes involved in conversion are, in general, as follows
|
|
@table @emph
|
|
@item Tokenization
|
|
Converting the string of characters into a list of tokens. Typically
|
|
this means whitespace separated tokesn of the original text string.
|
|
@item Token identification
|
|
identification of general types for the tokens, usually this is trivial
|
|
but requires some work to identify tokens of digits as years, dates,
|
|
numbers etc.
|
|
@item Token to word
|
|
Convert each tokens to zero or more words, expanding numbers,
|
|
abbreviations etc.
|
|
@item Part of speech
|
|
Identify the syntactic part of speech for the words.
|
|
@item Prosodic phrasing
|
|
Chunk utterance into prosodic phrases.
|
|
@item Lexical lookup
|
|
Find the pronucnation of each word from a lexicon/letter to sound
|
|
rule system including phonetic and syllable structure.
|
|
@item Intonational accents
|
|
Assign intonation accents to approrpiate syllables.
|
|
@item Assign duration
|
|
Assign duration to each phone in the utterance.
|
|
@item Generate F0 contour (tune)
|
|
Generate tune based on accents etc.
|
|
@item Render waveform
|
|
Render waveform from phones, duration and F) target values, this
|
|
itself may take several steps including unit selection (be they
|
|
diphones or other sized units), imposition of dsesired prosody
|
|
(duration and F0) and waveform reconstruction.
|
|
@end table
|
|
The number of steps and what actually happens may vary and is dependent
|
|
on the particular voice selected and the utterance's @emph{type},
|
|
see below.
|
|
|
|
Each of these steps in Festival is achived by a @emph{module} which
|
|
will typically add new information to the utterance structure.
|
|
|
|
@cindex Utterance structure
|
|
@cindex Items
|
|
@cindex Relations
|
|
An utterance structure consists of a set of @emph{items} which may be
|
|
part of one or more @emph{relations}. Items represent things like words
|
|
and phones, though may also be used to represent less concrete objects
|
|
like noun phrases, and nodes in metrical trees. An item contains a set
|
|
of features, (name and value). Relations are typically simple lists of
|
|
items or trees of items. For example the the @code{Word} relation is a
|
|
simple list of items each of which represent a word in the utterance.
|
|
Those words will also be in other relations, such as the
|
|
@emph{SylStructure} relation where the word will be the top of a tree
|
|
structure containing its syllables and segments.
|
|
|
|
Unlike previous versions of the system items (then called stream items)
|
|
are not in any particular relations (or stream). And are merely part of
|
|
the relations they are within. Importantly this allows much more general
|
|
relations to be made over items that was allowed in the previous
|
|
system. This new architecture is the continuation of our goal
|
|
of providing a general efficient structure for representing complex
|
|
interrelated utterance objects.
|
|
|
|
@cindex Festival relations
|
|
The architecture is fully general and new items and relations may
|
|
be defined at run time, such that new modules may use any relations
|
|
they wish. However within our standard English (and other voices)
|
|
we have used a specific set of relations ass follows.
|
|
@table @emph
|
|
@item Token
|
|
a list of trees. This is first formed as a list of tokens found
|
|
in a character text string. Each root's daughters are the @emph{Word}'s
|
|
that the token is related to.
|
|
@item Word
|
|
a list of words. These items will also appear as daughters (leaf nodes)
|
|
of the @code{Token} relation. They may also appear in the @code{Syntax}
|
|
relation (as leafs) if the parser is used. They will also be leafs
|
|
of the @code{Phrase} relation.
|
|
@item Phrase
|
|
a list of trees. This is a list of phrase roots whose daughters are
|
|
the @code{Word's} within those phrases.
|
|
@item Syntax
|
|
a single tree. This, if the probabilistic parser is called, is a syntactic
|
|
binary branching tree over the members of the @code{Word} relation.
|
|
@item SylStructure
|
|
a list of trees. This links the @code{Word}, @code{Syllable} and
|
|
@code{Segment} relations. Each @code{Word} is the root of a tree
|
|
whose immediate daughters are its syllables and their daughters in
|
|
turn as its segments.
|
|
@item Syllable
|
|
a list of syllables. Each member will also be in a the
|
|
@code{SylStructure} relation. In that relation its parent will be the
|
|
word it is in and its daughters will be the segments that are in it.
|
|
Syllables are also in the @code{Intonation} relation giving links to
|
|
their related intonation events.
|
|
@item Segment
|
|
a list of segments (phones). Each member (except silences) will be leaf
|
|
nodes in the @code{SylStructure} relation. These may also be in the
|
|
@code{Target} relation linking them to F0 target points.
|
|
@item IntEvent
|
|
a list of intonation events (accents and boundaries). These are related
|
|
to syllables through the @code{Intonation} relation as leafs on that
|
|
relation. Thus their parent in the @code{Intonation} relation is the
|
|
syllable these events are attached to.
|
|
@item Intonation
|
|
a list of trees relating syllables to intonation events. Roots of
|
|
the trees in @code{Intonation} are @code{Syllables} and their daughters
|
|
are @code{IntEvents}.
|
|
@item Wave
|
|
a single item with a feature called @code{wave} whose value
|
|
is the generated waveform.
|
|
@end table
|
|
This is a non-exhaustive list some modules may add other relations
|
|
and not all utterance may have all these relations, but the above
|
|
is the general case.
|
|
|
|
@node Utterance types, Example utterance types, Utterance structure, Utterances
|
|
@section Utterance types
|
|
|
|
@cindex utterance types
|
|
@cindex @code{defUttType}
|
|
@cindex @code{utt.synth}
|
|
The primary purpose of types is to define which modules are to be
|
|
applied to an utterance. @code{UttTypes} are defined in
|
|
@file{lib/synthesis.scm}. The function @code{defUttType} defines which
|
|
modules are to be applied to an utterance of that type. The function
|
|
@code{utt.synth} is called applies this list of module to an utterance
|
|
before waveform synthesis is called.
|
|
|
|
For example when a @code{Segment} type Utterance is synthesized it needs
|
|
only have its values loaded into a @code{Segment} relation and a
|
|
@code{Target} relation, then the low level waveform synthesis module
|
|
@code{Wave_Synth} is called. This is defined as follows
|
|
@lisp
|
|
(defUttType Segments
|
|
(Initialize utt)
|
|
(Wave_Synth utt))
|
|
@end lisp
|
|
A more complex type is @code{Text} type utterance which requires many
|
|
more modules to be called before a waveform can be synthesized
|
|
@lisp
|
|
(defUttType Text
|
|
(Initialize utt)
|
|
(Text utt)
|
|
(Token utt)
|
|
(POS utt)
|
|
(Phrasify utt)
|
|
(Word utt)
|
|
(Intonation utt)
|
|
(Duration utt)
|
|
(Int_Targets utt)
|
|
(Wave_Synth utt)
|
|
)
|
|
@end lisp
|
|
@cindex @code{Initialize}
|
|
The @code{Initialize} module should normally be called for all
|
|
types. It loads the necessary relations from the input form
|
|
and deletes all other relations (if any exist) ready for synthesis.
|
|
|
|
Modules may be directly defined as C/C++ functions and declared with a
|
|
Lisp name or simple functions in Lisp that check some global parameter
|
|
before calling a specific module (e.g. choosing between different
|
|
intonation modules).
|
|
|
|
These types are used when calling the function
|
|
@code{utt.synth} and individual modules may be called explicitly by
|
|
hand if required.
|
|
|
|
@cindex @code{defSynthType}
|
|
@cindex SynthTypes
|
|
Because we expect waveform synthesis methods to themselves become
|
|
complex with a defined set of functions to select, join, and modify
|
|
units we now support an addition notion of @code{SynthTypes} like
|
|
@code{UttTypes} these define a set of functions to apply
|
|
to an utterance. These may be defined using the @code{defSynthType}
|
|
function. For example
|
|
@lisp
|
|
(defSynthType Festival
|
|
(print "synth method Festival")
|
|
|
|
(print "select")
|
|
(simple_diphone_select utt)
|
|
|
|
(print "join")
|
|
(cut_unit_join utt)
|
|
|
|
(print "impose")
|
|
(simple_impose utt)
|
|
(simple_power utt)
|
|
|
|
(print "synthesis")
|
|
(frames_lpc_synthesis utt)
|
|
)
|
|
@end lisp
|
|
A @code{SynthType} is selected by naming as the value of the
|
|
parameter @code{Synth_Method}.
|
|
|
|
@cindex synthesis hooks
|
|
@cindex @code{after_analysis_hooks}
|
|
@cindex @code{after_synth_hooks}
|
|
@cindex @code{before_synth_hooks}
|
|
@cindex talking head
|
|
Duration the application of the function @code{utt.synth} there are
|
|
three hooks applied. This allows addition control of the synthesis
|
|
process. @code{before_synth_hooks} is applied before any modules are
|
|
applied. @code{after_analysis_hooks} is applied at the start of
|
|
@code{Wave_Synth} when all text, linguistic and prosodic processing have
|
|
been done. @code{after_synth_hooks} is applied after all modules have
|
|
been applied. These are useful for things such as, altering the volume
|
|
of a voice that happens to be quieter than others, or for example
|
|
outputing information for a talking head before waveform synthesis
|
|
occurs so preparation of the facial frames and synthesizing the waveform
|
|
may be done in parallel. (see @file{festival/examples/th-mode.scm} for
|
|
an example use of these hooks for a talking head text mode.)
|
|
|
|
@node Example utterance types, Utterance modules, Utterance types, Utterances
|
|
@section Example utterance types
|
|
|
|
@cindex utterance examples
|
|
A number of utterance types are currently supported. It is easy to add
|
|
new ones but the standard distribution includes the following.
|
|
|
|
@table @code
|
|
@item Text
|
|
@cindex @code{Text} utterance
|
|
Raw text as a string.
|
|
@lisp
|
|
(Utterance Text "This is an example")
|
|
@end lisp
|
|
@item Words
|
|
@cindex @code{Words} utterance
|
|
A list of words
|
|
@lisp
|
|
(Utterance Words (this is an example))
|
|
@end lisp
|
|
Words may be atomic or lists if further features need to be specified.
|
|
For example to specify a word and its part of speech you
|
|
can use
|
|
@lisp
|
|
(Utterance Words (I (live (pos v)) in (Reading (pos n) (tone H-H%))))
|
|
@end lisp
|
|
Note: the use of the tone feature requires an intonation mode that
|
|
supports it.
|
|
|
|
Any feature and value named in the input will be added to the Word
|
|
item.
|
|
@item Phrase
|
|
This allows explicit phrasing and features on Tokens to be specified.
|
|
The input consists of a list of phrases each contains a list of tokens.
|
|
@lisp
|
|
(Utterance
|
|
Phrase
|
|
((Phrase ((name B))
|
|
I saw the man
|
|
(in ((EMPH 1)))
|
|
the park)
|
|
(Phrase ((name BB))
|
|
with the telescope)))
|
|
@end lisp
|
|
ToBI tones and accents may also be specified on Tokens but these will
|
|
only take effect if the selected intonation method uses them.
|
|
@item Segments
|
|
@cindex @code{Segments} utterance
|
|
This allows specification of segments, durations and F0 target
|
|
values.
|
|
@lisp
|
|
(Utterance
|
|
Segments
|
|
((# 0.19 )
|
|
(h 0.055 (0 115))
|
|
(@@ 0.037 (0.018 136))
|
|
(l 0.064 )
|
|
(ou 0.208 (0.0 134) (0.100 135) (0.208 123))
|
|
(# 0.19)))
|
|
@end lisp
|
|
Note the times are in @emph{seconds} NOT milliseconds. The format of
|
|
each segment entry is segment name, duration in seconds, and list of
|
|
target values. Each target value consists of a pair of point into the
|
|
segment (in seconds) and F0 value in Hz.
|
|
@item Phones
|
|
@cindex @code{Phones} utterance
|
|
This allows a simple specification of a list of phones. Synthesis
|
|
specifies fixed durations (specified in @code{FP_duration}, default 100
|
|
ms) and monotone intonation (specified in @code{FP_F0}, default 120Hz).
|
|
This may be used for simple checks for waveform synthesizers etc.
|
|
@lisp
|
|
(Utterance Phones (# h @@ l ou #))
|
|
@end lisp
|
|
@cindex @code{SayPhones}
|
|
Note the function @code{SayPhones} allows synthesis and playing of
|
|
lists of phones through this utterance type.
|
|
@item Wave
|
|
@cindex @code{Wave} utterance
|
|
A waveform file. Synthesis here simply involves loading
|
|
the file.
|
|
@lisp
|
|
(Utterance Wave fred.wav)
|
|
@end lisp
|
|
@end table
|
|
@cindex @code{Tokens} utterance
|
|
@cindex @code{SegF0} utterance
|
|
Others are supported, as defined in @file{lib/synthesis.scm} but are
|
|
used internally by various parts of the system. These include
|
|
@code{Tokens} used in TTS and @code{SegF0} used by @code{utt.resynth}.
|
|
|
|
@node Utterance modules, Accessing an utterance, Example utterance types, Utterances
|
|
@section Utterance modules
|
|
|
|
@cindex modules
|
|
The module is the basic unit that does the work of synthesis. Within
|
|
Festival there are duration modules, intonation modules, wave synthesis
|
|
modules etc. As stated above the utterance type defines the set of
|
|
modules which are to be applied to the utterance. These modules in turn
|
|
will create relations and items so that ultimately a waveform is
|
|
generated, if required.
|
|
|
|
@cindex Parameters
|
|
Many of the chapters in this manual are solely concerned with particular
|
|
modules in the system. Note that many modules have internal choices,
|
|
such as which duration method to use or which intonation method to
|
|
use. Such general choices are often done through the @code{Parameter}
|
|
system. Parameters may be set for different features like
|
|
@code{Duration_Method}, @code{Synth_Method} etc. Formerly the values
|
|
for these parameters were atomic values but now they may be the
|
|
functions themselves. For example, to select the Klatt duration rules
|
|
@lisp
|
|
(Parameter.set 'Duration_Method Duration_Klatt)
|
|
@end lisp
|
|
This allows new modules to be added without requiring changes to
|
|
the central Lisp functions such as @code{Duration}, @code{Intonation},
|
|
and @code{Wave_Synth}.
|
|
|
|
@node Accessing an utterance, Features, Utterance modules, Utterances
|
|
@section Accessing an utterance
|
|
|
|
There are a number of standard functions that allow one to access parts
|
|
of an utterance and traverse through it.
|
|
|
|
@cindex @code{utt.relation} functions
|
|
@cindex @code{item} functions
|
|
Functions exist in Lisp (and of course C++) for accessing an utterance.
|
|
The Lisp access functions are
|
|
@table @samp
|
|
@item (utt.relationnames UTT)
|
|
returns a list of the names of the relations currently created in @code{UTT}.
|
|
@item (utt.relation.items UTT RELATIONNAME)
|
|
returns a list of all items in @code{RELATIONNAME} in @code{UTT}. This
|
|
is nil if no relation of that name exists. Note for tree relation will
|
|
give the items in pre-order.
|
|
@item (utt.relation_tree UTT RELATIONNAME)
|
|
A Lisp tree presentation of the items @code{RELATIONNAME} in @code{UTT}.
|
|
The Lisp bracketing reflects the tree structure in the relation.
|
|
@item (utt.relation.leafs UTT RELATIONNAME)
|
|
A list of all the leafs of the items in @code{RELATIONNAME} in
|
|
@code{UTT}. Leafs are defined as those items with no daughters within
|
|
that relation. For simple list relations @code{utt.relation.leafs} and
|
|
@code{utt.relation.items} will return the same thing.
|
|
@item (utt.relation.first UTT RELATIONNAME)
|
|
returns the first item in @code{RELATIONNAME}. Returns @code{nil}
|
|
if this relation contains no items
|
|
@item (utt.relation.last UTT RELATIONNAME)
|
|
returns the last (the most next) item in @code{RELATIONNAME}. Returns
|
|
@code{nil} if this relation contains no items
|
|
@item (item.feat ITEM FEATNAME)
|
|
returns the value of feature @code{FEATNAME} in @code{ITEM}. @code{FEATNAME}
|
|
may be a feature name, feature function name, or pathname (see below).
|
|
allowing reference to other parts of the utterance this item is in.
|
|
@item (item.features ITEM)
|
|
Returns an assoc list of feature-value pairs of all local features on
|
|
this item.
|
|
@item (item.name ITEM)
|
|
Returns the name of this @code{ITEM}. This could also be accessed
|
|
as @code{(item.feat ITEM 'name)}.
|
|
@item (item.set_name ITEM NEWNAME)
|
|
Sets name on @code{ITEM} to be @code{NEWNAME}. This is equivalent to
|
|
@code{(item.set_feat ITEM 'name NEWNAME)}
|
|
@item (item.set_feat ITEM FEATNAME FEATVALUE)
|
|
set the value of @code{FEATNAME} to @code{FEATVALUE} in @code{ITEM}.
|
|
@code{FEATNAME} should be a simple name and not refer to next,
|
|
previous or other relations via links.
|
|
@item (item.relation ITEM RELATIONNAME)
|
|
Return the item as viewed from @code{RELATIONNAME}, or @code{nil} if
|
|
@code{ITEM} is not in that relation.
|
|
@item (item.relationnames ITEM)
|
|
Return a list of relation names that this item is in.
|
|
@item (item.relationname ITEM)
|
|
Return the relation name that this item is currently being viewed as.
|
|
@item (item.next ITEM)
|
|
Return the next item in @code{ITEM}'s current relation, or @code{nil}
|
|
if there is no next.
|
|
@item (item.prev ITEM)
|
|
Return the previous item in @code{ITEM}'s current relation, or @code{nil}
|
|
if there is no previous.
|
|
@item (item.parent ITEM)
|
|
Return the parent of @code{ITEM} in @code{ITEM}'s current relation, or
|
|
@code{nil} if there is no parent.
|
|
@item (item.daughter1 ITEM)
|
|
Return the first daughter of @code{ITEM} in @code{ITEM}'s current relation, or
|
|
@code{nil} if there are no daughters.
|
|
@item (item.daughter2 ITEM)
|
|
Return the second daughter of @code{ITEM} in @code{ITEM}'s current relation, or
|
|
@code{nil} if there is no second daughter.
|
|
@item (item.daughtern ITEM)
|
|
Return the last daughter of @code{ITEM} in @code{ITEM}'s current relation, or
|
|
@code{nil} if there are no daughters.
|
|
@item (item.leafs ITEM)
|
|
Return a list of all lefs items (those with no daughters) dominated
|
|
by this item.
|
|
@item (item.next_leaf ITEM)
|
|
Find the next item in this relation that has no daughters. Note this
|
|
may traverse up the tree from this point to search for such an item.
|
|
|
|
@end table
|
|
|
|
As from 1.2 the utterance structure may be fully manipulated from
|
|
Scheme. Relations and items may be created and deleted, as easily
|
|
as they can in C++;
|
|
@table @samp
|
|
@item (utt.relation.present UTT RELATIONNAME)
|
|
returns @code{t} if relation named @code{RELATIONNAME} is present, @code{nil}
|
|
otherwise.
|
|
@item (utt.relation.create UTT RELATIONNAME)
|
|
Creates a new relation called @code{RELATIONNAME}. If this relation
|
|
already exists it is deleted first and items in the relation are
|
|
derefenced from it (deleting the items if they are no longer referenced
|
|
by any relation). Thus create relation guarantees an empty relation.
|
|
@item (utt.relation.delete UTT RELATIONNAME)
|
|
Deletes the relation called @code{RELATIONNAME} in utt. All items in
|
|
that relation are derefenced from the relation and if they are no
|
|
longer in any relation the items themselves are deleted.
|
|
@item (utt.relation.append UTT RELATIONNAME ITEM)
|
|
Append @code{ITEM} to end of relation named @code{RELATIONNAME} in
|
|
@code{UTT}. Returns @code{nil} if there is not relation named
|
|
@code{RELATIONNAME} in @code{UTT} otherwise returns the item
|
|
appended. This new item becomes the last in the top list.
|
|
@code{ITEM} item may be an item itself (in this or another relation)
|
|
or a LISP description of an item, which consist of a list containing
|
|
a name and a set of feature vale pairs. It @code{ITEM} is @code{nil}
|
|
or inspecified an new empty item is added. If @code{ITEM} is already
|
|
in this relation it is dereferenced from its current position (and
|
|
an empty item re-inserted).
|
|
@item (item.insert ITEM1 ITEM2 DIRECTION)
|
|
Insert @code{ITEM2} into @code{ITEM1}'s relation in the direction
|
|
specified by @code{DIRECTION}. @code{DIRECTION} may take the
|
|
value, @code{before}, @code{after}, @code{above} and @code{below}.
|
|
If unspecified, @code{after} is assumed. Note it is not recommended
|
|
to insert above and below and the functions @code{item.insert_parent}
|
|
and @code{item.append_daughter} should normally be used for tree building.
|
|
Inserting using @code{before} and @code{after} within daughters is
|
|
perfectly safe.
|
|
@item (item.append_daughter PARENT DAUGHTER)
|
|
Append @code{DAUGHTER}, an item or a description of an item to
|
|
the item @code{PARENT} in the @code{PARENT}'s relation.
|
|
@item (item.insert_parent DAUGHTER NEWPARENT)
|
|
Insert a new parent above @code{DAUGHTER}. @code{NEWPARENT} may
|
|
be a item or the description of an item.
|
|
@item (item.delete ITEM)
|
|
Delete this item from all relations it is in. All daughters of this
|
|
item in each relations are also removed from the relation (which may in
|
|
turn cause them to be deleted if they cease to be referenced by any
|
|
other relation.
|
|
@item (item.relation.remove ITEM)
|
|
Remove this item from this relation, and any of its daughters. Other
|
|
relations this item are in remain untouched.
|
|
@item (item.move_tree FROM TO)
|
|
Move the item @code{FROM} to the position of @code{TO} in @code{TO}'s
|
|
relation. @code{FROM} will often be in the same relation as @code{TO}
|
|
but that isn't necessary. The contents of @code{TO} are dereferenced.
|
|
its daughters are saved then descendants of @code{FROM} are
|
|
recreated under the new @code{TO}, then @code{TO}'s previous
|
|
daughters are derefenced. The order of this is important as @code{FROM}
|
|
may be part of @code{TO}'s descendants. Note that if @code{TO}
|
|
is part of @code{FROM}'s descendants no moving occurs and @code{nil}
|
|
is returned. For example to remove all punction terminal nodes in
|
|
the Syntax relation the call would be something like
|
|
@lisp
|
|
(define (syntax_relation_punc p)
|
|
(if (string-equal "punc" (item.feat (item.daughter2 p) "pos"))
|
|
(item.move_tree (item.daughter1 p) p)
|
|
(mapcar syntax_remove_punc (item.daughters p))))
|
|
@end lisp
|
|
@item (item.exchange_trees ITEM1 ITEM2)
|
|
Exchange @code{ITEM1} and @code{ITEM2} and their descendants in
|
|
@code{ITEM2}'s relation. If @code{ITEM1} is within @code{ITEM2}'s
|
|
descendants or vice versa @code{nil} is returns and no exchange takes
|
|
place. If @code{ITEM1} is not in @code{ITEM2}'s relation, no
|
|
exchange takes place.
|
|
@end table
|
|
|
|
Daughters of a node are actually represented as a list whose first
|
|
daughter is double linked to the parent. Although being aware of
|
|
this structure may be useful it is recommended that all access go through
|
|
the tree specific functions @code{*.parent} and @code{*.daughter*}
|
|
which properly deal with the structure, thus is the internal structure
|
|
ever changes in the future only these tree access function need be
|
|
updated.
|
|
|
|
With the above functions quite elaborate utterance manipulations can
|
|
be performed. For example in post-lexical rules where modifications
|
|
to the segments are required based on the words and their context.
|
|
@xref{Post-lexical rules}, for an example of using various
|
|
utterance access functions.
|
|
|
|
@node Features, Utterance I/O, Accessing an utterance, Utterances
|
|
@section Features
|
|
|
|
@cindex features
|
|
In previous versions items had a number of predefined features. This is
|
|
no longer the case and all features are optional. Particularly the
|
|
@code{start} and @code{end} features are no longer fixed, though those
|
|
names are still used in the relations where yjeu are appropriate.
|
|
Specific functions are provided for the @code{name} feature but they are
|
|
just short hand for normal feature access. Simple features directly access
|
|
the features in the underlying @code{EST_Feature} class in an item.
|
|
|
|
In addition to simple features there is a mechanism for relating
|
|
functions to names, thus accessing a feature may actually call a
|
|
function. For example the features @code{num_syls} is defined as a
|
|
feature function which will count the number of syllables in the
|
|
given word, rather than simple access a pre-existing feature. Feature
|
|
functions are usually dependent on the particular realtion the
|
|
item is in, e.g. some feature functions are only appropriate for
|
|
items in the @code{Word} relation, or only appropriate for those in the
|
|
@code{IntEvent} relation.
|
|
|
|
The third aspect of feature names is a path component. These are
|
|
parts of the name (preceding in @code{.}) that indicated some
|
|
trversal of the utterance structure. For example the features
|
|
@code{name} will access the name feature on the given item. The
|
|
feature @code{n.name} will return the name feature on the next item
|
|
(in that item's relation). A number of basic direction
|
|
operators are defined.
|
|
@table @code
|
|
@item n.
|
|
next
|
|
@item p.
|
|
previous
|
|
@item nn.
|
|
next next
|
|
@item pp.
|
|
previous
|
|
@item parent.
|
|
@item daughter1.
|
|
first daughter
|
|
@item daughter2.
|
|
second daughter
|
|
@item daughtern.
|
|
last daughter
|
|
@item first.
|
|
most previous item
|
|
@item last.
|
|
most next item
|
|
@end table
|
|
Also you may specific traversal to another relation relation, though
|
|
the @code{R:<relationame>.} operator. For example given an Item
|
|
in the syllable relation @code{R:SylStructure.parent.name} would
|
|
give the name of word the syllable is in.
|
|
|
|
Some more complex examples are as follows, assuming we are starting
|
|
form an item in the @code{Syllable} relation.
|
|
@table @samp
|
|
@item stress
|
|
This item's lexical stress
|
|
@item n.stress
|
|
The next syllable's lexical stress
|
|
@item p.stress
|
|
The previous syllable's lexical stress
|
|
@item R:SylStructure.parent.name
|
|
The word this syllable is in
|
|
@item R:SylStructure.parent.R:Word.n.name
|
|
The word next to the word this syllable is in
|
|
@item n.R:SylStructure.parent.name
|
|
The word the next syllable is in
|
|
@item R:SylStructure.daughtern.ph_vc
|
|
The phonetic feature @code{vc} of the final segment in this syllable.
|
|
@end table
|
|
A list of all feature functions is given in an appendix of this
|
|
document. @xref{Feature functions}. New functions may also be added
|
|
in Lisp.
|
|
|
|
In C++ feature values are of class @emph{EST_Val} which may be a string,
|
|
int, or a float (or any arbitrary object). In Scheme this distinction
|
|
cannot not always be made and sometimes when you expect an int you
|
|
actually get a string. Care should be take to ensure the right matching
|
|
functions are use in Scheme. It is recommended you use
|
|
@code{string-append} or @code{string-match} as they will always work.
|
|
|
|
If a pathname does not identify a valid path for the particular
|
|
item (e.g. there is no next) @code{"0"} is returned.
|
|
|
|
@cindex CART trees
|
|
@cindex linear regression
|
|
When collecting data from speech databases it is often useful to collect
|
|
a whole set of features from all utterances in a database. These
|
|
features can then be used for building various models (both CART tree
|
|
models and linear regression modules use these feature names),
|
|
|
|
A number of functions exist to help in this task. For example
|
|
@lisp
|
|
(utt.features utt1 'Word '(name pos p.pos n.pos))
|
|
@end lisp
|
|
will return a list of word, and part of speech context for each
|
|
word in the utterance.
|
|
|
|
@xref{Extracting features}, for an example of extracting sets
|
|
of features from a database for use in building stochastic models.
|
|
|
|
@node Utterance I/O, , Features, Utterances
|
|
@section Utterance I/O
|
|
|
|
A number of functions are available to allow an utterance's
|
|
structure to be made available for other programs.
|
|
|
|
@cindex @code{utt.save}
|
|
@cindex @code{utt.load}
|
|
The whole structure, all relations, items and features may be
|
|
saved in an ascii format using the function @code{utt.save}. This
|
|
file may be reloaded using the @code{utt.load} function. Note the
|
|
waveform is not saved using the form.
|
|
|
|
@cindex saving the waveform
|
|
@cindex resampling
|
|
@cindex rescaling a waveform
|
|
@cindex @code{utt.save.wave}
|
|
@cindex @code{utt.import.wave}
|
|
@cindex loading a waveform
|
|
Individual aspects of an utterance may be selectively saved. The
|
|
waveform itself may be saved using the function @code{utt.save.wave}.
|
|
This will save the waveform in the named file in the format specified
|
|
in the @code{Parameter} @code{Wavefiletype}. All formats supported by
|
|
the Edinburgh Speech Tools are valid including @code{nist}, @code{esps},
|
|
@code{sun}, @code{riff}, @code{aiff}, @code{raw} and @code{ulaw}. Note
|
|
the functions @code{utt.wave.rescale} and @code{utt.wave.resample} may
|
|
be used to change the gain and sample frequency of the waveform before
|
|
saving it. A waveform may be imported into an existing utterance with
|
|
the function @code{utt.import.wave}. This is specifically designed to
|
|
allow external methods of waveform synthesis. However if you just wish
|
|
to play an external wave or make it into an utterance you should
|
|
consider the utterance @code{Wave} type.
|
|
|
|
@cindex saving segments
|
|
@cindex saving relations
|
|
@cindex @code{utt.save.segs}
|
|
The segments of an utterance may be saved in a file using the function
|
|
@code{utt.save.segs} which saves the segments of the named utterance in
|
|
xlabel format. Any other stream may also be saved using the more
|
|
general @code{utt.save.relation} which takes the additional argument of
|
|
a relation name. The names of each item and the end feature of each
|
|
item are saved in the named file, again in Xlabel format, other features
|
|
are saved in extra fields. For more elaborated saving methods you can
|
|
easily write a Scheme function to save data in an utterance in whatever
|
|
format is required. See the file @file{lib/mbrola.scm} for an example.
|
|
|
|
@cindex display
|
|
@cindex Xwaves
|
|
A simple function to allow the displaying of an utterance in
|
|
Entropic's Xwaves tool is provided by the function @code{display}.
|
|
It simply saves the waveform and the segments and sends appropriate
|
|
commands to (the already running) Xwaves and xlabel programs.
|
|
|
|
@cindex resynthesis
|
|
@cindex synthesis of natural utterances
|
|
A function to synthesize an externally specified utterance is provided
|
|
for by @code{utt.resynth} which takes two filename arguments, an xlabel
|
|
segment file and an F0 file. This function loads, synthesizes and plays
|
|
an utterance synthesized from these files. The loading is provided by
|
|
the underlying function @code{utt.load.segf0}.
|
|
|
|
@node Text analysis, POS tagging, Utterances, Top
|
|
@chapter Text analysis
|
|
|
|
@menu
|
|
* Tokenizing:: Splitting text into tokens
|
|
* Token to word rules::
|
|
* Homograph disambiguation:: "Wed 5 may wind US Sen up"
|
|
@end menu
|
|
|
|
@node Tokenizing, Token to word rules, , Text analysis
|
|
@section Tokenizing
|
|
|
|
@cindex tokenizing
|
|
@cindex whitespace
|
|
@cindex punctuation
|
|
A crucial stage in text processing is the initial tokenization of text.
|
|
A @emph{token} in Festival is an atom separated with whitespace from a
|
|
text file (or string). If punctuation for the current language is
|
|
defined, characters matching that punctuation are removed from the
|
|
beginning and end of a token and held as features of the token. The
|
|
default list of characters to be treated as white space is defined as
|
|
@lisp
|
|
(defvar token.whitespace " \t\n\r")
|
|
@end lisp
|
|
While the default set of punctuation characters is
|
|
@lisp
|
|
(defvar token.punctuation "\"'`.,:;!?()@{@}[]")
|
|
(defvar token.prepunctuation "\"'`(@{[")
|
|
@end lisp
|
|
These are declared in @file{lib/token.scm} but may be changed
|
|
for different languages, text modes etc.
|
|
|
|
@node Token to word rules, Homograph disambiguation, Tokenizing , Text analysis
|
|
@section Token to word rules
|
|
|
|
@cindex tokens to words
|
|
Tokens are further analysed into lists of words. A word
|
|
is an atom that can be given a pronunciation by the lexicon (or
|
|
letter to sound rules). A token may give rise to a number
|
|
of words or none at all.
|
|
|
|
For example the basic tokens
|
|
@example
|
|
This pocket-watch was made in 1983.
|
|
@end example
|
|
would give a word relation of
|
|
@example
|
|
this pocket watch was made in nineteen eighty three
|
|
@end example
|
|
|
|
Becuase the relationship between tokens and word in some cases is
|
|
complex, a user function may be specified for translating tokens into
|
|
words. This is designed to deal with things like numbers, email
|
|
addresses, and other non-obvious pronunciations of tokens as zero or
|
|
more words. Currently a builtin function
|
|
@code{builtin_english_token_to_words} offers much of the necessary
|
|
functionality for English but a user may further customize this.
|
|
|
|
If the user defines a function @code{token_to_words} which takes two
|
|
arguments: a token item and a token name, it will be called by the
|
|
@code{Token_English} and @code{Token_Any} modules. A substantial
|
|
example is given as @code{english_token_to_words} in
|
|
@file{festival/lib/token.scm}.
|
|
|
|
An example of this function is in
|
|
@file{lib/token.scm}. It is quite elaborate and covers most of the
|
|
common multi-word tokens in English including, numbers, money symbols,
|
|
Roman numerals, dates, times, plurals of symbols, number ranges,
|
|
telephone number and various other symbols.
|
|
|
|
Let us look at the treatment of one particular phenomena which shows
|
|
the use of these rules. Consider the expression "$12 million" which
|
|
should be rendered as the words "twelve million dollars". Note the word
|
|
"dollars" which is introduced by the "$" sign, ends up after the end of
|
|
the expression. There are two cases we need to deal with as there are
|
|
two tokens. The first condition in the @code{cond} checks if the
|
|
current token name is a money symbol, while the second condition check
|
|
that following word is a magnitude (million, billion, trillion, zillion
|
|
etc.) If that is the case the "$" is removed and the remaining numbers
|
|
are pronounced, by calling the builtin token to word function. The
|
|
second condition deals with the second token. It confirms the previous
|
|
is a money value (the same regular expression as before) and then
|
|
returns the word followed by the word "dollars". If it is neither of
|
|
these forms then the builtin function is called.
|
|
@lisp
|
|
(define (token_to_words token name)
|
|
"(token_to_words TOKEN NAME)
|
|
Returns a list of words for NAME from TOKEN."
|
|
(cond
|
|
((and (string-matches name "\\$[0-9,]+\\(\\.[0-9]+\\)?")
|
|
(string-matches (item.feat token "n.name") ".*illion.?"))
|
|
(builtin_english_token_to_words token (string-after name "$")))
|
|
((and (string-matches (item.feat token "p.name")
|
|
"\\$[0-9,]+\\(\\.[0-9]+\\)?")
|
|
(string-matches name ".*illion.?"))
|
|
(list
|
|
name
|
|
"dollars"))
|
|
(t
|
|
(builtin_english_token_to_words token name))))
|
|
@end lisp
|
|
It is valid to make some conditions return no words, though some care
|
|
should be taken with that, as punctuation information may no longer be
|
|
available to later processing if there are no words related to
|
|
a token.
|
|
|
|
@node Homograph disambiguation, , Token to word rules, Text analysis
|
|
@section Homograph disambiguation
|
|
|
|
@cindex homographs
|
|
Not all tokens can be rendered as words easily. Their context may affect
|
|
the way they are to be pronounced. For example in the
|
|
utterance
|
|
@example
|
|
On May 5 1985, 1985 people moved to Livingston.
|
|
@end example
|
|
@exdent the tokens "1985" should be pronounced differently, the first as
|
|
a year, "nineteen eighty five" while the second as a quantity "one
|
|
thousand nine hundred and eighty five". Numbers may also be pronounced
|
|
as ordinals as in the "5" above, it should be "fifth" rather than
|
|
"five".
|
|
|
|
Also, the pronunciation of certain words cannot simply be found from
|
|
their orthographic form alone. Linguistic part of speech tags help to
|
|
disambiguate a large class of homographs, e.g. "lives". A part of
|
|
speech tagger is included in Festival and discussed in @ref{POS
|
|
tagging}. But even part of speech isn't sufficient in a number of
|
|
cases. Words such as "bass", "wind", "bow" etc cannot by distinguished
|
|
by part of speech alone, some semantic information is also required. As
|
|
full semantic analysis of text is outwith the realms of Festival's
|
|
capabilities some other method for disambiguation is required.
|
|
|
|
Following the work of @cite{yarowsky96} we have included a method
|
|
for identified tokens to be further labelled with extra tags to
|
|
help identify their type. Yarowsky uses @emph{decision lists} to
|
|
identify different types for homographs. Decision lists are
|
|
a restricted form of decision trees which have some advantages
|
|
over full trees, they are easier to build and Yarowsky has shown
|
|
them to be adequate for typical homograph resolution.
|
|
|
|
@subsection Using disambiguators
|
|
|
|
Festival offers a method for assigning a @code{token_pos} feature to
|
|
each token. It does so using Yarowsky-type disambiguation techniques.
|
|
A list of disambiguators can be provided in the variable
|
|
@code{token_pos_cart_trees}. Each disambiguator consists of a regular
|
|
expression and a CART tree (which may be a decision list as they have the
|
|
same format). If a token matches the regular expression the CART tree
|
|
is applied to the token and the resulting class is assigned
|
|
to the token via the feature @code{token_pos}. This is done
|
|
by the @code{Token_POS} module.
|
|
|
|
For example, the follow disambiguator distinguishes "St" (street and saint)
|
|
and "Dr" (doctor and drive).
|
|
@lisp
|
|
("\\([dD][Rr]\\|[Ss][tT]\\)"
|
|
((n.name is 0)
|
|
((p.cap is 1)
|
|
((street))
|
|
((p.name matches "[0-9]*\\(1[sS][tT]\\|2[nN][dD]\\|3[rR][dD]\\|[0-9][tT][hH]\\)")
|
|
((street))
|
|
((title))))
|
|
((punc matches ".*,.*")
|
|
((street))
|
|
((p.punc matches ".*,.*")
|
|
((title))
|
|
((n.cap is 0)
|
|
((street))
|
|
((p.cap is 0)
|
|
((p.name matches "[0-9]*\\(1[sS][tT]\\|2[nN][dD]\\|3[rR][dD]\\|[0-9][tT][hH]\\)")
|
|
((street))
|
|
((title)))
|
|
((pp.name matches "[1-9][0-9]+")
|
|
((street))
|
|
((title)))))))))
|
|
@end lisp
|
|
Note that these only assign values for the feature @code{token_pos} and
|
|
do nothing more. You must have a related token to word rule that
|
|
interprets this feature value and does the required translation. For
|
|
example the corresponding token to word rule for the above disambiguator
|
|
is
|
|
@lisp
|
|
((string-matches name "\\([dD][Rr]\\|[Ss][tT]\\)")
|
|
(if (string-equal (item.feat token "token_pos") "street")
|
|
(if (string-matches name "[dD][rR]")
|
|
(list "drive")
|
|
(list "street"))
|
|
(if (string-matches name "[dD][rR]")
|
|
(list "doctor")
|
|
(list "saint"))))
|
|
@end lisp
|
|
|
|
@subsection Building disambiguators
|
|
|
|
Festival offers some support for building disambiguation trees. The
|
|
basic method is to find all occurrences of a homographic token in a large
|
|
text database, label each occurrence into classes, extract appropriate
|
|
context features for these tokens and finally build an classification tree
|
|
or decision list based on the extracted features.
|
|
|
|
The extraction and building of trees is not yet a fully automated
|
|
process in Festival but the file @file{festival/examples/toksearch.scm}
|
|
shows some basic Scheme code we use for extracting tokens from very
|
|
large collections of text.
|
|
|
|
The function @code{extract_tokens} does the real work. It reads the
|
|
given file, token by token into a token stream. Each token is tested
|
|
against the desired tokens and if there is a match the named features
|
|
are extracted. The token stream will be extended to provide the
|
|
necessary context. Note that only some features will make any sense in
|
|
this situation. There is only a token relation so referring to words,
|
|
syllables etc. is not productive.
|
|
|
|
In this example databases are identified by a file that lists all
|
|
the files in the text databases. Its name is expected to be
|
|
@file{bin/DBNAME.files} where @code{DBNAME} is the name of
|
|
the database. The file should contain a list
|
|
of filenames in the database e.g for the Gutenberg texts the
|
|
file @file{bin/Gutenberg.files} contains
|
|
@example
|
|
gutenberg/etext90/bill11.txt
|
|
gutenberg/etext90/const11.txt
|
|
gutenberg/etext90/getty11.txt
|
|
gutenberg/etext90/jfk11.txt
|
|
...
|
|
@end example
|
|
|
|
Extracting the tokens is typically done in two passes. The first pass
|
|
extracts the context (I've used 5 tokens either side). It extracts
|
|
the file and position, so the token is identified, and the word
|
|
in context.
|
|
|
|
Next those examples should be labelled with a small set of classes
|
|
which identify the type of the token. For example for a token
|
|
like "Dr" whether it is a person's title or a street identifier.
|
|
Note that hand-labelling can be laborious, though it is surprising
|
|
how few tokens of particular types actually exist in 62 million
|
|
words.
|
|
|
|
The next task is to extract the tokens with the features that will best
|
|
distinguish the particular token. In our "Dr" case this will involve
|
|
punctuation around the token, capitalisation of surrounding tokens etc.
|
|
After extracting the distinguishing tokens you must line up the labels
|
|
with these extracted features. It would be easier to extract both the
|
|
context and the desired features at the same time but experience shows
|
|
that in labelling, more appropriate features come to mind that will
|
|
distinguish classes better and you don't want to have to label twice.
|
|
|
|
Once a set of features consisting of the label and features is created
|
|
it is easy to use @file{wagon} to create the corresponding decision tree
|
|
or decision list. @file{wagon} supports both decision trees and decision
|
|
lists, it may be worth experimenting to find out which give the best
|
|
results on some held out test data. It appears that decision trees are
|
|
typically better, but are often much larger, and the size does not
|
|
always justify the the sometimes only slightly better results.
|
|
|
|
@node POS tagging, Phrase breaks, Text analysis, Top
|
|
@chapter POS tagging
|
|
|
|
@cindex part of speech tagging
|
|
@cindex tagging
|
|
@cindex POS tagging
|
|
Part of speech tagging is a fairly well-defined process. Festival
|
|
includes a part of speech tagger following the HMM-type taggers as found
|
|
in the Xerox tagger and others (e.g. @cite{DeRose88}). Part of speech
|
|
tags are assigned, based on the probability distribution of tags given a
|
|
word, and from ngrams of tags. These models are externally specified
|
|
and a Viterbi decoder is used to assign part of speech tags at run time.
|
|
|
|
So far this tagger has only been used for English but there
|
|
is nothing language specific about it. The module @code{POS}
|
|
assigns the tags. It accesses the following variables for
|
|
parameterization.
|
|
@table @code
|
|
@item pos_lex_name
|
|
The name of a "lexicon" holding reverse probabilities of words
|
|
given a tag (indexed by word). If this is unset or has the
|
|
value @code{NIL} no part of speech tagging takes place.
|
|
@item pos_ngram_name
|
|
The name of a loaded ngram model of part of speech tags (loaded
|
|
by @code{ngram.load}).
|
|
@item pos_p_start_tag
|
|
The name of the most likely tag before the start of an utterance.
|
|
This is typically the tag for sentence final punctuation marks.
|
|
@item pos_pp_start_tag
|
|
The name of the most likely tag two before the start of an utterance.
|
|
For English the is typically a simple noun, but for other languages
|
|
it might be a verb. If the ngram model is bigger than three
|
|
this tag is effectively repeated for the previous left contexts.
|
|
@item pos_map
|
|
We have found that it is often better to use a rich tagset for
|
|
prediction of part of speech tags but that in later use (phrase breaks
|
|
and dictionary lookup) a much more constrained tagset is better. Thus
|
|
mapping of the predicted tagset to a different tagset is supported.
|
|
@code{pos_map} should be a a list of pairs consisting of a list of tags
|
|
to be mapped and the new tag they are to be mapped to.
|
|
@end table
|
|
|
|
Note is it important to have the part of speech tagger match
|
|
the tags used in later parts of the system, particularly the
|
|
lexicon. Only two of our lexicons used so far have
|
|
(mappable) part of speech labels.
|
|
|
|
An example of the part of speech tagger for English can be found in
|
|
@file{lib/pos.scm}.
|
|
|
|
@node Phrase breaks, Intonation, POS tagging, Top
|
|
@chapter Phrase breaks
|
|
|
|
@cindex phrase breaks
|
|
There are two methods for predicting phrase breaks in Festival, one
|
|
simple and one sophisticated. These two methods are selected through
|
|
the parameter @code{Phrase_Method} and phrasing is achieved by the
|
|
module @code{Phrasify}.
|
|
|
|
The first method is by CART tree. If parameter @code{Phrase_Method} is
|
|
@code{cart_tree}, the CART tree in the variable @code{phrase_cart_tree}
|
|
is applied to each word to see if a break should be inserted or not.
|
|
The tree should predict categories @code{BB} (for big break), @code{B}
|
|
(for break) or @code{NB} (for no break). A simple example of a tree to
|
|
predict phrase breaks is given in the file @file{lib/phrase.scm}.
|
|
@lisp
|
|
(set! simple_phrase_cart_tree
|
|
'
|
|
((R:Token.parent.punc in ("?" "." ":"))
|
|
((BB))
|
|
((R:Token.parent.punc in ("'" "\"" "," ";"))
|
|
((B))
|
|
((n.name is 0)
|
|
((BB))
|
|
((NB))))))
|
|
@end lisp
|
|
|
|
The second and more elaborate method of phrase break prediction is used
|
|
when the parameter @code{Phrase_Method} is @code{prob_models}. In this
|
|
case a probabilistic model using probabilities of a break after a word
|
|
based on the part of speech of the neighbouring words and the previous
|
|
word. This is combined with a ngram model of the distribution of breaks
|
|
and non-breaks using a Viterbi decoder to find the optimal phrasing of
|
|
the utterance. The results using this technique are good and even show
|
|
good results on unseen data from other researchers' phrase break tests
|
|
(see @cite{black97b}). However sometimes it does sound wrong,
|
|
suggesting there is still further work required.
|
|
|
|
Parameters for this module are set through the feature list held
|
|
in the variable @code{phr_break_params}, and example of which
|
|
for English is set in @code{english_phr_break_params} in
|
|
the file @file{lib/phrase.scm}. The features names and meaning are
|
|
|
|
@table @code
|
|
@item pos_ngram_name
|
|
The name of a loaded ngram that gives probability distributions of B/NB
|
|
given previous, current and next part of speech.
|
|
@item pos_ngram_filename
|
|
The filename containing @code{pos_ngram_name}.
|
|
@item break_ngram_name
|
|
The name of a loaded ngram of B/NB distributions. This is typically
|
|
a 6 or 7-gram.
|
|
@item break_ngram_filename
|
|
The filename containing @code{break_ngram_name}.
|
|
@item gram_scale_s
|
|
A weighting factor for breaks in the break/non-break ngram. Increasing
|
|
the value insertes more breaks, reducing it causes less breaks to be
|
|
inserted.
|
|
@item phrase_type_tree
|
|
A CART tree that is used to predict type of break given the predict
|
|
break position. This (rather crude) technique is current used to
|
|
distinguish major and minor breaks.
|
|
@item break_tags
|
|
A list of the break tags (typically @code{(B NB)}).
|
|
@item pos_map
|
|
A part of speech map used to map the @code{pos} feature of words
|
|
into a smaller tagset used by the phrase predictor.
|
|
@end table
|
|
|
|
@node Intonation, Duration, Phrase breaks , Top
|
|
@chapter Intonation
|
|
|
|
@cindex intonation
|
|
A number of different intonation modules are available with
|
|
varying levels of control. In general intonation is generated
|
|
in two steps.
|
|
@enumerate
|
|
@item Prediction of accents (and/or end tones) on a per
|
|
syllable basis.
|
|
@item Prediction of F0 target values, this must be done after
|
|
durations are predicted.
|
|
@end enumerate
|
|
|
|
Reflecting this split there are two main intonation modules that call
|
|
sub-modules depending on the desired intonation methods. The
|
|
@code{Intonation} and @code{Int_Targets} modules are defined in Lisp
|
|
(@file{lib/intonation.scm}) and call sub-modules which are (so far) in
|
|
C++.
|
|
|
|
@menu
|
|
* Default intonation:: Effectively none at all.
|
|
* Simple intonation:: Accents and hats.
|
|
* Tree intonation:: Accents and Tones, and F0 prediction by LR
|
|
* Tilt intonation:: Using the Tilt intonation model
|
|
* General intonation:: A programmable intonation module
|
|
* Using ToBI:: A ToBI by rule example
|
|
@end menu
|
|
|
|
@node Default intonation, Simple intonation, ,Intonation
|
|
@section Default intonation
|
|
|
|
@cindex duff intonation
|
|
@cindex monotone
|
|
This is the simplest form of intonation and offers the modules
|
|
@code{Intonation_Default} and @code{Intonation_Targets_Default}. The
|
|
first of which actually does nothing at all.
|
|
@code{Intonation_Targets_Default} simply creates a target at the start
|
|
of the utterance, and one at the end. The values of which, by default
|
|
are 130 Hz and 110 Hz. These values may be set through the
|
|
parameter @code{duffint_params} for example the following will
|
|
general a monotone at 150Hz.
|
|
@lisp
|
|
(set! duffint_params '((start 150) (end 150)))
|
|
(Parameter.set 'Int_Method 'DuffInt)
|
|
(Parameter.set 'Int_Target_Method Int_Targets_Default)
|
|
@end lisp
|
|
|
|
@node Simple intonation, Tree intonation, Default intonation,Intonation
|
|
@section Simple intonation
|
|
|
|
@cindex @code{int_accent_cart_tree}
|
|
This module uses the CART tree in @code{int_accent_cart_tree} to predict
|
|
if each syllable is accented or not. A predicted value of @code{NONE}
|
|
means no accent is generated by the corresponding @code{Int_Targets_Simple}
|
|
function. Any other predicted value will cause a `hat' accent to be
|
|
put on that syllable.
|
|
|
|
A default @code{int_accent_cart_tree} is available in the value
|
|
@code{simple_accent_cart_tree} in @file{lib/intonation.scm}. It simply
|
|
predicts accents on the stressed syllables on content words in
|
|
poly-syllabic words, and on the only syllable in single syllable content
|
|
words. Its form is
|
|
@lisp
|
|
(set! simple_accent_cart_tree
|
|
'
|
|
((R:SylStructure.parent.gpos is content)
|
|
((stress is 1)
|
|
((Accented))
|
|
((position_type is single)
|
|
((Accented))
|
|
((NONE))))
|
|
((NONE))))
|
|
@end lisp
|
|
|
|
The function @code{Int_Targets_Simple} uses parameters in the a-list
|
|
in variable @code{int_simple_params}. There are two interesting
|
|
parameters @code{f0_mean} which gives the mean F0 for this speaker
|
|
(default 110 Hz) and @code{f0_std} is the standard deviation of
|
|
F0 for this speaker (default 25 Hz). This second value is used
|
|
to determine the amount of variation to be put in the generated
|
|
targets.
|
|
|
|
@cindex F0 generation
|
|
For each Phrase in the given utterance an F0 is generated starting at
|
|
@code{f0_code+(f0_std*0.6)} and declines @code{f0_std} Hz over the
|
|
length of the phrase until the last syllable whose end is set to
|
|
@code{f0_code-f0_std}. An imaginary line called @code{baseline} is
|
|
drawn from start to the end (minus the final extra fall), For each
|
|
syllable that is accented (i.e. has an IntEvent related to it) three
|
|
targets are added. One at the start, one in mid vowel, and one at the
|
|
end. The start and end are at position @code{baseline} Hz (as declined
|
|
for that syllable) and the mid vowel is set to @code{baseline+f0_std}.
|
|
|
|
Note this model is not supposed to be complex or comprehensive but it
|
|
offers a very quick and easy way to generate something other than a
|
|
fixed line F0. Something similar to this has been for Spanish and Welsh
|
|
without (too many) people complaining. However it is not designed as a
|
|
serious intonation module.
|
|
|
|
@node Tree intonation, Tilt intonation, Simple intonation, Intonation
|
|
@section Tree intonation
|
|
|
|
This module is more flexible. Two different CART trees can be used to
|
|
predict `accents' and `endtones'. Although at present this module is
|
|
used for an implementation of the ToBI intonation labelling system it
|
|
could be used for many different types of intonation system.
|
|
|
|
The target module for this method uses a Linear Regression model to
|
|
predict start mid-vowel and end targets for each syllable using
|
|
arbitrarily specified features. This follows the work described in
|
|
@cite{black96}. The LR models are held as as described below
|
|
@xref{Linear regression}. Three models are used in the variables
|
|
@code{f0_lr_start}, @code{f0_lr_mid} and @code{f0_lr_end}.
|
|
|
|
@node Tilt intonation, General intonation, Tree intonation, Intonation
|
|
@section Tilt intonation
|
|
|
|
Tilt description to be inserted.
|
|
|
|
@node General intonation, Using ToBI, Tilt intonation, Intonation
|
|
@section General intonation
|
|
|
|
As there seems to be a number of intonation theories that predict
|
|
F0 contours by rule (possibly using trained parameters) this
|
|
module aids the external specification of such rules for a wide
|
|
class of intonation theories (through primarily those that might
|
|
be referred to as the ToBI group). This is designed to be multi-lingual
|
|
and offer a quick way to port often pre-existing rules into Festival
|
|
without writing new C++ code.
|
|
|
|
The accent prediction part uses the same mechanisms as the Simple
|
|
intonation method described above, a decision tree for
|
|
accent prediction, thus the tree in the variable
|
|
@code{int_accent_cart_tree} is used on each syllable to predict
|
|
an @code{IntEvent}.
|
|
|
|
The target part calls a specified Scheme function which returns
|
|
a list of target points for a syllable. In this way any arbitrary
|
|
tests may be done to produce the target points. For example
|
|
here is a function which returns three target points
|
|
for each syllable with an @code{IntEvent} related to it (i.e.
|
|
accented syllables).
|
|
@lisp
|
|
(define (targ_func1 utt syl)
|
|
"(targ_func1 UTT STREAMITEM)
|
|
Returns a list of targets for the given syllable."
|
|
(let ((start (item.feat syl 'syllable_start))
|
|
(end (item.feat syl 'syllable_end)))
|
|
(if (equal? (item.feat syl "R:Intonation.daughter1.name") "Accented")
|
|
(list
|
|
(list start 110)
|
|
(list (/ (+ start end) 2.0) 140)
|
|
(list end 100)))))
|
|
@end lisp
|
|
This function may be identified as the function to call by
|
|
the following setup parameters.
|
|
@lisp
|
|
(Parameter.set 'Int_Method 'General)
|
|
(Parameter.set 'Int_Target_Method Int_Targets_General)
|
|
|
|
(set! int_general_params
|
|
(list
|
|
(list 'targ_func targ_func1)))
|
|
@end lisp
|
|
|
|
@node Using ToBI, , General intonation, Intonation
|
|
@section Using ToBI
|
|
|
|
An example implementation of a ToBI to F0 target module is included in
|
|
@file{lib/tobi_rules.scm} based on the rules described in @cite{jilka96}.
|
|
This uses the general intonation method discussed in the previous
|
|
section. This is designed to be useful to people who are experimenting
|
|
with ToBI (@cite{silverman92}), rather than general text to speech.
|
|
|
|
To use this method you need to load @file{lib/tobi_rules.scm} and
|
|
call @code{setup_tobi_f0_method}. The default is in a male's
|
|
pitch range, i.e. for @code{voice_rab_diphone}. You can change
|
|
it for other pitch ranges by changing the folwoing variables.
|
|
@lisp
|
|
(Parameter.set 'Default_Topline 110)
|
|
(Parameter.set 'Default_Start_Baseline 87)
|
|
(Parameter.set 'Default_End_Baseline 83)
|
|
(Parameter.set 'Current_Topline (Parameter.get 'Default_Topline))
|
|
(Parameter.set 'Valley_Dip 75)
|
|
@end lisp
|
|
|
|
An example using this from STML is given in @file{examples/tobi.stml}.
|
|
But it can also be used from Scheme. For example before
|
|
defining an utterance you should execute the following either
|
|
from teh command line on in some setup file
|
|
@example
|
|
(voice_rab_diphone)
|
|
(require 'tobi_rules)
|
|
(setup_tobi_f0_method)
|
|
@end example
|
|
In order to allow specification of accents, tones, and break levels
|
|
you must use an utterance type that allows such specification. For
|
|
example
|
|
@lisp
|
|
(Utterance
|
|
Words
|
|
(boy
|
|
(saw ((accent H*)))
|
|
the
|
|
(girl ((accent H*)))
|
|
in the
|
|
(park ((accent H*) (tone H-)))
|
|
with the
|
|
(telescope ((accent H*) (tone H-H%)))))
|
|
|
|
(Utterance Words
|
|
(The
|
|
(boy ((accent L*)))
|
|
saw
|
|
the
|
|
(girl ((accent H*) (tone L-)))
|
|
with
|
|
the
|
|
(telescope ((accent H*) (tone H-H%))))))
|
|
@end lisp
|
|
You can display the the synthesized form of these utterance in
|
|
Xwaves. Start an Xwaves and an Xlabeller and call the function
|
|
@code{display} on the synthesized utterance.
|
|
|
|
@node Duration, UniSyn synthesizer, Intonation , Top
|
|
@chapter Duration
|
|
|
|
@cindex duration
|
|
A number of different duration prediction modules are available with
|
|
varying levels of sophistication.
|
|
|
|
Segmental duration prediction is done by the module @code{Duration}
|
|
which calls different actual methods depending on the parameter
|
|
@code{Duration_Method}.
|
|
|
|
@cindex duration stretch
|
|
All of the following duration methods may be further affected by both a
|
|
global duration stretch and a per word one.
|
|
|
|
If the parameter @code{Duration_Stretch} is set, all absolute durations
|
|
predicted by any of the duration methods described here are multiplied by
|
|
the parameter's value. For example
|
|
@lisp
|
|
(Parameter.set 'Duration_Stretch 1.2)
|
|
@end lisp
|
|
@exdent will make everything speak more slowly.
|
|
|
|
@cindex local duration stretch
|
|
In addition to the global stretch method, if the feature
|
|
@code{dur_stretch} on the related @code{Token} is set it will also be
|
|
used as a multiplicative factor on the duration produced by the selected
|
|
method. That is @code{R:Syllable.parent.parent.R:Token.parent.dur_stretch}.
|
|
There is a lisp function @code{duration_find_stretch} wchi will return
|
|
the combined gloabel and local duration stretch factor for a given
|
|
segment item.
|
|
|
|
Note these global and local methods of affecting the duration produced
|
|
by models are crude and should be considered hacks. Uniform
|
|
modification of durations is not what happens in real speech. These
|
|
parameters are typically used when the underlying duration method is
|
|
lacking in some way. However these can be useful.
|
|
|
|
Note it is quite easy to implement new duration methods in Scheme
|
|
directly.
|
|
|
|
@menu
|
|
* Default durations:: Fixed length durations
|
|
* Average durations::
|
|
* Klatt durations:: Klatt rules from book.
|
|
* CART durations:: Tree based durations
|
|
@end menu
|
|
|
|
@node Default durations, Average durations, , Duration
|
|
@section Default durations
|
|
|
|
@cindex fixed durations
|
|
If parameter @code{Duration_Method} is set to @code{Default}, the
|
|
simplest duration model is used. All segments are 100 milliseconds
|
|
(this can be modified by @code{Duration_Stretch}, and/or the localised
|
|
Token related @code{dur_stretch} feature).
|
|
|
|
@node Average durations, Klatt durations, Default durations, Duration
|
|
@section Average durations
|
|
|
|
If parameter @code{Duration_Method} is set to @code{Averages}
|
|
then segmental durations are set to their averages. The variable
|
|
@code{phoneme_durations} should be an a-list of phones and averages
|
|
in seconds. The file @file{lib/mrpa_durs.scm} has an example for
|
|
the mrpa phoneset.
|
|
|
|
If a segment is found that does not appear in the list a default
|
|
duration of 0.1 seconds is assigned, and a warning message generated.
|
|
|
|
@node Klatt durations, CART durations, Average durations, Duration
|
|
@section Klatt durations
|
|
|
|
@cindex Klatt duration rules
|
|
If parameter @code{Duration_Method} is set to @code{Klatt} the duration
|
|
rules from the Klatt book (@cite{allen87}, chapter 9). This method
|
|
requires minimum and inherent durations for each phoneme in the
|
|
phoneset. This information is held in the variable
|
|
@code{duration_klatt_params}. Each member of this list is a
|
|
three-tuple, of phone name, inherent duration and minimum duration. An
|
|
example for the mrpa phoneset is in @file{lib/klatt_durs.scm}.
|
|
|
|
@node CART durations, , Klatt durations, Duration
|
|
@section CART durations
|
|
|
|
Two very similar methods of duration prediction by CART tree
|
|
are supported. The first, used when parameter @code{Duration_Method}
|
|
is @code{Tree} simply predicts durations directly for each segment.
|
|
The tree is set in the variable @code{duration_cart_tree}.
|
|
|
|
The second, which seems to give better results, is used when parameter
|
|
@code{Duration_Method} is @code{Tree_ZScores}. In this second model the
|
|
tree predicts zscores (number of standard deviations from the mean)
|
|
rather than duration directly. (This follows @cite{campbell91}, but we
|
|
don't deal in syllable durations here.) This method requires means and
|
|
standard deviations for each phone. The variable
|
|
@code{duration_cart_tree} should contain the zscore prediction tree and
|
|
the variable @code{duration_ph_info} should contain a list of phone,
|
|
mean duration, and standard deviation for each phone in the phoneset.
|
|
|
|
An example tree trained from 460 sentences spoken by Gordon is
|
|
in @file{lib/gswdurtreeZ}. Phone means and standard deviations
|
|
are in @file{lib/gsw_durs.scm}.
|
|
|
|
After prediction the segmental duration is calculated by
|
|
the simple formula
|
|
@example
|
|
duration = mean + (zscore * standard deviation)
|
|
@end example
|
|
|
|
For some other duration models that affect an inherent duration by
|
|
some factor this method has been used. If the tree predicts factors
|
|
rather than zscores and the @code{duration_ph_info} entries
|
|
are phone, 0.0, inherent duration. The above formula will generate the
|
|
desired result. Klatt and Klatt-like rules can be implemented in the
|
|
this way without adding a new method.
|
|
|
|
@node UniSyn synthesizer, Diphone synthesizer, Duration , Top
|
|
@chapter UniSyn synthesizer
|
|
|
|
@cindex UniSyn
|
|
@cindex diphone synthesis
|
|
@cindex waveform synthesis
|
|
Since 1.3 a new general synthesizer module has been included. This
|
|
designed to replace the older diphone synthesizer described in the
|
|
next chapter. A redesign was made in order to have a generalized
|
|
waveform synthesizer, singla processing module that could be used
|
|
even when the units being concatenated are not diphones. Also at
|
|
this stage the full diphone (or other) database pre-processing
|
|
functions were added to the Speech Tool library.
|
|
|
|
@section UniSyn database format
|
|
|
|
@cindex grouped diphones
|
|
@cindex ungrouped diphones
|
|
@cindex separate diphones
|
|
The Unisyn synthesis modules can use databases in two basic
|
|
formats, @emph{separate} and @emph{grouped}. Separate is when
|
|
all files (signal, pitchmark and coefficient files) are accessed
|
|
individually during synthesis. This is the standard use during
|
|
databse development. Group format is when a database is collected
|
|
together into a single special file containing all information
|
|
necessary for waveform synthesis. This format is designed to
|
|
be used for distribution and general use of the database.
|
|
|
|
A database should consist of a set of waveforms, (which may be
|
|
translated into a set of coefficients if the desired the signal
|
|
processing method requires it), a set of pitchmarks and an index. The
|
|
pitchmarks are necessary as most of our current signal processing are
|
|
pitch synchronous.
|
|
|
|
@subsection Generating pitchmarks
|
|
|
|
@cindex pitchmarking
|
|
@cindex laryngograph
|
|
Pitchmarks may be derived from laryngograph files using the our
|
|
proved program @file{pitchmark} distributed with the speech
|
|
tools. The actual parameters to this program are still a bit of
|
|
an art form. The first major issue is which direction the lar
|
|
files. We have seen both, though it does seem to be CSTR's ones
|
|
are most often upside down while others (e.g. OGI's) are the right way
|
|
up. The @code{-inv} argument to @file{pitchmark} is specifically
|
|
provided to cater for this. There other issues in getting the
|
|
pitchmarks aligned. The basic command for generating pitchmarks
|
|
is
|
|
@example
|
|
pitchmark -inv lar/file001.lar -o pm/file001.pm -otype est \
|
|
-min 0.005 -max 0.012 -fill -def 0.01 -wave_end
|
|
@end example
|
|
The @file{-min}, @file{-max} and @file{-def} (fill values for unvoiced
|
|
regions), may need to be changed depending on the speaker pitch
|
|
range. The above is suitable for a male speaker. The @file{-fill}
|
|
option states that unvoiced sections should be filled with equally
|
|
spaced pitchmarks.
|
|
|
|
@subsection Generating LPC coefficients
|
|
|
|
@cindex LPC analysis
|
|
@cindex residual
|
|
@cindex LPC residual
|
|
LPC coefficients are generated using the @file{sig2fv} command. Two
|
|
stages are required, generating the LPC coefficients and generating
|
|
the residual. The prototypical commands for these are
|
|
@example
|
|
sig2fv wav/file001.wav -o lpc/file001.lpc -otype est -lpc_order 16 \
|
|
-coefs "lpc" -pm pm/file001.pm -preemph 0.95 -factor 3 \
|
|
-window_type hamming
|
|
sigfilter wav/file001.wav -o lpc/file001.res -otype nist \
|
|
-lpcfilter lpc/file001.lpc -inv_filter
|
|
@end example
|
|
@cindex power normalisation
|
|
For some databases you may need to normalize the power. Properly
|
|
normalizing power is difficult but we provide a simple function which may
|
|
do the jobs acceptably. You should do this on the waveform before
|
|
lpc analysis (and ensure you also do the residual extraction on the normalized
|
|
waveform rather than the original.
|
|
@example
|
|
ch_wave -scaleN 0.5 wav/file001.wav -o file001.Nwav
|
|
@end example
|
|
This normalizes the power by maximizing the signal first then multiplying
|
|
it by the given factor. If the database waveforms are clean (i.e.
|
|
no clicks) this can give reasonable results.
|
|
|
|
@section Generating a diphone index
|
|
|
|
@cindex diphone index
|
|
The diphone index consists of a short header following by an
|
|
ascii list of each diphone, the file it comes from followed by its
|
|
start middle and end times in seconds. For most databases this
|
|
files needs to be generated by some database specific script.
|
|
|
|
An example header is
|
|
@example
|
|
EST_File index
|
|
DataType ascii
|
|
NumEntries 2005
|
|
IndexName rab_diphone
|
|
EST_Header_End
|
|
@end example
|
|
The most notable part is the number of entries, which you should note
|
|
can get out of sync with the actual number of entries if you hand
|
|
edit entries. I.e. if you add an entry and the system still
|
|
can't find it check that the number of entries is right.
|
|
|
|
The entries themselves may take on one of two forms, full
|
|
entries or index entries. Full entries consist of a diphone
|
|
name, where the phones are separated by "-"; a file name
|
|
which is used to index into the pitchmark, LPC and waveform file;
|
|
and the start, middle (change over point between phones) and end
|
|
of the phone in the file in seconds of the diphone. For example
|
|
@example
|
|
r-uh edx_1001 0.225 0.261 0.320
|
|
r-e edx_1002 0.224 0.273 0.326
|
|
r-i edx_1003 0.240 0.280 0.321
|
|
r-o edx_1004 0.212 0.253 0.320
|
|
@end example
|
|
The second form of entry is an index entry which
|
|
simply states that reference to that diphone should actually be made
|
|
to another. For example
|
|
@example
|
|
aa-ll &aa-l
|
|
@end example
|
|
This states that the diphone @code{aa-ll} should actually use the
|
|
diphone @code{aa-l}. Note they are a number of ways to specify
|
|
alternates for missing diphones an this method is best used for fixing
|
|
single or small classes of missing or broken diphones. Index
|
|
entries may appear anywhere in the file but can't be nested.
|
|
|
|
Some checks are made one reading this index to ensure times etc
|
|
are reasonable but multiple entries for the same diphone are not, in
|
|
that case the later one will be selected.
|
|
|
|
@section Database declaration
|
|
|
|
There two major types of database @emph{grouped} and @emph{ungrouped}.
|
|
Grouped databases come as a single file containing the diphone index,
|
|
coeficinets and residuals for the diphones. This is the standard way
|
|
databases are distributed as voices in Festoval. Ungrouped
|
|
access diphones from individual files and is designed as a method
|
|
for debugging and testing databases before distribution. Using
|
|
ungrouped dataabse is slower but allows quicker changes to the index,
|
|
and associated coefficient files and residuals without rebuilding the
|
|
group file.
|
|
|
|
@cindex @code{us_diphone_init}
|
|
A database is declared to the system through the command
|
|
@code{us_diphone_init}. This function takes a parameter list of
|
|
various features used for setting up a database. The features are
|
|
@table @code
|
|
@item name
|
|
An atomic name for this database, used in selecting it from the current
|
|
set of laded database.
|
|
@item index_file
|
|
A filename name containing either a diphone index, as descripbed above,
|
|
or a group file. The feature @code{grouped} defines the distinction
|
|
between this being a group of simple index file.
|
|
@item grouped
|
|
Takes the value @code{"true"} or @code{"false"}. This defined
|
|
simple index or if the index file is a grouped file.
|
|
@item coef_dir
|
|
The directory containing the coefficients, (LPC or just pitchmarks in
|
|
the PSOLA case).
|
|
@item sig_dir
|
|
The directory containing the signal files (residual for LPC, full waveforms
|
|
for PSOLA).
|
|
@item coef_ext
|
|
The extension for coefficient files, typically @code{".lpc"} for LPC
|
|
file and @code{".pm"} for pitchmark files.
|
|
@item sig_ext
|
|
The extension for signal files, typically @code{".res"} for LPC residual
|
|
files and @code{".wav"} for waveform files.
|
|
@item default_diphone
|
|
@cindex default diphone
|
|
The diphone to be used when the requested one doesn't exist. No matter
|
|
how careful you are you should always include a default diphone for
|
|
distributed diphone database. Synthesis will throw an error if
|
|
no diphone is found and there is no default. Although it is usually
|
|
an error when this is required its better to fill in something than
|
|
stop synthesizing. Typical values for this are silence to silence
|
|
or schwa to schwa.
|
|
@item alternates_left
|
|
@cindex diphone alternates
|
|
@cindex alternate diphones
|
|
A list of pairs showing the alternate phone names for the left phone in
|
|
a diphone pair. This is list is used to rewrite the diphone name when
|
|
the directly requested one doesn't exist. This is the recommended
|
|
method for dealing with systematic holes in a diphone database.
|
|
@item alternates_right
|
|
A list of pairs showing the alternate phone names for the right phone in
|
|
a diphone pair. This is list is used to rewrite the diphone name when
|
|
the directly requested one doesn't exist. This is the recommended
|
|
method for dealing with systematic holes in a diphone database.
|
|
@end table
|
|
|
|
An example database definition is
|
|
@example
|
|
(set! rab_diphone_dir "/projects/festival/lib/voices/english/rab_diphone")
|
|
(set! rab_lpc_group
|
|
(list
|
|
'(name "rab_lpc_group")
|
|
(list 'index_file
|
|
(path-append rab_diphone_dir "group/rablpc16k.group"))
|
|
'(alternates_left ((i ii) (ll l) (u uu) (i@@ ii) (uh @@) (a aa)
|
|
(u@@ uu) (w @@) (o oo) (e@@ ei) (e ei)
|
|
(r @@)))
|
|
'(alternates_right ((i ii) (ll l) (u uu) (i@@ ii)
|
|
(y i) (uh @@) (r @@) (w @@)))
|
|
'(default_diphone @@-@@@@)
|
|
'(grouped "true")))
|
|
(us_dipohone_init rab_lpc_group)
|
|
@end example
|
|
|
|
@section Making groupfiles
|
|
|
|
@cindex group files
|
|
@cindex diphone group files
|
|
The function @code{us_make_group_file} will make a group file
|
|
of the currently selected US diphone database. It loads in all diphone
|
|
sin the dtabaase and saves them in the named file. An optional
|
|
second argument allows specification of how the group file will
|
|
be saved. These options are as a feature list. There
|
|
are three possible options
|
|
@table @code
|
|
@item track_file_format
|
|
The format for the coefficient files. By default this is
|
|
@code{est_binary}, currently the only other alternative is @code{est_ascii}.
|
|
@item sig_file_format
|
|
The format for the signal parts of the of the database. By default
|
|
this is @code{snd} (Sun's Audio format). This was choosen as it has
|
|
the smallest header and supports various sample formats. Any format
|
|
supported by the Edinburgh Speech Tools is allowed.
|
|
@item sig_sample_format
|
|
The format for the samples in the signal files. By default this
|
|
is @code{mulaw}. This is suitable when the signal files are LPC
|
|
residuals. LPC residuals have a much smaller dynamic range that
|
|
plain PCM files. Because @code{mulaw} representation is half the size
|
|
(8 bits) of standard PCM files (16bits) this significantly reduces
|
|
the size of the group file while only marginally altering the quality of
|
|
synthesis (and from experiments the effect is not perceptible). However
|
|
when saving group files where the signals are not LPC residuals (e.g.
|
|
in PSOLA) using this default @code{mulaw} is not recommended and
|
|
@code{short} should probably be used.
|
|
@end table
|
|
|
|
@section UniSyn module selection
|
|
|
|
In a voice selection a UniSyn database may be selected as follows
|
|
@example
|
|
(set! UniSyn_module_hooks (list rab_diphone_const_clusters ))
|
|
(set! us_abs_offset 0.0)
|
|
(set! window_factor 1.0)
|
|
(set! us_rel_offset 0.0)
|
|
(set! us_gain 0.9)
|
|
|
|
(Parameter.set 'Synth_Method 'UniSyn)
|
|
(Parameter.set 'us_sigpr 'lpc)
|
|
(us_db_select rab_db_name)
|
|
@end example
|
|
The @code{UniSyn_module_hooks} are run before synthesis, see the next
|
|
selection about diphone name selection. At present only @code{lpc}
|
|
is supported by the UniSyn module, though potentially there may be
|
|
others.
|
|
|
|
@cindex TD-PSOLA
|
|
@cindex PSOLA
|
|
An optional implementation of TD-PSOLA @cite{moulines90} has been
|
|
written but fear of legal problems unfortunately prevents it being in
|
|
the public distribution, but this policy should not be taken as
|
|
acknowledging or not acknowledging any alleged patent violation.
|
|
|
|
@section Diphone selection
|
|
|
|
@cindex selection of diphones
|
|
@cindex diphone selection
|
|
Diphone names are constructed for each phone-phone pair in the Segment
|
|
relation in an utterance. If a segment has the feature in forming a
|
|
diphone name UniSyn first checks for the feature @code{us_diphone_left}
|
|
(or @code{us_diphone_right} for the right hand part of the diphone) then
|
|
if that doesn't exist the feature @code{us_diphone} then if that doesn't
|
|
exist the feature @code{name}. Thus is is possible to specify diphone
|
|
names which are not simply the concatenation of two segment names.
|
|
|
|
This feature is used to specify consonant cluster diphone names
|
|
for our English voices. The hook @code{UniSyn_module_hooks} is run
|
|
before selection and we specify a function to add @code{us_diphone_*}
|
|
features as appropriate. See the function @code{rab_diphone_fix_phone_name}
|
|
in @file{lib/voices/english/rab_diphone/festvox/rab_diphone.scm} for
|
|
an example.
|
|
|
|
Once the diphone name is created it is used to select the diphone from
|
|
the database. If it is not found the name is converted using the list
|
|
of @code{alternates_left} and @code{alternates_right} as specified in
|
|
the database declaration. If that doesn't specify a diphone in the
|
|
database. The @code{default_diphone} is selected, and a warning is
|
|
printed. If no default diphone is specified or the default diphone
|
|
doesn't exist in the database an error is thrown.
|
|
|
|
@node Diphone synthesizer, Other synthesis methods, UniSyn synthesizer , Top
|
|
@chapter Diphone synthesizer
|
|
|
|
@emph{NOTE:} use of this diphone synthesis is depricated and it
|
|
will probably be removed from future versions, all of its functionality
|
|
has been replaced by the UniSyn synthesizer. It is not
|
|
compiled by default, if required add @code{ALSO_INCLUDE += diphone}
|
|
to your @file{festival/config/config} file.
|
|
|
|
@cindex diphone synthesis
|
|
A basic diphone synthesizer offers a method
|
|
for making speech from segments, durations and intonation
|
|
targets. This module was mostly written by Alistair Conkie
|
|
but the base diphone format is compatible with previous CSTR
|
|
diphone synthesizers.
|
|
|
|
The synthesizer offers residual excited LPC based synthesis (@cite{hunt89})
|
|
and PSOLA (TM) (@cite{moulines90}) (PSOLA is not available for
|
|
distribution).
|
|
|
|
@menu
|
|
* Diphone database format:: Format of basic dbs
|
|
* LPC databases:: Building and using LPC files.
|
|
* Group files:: Efficient binary formats
|
|
* Diphone_Init:: Loading diphone databases
|
|
* Access strategies:: Various access methods
|
|
* Diphone selection:: Mapping phones to special diphone names
|
|
@end menu
|
|
|
|
@node Diphone database format, LPC databases, , Diphone synthesizer
|
|
@section Diphone database format
|
|
|
|
A diphone database consists of a @emph{dictionary file}, a set
|
|
of @emph{waveform files}, and a set of @emph{pitch mark files}. These
|
|
files are the same format as the previous CSTR (Osprey) synthesizer.
|
|
|
|
The dictionary file consist of one entry per line. Each entry
|
|
consists of five fields: a diphone name of the form @var{P1-P2}, a
|
|
filename (without extension), a floating point start position in the
|
|
file in milliseconds, a mid position in milliseconds (change in phone),
|
|
and an end position in milliseconds. Lines starting with a semi-colon
|
|
and blank lines are ignored. The list may be in any order.
|
|
|
|
For example a partial list of phones may look like.
|
|
@example
|
|
ch-l r021 412.035 463.009 518.23
|
|
jh-l d747 305.841 382.301 446.018
|
|
h-l d748 356.814 403.54 437.522
|
|
#-@@ d404 233.628 297.345 331.327
|
|
@@-# d001 836.814 938.761 1002.48
|
|
@end example
|
|
|
|
Waveform files may be in any form, as long as every file is the same
|
|
type, headered or unheadered as long as the format is supported the
|
|
speech tools wave reading functions. These may be standard linear PCM
|
|
waveform files in the case of PSOLA or LPC coefficients and residual
|
|
when using the residual LPC synthesizer. @ref{LPC databases}
|
|
|
|
Pitch mark files consist a simple list of positions in milliseconds
|
|
(plus places after the point) in order, one per line of each pitch mark
|
|
in the file. For high quality diphone synthesis these should be derived
|
|
from laryngograph data. During unvoiced sections pitch marks should be
|
|
artificially created at reasonable intervals (e.g. 10 ms). In the
|
|
current format there is no way to determine the "real" pitch marks from
|
|
the "unvoiced" pitch marks.
|
|
|
|
It is normal to hold a diphone database in a directory with
|
|
a number of sub-directories namely @file{dic/} contain
|
|
the dictionary file, @file{wave/} for the waveform files, typically
|
|
of whole nonsense words (sometimes this directory is called
|
|
@file{vox/} for historical reasons) and @file{pm/} for
|
|
the pitch mark files. The filename in the dictionary entry should
|
|
be the same for waveform file and the pitch mark file (with different
|
|
extensions).
|
|
|
|
@node LPC databases, Group files, Diphone database format, Diphone synthesizer
|
|
@section LPC databases
|
|
|
|
The standard method for diphone resynthesis in the released system is
|
|
residual excited LPC (@cite{hunt89}). The actual method of resynthesis
|
|
isn't important to the database format, but if residual LPC synthesis
|
|
is to be used then it is necessary to make the LPC coefficient
|
|
files and their corresponding residuals.
|
|
|
|
Previous versions of the system used a "host of hacky little scripts"
|
|
to this but now that the Edinburgh Speech Tools supports LPC analysis
|
|
we can provide a walk through for generating these.
|
|
|
|
We assume that the waveform file of nonsense words are in a directory
|
|
called @file{wave/}. The LPC coefficients and residuals will be, in
|
|
this example, stored in @file{lpc16k/} with extensions @file{.lpc} and
|
|
@file{.res} respectively.
|
|
|
|
Before starting it is worth considering power normalization. We have
|
|
found this important on all of the databases we have collected so far.
|
|
The @code{ch_wave} program, part of the speech tools, with the optional
|
|
@code{-scaleN 0.4} may be used if a more complex method is not
|
|
available.
|
|
|
|
The following shell command generates the files
|
|
@example
|
|
for i in wave/*.wav
|
|
do
|
|
fname=`basename $i .wav`
|
|
echo $i
|
|
lpc_analysis -reflection -shift 0.01 -order 18 -o lpc16k/$fname.lpc \
|
|
-r lpc16k/$fname.res -otype htk -rtype nist $i
|
|
done
|
|
@end example
|
|
It is said that the LPC order should be sample rate divided by one
|
|
thousand plus 2. This may or may not be appropriate and if you are
|
|
particularly worried about the database size it is worth experimenting.
|
|
|
|
The program @file{lpc_analysis}, found in @file{speech_tools/bin},
|
|
can be used to generate the LPC coefficients and residual. Note
|
|
these should be reflection coefficients so they may be quantised
|
|
(as they are in group files).
|
|
|
|
The coefficients and residual files produced by different LPC analysis
|
|
programs may start at different offsets. For example the Entropic's ESPS
|
|
functions generate LPC coefficients that are offset by one frame shift
|
|
(e.g. 0.01 seconds). Our own @file{lpc_analysis} routine has no offset.
|
|
The @code{Diphone_Init} parameter list allows these offsets to be
|
|
specified. Using the above function to generate the LPC files the
|
|
description parameters should include
|
|
@lisp
|
|
(lpc_frame_offset 0)
|
|
(lpc_res_offset 0.0)
|
|
@end lisp
|
|
While when generating using ESPS routines the description should be
|
|
@lisp
|
|
(lpc_frame_offset 1)
|
|
(lpc_res_offset 0.01)
|
|
@end lisp
|
|
The defaults actually follow the ESPS form, that is @code{lpc_frame_offset}
|
|
is 1 and @code{lpc_res_offset} is equal to the frame shift, if they are
|
|
not explicitly mentioned.
|
|
|
|
Note the biggest problem we have in implementing the residual excited
|
|
LPC resynthesizer was getting the right part of the residual to line up
|
|
with the right LPC coefficients describing the pitch mark. Making
|
|
errors in this degrades the synthesized waveform notably, but not
|
|
seriously, making it difficult to determine if it is an offset problem or
|
|
some other bug.
|
|
|
|
Although we have started investigating if extracting pitch synchronous
|
|
LPC parameters rather than fixed shift parameters gives better
|
|
performance, we haven't finished this work. @file{lpc_analysis}
|
|
supports pitch synchronous analysis but the raw "ungrouped"
|
|
access method does not yet. At present the LPC parameters are
|
|
extracted at a particular pitch mark by interpolating over the
|
|
closest LPC parameters. The "group" files hold these interpolated
|
|
parameters pitch synchronously.
|
|
|
|
The American English voice @file{kd} was created using the speech
|
|
tools @file{lpc_analysis} program and its set up should
|
|
be looked at if you are going to copy it. The British English voice
|
|
@file{rb} was constructed using ESPS routines.
|
|
|
|
@node Group files, Diphone_Init, LPC databases, Diphone synthesizer
|
|
@section Group files
|
|
|
|
@cindex group files
|
|
Databases may be accessed directly but this is usually too inefficient
|
|
for any purpose except debugging. It is expected that @emph{group
|
|
files} will be built which contain a binary representation of the
|
|
database. A group file is a compact efficient representation of the
|
|
diphone database. Group files are byte order independent, so may be
|
|
shared between machines of different byte orders and word sizes.
|
|
Certain information in a group file may be changed at load time so a
|
|
database name, access strategy etc. may be changed from what was set
|
|
originally in the group file.
|
|
|
|
A group file contains the basic parameters, the diphone index, the
|
|
signal (original waveform or LPC residual), LPC coefficients, and the
|
|
pitch marks. It is all you need for a run-time synthesizer.
|
|
Various compression mechanisms are supported to allow smaller databases
|
|
if desired. A full English LPC plus residual database at 8k ulaw
|
|
is about 3 megabytes, while a full 16 bit version at 16k is about
|
|
8 megabytes.
|
|
|
|
Group files are created with the @code{Diphone.group} command which
|
|
takes a database name and an output filename as an argument. Making
|
|
group files can take some time especially if they are large. The
|
|
@code{group_type} parameter specifies @code{raw} or @code{ulaw}
|
|
for encoding signal files. This can significantly reduce the size
|
|
of databases.
|
|
|
|
Group files may be partially loaded (see access strategies) at
|
|
run time for quicker start up and to minimise run-time
|
|
memory requirements.
|
|
|
|
@node Diphone_Init, Access strategies, Group files, Diphone synthesizer
|
|
@section Diphone_Init
|
|
|
|
The basic method for describing a database is through the @code{Diphone_Init}
|
|
command. This function takes a single argument, a list of
|
|
pairs of parameter name and value. The parameters are
|
|
@table @code
|
|
@item name
|
|
An atomic name for this database.
|
|
@item group_file
|
|
The filename of a group file, which may itself contain parameters
|
|
describing itself
|
|
@item type
|
|
The default value is @code{pcm}, but for distributed voices
|
|
this is always @code{lpc}.
|
|
@item index_file
|
|
A filename containing the diphone dictionary.
|
|
@item signal_dir
|
|
A directory (slash terminated) containing the pcm waveform files.
|
|
@item signal_ext
|
|
A dot prefixed extension for the pcm waveform files.
|
|
@item pitch_dir
|
|
A directory (slash terminated) containing the pitch mark files.
|
|
@item pitch_ext
|
|
A dot prefixed extension for the pitch files
|
|
@item lpc_dir
|
|
A directory (slash terminated) containing the LPC coefficient files
|
|
and residual files.
|
|
@item lpc_ext
|
|
A dot prefixed extension for the LPC coefficient files
|
|
@item lpc_type
|
|
The type of LPC file (as supported by the speech tools)
|
|
@item lpc_frame_offset
|
|
The number of frames "missing" from the beginning of the file.
|
|
Often LPC parameters are offset by one frame.
|
|
@item lpc_res_ext
|
|
A dot prefixed extension for the residual files
|
|
@item lpc_res_type
|
|
The type of the residual files, this is a standard waveform type
|
|
as supported by the speech tools.
|
|
@item lpc_res_offset
|
|
Number of seconds "missing" from the beginning of the residual file.
|
|
Some LPC analysis technique do not generate a residual until after one
|
|
frame.
|
|
@item samp_freq
|
|
Sample frequency of signal files
|
|
@item phoneset
|
|
Phoneset used, must already be declared.
|
|
@item num_diphones
|
|
Total number of diphones in database. If specified this must be
|
|
equal or bigger than the number of entries in the index file.
|
|
If it is not specified the square of the number of phones in the
|
|
phoneset is used.
|
|
@item sig_band
|
|
number of sample points around actual diphone to take from file.
|
|
This should be larger than any windowing used on the signal,
|
|
and/or up to the pitch marks outside the diphone signal.
|
|
@item alternates_after
|
|
List of pairs of phones stating replacements for the second
|
|
part of diphone when the basic diphone is not found in the
|
|
diphone database.
|
|
@item alternates_before
|
|
List of pairs of phones stating replacements for the first
|
|
part of diphone when the basic diphone is not found in the
|
|
diphone database.
|
|
@item default_diphone
|
|
When unexpected combinations occur and no appropriate diphone can be
|
|
found this diphone should be used. This should be specified for all
|
|
diphone databases that are to be robust. We usually us the silence to
|
|
silence diphone. No mater how carefully you designed your diphone set,
|
|
conditions when an unknown diphone occur seem to @emph{always} happen.
|
|
If this is not set and a diphone is requested that is not in the
|
|
database an error occurs and synthesis will stop.
|
|
@end table
|
|
|
|
Examples of both general set up, making group files and general
|
|
use are in
|
|
@example
|
|
@file{lib/voices/english/rab_diphone/festvox/rab_diphone.scm}
|
|
@end example
|
|
|
|
@node Access strategies, Diphone selection , Diphone_Init, Diphone synthesizer
|
|
@section Access strategies
|
|
|
|
@cindex access strategies
|
|
Three basic accessing strategies are available when using
|
|
diphone databases. They are designed to optimise access time, start up
|
|
time and space requirements.
|
|
|
|
@table @code
|
|
@item direct
|
|
Load all signals at database init time. This is the slowest startup but
|
|
the fastest to access. This is ideal for servers. It is also useful
|
|
for small databases that can be loaded quickly. It is reasonable for
|
|
many group files.
|
|
@item dynamic
|
|
Load signals as they are required. This has much faster
|
|
start up and will only gradually use up memory as the diphones
|
|
are actually used. Useful for larger databases, and for non-group
|
|
file access.
|
|
@item ondemand
|
|
Load the signals as they are requested but free them if they are not
|
|
required again immediately. This is slower access but requires low
|
|
memory usage. In group files the re-reads are quite cheap as the
|
|
database is well cached and a file description is already open for the
|
|
file.
|
|
@end table
|
|
Note that in group files pitch marks (and LPC coefficients) are
|
|
always fully loaded (cf. @code{direct}), as they are typically
|
|
smaller. Only signals (waveform files or residuals) are potentially
|
|
dynamically loaded.
|
|
|
|
@node Diphone selection, , Access strategies, Diphone synthesizer
|
|
@section Diphone selection
|
|
|
|
@cindex dark l's
|
|
@cindex consonant clusters
|
|
@cindex overriding diphone names
|
|
@cindex diphone names
|
|
The appropriate diphone is selected based on the name of the phone
|
|
identified in the segment stream. However for better diphone synthesis
|
|
it is useful to augment the diphone database with other diphones in
|
|
addition to the ones directly from the phoneme set. For example dark
|
|
and light l's, distinguishing consonants from their consonant cluster
|
|
form and their isolated form. There are however two methods to identify
|
|
this modification from the basic name.
|
|
|
|
@cindex @code{diphone_module_hooks}
|
|
@cindex @code{diphone_phone_name}
|
|
When the diphone module is called the hook @code{diphone_module_hooks}
|
|
is applied. That is a function of list of functions which will be
|
|
applied to the utterance. Its main purpose is to allow the conversion
|
|
of the basic name into an augmented one. For example converting a basic
|
|
@code{l} into a dark l, denoted by @code{ll}. The functions given in
|
|
@code{diphone_module_hooks} may set the feature
|
|
@code{diphone_phone_name} which if set will be used rather than the
|
|
@code{name} of the segment.
|
|
|
|
For example suppose we wish to use a dark l (@code{ll}) rather than
|
|
a normal l for all l's that appear in the coda of a syllable.
|
|
First we would define a function to which identifies this condition
|
|
and adds the addition feature @code{diphone_phone_name} identify
|
|
the name change. The following function would
|
|
achieve this
|
|
@lisp
|
|
(define (fix_dark_ls utt)
|
|
"(fix_dark_ls UTT)
|
|
Identify ls in coda position and relabel them as ll."
|
|
(mapcar
|
|
(lambda (seg)
|
|
(if (and (string-equal "l" (item.name seg))
|
|
(string-equal "+" (item.feat seg "p.ph_vc"))
|
|
(item.relation.prev seg "SylStructure"))
|
|
(item.set_feat seg "diphone_phone_name" "ll")))
|
|
(utt.relation.items utt 'Segment))
|
|
utt)
|
|
@end lisp
|
|
Then when we wish to use this for a particular voice we need to
|
|
add
|
|
@lisp
|
|
(set! diphone_module_hooks (list fix_dark_ls))
|
|
@end lisp
|
|
@exdent in the voice selection function.
|
|
|
|
For a more complex example including consonant cluster identification
|
|
see the American English voice @file{ked} in
|
|
@file{festival/lib/voices/english/ked/festvox/kd_diphone.scm}. The
|
|
function @code{ked_diphone_fix_phone_name} carries out a number of
|
|
mappings.
|
|
|
|
The second method for changing a name is during actual look up of a
|
|
diphone in the database. The list of alternates is given by the
|
|
@code{Diphone_Init} function. These are used when the specified diphone
|
|
can't be found. For example we often allow mappings of dark l,
|
|
@code{ll} to @code{l} as sometimes the dark l diphone doesn't actually
|
|
exist in the database.
|
|
|
|
@node Other synthesis methods, Audio output, Diphone synthesizer, Top
|
|
@chapter Other synthesis methods
|
|
|
|
Festival supports a number of other synthesis systems
|
|
|
|
@menu
|
|
* LPC diphone synthesizer:: A small LPC synthesizer (Donovan diphones)
|
|
* MBROLA:: Interface to MBROLA
|
|
* Synthesizers in development::
|
|
@end menu
|
|
|
|
@node LPC diphone synthesizer, MBROLA , , Other synthesis methods
|
|
@section LPC diphone synthesizer
|
|
|
|
A very simple, and very efficient LPC diphone synthesizer using
|
|
the "donovan" diphones is also supported. This synthesis method
|
|
is primarily the work of Steve Isard and later Alistair Conkie.
|
|
The synthesis quality is not as good as the residual excited LPC
|
|
diphone synthesizer but has the advantage of being much smaller.
|
|
The donovan diphone database is under 800k.
|
|
|
|
The diphones are loaded through the @code{Donovan_Init} function
|
|
which takes the name of the dictionary file and the diphone file
|
|
as arguments, see the following for details
|
|
@example
|
|
lib/voices/english/don_diphone/festvox/don_diphone.scm
|
|
@end example
|
|
|
|
@node MBROLA, Synthesizers in development, LPC diphone synthesizer, Other synthesis methods
|
|
@section MBROLA
|
|
|
|
@cindex MBROLA
|
|
As an example of how Festival may use a completely external synthesis
|
|
method we support the free system MBROLA. MBROLA is both a diphone
|
|
synthesis technique and an actual system that constructs waveforms from
|
|
segment, duration and F0 target information. For details see the MBROLA
|
|
home page at @url{http://tcts.fpms.ac.be/synthesis/mbrola.html}. MBROLA
|
|
already supports a number of diphone sets including French, Spanish,
|
|
German and Romanian.
|
|
|
|
Festival support for MBROLA is in the file @file{lib/mbrola.scm}.
|
|
It is all in Scheme. The function @code{MBROLA_Synth} is called
|
|
when parameter @code{Synth_Method} is @code{MBROLA}. The
|
|
function simply saves the segment, duration and target information
|
|
from the utterance, calls the external @file{mbrola} program with the
|
|
selected diphone database, and reloads the generated waveform
|
|
back into the utterance.
|
|
|
|
An MBROLA-ized version of the Roger diphoneset is available from the
|
|
MBROLA site. The simple Festival end is distributed as part of
|
|
the system in @file{festvox_en1.tar.gz}.
|
|
The following variables are used by the process
|
|
@table @code
|
|
@item mbrola_progname
|
|
the pathname of the mbrola executable.
|
|
@item mbrola_database
|
|
the name of the database to use. This variable is switched between
|
|
different speakers.
|
|
@end table
|
|
|
|
@node Synthesizers in development, , MBROLA, Other synthesis methods
|
|
@section Synthesizers in development
|
|
|
|
In addition to the above synthesizers Festival also
|
|
supports CSTR's older PSOLA synthesizer written by Paul Taylor.
|
|
But as the newer diphone synthesizer produces similar quality
|
|
output and is a newer (and hence a cleaner) implementation
|
|
further development of the older module is unlikely.
|
|
|
|
@cindex selection-based synthesis
|
|
An experimental unit seleciton synthesis module is included in
|
|
@file{modules/clunits/} it is an implementation of @cite{black97c}. It
|
|
is included for people wishing to continue reserach in the area rather
|
|
than as a fully usable waveform synthesis engine. Although it sometimes
|
|
gives excellent results it also sometimes gives amazingly bad ones too.
|
|
We included this as an example of one possible framework for selection-based
|
|
synthesis.
|
|
|
|
As one of our funded projects is to specifically develop new selection
|
|
based synthesis algorithms we expect to include more models within later
|
|
versions of the system.
|
|
|
|
Also, now that Festival has been released other groups are working
|
|
on new synthesis techniques in the system. Many of these will
|
|
become available and where possible we will give pointers from
|
|
the Festival home page to them. Particularly there is an alternative
|
|
residual excited LPC module implemented at the Center for Spoken
|
|
Language Understanding (CSLU) at the Oregon Graduate Institute (OGI).
|
|
|
|
@node Audio output, Voices, Other synthesis methods , Top
|
|
@chapter Audio output
|
|
|
|
@cindex audio output
|
|
If you have never heard any audio ever on your machine then you must
|
|
first work out if you have the appropriate hardware. If you do, you
|
|
also need the appropriate software to drive it. Festival can directly
|
|
interface with a number of audio systems or use external
|
|
methods for playing audio.
|
|
|
|
The currently supported audio methods are
|
|
@table @samp
|
|
@cindex NAS
|
|
@cindex netaudio
|
|
@item NAS
|
|
NCD's NAS, is a network transparent audio system (formerly called
|
|
netaudio). If you already run servers on your machines you
|
|
simply need to ensure your @code{AUDIOSERVER} environment variable
|
|
is set (or your @code{DISPLAY} variable if your audio output device is the
|
|
same as your X Windows display).
|
|
You may set NAS as your audio output method by the command
|
|
@lisp
|
|
(Parameter.set 'Audio_Method 'netaudio)
|
|
@end lisp
|
|
@cindex @file{/dev/audio}
|
|
@cindex sunaudio
|
|
@item /dev/audio
|
|
On many systems @file{/dev/audio} offers a simple low level method for
|
|
audio output. It is limited to mu-law encoding at 8KHz. Some
|
|
implementations of @file{/dev/audio} allow other sample rates and sample
|
|
types but as that is non-standard this method only uses the common
|
|
format. Typical systems that offer these are Suns, Linux and FreeBSD
|
|
machines. You may set direct @file{/dev/audio} access as your audio
|
|
method by the command
|
|
@lisp
|
|
(Parameter.set 'Audio_Method 'sunaudio)
|
|
@end lisp
|
|
@cindex @file{/dev/audio}
|
|
@cindex sun16
|
|
@item /dev/audio (16bit)
|
|
Later Sun Microsystems workstations support 16 bit
|
|
linear audio at various sample rates. Support for this form
|
|
of audio output is supported. It is a compile time option (as
|
|
it requires include files that only exist on Sun machines. If
|
|
your installation supports it (check the members of the list
|
|
@code{*modules*}) you can select 16 bit audio output on
|
|
Suns by the command
|
|
@lisp
|
|
(Parameter.set 'Audio_Method 'sun16audio)
|
|
@end lisp
|
|
Note this will send it to the local machine where the festival binary
|
|
is running, this might not be the one you are sitting next to---that's
|
|
why we recommend netaudio. A hacky solution to playing audio on a local
|
|
machine from a remote machine without using netaudio is described
|
|
in @ref{Installation}
|
|
@item /dev/dsp (voxware)
|
|
@cindex @file{/dev/dsp}
|
|
@cindex Linux
|
|
@cindex FreeBSD
|
|
@cindex voxware
|
|
Both FreeBSD and Linux have a very similar audio interface through
|
|
@file{/dev/dsp}. There is compile time support for these in the speech
|
|
tools and when compiled with that option Festival may utilise it.
|
|
Check the value of the variable @code{*modules*} to see which audio
|
|
devices are directly supported. On FreeBSD, if supported, you
|
|
may select local 16 bit linear audio by the command
|
|
@lisp
|
|
(Parameter.set 'Audio_Method 'freebsd16audio)
|
|
@end lisp
|
|
While under Linux, if supported, you may use the command
|
|
@lisp
|
|
(Parameter.set 'Audio_Method 'linux16audio)
|
|
@end lisp
|
|
Some earlier (and smaller machines) only have 8bit audio even though
|
|
they include a @file{/dev/dsp} (Soundblaster PRO for example). This was
|
|
not dealt with properly in earlier versions of the system but now the
|
|
support automatically checks to see the sample width supported and uses
|
|
it accordingly. 8 bit at higher frequencies that 8K sounds better than
|
|
straight 8k ulaw so this feature is useful.
|
|
|
|
@cindex Windows NT audio
|
|
@cindex Windows 95 audio
|
|
@item mplayer
|
|
Under Windows NT or 95 you can use the @file{mplayer} command which
|
|
we have found requires special treatement to get its parameters right.
|
|
Rather than using @code{Audio_Command} you can select this on
|
|
Windows machine with the following command
|
|
@lisp
|
|
(Parameter.set 'Audio_Method 'mplayeraudio)
|
|
@end lisp
|
|
Alternatively built-in audio output is available with
|
|
@lisp
|
|
(Parameter.set 'Audio_Method 'win32audio)
|
|
@end lisp
|
|
@cindex IRIX
|
|
@cindex SGI
|
|
@item SGI IRIX
|
|
Builtin audio output is now available for SGI's IRIX 6.2 using
|
|
the command
|
|
@lisp
|
|
(Parameter.set 'Audio_Method 'irixaudio)
|
|
@end lisp
|
|
@cindex audio command
|
|
@item Audio Command
|
|
Alternatively the user can provide a command that can play an audio
|
|
file. Festival will execute that command in an environment where the
|
|
shell variables @code{SR} is set to the sample rate (in Hz) and
|
|
@code{FILE} which, by default, is the name of an unheadered raw, 16bit
|
|
file containing the synthesized waveform in the byte order of the
|
|
machine Festival is running on. You can specify your audio play command
|
|
and that you wish Festival to execute that command through the following
|
|
command
|
|
@lisp
|
|
(Parameter.set 'Audio_Command "sun16play -f $SR $FILE")
|
|
(Parameter.set 'Audio_Method 'Audio_Command)
|
|
@end lisp
|
|
On SGI machines under IRIX the equivalent would be
|
|
@lisp
|
|
(Parameter.set 'Audio_Command
|
|
"sfplay -i integer 16 2scomp rate $SR end $FILE")
|
|
(Parameter.set 'Audio_Method 'Audio_Command)
|
|
@end lisp
|
|
@end table
|
|
The @code{Audio_Command} method of playing waveforms Festival supports
|
|
two additional audio parameters. @code{Audio_Required_Rate} allows you
|
|
to use Festival's internal sample rate conversion function to any desired
|
|
rate. Note this may not be as good as playing the waveform at the
|
|
sample rate it is originally created in, but as some hardware devices
|
|
are restrictive in what sample rates they support, or have naive
|
|
resample functions this could be optimal. The second additional
|
|
audio parameter is @code{Audio_Required_Format} which can be
|
|
used to specify the desired output forms of the file. The default
|
|
is unheadered raw, but this may be any of the values supported by
|
|
the speech tools (including nist, esps, snd, riff, aiff, audlab, raw
|
|
and, if you really want it, ascii). For example suppose you
|
|
have a program that only plays sun headered files at 16000 KHz you can
|
|
set up audio output as
|
|
@lisp
|
|
(Parameter.set 'Audio_Method 'Audio_Command)
|
|
(Parameter.set 'Audio_Required_Rate 16000)
|
|
(Parameter.set 'Audio_Required_Format 'snd)
|
|
(Parameter.set 'Audio_Command "sunplay $FILE")
|
|
@end lisp
|
|
|
|
@cindex audio devices
|
|
Where the audio method supports it, you can specify an alternative audio
|
|
device for machines that have more than one audio device.
|
|
@lisp
|
|
(Parameter.set 'Audio_Device "/dev/dsp2")
|
|
@end lisp
|
|
|
|
@cindex remote audio
|
|
@cindex snack
|
|
If Netaudio is not available and you need to play audio on a
|
|
machine different from teh one Festival is running on we have
|
|
had reports that @file{snack} (@url{http://www.speech.kth.se/snack/})
|
|
is a possible solution. It allows remote play but importnatly
|
|
also supports Windows 95/NT based clients.
|
|
|
|
@cindex audio spooler
|
|
@cindex asynchronous synthesis
|
|
Because you do not want to wait for a whole file to be synthesized
|
|
before you can play it, Festival also offers an @emph{audio spooler}
|
|
that allows the playing of audio files while continuing to synthesize
|
|
the following utterances. On reasonable workstations this allows the
|
|
breaks between utterances to be as short as your hardware allows them
|
|
to be.
|
|
|
|
The audio spooler may be started by selecting asynchronous
|
|
mode
|
|
@lisp
|
|
(audio_mode 'async)
|
|
@end lisp
|
|
This is switched on by default be the function @code{tts}.
|
|
You may put Festival back into synchronous mode (i.e. the @code{utt.play}
|
|
command will wait until the audio has finished playing before returning).
|
|
by the command
|
|
@lisp
|
|
(audio_mode 'sync)
|
|
@end lisp
|
|
Additional related commands are
|
|
@table @code
|
|
@item (audio_mode 'close)
|
|
Close the audio server down but wait until it is cleared. This is
|
|
useful in scripts etc. when you wish to only exit when all audio is
|
|
complete.
|
|
@item (audio_mode 'shutup)
|
|
Close the audio down now, stopping the current file being played and
|
|
any in the queue. Note that this may take some time to take effect
|
|
depending on which audio method you use. Sometimes there can be
|
|
100s of milliseconds of audio in the device itself which cannot
|
|
be stopped.
|
|
@item (audio_mode 'query)
|
|
Lists the size of each waveform currently in the queue.
|
|
@end table
|
|
|
|
@node Voices, Tools, Audio output, Top
|
|
@chapter Voices
|
|
|
|
This chapter gives some general suggestions about adding new voices to
|
|
Festival. Festival attempts to offer an environment where new voices
|
|
and languages can easily be slotted in to the system.
|
|
|
|
@menu
|
|
* Current voices:: Currently available voices
|
|
* Building a new voice:: Building a new voice
|
|
* Defining a new voice:: Defining a new voice
|
|
@end menu
|
|
|
|
@node Current voices, Building a new voice, , Voices
|
|
@section Current voices
|
|
|
|
@cindex voices
|
|
Currently there are a number of voices available in Festival and we
|
|
expect that number to increase. Each is elected via a function of the
|
|
name @samp{voice_*} which sets up the waveform synthesizer, phone set,
|
|
lexicon, duration and intonation models (and anything else necessary)
|
|
for that speaker. These voice setup functions are defined in
|
|
@file{lib/voices.scm}.
|
|
|
|
The current voice functions are
|
|
@table @code
|
|
@item voice_rab_diphone
|
|
A British English male RP speaker, Roger. This uses the UniSyn residual
|
|
excited LPC diphone synthesizer. The lexicon is the computer users
|
|
version of Oxford Advanced Learners' Dictionary, with letter to sound
|
|
rules trained from that lexicon. Intonation is provided by a ToBI-like
|
|
system using a decision tree to predict accent and end tone position.
|
|
The F0 itself is predicted as three points on each syllable, using
|
|
linear regression trained from the Boston University FM database (f2b)
|
|
and mapped to Roger's pitch range. Duration is predicted by decision
|
|
tree, predicting zscore durations for segments trained from the 460
|
|
Timit sentence spoken by another British male speaker.
|
|
@item voice_ked_diphone
|
|
An American English male speaker, Kurt. Again this uses the UniSyn
|
|
residual excited LPC diphone synthesizer. This uses the CMU lexicon,
|
|
and letter to sound rules trained from it. Intonation as with Roger is
|
|
trained from the Boston University FM Radio corpus. Duration for this
|
|
voice also comes from that database.
|
|
@item voice_kal_diphone
|
|
An American English male speaker. Again this uses the UniSyn residual
|
|
excited LPC diphone synthesizer. And like ked, uses the CMU lexicon,
|
|
and letter to sound rules trained from it. Intonation as with Roger is
|
|
trained from the Boston University FM Radio corpus. Duration for this
|
|
voice also comes from that database. This voice was built in two days
|
|
work and is at least as good as ked due to us understanding the process
|
|
better. The diphone labels were autoaligned with hand correction.
|
|
@item voice_don_diphone
|
|
Steve Isard's LPC based diphone synthesizer, Donovan diphones. The
|
|
other parts of this voice, lexicon, intonation, and duration are the
|
|
same as @code{voice_rab_diphone} described above. The
|
|
quality of the diphones is not as good as the other voices because it
|
|
uses spike excited LPC. Although the quality is not as good it
|
|
is much faster and the database is much smaller than the others.
|
|
@item voice_el_diphone
|
|
A male Castilian Spanish speaker, using the Eduardo Lopez diphones.
|
|
Alistair Conkie and Borja Etxebarria did much to make this. It has
|
|
improved recently but is not as comprehensive as our English voices.
|
|
@item voice_gsw_diphone
|
|
This offers a male RP speaker, Gordon, famed for many previous CSTR
|
|
synthesizers, using the standard diphone module. Its higher
|
|
levels are very similar to the Roger voice above. This voice
|
|
is not in the standard distribution, and is unlikely to be added
|
|
for commercial reasons, even though it sounds better than Roger.
|
|
@item voice_en1_mbrola
|
|
The Roger diphone set using the same front end as @code{voice_rab_diphone}
|
|
but uses the MBROLA diphone synthesizer for waveform synthesis. The
|
|
MBROLA synthesizer and Roger diphone database (called @code{en1})
|
|
is not distributed by CSTR but is available for non-commercial use
|
|
for free from @url{http://tcts.fpms.ac.be/synthesis/mbrola.html}.
|
|
We do however provide the Festival part of the voice in
|
|
@file{festvox_en1.tar.gz}.
|
|
@item voice_us1_mbrola
|
|
A female Amercian English voice using our standard US English front end and the
|
|
@code{us1} database for the MBROLA diphone synthesizer for waveform
|
|
synthesis. The MBROLA synthesizer and the @code{us1} diphone database
|
|
is not distributed by CSTR but is available for
|
|
non-commercial use for free from
|
|
@url{http://tcts.fpms.ac.be/synthesis/mbrola.html}. We
|
|
provide the Festival part of the voice in @file{festvox_us1.tar.gz}.
|
|
@item voice_us2_mbrola
|
|
A male Amercian English voice using our standard US English front end and the
|
|
@code{us2} database for the MBROLA diphone synthesizer for waveform
|
|
synthesis. The MBROLA synthesizer and the @code{us2} diphone database
|
|
is not distributed by CSTR but is available for
|
|
non-commercial use for free from
|
|
@url{http://tcts.fpms.ac.be/synthesis/mbrola.html}. We
|
|
provide the Festival part of the voice in @file{festvox_us2.tar.gz}.
|
|
@item voice_us3_mbrola
|
|
Another male Amercian English voice using our standard US English front
|
|
end and the @code{us2} database for the MBROLA diphone synthesizer for
|
|
waveform synthesis. The MBROLA synthesizer and the @code{us2} diphone
|
|
database is not distributed by CSTR but is available for non-commercial
|
|
use for free from @url{http://tcts.fpms.ac.be/synthesis/mbrola.html}.
|
|
We provide the Festival part of the voice in @file{festvox_us1.tar.gz}.
|
|
@end table
|
|
@cindex CSLU
|
|
Other voices will become available through time. Groups other than CSTR
|
|
are working on new voices. Particularly OGI's CSLU have release a
|
|
number of American English voices, two Mexican Spanish voices and two German
|
|
voices. All use OGI's their own residual excited LPC
|
|
synthesizer which is distributed as a plug-in for Festival.
|
|
(see @url{http://www.cse.ogi.edu/CSLU/research/TTS} for
|
|
details).
|
|
|
|
Other languages are being worked on including German, Basque, Welsh,
|
|
Greek and Polish already have been developed and could be release soon.
|
|
CSTR has a set of Klingon diphones though the text anlysis for Klingon
|
|
still requires some work (If anyone has access to a good Klingon
|
|
continous speech corpora please let us know.)
|
|
|
|
Pointers and examples of voices developed at CSTR and elsewhere will
|
|
be posted on the Festival home page.
|
|
|
|
@node Building a new voice, Defining a new voice , Current voices, Voices
|
|
@section Building a new voice
|
|
|
|
@cindex spanish voice
|
|
This section runs through the definition of a new voice in Festival.
|
|
Although this voice is simple (it is a simplified version of the
|
|
distributed spanish voice) it shows all the major parts that must be
|
|
defined to get Festival to speak in a new voice. Thanks go to Alistair
|
|
Conkie for helping me define this but as I don't speak Spanish there are
|
|
probably many mistakes. Hopefully its pedagogical use is better than
|
|
its ability to be understood in Castille.
|
|
|
|
A much more detailed document on building voices in Festival has been
|
|
written and is recommend reading for any one attempting to add a new
|
|
voice to Festival @cite{black99}. The information here is a little
|
|
sparse though gives the basic requirements.
|
|
|
|
The general method for defining a new voice is to define the
|
|
parameters for all the various sub-parts e.g. phoneset, duration
|
|
parameter intonation parameters etc., then defined a function
|
|
of the form @code{voice_NAME} which when called will actually
|
|
select the voice.
|
|
|
|
@subsection Phoneset
|
|
|
|
@cindex phoneset definitions
|
|
For most new languages and often for new dialects, a new
|
|
phoneset is required. It is really the basic building
|
|
block of a voice and most other parts are defined in terms
|
|
of this set, so defining it first is a good start.
|
|
@example
|
|
(defPhoneSet
|
|
spanish
|
|
;;; Phone Features
|
|
(;; vowel or consonant
|
|
(vc + -)
|
|
;; vowel length: short long diphthong schwa
|
|
(vlng s l d a 0)
|
|
;; vowel height: high mid low
|
|
(vheight 1 2 3 -)
|
|
;; vowel frontness: front mid back
|
|
(vfront 1 2 3 -)
|
|
;; lip rounding
|
|
(vrnd + -)
|
|
;; consonant type: stop fricative affricative nasal liquid
|
|
(ctype s f a n l 0)
|
|
;; place of articulation: labial alveolar palatal labio-dental
|
|
;; dental velar
|
|
(cplace l a p b d v 0)
|
|
;; consonant voicing
|
|
(cvox + -)
|
|
)
|
|
;; Phone set members (features are not! set properly)
|
|
(
|
|
(# - 0 - - - 0 0 -)
|
|
(a + l 3 1 - 0 0 -)
|
|
(e + l 2 1 - 0 0 -)
|
|
(i + l 1 1 - 0 0 -)
|
|
(o + l 3 3 - 0 0 -)
|
|
(u + l 1 3 + 0 0 -)
|
|
(b - 0 - - + s l +)
|
|
(ch - 0 - - + a a -)
|
|
(d - 0 - - + s a +)
|
|
(f - 0 - - + f b -)
|
|
(g - 0 - - + s p +)
|
|
(j - 0 - - + l a +)
|
|
(k - 0 - - + s p -)
|
|
(l - 0 - - + l d +)
|
|
(ll - 0 - - + l d +)
|
|
(m - 0 - - + n l +)
|
|
(n - 0 - - + n d +)
|
|
(ny - 0 - - + n v +)
|
|
(p - 0 - - + s l -)
|
|
(r - 0 - - + l p +)
|
|
(rr - 0 - - + l p +)
|
|
(s - 0 - - + f a +)
|
|
(t - 0 - - + s t +)
|
|
(th - 0 - - + f d +)
|
|
(x - 0 - - + a a -)
|
|
)
|
|
)
|
|
(PhoneSet.silences '(#))
|
|
@end example
|
|
@cindex silences
|
|
Note some phonetic features may be wrong.
|
|
|
|
@subsection Lexicon and LTS
|
|
|
|
Spanish is a language whose pronunciation can almost completely be
|
|
predicted from its orthography so in this case we do not need a
|
|
list of words and their pronunciations and can do most of the work
|
|
with letter to sound rules.
|
|
|
|
@cindex lexicon creation
|
|
Let us first make a lexicon structure as follows
|
|
@example
|
|
(lex.create "spanish")
|
|
(lex.set.phoneset "spanish")
|
|
@end example
|
|
However if we did just want a few entries to test our system without
|
|
building any letter to sound rules we could add entries directly to
|
|
the addenda. For example
|
|
@example
|
|
(lex.add.entry
|
|
'("amigos" nil (((a) 0) ((m i) 1) (g o s))))
|
|
@end example
|
|
A letter to sound rule system for Spanish is quite simple
|
|
in the format supported by Festival. The following is a good
|
|
start to a full set.
|
|
@cindex letter to sound rules
|
|
@example
|
|
(lts.ruleset
|
|
; Name of rule set
|
|
spanish
|
|
; Sets used in the rules
|
|
(
|
|
(LNS l n s )
|
|
(AEOU a e o u )
|
|
(AEO a e o )
|
|
(EI e i )
|
|
(BDGLMN b d g l m n )
|
|
)
|
|
; Rules
|
|
(
|
|
( [ a ] = a )
|
|
( [ e ] = e )
|
|
( [ i ] = i )
|
|
( [ o ] = o )
|
|
( [ u ] = u )
|
|
( [ "'" a ] = a1 ) ;; stressed vowels
|
|
( [ "'" e ] = e1 )
|
|
( [ "'" i ] = i1 )
|
|
( [ "'" o ] = o1 )
|
|
( [ "'" u ] = u1 )
|
|
( [ b ] = b )
|
|
( [ v ] = b )
|
|
( [ c ] "'" EI = th )
|
|
( [ c ] EI = th )
|
|
( [ c h ] = ch )
|
|
( [ c ] = k )
|
|
( [ d ] = d )
|
|
( [ f ] = f )
|
|
( [ g ] "'" EI = x )
|
|
( [ g ] EI = x )
|
|
( [ g u ] "'" EI = g )
|
|
( [ g u ] EI = g )
|
|
( [ g ] = g )
|
|
( [ h u e ] = u e )
|
|
( [ h i e ] = i e )
|
|
( [ h ] = )
|
|
( [ j ] = x )
|
|
( [ k ] = k )
|
|
( [ l l ] # = l )
|
|
( [ l l ] = ll )
|
|
( [ l ] = l )
|
|
( [ m ] = m )
|
|
( [ ~ n ] = ny )
|
|
( [ n ] = n )
|
|
( [ p ] = p )
|
|
( [ q u ] = k )
|
|
( [ r r ] = rr )
|
|
( # [ r ] = rr )
|
|
( LNS [ r ] = rr )
|
|
( [ r ] = r )
|
|
( [ s ] BDGLMN = th )
|
|
( [ s ] = s )
|
|
( # [ s ] C = e s )
|
|
( [ t ] = t )
|
|
( [ w ] = u )
|
|
( [ x ] = k s )
|
|
( AEO [ y ] = i )
|
|
( # [ y ] # = i )
|
|
( [ y ] = ll )
|
|
( [ z ] = th )
|
|
))
|
|
@end example
|
|
We could simply set our lexicon to use the above letter to sound
|
|
system with the following command
|
|
@example
|
|
(lex.set.lts.ruleset 'spanish)
|
|
@end example
|
|
But this would not deal with upper case letters. Instead of
|
|
writing new rules for upper case letters we can define that
|
|
a Lisp function be called when looking up a word and intercept
|
|
the lookup with our own function. First we state that unknown
|
|
words should
|
|
call a function, and then define the function we wish called.
|
|
The actual link to ensure our function will be called is done
|
|
below at lexicon selection time
|
|
@example
|
|
(define (spanish_lts word features)
|
|
"(spanish_lts WORD FEATURES)
|
|
Using letter to sound rules build a spanish pronunciation of WORD."
|
|
(list word
|
|
nil
|
|
(lex.syllabify.phstress (lts.apply (downcase word) 'spanish))))
|
|
(lex.set.lts.method spanish_lts)
|
|
@end example
|
|
In the function we downcase the word and apply the LTS rule to it.
|
|
Next we syllabify it and return the created lexical entry.
|
|
|
|
@subsection Phrasing
|
|
|
|
Without detailed labelled databases we cannot build statistical models
|
|
of phrase breaks, but we can simply build a phrase break model based
|
|
on punctuation. The following is a CART tree to predict simple breaks,
|
|
from punctuation.
|
|
@example
|
|
(set! spanish_phrase_cart_tree
|
|
'
|
|
((lisp_token_end_punc in ("?" "." ":"))
|
|
((BB))
|
|
((lisp_token_end_punc in ("'" "\"" "," ";"))
|
|
((B))
|
|
((n.name is 0) ;; end of utterance
|
|
((BB))
|
|
((NB))))))
|
|
@end example
|
|
|
|
@subsection Intonation
|
|
|
|
For intonation there are number of simple options without requiring
|
|
training data. For this example we will simply use a hat pattern on all
|
|
stressed syllables in content words and on single syllable content
|
|
words. (i.e. @code{Simple}) Thus we need an accent prediction CART
|
|
tree.
|
|
@example
|
|
(set! spanish_accent_cart_tree
|
|
'
|
|
((R:SylStructure.parent.gpos is content)
|
|
((stress is 1)
|
|
((Accented))
|
|
((position_type is single)
|
|
((Accented))
|
|
((NONE))))
|
|
((NONE))))
|
|
@end example
|
|
We also need to specify the pitch range of our speaker. We will
|
|
be using a male Spanish diphone database of the follow range
|
|
@example
|
|
(set! spanish_el_int_simple_params
|
|
'((f0_mean 120) (f0_std 30)))
|
|
@end example
|
|
|
|
@subsection Duration
|
|
|
|
We will use the trick mentioned above for duration prediction.
|
|
Using the zscore CART tree method, we will actually use it to
|
|
predict factors rather than zscores.
|
|
|
|
The tree predicts longer durations in stressed syllables and in
|
|
clause initial and clause final syllables.
|
|
@example
|
|
(set! spanish_dur_tree
|
|
'
|
|
((R:SylStructure.parent.R:Syllable.p.syl_break > 1 ) ;; clause initial
|
|
((R:SylStructure.parent.stress is 1)
|
|
((1.5))
|
|
((1.2)))
|
|
((R:SylStructure.parent.syl_break > 1) ;; clause final
|
|
((R:SylStructure.parent.stress is 1)
|
|
((2.0))
|
|
((1.5)))
|
|
((R:SylStructure.parent.stress is 1)
|
|
((1.2))
|
|
((1.0))))))
|
|
@end example
|
|
In addition to the tree we need durations for each phone in the
|
|
set
|
|
@example
|
|
(set! spanish_el_phone_data
|
|
'(
|
|
(# 0.0 0.250)
|
|
(a 0.0 0.090)
|
|
(e 0.0 0.090)
|
|
(i 0.0 0.080)
|
|
(o 0.0 0.090)
|
|
(u 0.0 0.080)
|
|
(b 0.0 0.065)
|
|
(ch 0.0 0.135)
|
|
(d 0.0 0.060)
|
|
(f 0.0 0.100)
|
|
(g 0.0 0.080)
|
|
(j 0.0 0.100)
|
|
(k 0.0 0.100)
|
|
(l 0.0 0.080)
|
|
(ll 0.0 0.105)
|
|
(m 0.0 0.070)
|
|
(n 0.0 0.080)
|
|
(ny 0.0 0.110)
|
|
(p 0.0 0.100)
|
|
(r 0.0 0.030)
|
|
(rr 0.0 0.080)
|
|
(s 0.0 0.110)
|
|
(t 0.0 0.085)
|
|
(th 0.0 0.100)
|
|
(x 0.0 0.130)
|
|
))
|
|
@end example
|
|
|
|
@subsection Waveform synthesis
|
|
|
|
There are a number of choices for waveform synthesis currently
|
|
supported. MBROLA supports Spanish, so we could use that. But their
|
|
Spanish diphones in fact use a slightly different phoneset so we would
|
|
need to change the above definitions to use it effectively. Here we will
|
|
use a diphone database for Spanish recorded by Eduardo Lopez when he was
|
|
a Masters student some years ago.
|
|
|
|
Here we simply load our pre-built diphone database
|
|
@example
|
|
(us_diphone_init
|
|
(list
|
|
'(name "el_lpc_group")
|
|
(list 'index_file
|
|
(path-append spanish_el_dir "group/ellpc11k.group"))
|
|
'(grouped "true")
|
|
'(default_diphone "#-#")))
|
|
@end example
|
|
|
|
@subsection Voice selection function
|
|
|
|
The standard way to define a voice in Festival is to define
|
|
a function of the form @code{voice_NAME} which selects
|
|
all the appropriate parameters. Because the definition
|
|
below follows the above definitions we know that everything
|
|
appropriate has been loaded into Festival and hence we
|
|
just need to select the appropriate a parameters.
|
|
|
|
@example
|
|
(define (voice_spanish_el)
|
|
"(voice_spanish_el)
|
|
Set up synthesis for Male Spanish speaker: Eduardo Lopez"
|
|
(voice_reset)
|
|
(Parameter.set 'Language 'spanish)
|
|
;; Phone set
|
|
(Parameter.set 'PhoneSet 'spanish)
|
|
(PhoneSet.select 'spanish)
|
|
(set! pos_lex_name nil)
|
|
;; Phrase break prediction by punctuation
|
|
(set! pos_supported nil)
|
|
;; Phrasing
|
|
(set! phrase_cart_tree spanish_phrase_cart_tree)
|
|
(Parameter.set 'Phrase_Method 'cart_tree)
|
|
;; Lexicon selection
|
|
(lex.select "spanish")
|
|
;; Accent prediction
|
|
(set! int_accent_cart_tree spanish_accent_cart_tree)
|
|
(set! int_simple_params spanish_el_int_simple_params)
|
|
(Parameter.set 'Int_Method 'Simple)
|
|
;; Duration prediction
|
|
(set! duration_cart_tree spanish_dur_tree)
|
|
(set! duration_ph_info spanish_el_phone_data)
|
|
(Parameter.set 'Duration_Method 'Tree_ZScores)
|
|
;; Waveform synthesizer: diphones
|
|
(Parameter.set 'Synth_Method 'UniSyn)
|
|
(Parameter.set 'us_sigpr 'lpc)
|
|
(us_db_select 'el_lpc_group)
|
|
|
|
(set! current-voice 'spanish_el)
|
|
)
|
|
|
|
(provide 'spanish_el)
|
|
@end example
|
|
|
|
@subsection Last remarks
|
|
|
|
We save the above definitions in a file @file{spanish_el.scm}. Now we
|
|
can declare the new voice to Festival. @xref{Defining a new voice},
|
|
for a description of methods for adding new voices. For testing
|
|
purposes we can explciitly load the file @file{spanish_el.scm}
|
|
|
|
The voice is now available for use in festival.
|
|
@example
|
|
festival> (voice_spanish_el)
|
|
spanish_el
|
|
festival> (SayText "hola amigos")
|
|
<Utterance 0x04666>
|
|
@end example
|
|
|
|
As you can see adding a new voice is not very difficult. Of course
|
|
there is quite a lot more than the above to add a high quality robust
|
|
voice to Festival. But as we can see many of the basic tools that we
|
|
wish to use already exist. The main difference between the above voice
|
|
and the English voices already in Festival are that their models are
|
|
better trained from databases. This produces, in general, better
|
|
results, but the concepts behind them are basically the same. All
|
|
of those trainable methods may be parameterized with data for
|
|
new voices.
|
|
|
|
As Festival develops, more modules will be added with better support for
|
|
training new voices so in the end we hope that adding in high quality
|
|
new voices is actually as simple as (or indeed simpler than) the above
|
|
description.
|
|
|
|
@subsection Resetting globals
|
|
|
|
@cindex reseting globals
|
|
@cindex @code{voice_reset}
|
|
Because the version of Scheme used in Festival only has a single flat
|
|
name space it is unfortunately too easy for voices to set some global
|
|
which accidentally affects all other voices selected after it. Because
|
|
of this problem we have introduced a convention to try to minimise the
|
|
possibility of this becoming a problem. Each voice function
|
|
defined should always call @code{voice_reset} at the start. This
|
|
will reset any globals and also call a tidy up function provided by
|
|
the previous voice function.
|
|
|
|
Likewise in your new voice function you should provide a tidy up
|
|
function to reset any non-standard global variables you set. The
|
|
function @code{current_voice_reset} will be called by
|
|
@code{voice_reset}. If the value of @code{current_voice_reset} is
|
|
@code{nil} then it is not called. @code{voice_reset} sets
|
|
@code{current_voice_reset} to @code{nil}, after calling it.
|
|
|
|
For example suppose some new voice requires the audio device to
|
|
be directed to a different machine. In this example we make
|
|
the giant's voice go through the netaudio machine @code{big_speakers}
|
|
while the standard voice go through @code{small_speakers}.
|
|
|
|
Although we can easily select the machine @code{big_speakers} as out
|
|
when our @code{voice_giant} is called, we also need to set it back when
|
|
the next voice is selected, and don't want to have to modify every other
|
|
voice defined in the system. Let us first define two functions to
|
|
selection the audio output.
|
|
@lisp
|
|
(define (select_big)
|
|
(set! giant_previous_audio (getenv "AUDIOSERVER"))
|
|
(setenv "AUDIOSERVER" "big_speakers"))
|
|
|
|
(define (select_normal)
|
|
(setenv "AUDIOSERVER" giant_previous_audio))
|
|
@end lisp
|
|
Note we save the previous value of @code{AUDIOSERVER} rather than simply
|
|
assuming it was @code{small_speakers}.
|
|
|
|
Our definition of @code{voice_giant} definition of @code{voice_giant}
|
|
will look something like
|
|
@lisp
|
|
(define (voice_giant)
|
|
"comment comment ..."
|
|
(voice_reset) ;; get into a known state
|
|
(select_big)
|
|
;;; other giant voice parameters
|
|
...
|
|
|
|
(set! current_voice_rest select_normal)
|
|
(set! current-voice 'giant))
|
|
@end lisp
|
|
The obvious question is which variables should a voice reset.
|
|
Unfortunately there is not a definitive answer to that. To a certain
|
|
extent I don't want to define that list as there will be many variables
|
|
that will by various people in Festival which are not in the original
|
|
distribution and we don't want to restrict them. The longer term answer
|
|
is some for of partitioning of the Scheme name space perhaps having
|
|
voice local variables (cf. Emacs buffer local variables). But
|
|
ultimately a voice may set global variables which could redefine the
|
|
operation of later selected voices and there seems no real way to stop
|
|
that, and keep the generality of the system.
|
|
|
|
@cindex current voice
|
|
Note the convention of setting the global @code{current-voice} as
|
|
the end of any voice definition file. We do not enforce this
|
|
but probabaly should. The variable @code{current-voice} at
|
|
any time should identify the current voice, the voice
|
|
description information (described below) will relate this name
|
|
to properties identifying it.
|
|
|
|
@node Defining a new voice, , Building a new voice, Voices
|
|
@section Defining a new voice
|
|
|
|
@cindex adding new voices
|
|
As there are a number of voices available for Festival and they may or
|
|
may not exists in different installations we have tried to make it
|
|
as simple as possible to add new voices to the system without having
|
|
to change any of the basic distribution. In fact if the voices
|
|
use the following standard method for describing themselves it
|
|
is merely a matter of unpacking them in order for them to be used
|
|
by the system.
|
|
|
|
@cindex @code{voice-path}
|
|
The variable @code{voice-path} conatins a list of directories where
|
|
voices will be automatically searched for. If this is not set it is set
|
|
automatically by appending @file{/voices/} to all paths in festival
|
|
@code{load-path}. You may add new directories explicitly to this
|
|
variable in your @file{sitevars.scm} file or your own @file{.festivalrc}
|
|
as you wish.
|
|
|
|
Each voice directory is assumed to be of the form
|
|
@example
|
|
LANGUAGE/VOICENAME/
|
|
@end example
|
|
Within the @code{VOICENAME/} directory itself it is assumed there is a
|
|
file @file{festvox/VOICENAME.scm} which when loaded will define the
|
|
voice itself. The actual voice function should be called
|
|
@code{voice_VOICENAME}.
|
|
|
|
For example the voices distributed with the standard Festival
|
|
distribution all unpack in @file{festival/lib/voices}. The Amercan
|
|
voice @file{ked_diphone} unpacks into
|
|
@example
|
|
festival/lib/voices/english/ked_diphone/
|
|
@end example
|
|
Its actual definition file is in
|
|
@example
|
|
festival/lib/voices/english/ked_diphone/festvox/ked_diphone.scm
|
|
@end example
|
|
Note the name of the directory and the name of the Scheme definition
|
|
file must be the same.
|
|
|
|
Alternative voices using perhaps a different encoding of the database but
|
|
the same front end may be defined in the same way by using symbolic
|
|
links in the langauge directoriy to the main directory. For example
|
|
a PSOLA version of the ked voice may be defined in
|
|
@example
|
|
festival/lib/voices/english/ked_diphone/festvox/ked_psola.scm
|
|
@end example
|
|
Adding a symbole link in @file{festival/lib/voices/english/} ro
|
|
@file{ked_diphone} called @file{ked_psola} will allow that voice
|
|
to be automatically registered when Festival starts up.
|
|
|
|
Note that this method doesn't actually load the voices it finds, that
|
|
could be prohibitively time consuming to the start up process. It
|
|
blindly assumes that there is a file
|
|
@file{VOICENAME/festvox/VOICENAME.scm} to load. An autoload definition
|
|
is given for @code{voice_VOICENAME} which when called will load that
|
|
file and call the real definition if it exists in the file.
|
|
|
|
This is only a recommended method to make adding new voices easier, it
|
|
may be ignored if you wish. However we still recommend that even if you
|
|
use your own convetions for adding new voices you consider the autoload
|
|
function to define them in, for example, the @file{siteinit.scm} file or
|
|
@file{.festivalrc}. The autoload function takes three arguments:
|
|
a function name, a file containing the actual definiton and a comment.
|
|
For example a definition of voice can be done explicitly by
|
|
@example
|
|
(autooad voice_f2b "/home/awb/data/f2b/ducs/f2b_ducs"
|
|
"American English female f2b")))
|
|
@end example
|
|
Of course you can also load the definition file explicitly if you
|
|
wish.
|
|
|
|
@cindex @code{proclaim_voice}
|
|
In order to allow the system to start making intellegent use of voices
|
|
we recommend that all voice definitions include a call to the function
|
|
@code{voice_proclaim} this allows the system to know some properties
|
|
about the voice such as language, gender and dialect. The
|
|
@code{proclaim_voice} function taks two arguments a name (e.g.
|
|
@code{rab_diphone} and an assoc list of features and names. Currently
|
|
we require @code{language}, @code{gender}, @code{dialect} and
|
|
@code{description}. The last being a textual description of the voice
|
|
itself. An example proclaimation is
|
|
@lisp
|
|
(proclaim_voice
|
|
'rab_diphone
|
|
'((language english)
|
|
(gender male)
|
|
(dialect british)
|
|
(description
|
|
"This voice provides a British RP English male voice using a
|
|
residual excited LPC diphone synthesis method. It uses a
|
|
modified Oxford Advanced Learners' Dictionary for pronunciations.
|
|
Prosodic phrasing is provided by a statistically trained model
|
|
using part of speech and local distribution of breaks. Intonation
|
|
is provided by a CART tree predicting ToBI accents and an F0
|
|
contour generated from a model trained from natural speech. The
|
|
duration model is also trained from data using a CART tree.")))
|
|
@end lisp
|
|
There are functions to access a description. @code{voice.description}
|
|
will return the description for a given voice and will load that voice
|
|
if it is not already loaded. @code{voice.describe} will describe the
|
|
given given voice by synthesizing the textual description using the
|
|
current voice. It would be nice to use the voice itself to give a self
|
|
introduction but unfortunately that introduces of problem of decide
|
|
which language the description should be in, we are not all as fluent in
|
|
welsh as we'd like to be.
|
|
|
|
The function @code{voice.list} will list the @emph{potential} voices in
|
|
the system. These are the names of voices which have been found in the
|
|
@code{voice-path}. As they have not actaully been loaded they can't
|
|
actually be confirmed as usable voices. One solution to this would be
|
|
to load all voices at start up time which would allow confirmation they
|
|
exist and to get their full description through @code{proclaim_voice}.
|
|
But start up is already too slow in festival so we have to accept this
|
|
stat for the time being. Splitting the description of the voice from
|
|
the actual definition is a possible solution to this problem but we have
|
|
not yet looked in to this.
|
|
|
|
@node Tools, Building models from databases, Voices, Top
|
|
@chapter Tools
|
|
|
|
@cindex tools
|
|
A number of basic data manipulation tools are supported by Festival.
|
|
These often make building new modules very easy and are already used
|
|
in many of the existing modules. They typically offer a Scheme method
|
|
for entering data, and Scheme and C++ functions for evaluating it.
|
|
|
|
@menu
|
|
* Regular expressions::
|
|
* CART trees:: Building and using CART
|
|
* Ngrams:: Building and using Ngrams
|
|
* Viterbi decoder:: Using the Viterbi decoder
|
|
* Linear regression:: Building and using linear regression models
|
|
@end menu
|
|
|
|
@node Regular expressions, CART trees, , Tools
|
|
@section Regular expressions
|
|
|
|
@cindex regular expressions
|
|
@cindex regex
|
|
@cindex wild card matching
|
|
Regular expressions are a formal method for describing a certain class
|
|
of mathematical languages. They may be viewed as patterns which match
|
|
some set of strings. They are very common in many software tools such
|
|
as scripting languages like the UNIX shell, PERL, awk, Emacs etc.
|
|
Unfortunately the exact form of regualr expressions often differs
|
|
slightly between different applications making their use often a little
|
|
tricky.
|
|
|
|
Festival support regular expressions based mainly of the form used in
|
|
the GNU libg++ @code{Regex} class, though we have our own implementation
|
|
of it. Our implementation (@code{EST_Regex}) is actually based on Henry
|
|
Spencer's @file{regex.c} as distributed with BSD 4.4.
|
|
|
|
Regular expressions are represented as character strings which
|
|
are interpreted as regular expressions by certain Scheme
|
|
and C++ functions. Most characters in a regular expression are
|
|
treated as literals and match only that character but a number
|
|
of others have special meaning. Some characters may be escaped
|
|
with preceding backslashes to change them from operators to literals
|
|
(or sometime literals to operators).
|
|
|
|
@table @code
|
|
@item .
|
|
Matches any character.
|
|
@item $
|
|
matches end of string
|
|
@item ^
|
|
matches beginning of string
|
|
@item X*
|
|
matches zero or more occurrences of X, X may be a character, range
|
|
of parenthesized expression.
|
|
@item X+
|
|
matches one or more occurrences of X, X may be a character, range
|
|
of parenthesized expression.
|
|
@item X?
|
|
matches zero or one occurrence of X, X may be a character, range
|
|
of parenthesized expression.
|
|
@item [...]
|
|
a ranges matches an of the values in the brackets. The range
|
|
operator "-" allows specification of ranges e.g. @code{a-z} for all
|
|
lower case characters. If the first character of the range is
|
|
@code{^} then it matches anything character except those specified
|
|
in the range. If you wish @code{-} to be in the range you must
|
|
put that first.
|
|
@item \\(...\\)
|
|
Treat contents of parentheses as single object allowing operators
|
|
@code{*}, @code{+}, @code{?} etc to operate on more than single characters.
|
|
@item X\\|Y
|
|
matches either X or Y. X or Y may be single characters, ranges
|
|
or parenthesized expressions.
|
|
@end table
|
|
Note that actuall only one backslash is needed before a character to
|
|
escape it but because these expressions are most often contained with
|
|
Scheme or C++ strings, the escape mechanaism for those strings requires
|
|
that backslash itself be escaped, hence you will most often be
|
|
required to type two backslashes.
|
|
|
|
Some example may help in enderstanding the use of regular
|
|
expressions.
|
|
@table @code
|
|
@item a.b
|
|
matches any three letter string starting with an @code{a} and
|
|
ending with a @code{b}.
|
|
@item .*a
|
|
matches any string ending in an @code{a}
|
|
@item .*a.*
|
|
matches any string containing an @code{a}
|
|
@item [A-Z].*
|
|
matches any string starting with a capital letter
|
|
@item [0-9]+
|
|
matches any string of digits
|
|
@item -?[0-9]+\\(\\.[0-9]+\\)?
|
|
matches any positive or negative real number. Note the optional
|
|
preceding minus sign and the optional part contain the point and
|
|
following numbers. The point itself must be escaped as dot on its
|
|
own matches any character.
|
|
@item [^aeiouAEIOU]+
|
|
mathes any non-empty string which doesn't conatin a vowel
|
|
@item \\([Ss]at\\(urday\\)\\)?\\|\\([Ss]un\\(day\\)\\)
|
|
matches Saturday and Sunday in various ways
|
|
@end table
|
|
|
|
The Scheme function @code{string-matches} takes a string and
|
|
a regular expression and returns @code{t} if the regular
|
|
expression macthes the string and @code{nil} otherwise.
|
|
|
|
@node CART trees, Ngrams, Regular expressions , Tools
|
|
@section CART trees
|
|
|
|
@cindex CART trees
|
|
One of the basic tools available with Festival is a system for building
|
|
and using Classification and Regression Trees (@cite{breiman84}). This
|
|
standard statistical method can be used to predict both categorical and
|
|
continuous data from a set of feature vectors.
|
|
|
|
@cindex wagon
|
|
The tree itself contains yes/no questions about features and ultimately
|
|
provides either a probability distribution, when predicting categorical
|
|
values (classification tree), or a mean and standard deviation when
|
|
predicting continuous values (regression tree). Well defined techniques
|
|
can be used to construct an optimal tree from a set of training data.
|
|
The program, developed in conjunction with Festival, called
|
|
@file{wagon}, distributed with the speech tools, provides a basic but
|
|
ever increasingly powerful method for constructing trees.
|
|
|
|
A tree need not be automatically constructed, CART trees have the
|
|
advantage over some other automatic training methods, such as neural
|
|
networks and linear regression, in that their output is more readable
|
|
and often understandable by humans. Importantly this makes it possible
|
|
to modify them. CART trees may also be fully hand constructed. This
|
|
is used, for example, in generating some duration models for languages we
|
|
do not yet have full databases to train from.
|
|
|
|
A CART tree has the following syntax
|
|
@lisp
|
|
CART ::= QUESTION-NODE || ANSWER-NODE
|
|
QUESTION-NODE ::= ( QUESTION YES-NODE NO-NODE )
|
|
YES-NODE ::= CART
|
|
NO-NODE ::= CART
|
|
QUESTION ::= ( FEATURE in LIST )
|
|
QUESTION ::= ( FEATURE is STRVALUE )
|
|
QUESTION ::= ( FEATURE = NUMVALUE )
|
|
QUESTION ::= ( FEATURE > NUMVALUE )
|
|
QUESTION ::= ( FEATURE < NUMVALUE )
|
|
QUESTION ::= ( FEATURE matches REGEX )
|
|
ANSWER-NODE ::= CLASS-ANSWER || REGRESS-ANSWER
|
|
CLASS-ANSWER ::= ( (VALUE0 PROB) (VALUE1 PROB) ... MOST-PROB-VALUE )
|
|
REGRESS-ANSWER ::= ( ( STANDARD-DEVIATION MEAN ) )
|
|
@end lisp
|
|
Note that answer nodes are distinguished by their car not being atomic.
|
|
|
|
@cindex features
|
|
The interpretation of a tree is with respect to a Stream_Item
|
|
The @var{FEATURE} in a tree is a standard feature (@pxref{Features}).
|
|
|
|
The following example tree is used in one of the Spanish voices
|
|
to predict variations from average durations.
|
|
@lisp
|
|
(set! spanish_dur_tree
|
|
'
|
|
(set! spanish_dur_tree
|
|
'
|
|
((R:SylStructure.parent.R:Syllable.p.syl_break > 1 ) ;; clause initial
|
|
((R:SylStructure.parent.stress is 1)
|
|
((1.5))
|
|
((1.2)))
|
|
((R:SylStructure.parent.syl_break > 1) ;; clause final
|
|
((R:SylStructure.parent.stress is 1)
|
|
((2.0))
|
|
((1.5)))
|
|
((R:SylStructure.parent.stress is 1)
|
|
((1.2))
|
|
((1.0))))))
|
|
@end lisp
|
|
It is applied to the segment stream to give a factor to multiply
|
|
the average by.
|
|
|
|
@code{wagon} is constantly improving and with version 1.2 of the speech
|
|
tools may now be considered fairly stable for its basic operations.
|
|
Experimental features are described in help it gives. See the
|
|
Speech Tools manual for a more comprehensive discussion of using
|
|
@file{wagon}.
|
|
|
|
However the above format of trees is similar to those produced by many
|
|
other systems and hence it is reasonable to translate their formats into
|
|
one which Festival can use.
|
|
|
|
@node Ngrams, Viterbi decoder, CART trees, Tools
|
|
@section Ngrams
|
|
|
|
@cindex ngrams
|
|
Bigram, trigrams, and general ngrams are used in the part
|
|
of speech tagger and the phrase break predicter. An Ngram
|
|
C++ Class is defined in the speech tools library and some simple
|
|
facilities are added within Festival itself.
|
|
|
|
Ngrams may be built from files of tokens using the program
|
|
@code{ngram_build} which is part of the speech tools. See
|
|
the speech tools documentation for details.
|
|
|
|
Within Festival ngrams may be named and loaded from files
|
|
and used when required. The LISP function @code{load_ngram}
|
|
takes a name and a filename as argument and loads the Ngram
|
|
from that file. For an example of its use once loaded see
|
|
@file{src/modules/base/pos.cc} or
|
|
@file{src/modules/base/phrasify.cc}.
|
|
|
|
@node Viterbi decoder, Linear regression, Ngrams, Tools
|
|
@section Viterbi decoder
|
|
|
|
@cindex Viterbi decoder
|
|
Another common tool is a Viterbi decoder. This C++ Class is defined in
|
|
the speech tools library @file{speech_tooks/include/EST_viterbi.h} and
|
|
@file{speech_tools/stats/EST_viterbi.cc}. A Viterbi decoder
|
|
requires two functions at declaration time. The first constructs
|
|
candidates at each stage, while the second combines paths. A number of
|
|
options are available (which may change).
|
|
|
|
The prototypical example of use is in the part of speech tagger which
|
|
using standard Ngram models to predict probabilities of tags.
|
|
See @file{src/modules/base/pos.cc} for an example.
|
|
|
|
The Viterbi decoder can also be used through the Scheme function
|
|
@code{Gen_Viterbi}. This function respects the parameters defined
|
|
in the variable @code{get_vit_params}. Like other modules this
|
|
parameter list is an assoc list of feature name and value. The
|
|
parameters supported are:
|
|
@table @code
|
|
@item Relation
|
|
The name of the relation the decoeder is to be applied to.
|
|
@item cand_function
|
|
A function that is to be called for each item that will return
|
|
a list of candidates (with probilities).
|
|
@item return_feat
|
|
The name of a feature that the best candidate is to be returned in
|
|
for each item in the named relation.
|
|
@item p_word
|
|
The previous word to the first item in the named relation (only used
|
|
when ngrams are the "language model").
|
|
@item pp_word
|
|
The previous previous word to the first item in the named relation
|
|
(only used when ngrams are the "language model").
|
|
@item ngramname
|
|
the name of an ngram (loaded by @code{ngram.load}) to be used
|
|
as a "language model".
|
|
@item wfstmname
|
|
the name of a WFST (loaded by @code{wfst.load}) to be used
|
|
as a "language model", this is ignored if an @code{ngramname} is also
|
|
specified.
|
|
@item debug
|
|
If specified more debug features are added to the items in the
|
|
relation.
|
|
@item gscale_p
|
|
Grammar scaling factor.
|
|
@end table
|
|
Here is a short example to help make the use of this facility clearer.
|
|
|
|
There are two parts required for the Viterbi decode a set of
|
|
candidate observations and some "language model". For the
|
|
math to work properly the candidate observations must be reverse
|
|
probabilities (for each candidiate as given what is the probability
|
|
of the observation, rather than the probability of the candidate
|
|
given the observation). These can be calculated for the
|
|
probabilties candidate given the observation divided by the
|
|
probability of the candidate in isolation.
|
|
|
|
For the sake of simplicity let us assume we have a lexicon of words to
|
|
distribution of part of speech tags with reverse probabilities. And an
|
|
tri-gram called @code{pos-tri-gram} over ngram sequences of part of
|
|
speech tags. First we must define the candidate function
|
|
@lisp
|
|
(define (pos_cand_function w)
|
|
;; select the appropriate lexicon
|
|
(lex.select 'pos_lex)
|
|
;; return the list of cands with rprobs
|
|
(cadr
|
|
(lex.lookup (item.name w) nil)))
|
|
@end lisp
|
|
The returned candidate list would look somthing like
|
|
@lisp
|
|
( (jj -9.872) (vbd -6.284) (vbn -5.565) )
|
|
@end lisp
|
|
Our part of speech tagger function would look something
|
|
like this
|
|
@lisp
|
|
(define (pos_tagger utt)
|
|
(set! get_vit_params
|
|
(list
|
|
(list 'Relation "Word")
|
|
(list 'return_feat 'pos_tag)
|
|
(list 'p_word "punc")
|
|
(list 'pp_word "nn")
|
|
(list 'ngramname "pos-tri-gram")
|
|
(list 'cand_function 'pos_cand_function)))
|
|
(Gen_Viterbi utt)
|
|
utt)
|
|
@end lisp
|
|
this will assign the optimal part of speech tags to each word in utt.
|
|
|
|
@node Linear regression, , Viterbi decoder, Tools
|
|
@section Linear regression
|
|
|
|
@cindex linear regression
|
|
The linear regression model takes models built from some external
|
|
package and finds coefficients based on the features and weights. A
|
|
model consists of a list of features. The first should be the atom
|
|
@code{Intercept} plus a value. The following in the list should consist
|
|
of a feature (@pxref{Features}) followed by a weight. An optional third
|
|
element may be a list of atomic values. If the result of the feature is
|
|
a member of this list the feature's value is treated as 1 else it is 0.
|
|
This third argument allows an efficient way to map categorical values
|
|
into numeric values. For example, from the F0 prediction model in
|
|
@file{lib/f2bf0lr.scm}. The first few parameters are
|
|
@lisp
|
|
(set! f2b_f0_lr_start
|
|
'(
|
|
( Intercept 160.584956 )
|
|
( Word.Token.EMPH 36.0 )
|
|
( pp.tobi_accent 10.081770 (H*) )
|
|
( pp.tobi_accent 3.358613 (!H*) )
|
|
( pp.tobi_accent 4.144342 (*? X*? H*!H* * L+H* L+!H*) )
|
|
( pp.tobi_accent -1.111794 (L*) )
|
|
...
|
|
)
|
|
@end lisp
|
|
Note the feature @code{pp.tobi_accent} returns an atom, and is hence
|
|
tested with the map groups specified as third arguments.
|
|
|
|
Models may be built from feature data (in the same format as
|
|
@file{wagon} using the @file{ols} program distributed with the speech
|
|
tools library.
|
|
|
|
@node Building models from databases, Programming, Tools , Top
|
|
@chapter Building models from databases
|
|
|
|
@cindex databases
|
|
Because our research interests tend towards creating statistical
|
|
models trained from real speech data, Festival offers various support
|
|
for extracting information from speech databases, in a way suitable
|
|
for building models.
|
|
|
|
Models for accent prediction, F0 generation, duration, vowel
|
|
reduction, homograph disambiguation, phrase break assignment and
|
|
unit selection have been built using Festival to extract and process
|
|
various databases.
|
|
|
|
@menu
|
|
* Labelling databases:: Phones, syllables, words etc.
|
|
* Extracting features:: Extraction of model parameters.
|
|
* Building models:: Building stochastic models from features
|
|
@end menu
|
|
|
|
@node Labelling databases, Extracting features, , Building models from databases
|
|
@section Labelling databases
|
|
|
|
@cindex labelling
|
|
@cindex database labelling
|
|
In order for Festival to use a database it is most useful to build
|
|
utterance structures for each utterance in the database. As discussed
|
|
earlier, utterance structures contain relations of items. Given such a
|
|
structure for each utterance in a database we can easily read in the
|
|
utterance representation and access it, dumping information in a
|
|
normalised way allowing for easy building and testing of models.
|
|
|
|
Of course the level of labelling that exists, or that you are willing to
|
|
do by hand or using some automatic tool, for a particular database will
|
|
vary. For many purposes you will at least need phonetic labelling.
|
|
Hand labelled data is still better than auto-labelled data, but that
|
|
could change. The size and consistency of the data is important too.
|
|
|
|
For this discussion we will assume labels for: segments, syllables, words,
|
|
phrases, intonation events, pitch targets. Some of these can be
|
|
derived, some need to be labelled. This would not fail with less
|
|
labelling but of course you wouldn't be able to extract as much
|
|
information from the result.
|
|
|
|
In our databases these labels are in Entropic's Xlabel format,
|
|
though it is fairly easy to convert any reasonable format.
|
|
|
|
@table @emph
|
|
@item Segment
|
|
@cindex Segment labelling
|
|
These give phoneme labels for files. Note the these labels @emph{must}
|
|
be members of the phoneset that you will be using for this database.
|
|
Often phone label files may contain extra labels (e.g. beginning
|
|
and end silence) which are not really part of the phoneset. You
|
|
should remove (or re-label) these phones accordingly.
|
|
@item Word
|
|
@cindex Word labelling
|
|
Again these will need to be provided. The end of the word should
|
|
come at the last phone in the word (or just after). Pauses/silences
|
|
should not be part of the word.
|
|
@item Syllable
|
|
@cindex Syllable labelling
|
|
There is a chance these can be automatically generated from
|
|
Word and Segment files given a lexicon. Ideally these should
|
|
include lexical stress.
|
|
@item IntEvent
|
|
@cindex IntEvent labelling
|
|
These should ideally mark accent/boundary tone type for each syllable,
|
|
but this almost definitely requires hand-labelling. Also given that
|
|
hand-labelling of accent type is harder and not as accurate, it is
|
|
arguable that anything other than accented vs. non-accented can be used
|
|
reliably.
|
|
@item Phrase
|
|
@cindex IntEvent labelling
|
|
This could just mark the last non-silence phone in each utterance, or
|
|
before any silence phones in the whole utterance.
|
|
@item Target
|
|
@cindex Target labelling
|
|
This can be automatically derived from an F0 file and the Segment files.
|
|
A marking of the mean F0 in each voiced phone seem to give adequate
|
|
results.
|
|
@end table
|
|
Once these files are created an utterance file can be automatically
|
|
created from the above data. Note it is pretty easy to get the
|
|
streams right but getting the relations between the streams is
|
|
much harder. Firstly labelling is rarely accurate and small windows of
|
|
error must be allowed to ensure things line up properly. The second
|
|
problem is that some label files identify point type information
|
|
(IntEvent and Target) while others identify segments (e.g. Segment,
|
|
Words etc.). Relations have to know this in order to get it right.
|
|
For example is not right for all syllables between two
|
|
IntEvents to be linked to the IntEvent, only to the Syllable
|
|
the IntEvent is within.
|
|
|
|
@cindex creating utterances
|
|
The script @file{festival/examples/make_utts} is an example Festival script
|
|
which automatically builds the utterance files from the above labelled
|
|
files.
|
|
|
|
The script, by default assumes, a hierarchy in an database directory
|
|
of the following form. Under a directory @file{festival/} where
|
|
all festival specific database ifnromation can be kept, a directory
|
|
@file{relations/} contains a subdirectory for each basic relation
|
|
(e.g. @file{Segment/}, @file{Syllable/}, etc.) Each of which contains
|
|
the basic label files for that relation.
|
|
|
|
The following command will build a set of utterance structures (including
|
|
building hte relations that link between these basic relations).
|
|
@example
|
|
make_utts -phoneset radio festival/relation/Segment/*.Segment
|
|
@end example
|
|
This will create utterances in @file{festival/utts/}. There are
|
|
a number of options to @file{make_utts} use @file{-h} to find
|
|
them. The @file{-eval} option allows extra scheme code to be
|
|
loaded which may be called by the utterance building process.
|
|
The function @code{make_utts_user_function} will be called on all
|
|
utterance created. Redefining that in database specific loaded
|
|
code will allow database specific fixed to the utterance.
|
|
|
|
@node Extracting features, Building models, Labelling databases, Building models from databases
|
|
@section Extracting features
|
|
|
|
@cindex extracting features
|
|
@cindex feature extraction
|
|
The easiest way to extract features from a labelled database
|
|
of the form described in the previous section is by loading
|
|
in each of the utterance structures and dumping the desired features.
|
|
|
|
Using the same mechanism to extract the features as will eventually be
|
|
used by models built from the features has the important advantage of
|
|
avoiding spurious errors easily introduced when collecting data. For
|
|
example a feature such as @code{n.accent} in a Festival utterance will
|
|
be defined as 0 when there is no next accent. Extracting all the
|
|
accents and using an external program to calculate the next accent may
|
|
make a different decision so that when the generated model is used a
|
|
different value for this feature will be produced. Such mismatches
|
|
in training models and actual use are unfortunately common, so using
|
|
the same mechanism to extract data for training, and for actual
|
|
use is worthwhile.
|
|
|
|
The recommedn method for extracting features is using the
|
|
festival script @file{dumpfeats}. It basically takes a list
|
|
of feature names and a list of utterance files and dumps the desired
|
|
features.
|
|
|
|
Features may be dumped into a single file or into separate files
|
|
one for each utterance. Feature names may be specified on the
|
|
command line or in a separate file. Extar code to define new features
|
|
may be loaded too.
|
|
|
|
For example suppose we wanted to save the features for a set of
|
|
utterances include the duration, phone name, previous and next phone names
|
|
for all segments in each utterance.
|
|
@example
|
|
dumpfeats -feats "(segment_duration name p.name n.name)" \
|
|
-output feats/%s.dur -relation Segment \
|
|
festival/utts/*.utt
|
|
@end example
|
|
This will save these features in files named for the utterances they come
|
|
from in the directory @file{feats/}. The argument to @file{-feats} is
|
|
treated as literal list only if it starts with a left parenthesis, otherwise
|
|
it is treated as a filename contain named features (unbracketed).
|
|
|
|
Extra code (for new feature definitions) may be loaded through the
|
|
@file{-eval} option. If the argument to @file{-eval} starts
|
|
with a left parenthesis it is trated as an s-expression rather than
|
|
a filename and is evaluated. If argument @file{-output} contains
|
|
"%s" it will be filled in with the utterance's filename, if it
|
|
is a simple filename the features from all utterances will be saved
|
|
in that same file. The features for each item in the named
|
|
relation are saved on a single line.
|
|
@node Building models, , Extracting features, Building models from databases
|
|
@section Building models
|
|
|
|
@cindex wagon
|
|
This section describes how to build models from data extracted from
|
|
databases as described in the previous section. It uses the CART
|
|
building program, @file{wagon} which is available in the speech tools
|
|
distribution. But the data is suitable for many other types of model
|
|
building techniques, such as linear regression or neural networks.
|
|
|
|
Wagon is described in the speech tools manual, though we will
|
|
cover simple use here. To use Wagon you need a datafile and a
|
|
data description file.
|
|
|
|
A datafile consists of a number of vectors one per line each containing
|
|
the same number of fields. This, not coincidentally, is exactly the
|
|
format produced by @file{dumpfeats} described in the previous
|
|
section. The data description file describes the fields in the datafile
|
|
and their range. Fields may be of any of the following types: class (a
|
|
list of symbols), floats, or ignored. Wagon will build a classification
|
|
tree if the first field (the predictee) is of type class, or a
|
|
regression tree if the first field is a float. An example
|
|
data description file would be
|
|
@lisp
|
|
(
|
|
( duration float )
|
|
( name # @@ @@@@ a aa ai au b ch d dh e e@@ ei f g h i i@@ ii jh k l m n
|
|
ng o oi oo ou p r s sh t th u u@@ uh uu v w y z zh )
|
|
( n.name # @@ @@@@ a aa ai au b ch d dh e e@@ ei f g h i i@@ ii jh k l m n
|
|
ng o oi oo ou p r s sh t th u u@@ uh uu v w y z zh )
|
|
( p.name # @@ @@@@ a aa ai au b ch d dh e e@@ ei f g h i i@@ ii jh k l m n
|
|
ng o oi oo ou p r s sh t th u u@@ uh uu v w y z zh )
|
|
( R:SylStructure.parent.position_type 0 final initial mid single )
|
|
( pos_in_syl float )
|
|
( syl_initial 0 1 )
|
|
( syl_final 0 1)
|
|
( R:SylStructure.parent.R:Syllable.p.syl_break 0 1 3 )
|
|
( R:SylStructure.parent.syl_break 0 1 3 4 )
|
|
( R:SylStructure.parent.R:Syllable.n.syl_break 0 1 3 4 )
|
|
( R:SylStructure.parent.R:Syllable.p.stress 0 1 )
|
|
( R:SylStructure.parent.stress 0 1 )
|
|
( R:SylStructure.parent.R:Syllable.n.stress 0 1 )
|
|
)
|
|
@end lisp
|
|
The script @file{speech_tools/bin/make_wagon_desc} goes some way
|
|
to helping. Given a datafile and a file containing the field names, it
|
|
will construct an approximation of the description file. This file
|
|
should still be edited as all fields are treated as of type class by
|
|
@file{make_wagon_desc} and you may want to change them some of them to
|
|
float.
|
|
|
|
The data file must be a single file, although we created a number of
|
|
feature files by the process described in the previous section. From a
|
|
list of file ids select, say, 80% of them, as training data and cat them
|
|
into a single datafile. The remaining 20% may be catted together as
|
|
test data.
|
|
|
|
To build a tree use a command like
|
|
@example
|
|
wagon -desc DESCFILE -data TRAINFILE -test TESTFILE
|
|
@end example
|
|
The minimum cluster size (default 50) may be reduced using the
|
|
command line option @code{-stop} plus a number.
|
|
|
|
Varying the features and stop size may improve the results.
|
|
|
|
Building the models and getting good figures is only one part
|
|
of the process. You must integrate this model into Festival
|
|
if its going to be of any use. In the case of CART trees generated
|
|
by Wagon, Festival supports these directly. In the case of
|
|
CART trees predicting zscores, or factors to modify duration averages,
|
|
ees can be used as is.
|
|
|
|
Note there are other options to Wagon which may help build
|
|
better CART models. Consult the chapter in the speech tools manual
|
|
on Wagon for more information.
|
|
|
|
Other parts of the distributed system use CART trees, and linear
|
|
regression models that were training using the processes described in
|
|
this chapter. Some other parts of the distributed system use CART trees
|
|
which were written by hand and may be improved by properly applying
|
|
these processes.
|
|
|
|
@node Programming, API, Building models from databases, Top
|
|
@chapter Programming
|
|
|
|
@cindex programming
|
|
This chapter covers aspects of programming within the Festival
|
|
environment, creating new modules, and modifying existing ones.
|
|
It describes basic Classes available and gives some particular
|
|
examples of things you may wish to add.
|
|
|
|
@menu
|
|
* The source code:: A walkthrough of the source code
|
|
* Writing a new module:: Example access of an utterance
|
|
@end menu
|
|
|
|
@node The source code, Writing a new module, , Programming
|
|
@section The source code
|
|
|
|
@cindex source
|
|
The ultimate authority on what happens in the system lies
|
|
in the source code itself. No matter how hard we try, and how
|
|
automatic we make it, the source code will always be ahead
|
|
of the documentation. Thus if you are going to be using Festival in
|
|
a serious way, familiarity with the source is essential.
|
|
|
|
The lowest level functions are catered for in the Edinburgh
|
|
Speech Tools, a separate library distributed with Festival. The
|
|
Edinburgh Speech Tool Library offers the basic utterance structure,
|
|
waveform file access, and other various useful low-level functions
|
|
which we share between different speech systems in our work.
|
|
@xref{Top, , Overview, speechtools,
|
|
Edinburgh Speech Tools Library Manual}.
|
|
|
|
@cindex directory structure
|
|
The directory structure for the Festival distribution reflects
|
|
the conceptual split in the code.
|
|
|
|
@table @file
|
|
@item ./bin/
|
|
The user-level executable binaries and scripts that are part of the
|
|
festival system. These are simple symbolic links to the binaries
|
|
or if the system is compiled with shared libraries small wrap-around
|
|
shell scripts that set @code{LD_LIBRARY_PATH} appropriately
|
|
@item ./doc/
|
|
This contains the texinfo documentation for the whole system. The
|
|
@file{Makefile} constructs the info and/or html version as desired.
|
|
Note that the @code{festival} binary itself is used to generate the lists
|
|
of functions and variables used within the system, so must be compiled
|
|
and in place to generate a new version of the documentation.
|
|
@item ./examples/
|
|
This contains various examples. Some are explained within this manual,
|
|
others are there just as examples.
|
|
@item ./lib/
|
|
The basic Scheme parts of the system, including @file{init.scm} the
|
|
first file loaded by @code{festival} at start-up time. Depending on
|
|
your installation, this directory may also contain subdirectories
|
|
containing lexicons, voices and databases. This directory and its
|
|
sub-directories are used by Festival at run-time.
|
|
@item ./lib/etc/
|
|
Executables for Festival's internal use. A subdirectory containing
|
|
at least the audio spooler will be automatically created (one for each
|
|
different architecture the system is compiled on). Scripts are
|
|
added to this top level directory itself.
|
|
@item ./lib/voices/
|
|
By default this contains the voices used by Festival including their
|
|
basic Scheme set up functions as well as the diphone databases.
|
|
@item ./lib/dicts/
|
|
This contains various lexicon files distributed as part of the
|
|
system.
|
|
@item ./config/
|
|
This contains the basic @file{Makefile} configuration files for
|
|
compiling the system (run-time configuration is handled by Scheme
|
|
in the @file{lib/} directory). The file @file{config/config} created
|
|
as a copy of the standard @file{config/config-dist} is the installation
|
|
specific configuration. In most cases a simpel copy of the
|
|
distribution file will be sufficient.
|
|
@item ./src/
|
|
The main C++/C source for the system.
|
|
@item ./src/lib/
|
|
Where the @file{libFestival.a} is built.
|
|
@item ./src/include/
|
|
Where include files shared between various parts of the system
|
|
live. The file @file{festival.h} provides access to most of
|
|
the parts of the system.
|
|
@item ./src/main/
|
|
Contains the top level C++ files for the actual executables.
|
|
This is directory where the executable binary @file{festival}
|
|
is created.
|
|
@item ./src/arch/
|
|
The main core of the Festival system. At present everything is held in
|
|
a single sub-directory @file{./src/arc/festival/}. This contains the
|
|
basic core of the synthesis system itself. This directory contains lisp
|
|
front ends to access the core utterance architecture, and phonesets,
|
|
basic tools like, client/server support, ngram support, etc, and an
|
|
audio spooler.
|
|
@item ./src/modules/
|
|
In contrast to the @file{arch/} directory this contains the non-core
|
|
parts of the system. A set of basic example modules are included with
|
|
the standard distribution. These are the parts that do the synthesis,
|
|
the other parts are just there to make module writing easier.
|
|
@item ./src/modules/base/
|
|
This contains some basic simple modules that weren't quite big enough
|
|
to deserve their own directory. Most importantly it includes the
|
|
@code{Initialize} module called by many synthesis methods which sets
|
|
up an utterance structure and loads in initial values. This directory
|
|
also contains phrasing, part of speech, and word (syllable and phone
|
|
construction from words) modules.
|
|
@item ./src/modules/Lexicon/
|
|
This is not really a module in the true sense (the @code{Word} module
|
|
is the main user of this). This contains functions to construct, compile,
|
|
and access lexicons (entries of words, part of speech and
|
|
pronunciations). This also contains a letter-to-sound rule system.
|
|
@item ./src/modules/Intonation/
|
|
This contains various intonation systems, from the very simple
|
|
to quite complex parameter driven intonation systems.
|
|
@item ./src/modules/Duration/
|
|
This contains various duration prediction systems, from the very simple
|
|
(fixed duration) to quite complex parameter driven duration systems.
|
|
@item ./src/modules/UniSyn/
|
|
A basic diphone synthesizer system, supporting a simple database format
|
|
(which can be grouped into a more efficient binary representation). It
|
|
is multi-lingual, and allows multiple databases to be loaded at once.
|
|
It offers a choice of concatenation methods for diphones: residual
|
|
excited LPC or PSOLA (TM) (which is not distributed)
|
|
@item ./src/modules/Text/
|
|
Various text analysis functions, particularly the tokenizer and
|
|
utterance segmenter (from arbitrary files). This directory
|
|
also contains the support for text modes and SGML.
|
|
@item ./src/modules/donovan/
|
|
An LPC based diphone synthesizer. Very small and neat.
|
|
@item ./src/modules/rxp/
|
|
The Festival/Scheme front end to
|
|
An XML parser written by Richard Tobin from University of Edinburgh's
|
|
Language Technology Group.. rxp is now part of the speech tools
|
|
rather than just Festival.
|
|
@item ./src/modules/parser
|
|
A simple interface the the Stochastic Context Free Grammar parser in
|
|
the speech tools library.
|
|
@item ./src/modules/diphone
|
|
An optional module contain the previouslty used diphone synthsizer.
|
|
@item ./src/modules/clunits
|
|
A partial implementation of a cluster unit selection algorithm
|
|
as described in @cite{black97c}.
|
|
@item ./src/modules/Database rjc_synthesis
|
|
This consist of a new set of modules for doing waveform synthesis. They
|
|
are inteneded to unit size independent (e.g. diphone, phone, non-uniform
|
|
unit). Also selection, prosodic modification, joining and signal
|
|
processing are separately defined. Unfortunately this code has
|
|
not really been exercised enough to be considered stable to be used
|
|
in the default synthesis method, but those working on new synthesis
|
|
techniques may be interested in integration using these new modules.
|
|
They may be updated before the next full release of Festival.
|
|
@item ./src/modules/*
|
|
Other optional directories may be contained here containing
|
|
various research modules not yet part of the standard distribution.
|
|
See below for descriptions of how to add modules to the basic
|
|
system.
|
|
@end table
|
|
One intended use of Festival is offer a software system where
|
|
new modules may be easily tested in a stable environment. We
|
|
have tried to make the addition of new modules easy, without requiring
|
|
complex modifications to the rest of the system.
|
|
|
|
All of the basic modules should really be considered merely as example
|
|
modules. Without much effort all of them could be improved.
|
|
|
|
@node Writing a new module, , The source code, Programming
|
|
@section Writing a new module
|
|
|
|
This section gives a simple example of writing a new module. showing
|
|
the basic steps that must be done to create and add a new module that is
|
|
available for the rest of the system to use. Note many things can be
|
|
done solely in Scheme now and really only low-level very intensive
|
|
things (like waveform synthesizers) need be coded in C++.
|
|
|
|
@subsection Example 1: adding new modules
|
|
|
|
@cindex new modules
|
|
@cindex adding new modules
|
|
The example here is a duration module which sets durations of phones for
|
|
a given list of averages. To make this example more interesting, all
|
|
durations in accented syllables are increased by 1.5. Note that this is
|
|
just an example for the sake of one, this (and much better techniques)
|
|
could easily done within the system as it is at present using a
|
|
hand-crafted CART tree.
|
|
|
|
Our knew module, called @code{Duration_Simple} can most easily
|
|
be added to the @file{./src/Duration/} directory in a file
|
|
@file{simdur.cc}. You can worry about the copyright notice, but
|
|
after that you'll probably need the following includes
|
|
@lisp
|
|
#include <festival.h>
|
|
|
|
@end lisp
|
|
The module itself must be declared in a fixed form. That is
|
|
receiving a single LISP form (an utterance) as an argument
|
|
and returning that LISP form at the end. Thus our definition
|
|
will start
|
|
@lisp
|
|
LISP FT_Duration_Simple(LISP utt)
|
|
@{
|
|
@end lisp
|
|
Next we need to declare an utterance structure and extract it
|
|
from the LISP form. We also make a few other variable declarations
|
|
@lisp
|
|
EST_Utterance *u = get_c_utt(utt);
|
|
EST_Item *s;
|
|
float end=0.0, dur;
|
|
LISP ph_avgs,ldur;
|
|
@end lisp
|
|
@cindex accessing Lisp variables
|
|
We cannot list the average durations for each phone in the source
|
|
code as we cannot tell which phoneset we are using (or what
|
|
modifications we want to make to durations between speakers). Therefore
|
|
the phone and average duration information is held in a Scheme variable
|
|
for easy setting at run time. To use the information in our C++
|
|
domain we must get that value from the Scheme domain. This is
|
|
done with the following statement.
|
|
@lisp
|
|
ph_avgs = siod_get_lval("phoneme_averages","no phoneme durations");
|
|
@end lisp
|
|
The first argument to @code{siod_get_lval} is the Scheme name of
|
|
a variable which has been set to an assoc list of phone and average
|
|
duration before this module is called. See the
|
|
variable @code{phone_durations} in @file{lib/mrpa_durs.scm} for
|
|
the format. The second argument to @code{siod_get_lval}. is an
|
|
error message to be printed if the variable @code{phone_averages}
|
|
is not set. If the second argument to @code{siod_get_lval} is
|
|
@code{NULL} then no error is given and if the variable is unset
|
|
this function simply returns the Scheme value @code{nil}.
|
|
|
|
Now that we have the duration data we can go through each segment
|
|
in the utterance and add the duration. The loop looks like
|
|
@lisp
|
|
for (s=u->relation("Segment")->head(); s != 0; s = next(s))
|
|
@{
|
|
@end lisp
|
|
We can lookup the average duration of the current segment name
|
|
using the function @code{siod_assoc_str}. As arguments, it
|
|
takes the segment name @code{s->name()} and the assoc list of
|
|
phones and duration.
|
|
@lisp
|
|
ldur = siod_assoc_str(s->name(),ph_avgs);
|
|
@end lisp
|
|
Note the return value is actually a LISP pair (phone name and duration),
|
|
or @code{nil} if the phone isn't in the list. Here we check if
|
|
the segment is in the list. If it is not we print an error and set
|
|
the duration to 100 ms, if it is in the list the floating point number
|
|
is extracted from the LISP pair.
|
|
@lisp
|
|
if (ldur == NIL)
|
|
@{
|
|
cerr << "Phoneme: " << s->name() << " no duration "
|
|
<< endl;
|
|
dur = 0.100;
|
|
@}
|
|
else
|
|
dur = get_c_float(car(cdr(ldur)));
|
|
@end lisp
|
|
If this phone is in an accented syllable we wish to increase its
|
|
duration by a factor of 1.5. To find out if it is accented
|
|
we use the feature system to find the syllable this phone is
|
|
part of and find out if that syllable is accented.
|
|
@lisp
|
|
if (ffeature(s,"R:SylStructure.parent.accented") == 1)
|
|
dur *= 1.5;
|
|
@end lisp
|
|
Now that we have the desired duration we increment the @code{end}
|
|
duration with our predicted duration for this segment and set
|
|
the end of the current segment.
|
|
@lisp
|
|
end += dur;
|
|
s->fset("end",end);
|
|
@}
|
|
@end lisp
|
|
Finally we return the utterance from the function.
|
|
@lisp
|
|
return utt;
|
|
@}
|
|
@end lisp
|
|
Once a module is defined it must be declared to the system so it may be
|
|
called. To do this one must call the function
|
|
@code{festival_def_utt_module} which takes a LISP name, the C++ function
|
|
name and a documentation string describing what the module does. This
|
|
will automatically be available at run-time and added to the manual.
|
|
The call to this function should be added to the initialization function
|
|
in the directory you are adding the module too. The function is called
|
|
@code{festival_DIRNAME_init()}. If one doesn't exist you'll need to
|
|
create it.
|
|
|
|
In @file{./src/Duration/} the function @code{festival_Duration_init()}
|
|
is at the end of the file @file{dur_aux.cc}. Thus we can add our
|
|
new modules declaration at the end of that function. But first
|
|
we must declare the C++ function in that file. Thus above
|
|
that function we would add
|
|
@lisp
|
|
LISP FT_Duration_Simple(LISP args);
|
|
@end lisp
|
|
While at the end of the function @code{festival_Duration_init()}
|
|
we would add
|
|
@lisp
|
|
festival_def_utt_module("Duration_Simple",FT_Duration_Simple,
|
|
"(Duration_Simple UTT)\n\
|
|
Label all segments with average duration ... ");
|
|
@end lisp
|
|
|
|
In order for our new file to be compiled we must add it
|
|
to the @file{Makefile} in that directory, to the @code{SRCS} variable.
|
|
Then when we type @code{make} in @file{./src/} our new module
|
|
will be properly linked in and available for use.
|
|
|
|
Of course we are not quite finished. We still have to say when our
|
|
new duration module should be called. When we set
|
|
@lisp
|
|
(Parameter.set 'Duration_Method Duration_Simple)
|
|
@end lisp
|
|
for a voice it will use our new module, calls to the function
|
|
@code{utt.synth} will use our new duration module.
|
|
|
|
Note in earlier versions of Festival it was necessary to modify
|
|
the duration calling function in @file{lib/duration.scm} but
|
|
that is no longer necessary.
|
|
|
|
@subsection Example 2: accessing the utterance
|
|
|
|
@cindex Relations
|
|
In this example we will make more direct use of the utterance structure,
|
|
showing the gory details of following relations in an utterance. This
|
|
time we will create a module that will name all syllables with a
|
|
concatenation of the names of the segments they are related to.
|
|
|
|
As before we need the same standard includes
|
|
@lisp
|
|
#include "festival.h"
|
|
|
|
@end lisp
|
|
Now the definition the function
|
|
@lisp
|
|
LISP FT_Name_Syls(LISP utt)
|
|
@{
|
|
@end lisp
|
|
As with the previous example we are called with an utterance LISP object
|
|
and will return the same. The first task is to extract the
|
|
utterance object from the LISP object.
|
|
@lisp
|
|
EST_Utterance *u = get_c_utt(utt);
|
|
EST_Item *syl,*seg;
|
|
@end lisp
|
|
Now for each syllable in the utterance we want to find which segments
|
|
are related to it.
|
|
@lisp
|
|
for (syl=u->relation("Syllable")->head(); syl != 0; syl = next(syl))
|
|
@{
|
|
@end lisp
|
|
Here we declare a variable to cummulate the names of the segments.
|
|
@lisp
|
|
EST_String sylname = "";
|
|
@end lisp
|
|
Now we iterate through the @code{SylStructure} daughters of the
|
|
syllable. These will be the segments in that syllable.
|
|
@lisp
|
|
for (seg=daughter1(syl,"SylStructure"); seg; seg=next(seg))
|
|
sylname += seg->name();
|
|
@end lisp
|
|
Finally we set the syllables name to the concatenative name, and
|
|
loop to the next syllable.
|
|
@lisp
|
|
syl->set_name(sylname);
|
|
@}
|
|
@end lisp
|
|
Finally we return the LISP form of the utterance.
|
|
@lisp
|
|
return utt;
|
|
@}
|
|
@end lisp
|
|
|
|
@subsection Example 3: adding new directories
|
|
|
|
@cindex adding new directories
|
|
In this example we will add a whole new subsystem. This will often be a
|
|
common way for people to use Festival. For example let us assume we
|
|
wish to add a formant waveform synthesizer (e.g like that in the free
|
|
@file{rsynth} program). In this case we will add a whole new
|
|
sub-directory to the modules directory. Let us call it @file{rsynth/}.
|
|
|
|
In the directory we need a @file{Makefile} of the standard form so we
|
|
should copy one from one of the other directories,
|
|
e.g. @file{Intonation/}. Standard methods are used to identify the
|
|
source code files in a @file{Makefile} so that the @file{.o} files are
|
|
properly added to the library. Following the other examples will ensure
|
|
your code is integrated properly.
|
|
|
|
We'll just skip over the bit where you extract the information
|
|
from the utterance structure and synthesize the waveform
|
|
(see @file{donovan/donovan.cc} or @file{diphone/diphone.cc}
|
|
for examples).
|
|
|
|
To get Festival to use your new module you must tell it to compile the
|
|
directory's contents. This is done in @file{festival/config/config}.
|
|
Add the line
|
|
@lisp
|
|
ALSO_INCLUDE += rsynth
|
|
@end lisp
|
|
to the end of that file (there are simialr ones mentioned). Simply
|
|
adding the name of the directory here will add that as a new module
|
|
and the directory will be compiled.
|
|
|
|
What you must provide in your code is a function
|
|
@code{festival_DIRNAME_init()} which will be called at initialization
|
|
time. In this function you should call any further initialization
|
|
require and define and new Lisp functions you with to made available
|
|
to the rest of the system. For example in the @file{rsynth}
|
|
case we would define in some file in @file{rsynth/}
|
|
@lisp
|
|
#include "festival.h"
|
|
|
|
static LISP utt_rtsynth(LISP utt)
|
|
@{
|
|
EST_Utterance *u = get_c_utt(utt);
|
|
// Do format synthesis
|
|
return utt;
|
|
@}
|
|
|
|
void festival_rsynth_init()
|
|
@{
|
|
proclaim_module("rsynth");
|
|
|
|
festival_def_utt_module("Rsynth_Synth",utt_rsynth,
|
|
"(Rsynth_Synth UTT)
|
|
A simple formant synthesizer");
|
|
|
|
...
|
|
@}
|
|
|
|
@end lisp
|
|
Integration of the code in optional (and standard directories) is done
|
|
by automatically creating @file{src/modules/init_modules.cc} for the
|
|
list of standard directories plus those defined as
|
|
@code{ALSO_INCLUDE}. A call to a function called
|
|
@code{festival_DIRNAME_init()} will be made.
|
|
|
|
This mechanism is specifically designed so you can add modules to the
|
|
system without changing anything in the standard distribution.
|
|
|
|
@subsection Example 4: adding new LISP objects
|
|
|
|
@cindex adding new LISP objects
|
|
This third example shows you how to add a new Object to Scheme
|
|
and add wraparounds to allow manipulation within the Scheme
|
|
(and C++) domain.
|
|
|
|
Like example 2 we are assuming this is done in a new directory.
|
|
Suppose you have a new object called @code{Widget} that can
|
|
transduce a string into some other string (with some optional
|
|
continuous parameter). Thus, here we create a new file @file{widget.cc}
|
|
like this
|
|
|
|
@lisp
|
|
#include "festival.h"
|
|
#include "widget.h" // definitions for the widget class
|
|
@end lisp
|
|
In order to register the widgets as Lisp objects we actually
|
|
need to register them as @code{EST_Val}'s as well. Thus we now need
|
|
@lisp
|
|
VAL_REGISTER_CLASS(widget,Widget)
|
|
SIOD_REGISTER_CLASS(widget,Widget)
|
|
@end lisp
|
|
The first names given to these functions should be a short mnenomic name
|
|
for the object that will be used in the defining of a set
|
|
of access and construction functions. It of course must be unique
|
|
within the whole system. The second name is the name of the object
|
|
itself.
|
|
|
|
To understand its usage we can add a few simple widget manipulation
|
|
functions
|
|
@lisp
|
|
LISP widget_load(LISP filename)
|
|
@{
|
|
EST_String fname = get_c_string(filename);
|
|
Widget *w = new Widget; // build a new widget
|
|
|
|
if (w->load(fname) == 0) // successful load
|
|
return siod(w);
|
|
else
|
|
@{
|
|
cerr << "widget load: failed to load \"" << fname << "\"" << endl;
|
|
festival_error();
|
|
@}
|
|
return NIL; // for compilers that get confused
|
|
@}
|
|
@end lisp
|
|
Note that the function @code{siod} constructs a LISP object from
|
|
a @code{widget}, the class register macro defines that for you.
|
|
Also note that when giving an object to a @code{LISP} object it then
|
|
owns the object and is responsible for deleting it when garbage
|
|
collection occurs on that @code{LISP} object. Care should be
|
|
taken that you don't put the same object within different @code{LISP}
|
|
objects. The macros @code{VAL_RESGISTER_CLASS_NODEL} should be
|
|
called if you do not want your given object to be deleted by the LISP
|
|
system (this may cause leaks).
|
|
|
|
If you want refer to these functions in other files within your
|
|
models you can use
|
|
@lisp
|
|
VAL_REGISTER_CLASS_DCLS(widget,Widget)
|
|
SIOD_REGISTER_CLASS_DCLS(widget,Widget)
|
|
@end lisp
|
|
in a common @file{.h} file
|
|
|
|
The following defines a function that takes a LISP object containing
|
|
a widget, applies some method and returns a string.
|
|
@lisp
|
|
LISP widget_apply(LISP lwidget, LISP string, LISP param)
|
|
@{
|
|
Widget *w = widget(lwidget);
|
|
EST_String s = get_c_string(string);
|
|
float p = get_c_float(param);
|
|
EST_String answer;
|
|
|
|
answer = w->apply(s,p);
|
|
|
|
return strintern(answer);
|
|
@}
|
|
@end lisp
|
|
The function @code{widget}, defined by the registration macros, takes
|
|
a @code{LISP} object and returns a pointer to the @code{widget} inside
|
|
it. If the @code{LISP} object does not contain a @code{widget} an
|
|
error will be thrown.
|
|
|
|
Finally you wish to add these functions to the Lisp
|
|
system
|
|
@lisp
|
|
void festival_widget_init()
|
|
@{
|
|
init_subr_1("widget.load",widget_load,
|
|
"(widget.load FILENAME)\n\
|
|
Load in widget from FILENAME.");
|
|
init_subr_3("widget.apply",widget_apply,
|
|
"(widget.apply WIDGET INPUT VAL)\n\
|
|
Returns widget applied to string iNPUT with float VAL.");
|
|
@}
|
|
@end lisp
|
|
|
|
In your @file{Makefile} for this directory you'll need to add
|
|
the include directory where @file{widget.h} is, if it is not
|
|
contained within the directory itself. This is done through
|
|
the make variable @code{LOCAL_INCLUDES} as
|
|
@lisp
|
|
LOCAL_INCLUDES = -I/usr/local/widget/include
|
|
@end lisp
|
|
And for the linker you'll need to identify where your widget library
|
|
is. In your @file{festival/config/config} file at the end add
|
|
@lisp
|
|
COMPILERLIBS += -L/usr/local/widget/lib -lwidget
|
|
@end lisp
|
|
|
|
@node API, Examples , Programming , Top
|
|
@chapter API
|
|
|
|
If you wish to use Festival within some other application there are
|
|
a number of possible interfaces.
|
|
|
|
@menu
|
|
* Scheme API:: Programming in Scheme
|
|
* Shell API:: From Unix shell
|
|
* Server/client API:: Festival as a speech synthesis server
|
|
* C/C++ API:: Through function calls from C++.
|
|
* C only API:: Small independent C client access
|
|
* Java and JSAPI:: Sythesizing from Java
|
|
@end menu
|
|
|
|
@node Scheme API, Shell API, , API
|
|
@section Scheme API
|
|
|
|
@cindex Scheme programming
|
|
Festival includes a full programming language, Scheme (a variant of
|
|
Lisp) as a powerful interface to its speech synthesis functions.
|
|
Often this will be the easiest method of controlling Festival's
|
|
functionality. Even when using other API's they will ultimately
|
|
depend on the Scheme interpreter.
|
|
|
|
Scheme commands (as s-expressions) may be simply written in files and
|
|
interpreted by Festival, either by specification as arguments on
|
|
the command line, in the interactive interpreter, or through standard
|
|
input as a pipe. Suppose we have a file @file{hello.scm} containing
|
|
|
|
@lisp
|
|
;; A short example file with Festival Scheme commands
|
|
(voice_rab_diphone) ;; select Gordon
|
|
(SayText "Hello there")
|
|
(voice_don_diphone) ;; select Donovan
|
|
(SayText "and hello from me")
|
|
@end lisp
|
|
|
|
From the command interpreter we can execute the commands in this file
|
|
by loading them
|
|
@lisp
|
|
festival> (load "hello.scm")
|
|
nil
|
|
@end lisp
|
|
Or we can execute the commands in the file directly from the
|
|
shell command line
|
|
@lisp
|
|
unix$ festival -b hello.scm
|
|
@end lisp
|
|
The @samp{-b} option denotes batch operation meaning the file is loaded
|
|
and then Festival will exit, without starting the command interpreter.
|
|
Without this option @samp{-b} Festival will load
|
|
@file{hello.scm} and then accept commands on standard input. This can
|
|
be convenient when some initial set up is required for a session.
|
|
|
|
Note one disadvantage of the batch method is that time is required for
|
|
Festival's initialisation every time it starts up. Although this will
|
|
typically only be a few seconds, for saying short individual expressions
|
|
that lead in time may be unacceptable. Thus simply executing the
|
|
commands within an already running system is more desirable, or using
|
|
the server/client mode.
|
|
|
|
Of course its not just about strings of commands, because Scheme is a
|
|
fully functional language, functions, loops, variables, file access,
|
|
arithmetic operations may all be carried out in your Scheme programs.
|
|
Also, access to Unix is available through the @code{system}
|
|
function. For many applications directly programming them in Scheme is
|
|
both the easiest and the most efficient method.
|
|
|
|
@cindex scripts
|
|
A number of example Festival scripts are included in @file{examples/}.
|
|
Including a program for saying the time, and for telling you the latest
|
|
news (by accessing a page from the web). Also see the
|
|
detailed discussion of a script example in @xref{POS Example}.
|
|
|
|
@node Shell API, Server/client API, Scheme API, API
|
|
@section Shell API
|
|
|
|
@cindex shell programming
|
|
The simplest use of Festival (though not the most powerful) is
|
|
simply using it to directly render text files as speech. Suppose
|
|
we have a file @file{hello.txt} containing
|
|
@lisp
|
|
Hello world. Isn't it excellent weather
|
|
this morning.
|
|
@end lisp
|
|
We can simply call Festival as
|
|
@lisp
|
|
unix$ festival --tts hello.txt
|
|
@end lisp
|
|
Or for even simpler one-off phrases
|
|
@lisp
|
|
unix$ echo "hello " | festival --tts
|
|
@end lisp
|
|
This is easy to use but you will need to wait for Festival to start up
|
|
and initialise its databases before it starts to render the text as
|
|
speech. This may take several seconds on some machines. A socket based
|
|
server mechanism is provided in Festival which will allow a single
|
|
server process to start up once and be used efficiently by multiple
|
|
client programs.
|
|
|
|
Note also the use of Sable for marked up text, @pxref{XML/SGML mark-up}.
|
|
Sable allows various forms of additional information in text, such as
|
|
phrasing, emphasis, pronunciation, as well as changing voices, and
|
|
inclusion of external waveform files (i.e. random noises). For many
|
|
application this will be the preferred interface method. Other text
|
|
modes too are available through the command line by using
|
|
@code{auto-text-mode-alist}.
|
|
|
|
@node Server/client API, C/C++ API, Shell API, API
|
|
@section Server/client API
|
|
|
|
@cindex server mode
|
|
Festival offers a BSD socket-based interface. This allows
|
|
Festival to run as a server and allow client programs to access
|
|
it. Basically the server offers a new command interpreter for
|
|
each client that attaches to it. The server is forked for each
|
|
client but this is much faster than having to wait for a
|
|
Festival process to start from scratch. Also the server can
|
|
run on a bigger machine, offering much faster synthesis.
|
|
|
|
@emph{Note: the Festival server is inherently insecure and may allow
|
|
arbitrary users access to your machine.}
|
|
|
|
Every effort has been made to minimise the risk of unauthorised access
|
|
through Festival and a number of levels of security are provided.
|
|
However with any program offering socket access, like @code{httpd},
|
|
@code{sendmail} or @code{ftpd} there is a risk that unauthorised access
|
|
is possible. I trust Festival's security enough to often run it on my
|
|
own machine and departmental servers, restricting access to within our
|
|
department. Please read the information below before using
|
|
the Festival server so you understand the risks.
|
|
|
|
@subsection Server access control
|
|
|
|
@cindex security
|
|
@cindex server security
|
|
The following access control is available for Festival when
|
|
running as a server. When the server starts it will usually
|
|
start by loading in various commands specific for the task
|
|
it is to be used for. The following variables are used
|
|
to control access.
|
|
@table @code
|
|
@item server_port
|
|
A number identifying the inet socket port. By default this
|
|
is 1314. It may be changed as required.
|
|
@item server_log_file
|
|
If nil no logging takes place, if t logging is printed to standard out
|
|
and if a file name log messages are appended to that file. All
|
|
connections and attempted connections are logged with a time stamp
|
|
and the name of the client. All commands sent from the client
|
|
are also logged (output and data input is not logged).
|
|
@item server_deny_list
|
|
If non-nil it is used to identify which machines are not allowed
|
|
access to the server. This is a list of regular expressions.
|
|
If the host name of the client matches any of the regexs in this
|
|
list the client is denied access. This overrides all other
|
|
access methods. Remember that sometimes hosts are identified as
|
|
numbers not as names.
|
|
@item server_access_list
|
|
If this is non-nil only machines whose names match at least one of the
|
|
regexs in this list may connect as clients. Remember that sometimes
|
|
hosts are identified as numbers not as names, so you should probably
|
|
exclude the IP number of machine as well as its name to be properly
|
|
secure.
|
|
@item server_passwd
|
|
If this is non-nil, the client must send this passwd to the server
|
|
followed by a newline before access is given. This is required
|
|
even if the machine is included in the access list. This is designed
|
|
so servers for specific tasks may be set up with reasonable security.
|
|
@item (set_server_safe_functions FUNCNAMELIST)
|
|
If called this can restrict which functions the client may call. This
|
|
is the most restrictive form of access, and thoroughly recommended. In
|
|
this mode it would be normal to include only the specific functions the
|
|
client can execute (i.e. the function to set up output, and a tts
|
|
function). For example a server could call the following at
|
|
set up time, thus restricting calls to only those that
|
|
@file{festival_client} @code{--ttw} uses.
|
|
@lisp
|
|
(set_server_safe_functions
|
|
'(tts_return_to_client tts_text tts_textall Parameter.set))
|
|
@end lisp
|
|
|
|
@end table
|
|
Its is strongly recommend that you run Festival in server mode as userid
|
|
@code{nobody} to limit the access the process will have, also running it
|
|
in a chroot environment is more secure.
|
|
|
|
For example suppose we wish to allow access to all machines in the CSTR
|
|
domain except for @code{holmes.cstr.ed.ac.uk} and
|
|
@code{adam.cstr.ed.ac.uk}. This may be done by adding the following two
|
|
commands to a file e.g. @code{server.scm}
|
|
@lisp
|
|
(set! server_deny_list '("holmes\\.cstr\\.ed\\.ac\\.uk"
|
|
"adam\\.cstr\\.ed\\.ac\\.uk"))
|
|
(set! server_access_list '("[^\\.]*\\.cstr\\.ed\\.ac\\.uk"))
|
|
@end lisp
|
|
and them running the command
|
|
@lisp
|
|
festival PATH_TO/server.scm --server
|
|
@end lisp
|
|
|
|
This is not complete though as when DNS is not working @code{holmes} and
|
|
@code{adam} will still be able to access the server (but if our DNS
|
|
isn't working we probably have more serious problems). However the
|
|
above is secure in that only machines in the domain @code{cstr.ed.ac.uk}
|
|
can access the server, though there may be ways to fix machines to
|
|
identify themselves as being in that domain even when they are not.
|
|
|
|
By default Festival in server mode will only accept client connections
|
|
for @code{localhost}.
|
|
|
|
@subsection Client control
|
|
|
|
@cindex festival_client
|
|
@cindex client
|
|
An example client program called @file{festival_client} is
|
|
included with the system that provides a wide range of access methods
|
|
to the server. A number of options for the client are offered.
|
|
|
|
@table @code
|
|
@item --server
|
|
The name (or IP number) of the server host. By default this
|
|
is @file{localhost} (i.e. the same machine you run the client on).
|
|
@item --port
|
|
The port number the Festival server is running on. By default this
|
|
is 1314.
|
|
@item --output FILENAME
|
|
If a waveform is to be synchronously returned, it will be saved in
|
|
@var{FILENAME}. The @code{--ttw} option uses this as does the
|
|
use of the Festival command @code{utt.send.wave.client}. If
|
|
an output waveform file is received by @file{festival_client}
|
|
and no output file has been given the waveform is discarded with
|
|
an error message.
|
|
@item --passwd PASSWD
|
|
If a passwd is required by the server this should be stated
|
|
on the client call. @var{PASSWD} is sent plus a newline
|
|
before any other communication takes places. If this isn't
|
|
specified and a passwd is required, you must enter that first,
|
|
if the @code{--ttw} option is used, a passwd is required and
|
|
none specified access will be denied.
|
|
@item --prolog FILE
|
|
@var{FILE} is assumed to be contain Festival commands and its contents
|
|
are sent to the server after the passwd but before anything else. This
|
|
is convenient to use in conjunction with @code{--ttw} which otherwise
|
|
does not offer any way to send commands as well as the text to the
|
|
server.
|
|
@item --otype OUTPUTTYPE
|
|
If an output waveform file is to be used this specified the output type
|
|
of the file. The default is @code{nist}, but, @code{ulaw},
|
|
@code{riff}, @code{ulaw} and others as supported by the Edinburgh
|
|
Speech Tools Library are valid. You may use raw too but note that
|
|
Festival may return waveforms of various sampling rates depending on the
|
|
sample rates of the databases its using. You can of course make
|
|
Festival only return one particular sample rate, by using
|
|
@code{after_synth_hooks}. Note that byte order will be native machine of the
|
|
@emph{client} machine if the output format allows it.
|
|
@item --ttw
|
|
Text to wave is an attempt to make @code{festival_client} useful
|
|
in many simple applications. Although you can connect to the server
|
|
and send arbitrary Festival Scheme commands, this option automatically
|
|
does what is probably what you want most often. When specified
|
|
this options takes text from the specified file (or stdin),
|
|
synthesizes it (in one go) and saves it in the specified output
|
|
file. It basically does the following
|
|
@lisp
|
|
(Parameter.set 'Wavefiletype '<output type>)
|
|
(tts_textall "
|
|
<file/stdin contents>
|
|
")))
|
|
@end lisp
|
|
Note that this is best used for small, single utterance texts as you
|
|
have to wait for the whole text to be synthesized before it is returned.
|
|
@item --aucommand COMMAND
|
|
Execute @var{COMMAND} of each waveform returned by the server. The
|
|
variable @code{FILE} will be set when @var{COMMAND} is executed.
|
|
@item --async
|
|
So that the delay between the text being sent and the first sound
|
|
being available to play, this option in conjunction with @code{--ttw}
|
|
causes the text to be synthesized utterance by utterance and be sent back
|
|
in separated waveforms. Using @code{--aucommand} each waveform my
|
|
be played locally, and when @file{festival_client} is interrupted
|
|
the sound will stop. Getting the client to connect to an audio
|
|
server elsewhere means the sound will not necessarily stop when
|
|
the @file{festival_client} process is stopped.
|
|
@item --withlisp
|
|
With each command being sent to Festival a Lisp return value is
|
|
sent, also Lisp expressions may be sent from the server to the
|
|
client through the command @code{send_client}. If this option
|
|
is specified the Lisp expressions are printed to standard out,
|
|
otherwise this information is discarded.
|
|
@end table
|
|
|
|
A typical example use of @file{festival_client} is
|
|
@example
|
|
festival_client --async --ttw --aucommand 'na_play $FILE' fred.txt
|
|
@end example
|
|
This will use @file{na_play} to play each waveform generated for the
|
|
utterances in @file{fred.txt}. Note the @emph{single} quotes so that
|
|
the @code{$} in @code{$FILE} isn't expanded locally.
|
|
|
|
Note the server must be running before you can talk to it. At present
|
|
Festival is not set up for automatic invocations through @file{inetd}
|
|
and @file{/etc/services}. If you do that yourself, note
|
|
that it is a different type of interface as @file{inetd} assumes all
|
|
communication goes through standard in/out.
|
|
|
|
Also note that each connection to the server starts a new session.
|
|
Variables are not persistent over multiple calls to the server so if any
|
|
initialization is required (e.g. loading of voices) it must be done
|
|
each time the client starts or more reasonably in the server
|
|
when it is started.
|
|
|
|
@cindex perl
|
|
A PERL festival client is also available in
|
|
@file{festival/examples/festival_client.pl}
|
|
|
|
@subsection Server/client protocol
|
|
|
|
@cindex client/server protocol
|
|
@cindex server/client protocol
|
|
The client talks to the server using s-expression (Lisp). The server
|
|
will reply with a number of different chunks until either OK is
|
|
returned or ER (on error). The communication is synchronous, each
|
|
client request can generate a number of waveform (WV) replies and/or
|
|
Lisp replies (LP) and will be terminated with an OK (or ER). Lisp is
|
|
used as it has its own inherent syntax that Festival can already
|
|
parse.
|
|
|
|
The following pseudo-code will help define the protocol
|
|
as well as show typical use
|
|
@lisp
|
|
fprintf(serverfd,"%s\n",s-expression);
|
|
do
|
|
ack = read three character acknowledgemnt
|
|
if (ack == "WV\n")
|
|
read a waveform
|
|
else if (ack == "LP\n")
|
|
read an s-expression
|
|
else if (ack == "ER\n")
|
|
an error occurred, break;
|
|
while ack != "OK\n"
|
|
@end lisp
|
|
The server can send a waveform in an utterance to the client through the
|
|
function @code{utt.send.wave.client}. The server can send a lisp
|
|
expression to the client through the function @emph{TO BE DONE}.
|
|
|
|
|
|
@node C/C++ API, C only API, Server/client API, API
|
|
@section C/C++ API
|
|
|
|
As well as offerening an interface through Scheme and the shell some
|
|
users may also wish to embedd Festival within their own C++ programs.
|
|
A number of simply to use high level functions are available for such
|
|
uses.
|
|
|
|
In order to use Festival you must include
|
|
@file{festival/src/include/festival.h} which in turn will include the
|
|
necessary other include files in @file{festival/src/include} and
|
|
@file{speech_tools/include} you should ensure these are included in the
|
|
include path for you your program. Also you will need to link your
|
|
program with @file{festival/src/lib/libFestival.a},
|
|
@file{speech_tools/lib/libestools.a},
|
|
@file{speech_tools/lib/libestbase.a} and
|
|
@file{speech_tools/lib/libeststring.a} as well as any other optional
|
|
libraries such as net audio.
|
|
|
|
The main external functions available for C++ users of Festival
|
|
are.
|
|
@table @code
|
|
@item void festival_initialize(int load_init_files,int heapsize);
|
|
This must be called before any other festival functions may be called.
|
|
It sets up the synthesizer system. The first argument if true,
|
|
causes the system set up files to be loaded (which is normally
|
|
what is necessary), the second argument is the initial size of the
|
|
Scheme heap, this should normally be 210000 unless you envisage
|
|
processing very large Lisp structures.
|
|
@item int festival_say_file(const EST_String &filename);
|
|
Say the contents of the given file. Returns @code{TRUE} or @code{FALSE}
|
|
depending on where this was successful.
|
|
@item int festival_say_text(const EST_String &text);
|
|
Say the contents of the given string. Returns @code{TRUE} or @code{FALSE}
|
|
depending on where this was successful.
|
|
@item int festival_load_file(const EST_String &filename);
|
|
Load the contents of the given file and evaluate its contents as
|
|
Lisp commands. Returns @code{TRUE} or @code{FALSE}
|
|
depending on where this was successful.
|
|
@item int festival_eval_command(const EST_String &expr);
|
|
Read the given string as a Lisp command and evaluate it. Returns
|
|
@code{TRUE} or @code{FALSE} depending on where this was successful.
|
|
@item int festival_text_to_wave(const EST_String &text,EST_Wave &wave);
|
|
Synthesize the given string into the given wave. Returns @code{TRUE} or
|
|
@code{FALSE} depending on where this was successful.
|
|
@end table
|
|
Many other commands are also available but often the above will be
|
|
sufficient.
|
|
|
|
Below is a simple top level program that uses the Festival
|
|
functions
|
|
@lisp
|
|
int main(int argc, char **argv)
|
|
@{
|
|
EST_Wave wave;
|
|
int heap_size = 210000; // default scheme heap size
|
|
int load_init_files = 1; // we want the festival init files loaded
|
|
|
|
festival_initialize(load_init_files,heap_size);
|
|
|
|
// Say simple file
|
|
festival_say_file("/etc/motd");
|
|
|
|
festival_eval_command("(voice_ked_diphone)");
|
|
// Say some text;
|
|
festival_say_text("hello world");
|
|
|
|
// Convert to a waveform
|
|
festival_text_to_wave("hello world",wave);
|
|
wave.save("/tmp/wave.wav","riff");
|
|
|
|
// festival_say_file puts the system in async mode so we better
|
|
// wait for the spooler to reach the last waveform before exiting
|
|
// This isn't necessary if only festival_say_text is being used (and
|
|
// your own wave playing stuff)
|
|
festival_wait_for_spooler();
|
|
|
|
return 0;
|
|
@}
|
|
@end lisp
|
|
|
|
@node C only API, Java and JSAPI, C/C++ API, API
|
|
@section C only API
|
|
|
|
@cindex C interface
|
|
@cindex festival_client.c
|
|
A simpler C only interface example is given inf
|
|
@file{festival/examples/festival_client.c}. That interface talks to a
|
|
festival server. The code does not require linking with any other EST
|
|
or Festival code so is much smaller and easier to include in other
|
|
programs. The code is missing some functionality but not much consider
|
|
how much smaller it is.
|
|
|
|
@node Java and JSAPI, , C only API, API
|
|
@section Java and JSAPI
|
|
|
|
@cindex Java
|
|
@cindex JSAPI
|
|
Initial support for talking to a Festival server from java is included
|
|
from version 1.3.0 and initial JSAPI support is included from 1.4.0.
|
|
At present the JSAPI talks to a Festival server elsewhere rather than
|
|
as part of the Java process itself.
|
|
|
|
A simple (Pure) Java festival client is given
|
|
@file{festival/src/modules/java/cstr/festival/Client.java} with a
|
|
wraparound script in @file{festival/bin/festival_client_java}.
|
|
|
|
See the file @file{festival/src/modules/java/cstr/festival/jsapi/ReadMe}
|
|
for requirements and a small example of using the JSAPI interface.
|
|
|
|
@node Examples, Problems , API , Top
|
|
@chapter Examples
|
|
|
|
This chapter contains some simple walkthrough examples of using
|
|
Festival in various ways, not just as speech synthesizer
|
|
|
|
@menu
|
|
* POS Example:: Using Festival as a part of speech tagger
|
|
* Singing Synthesis:: Using Festival for singing
|
|
@end menu
|
|
|
|
@node POS Example,Singing Synthesis, , Examples
|
|
@section POS Example
|
|
|
|
@cindex POS example
|
|
@cindex script programming
|
|
This example shows how we can use part of the standard synthesis process
|
|
to tokenize and tag a file of text. This section does not cover
|
|
training and setting up a part of speech tag set @xref{POS tagging},
|
|
only how to go about using the standard POS tagger on text.
|
|
|
|
This example also shows how to use Festival as a simple scripting
|
|
language, and how to modify various methods used during text to speech.
|
|
|
|
The file @file{examples/text2pos} contains an executable shell script
|
|
which will read arbitrary ascii text from standard input and produce
|
|
words and their part of speech (one per line) on standard output.
|
|
|
|
A Festival script, like any other UNIX script, it must start with the
|
|
the characters @code{#!} followed by the name of the @file{festival}
|
|
executable. For scripts the option @code{-script} is also
|
|
required. Thus our first line looks like
|
|
@lisp
|
|
#!/usr/local/bin/festival -script
|
|
@end lisp
|
|
Note that the pathname may need to be different on your system
|
|
|
|
Following this we have copious comments, to keep our lawyers happy,
|
|
before we get into the real script.
|
|
|
|
The basic idea we use is that the tts process segments text into
|
|
utterances, those utterances are then passed to a list of functions, as
|
|
defined by the Scheme variable @code{tts_hooks}. Normally this variable
|
|
contains a list of two function, @code{utt.synth} and @code{utt.play} which
|
|
will synthesize and play the resulting waveform. In this case, instead,
|
|
we wish to predict the part of speech value, and then print it out.
|
|
|
|
The first function we define basically replaces the normal synthesis
|
|
function @code{utt.synth}. It runs the standard festival utterance
|
|
modules used in the synthesis process, up to the point where POS is
|
|
predicted. This function looks like
|
|
@lisp
|
|
(define (find-pos utt)
|
|
"Main function for processing TTS utterances. Predicts POS and
|
|
prints words with their POS"
|
|
(Token utt)
|
|
(POS utt)
|
|
)
|
|
@end lisp
|
|
The normal text-to-speech process first tokenizes the text splitting it
|
|
in to ``sentences''. The utterance type of these is @code{Token}. Then
|
|
we call the @code{Token} utterance module, which converts the tokens to
|
|
a stream of words. Then we call the @code{POS} module to predict part
|
|
of speech tags for each word. Normally we would call other modules
|
|
ultimately generating a waveform but in this case we need no further
|
|
processing.
|
|
|
|
The second function we define is one that will print out the words and
|
|
parts of speech
|
|
@lisp
|
|
(define (output-pos utt)
|
|
"Output the word/pos for each word in utt"
|
|
(mapcar
|
|
(lambda (pair)
|
|
(format t "%l/%l\n" (car pair) (car (cdr pair))))
|
|
(utt.features utt 'Word '(name pos))))
|
|
@end lisp
|
|
This uses the @code{utt.features} function to extract features from the
|
|
items in a named stream of an utterance. In this case we want the
|
|
@code{name} and @code{pos} features for each item in the @code{Word}
|
|
stream. Then for each pair we print out the word's name, a slash and its
|
|
part of speech followed by a newline.
|
|
|
|
Our next job is to redefine the functions to be called
|
|
during text to speech. The variable @code{tts_hooks} is defined
|
|
in @file{lib/tts.scm}. Here we set it to our two newly-defined
|
|
functions
|
|
@lisp
|
|
(set! tts_hooks (list find-pos output-pos))
|
|
@end lisp
|
|
@cindex garbage collection
|
|
@cindex GC
|
|
So that garbage collection messages do not appear on the screen
|
|
we stop the message from being outputted by the following
|
|
command
|
|
@lisp
|
|
(gc-status nil)
|
|
@end lisp
|
|
The final stage is to start the tts process running on standard
|
|
input. Because we have redefined what functions are to be run on
|
|
the utterances, it will no longer generate speech but just predict
|
|
part of speech and print it to standard output.
|
|
@lisp
|
|
(tts_file "-")
|
|
@end lisp
|
|
|
|
@node Singing Synthesis, ,POS Example, Examples
|
|
@section Singing Synthesis
|
|
|
|
@cindex singing
|
|
As an interesting example a @file{singing-mode} is included. This
|
|
offers an XML based mode for specifying songs, both notes and duration.
|
|
This work was done as a student project by Dominic Mazzoni. A
|
|
number of examples are provided in @file{examples/songs}. This
|
|
may be run as
|
|
@lisp
|
|
festival> (tts "doremi.xml" 'singing)
|
|
@end lisp
|
|
Each note can be given a note and a beat value
|
|
@example
|
|
<?xml version="1.0"?>
|
|
<!DOCTYPE SINGING PUBLIC "-//SINGING//DTD SINGING mark up//EN"
|
|
"Singing.v0_1.dtd"
|
|
[]>
|
|
<SINGING BPM="30">
|
|
<PITCH NOTE="G3"><DURATION BEATS="0.3">doe</DURATION></PITCH>
|
|
<PITCH NOTE="A3"><DURATION BEATS="0.3">ray</DURATION></PITCH>
|
|
<PITCH NOTE="B3"><DURATION BEATS="0.3">me</DURATION></PITCH>
|
|
<PITCH NOTE="C4"><DURATION BEATS="0.3">fah</DURATION></PITCH>
|
|
<PITCH NOTE="D4"><DURATION BEATS="0.3">sew</DURATION></PITCH>
|
|
<PITCH NOTE="E4"><DURATION BEATS="0.3">lah</DURATION></PITCH>
|
|
<PITCH NOTE="F#4"><DURATION BEATS="0.3">tee</DURATION></PITCH>
|
|
<PITCH NOTE="G4"><DURATION BEATS="0.3">doe</DURATION></PITCH>
|
|
</SINGING>
|
|
@end example
|
|
You can construct multi-part songs by synthesizing each part
|
|
and generating waveforms, then combining them. For example
|
|
@example
|
|
text2wave -mode singing america1.xml -o america1.wav
|
|
text2wave -mode singing america2.xml -o america2.wav
|
|
text2wave -mode singing america3.xml -o america3.wav
|
|
text2wave -mode singing america4.xml -o america4.wav
|
|
ch_wave -o america.wav -pc longest america?.wav
|
|
@end example
|
|
The voice used to sing is the current voice. Note that the number of
|
|
syllables in the words must match that at run time, which means this
|
|
doesn't always work cross dialect (UK voices sometimes won't work without
|
|
tweaking).
|
|
|
|
@cindex flinger
|
|
This technique is basically simple, though is definitely effective.
|
|
However for a more serious singing synthesizer we recommend you look
|
|
at Flinger @url{http://cslu.cse.ogi.edu/tts/flinger/}, which addresses
|
|
the issues of synthesizing the human singing voice in more detail.
|
|
|
|
@node Problems, References , Examples , Top
|
|
@chapter Problems
|
|
|
|
There will be many problems with Festival, both in installation
|
|
and running it. It is a young system and there is a lot to it.
|
|
We believe the basic design is sound and problems will be features
|
|
that are missing or incomplete rather than fundamental ones.
|
|
|
|
We are always open to suggestions on how to improve it and fix problems,
|
|
we don't guarantee we'll have the time to fix problems but we are
|
|
interested in hearing what problems you have.
|
|
|
|
Before you smother us with mail here is an incomplete
|
|
list of general problems we have already identified
|
|
@itemize @bullet
|
|
@item
|
|
The more documentation we write the more we realize how much more
|
|
documentation is required. Most of the Festival documentation was
|
|
written by someone who knows the system very well, and makes many
|
|
English mistakes. A good re-write by some one else would be a good
|
|
start.
|
|
@item
|
|
The system is far too slow. Although machines are getting faster, it
|
|
still takes too long to start the system and get it to speak some
|
|
given text. Even so, on reasonable machines, Festival can generate
|
|
the speech several times faster than it takes to say it. But even if
|
|
it is five time faster, it will take 2 seconds to generate a 10 second
|
|
utterance. A 2 second wait is too long. Faster machines would improve
|
|
this but a change in design is a better solution.
|
|
@item
|
|
The system is too big. It takes a long time to compile even on quite
|
|
large machines, and its foot print is still in the 10s of megabytes as
|
|
is the run-time requirement. Although we have spent some time trying
|
|
to fix this (optional modules have made the possibility of building
|
|
a much smaller binary) we haven't done enough yet.
|
|
@item
|
|
The signal quality of the voices isn't very good by today's standard of
|
|
synthesizers, even given the improvement quality since the last release.
|
|
This is partly our fault in not spending the time (or perhaps also not
|
|
having enough expertise) on the low-level waveform synthesis parts of
|
|
the system. This will improve in the future with better signal
|
|
processing (under development) and better synthesis techniques (also
|
|
under development).
|
|
@end itemize
|
|
|
|
@node References, Feature functions , Problems , Top
|
|
@chapter References
|
|
|
|
@table @emph
|
|
@item allen87
|
|
Allen J., Hunnicut S. and Klatt, D. @emph{Text-to-speech: the
|
|
MITalk system}, Cambridge University Press, 1987.
|
|
@item abelson85
|
|
Abelson H. and Sussman G. @emph{Structure and Interpretation of Computer
|
|
Programs}, MIT Press, 1985.
|
|
@item black94
|
|
Black A. and Taylor, P. "CHATR: a generic speech synthesis system.",
|
|
@emph{Proceedings of COLING-94}, Kyoto, Japan 1994.
|
|
@item black96
|
|
Black, A. and Hunt, A. "Generating F0 contours from ToBI labels
|
|
using linear regression", @emph{ICSLP96}, vol. 3, pp 1385-1388,
|
|
Philadelphia, PA. 1996.
|
|
@item black97b
|
|
Black, A, and Taylor, P. "Assigning Phrase Breaks from Part-of-Speech
|
|
Sequences", @emph{Eurospeech97}, Rhodes, Greece, 1997.
|
|
@item black97c
|
|
Black, A, and Taylor, P. "Automatically clustering similar units for
|
|
unit selection in speech synthesis", @emph{Eurospeech97}, Rhodes, Greece,
|
|
1997.
|
|
@item black98
|
|
Black, A., Lenzo, K. and Pagel, V., "Issues in building general
|
|
letter to sound rules.", 3rd ESCA Workshop on Speech Synthesis,
|
|
Jenolan Caves, Australia, 1998.
|
|
@item black99
|
|
Black, A., and Lenzo, K., "Building Voices in the Festival Speech
|
|
Synthesis System," unpublished document, Carnegie Mellon University,
|
|
available at
|
|
@url{http://www.cstr.ed.ac.uk/projects/festival/docs/festvox/}
|
|
@item breiman84
|
|
Breiman, L., Friedman, J. Olshen, R. and Stone, C. @emph{Classification
|
|
and regression trees}, Wadsworth and Brooks, Pacific Grove, CA. 1984.
|
|
@item campbell91
|
|
Campbell, N. and Isard, S. "Segment durations in a syllable frame",
|
|
@emph{Journal of Phonetics}, 19:1 37-47, 1991.
|
|
@item DeRose88
|
|
DeRose, S. "Grammatical category disambiguation by statistical
|
|
optimization". @emph{Computational Linguistics}, 14:31-39, 1988.
|
|
@item dusterhoff97
|
|
Dusterhoff, K. and Black, A. "Generating F0 contours for speech
|
|
synthesis using the Tilt intonation theory" @emph{Proceedings of ESCA
|
|
Workshop of Intonation}, September, Athens, Greece. 1997
|
|
@item dutoit97
|
|
Dutoit, T. @emph{An introduction to Text-to-Speech Synthesis} Kluwer
|
|
Acedemic Publishers, 1997.
|
|
@item hunt89
|
|
Hunt, M., Zwierynski, D. and Carr, R. "Issues in high quality LPC
|
|
analysis and synthesis", @emph{Eurospeech89}, vol. 2, pp 348-351,
|
|
Paris, France. 1989.
|
|
@item jilka96
|
|
Jilka M. @emph{Regelbasierte Generierung natuerlich klingender
|
|
Intonation des Amerikanischen Englisch}, Magisterarbeit, Institute of
|
|
Natural Language Processing, University of Stuttgart. 1996
|
|
@item moulines90
|
|
Moulines, E, and Charpentier, N. "Pitch-synchronous waveform processing
|
|
techniques for text-to-speech synthesis using diphones"
|
|
@emph{Speech Communication}, 9(5/6) pp 453-467. 1990.
|
|
@item pagel98,
|
|
Pagel, V., Lenzo, K., and Black, A.
|
|
"Letter to Sound Rules for Accented Lexicon Compression", ICSLP98, Sydney,
|
|
Australia, 1998.
|
|
@item ritchie92
|
|
Ritchie G, Russell G, Black A and Pulman S. @emph{Computational
|
|
Morphology: practical mechanisms for the English Lexicon}, MIT Press,
|
|
Cambridge, Mass.
|
|
@item vansanten96
|
|
van Santen, J., Sproat, R., Olive, J. and Hirschberg, J. eds,
|
|
"Progress in Speech Synthesis," Springer Verlag, 1996.
|
|
@item silverman92
|
|
Silverman K., Beckman M., Pitrelli, J., Ostendorf, M., Wightman, C.,
|
|
Price, P., Pierrehumbert, J., and Hirschberg, J "ToBI: a standard for
|
|
labelling English prosody." @emph{Proceedings of ICSLP92} vol 2. pp
|
|
867-870, 1992
|
|
@item sproat97
|
|
Sproat, R., Taylor, P, Tanenblatt, M. and Isard, A. "A Markup Language for
|
|
Text-to-Speech Synthesis", @emph{Eurospeech97}, Rhodes, Greece, 1997.
|
|
@item sproat98,
|
|
Sproat, R. eds, "Multilingual Text-to-Speech Synthesis: The Bell Labs
|
|
approach", Kluwer 1998.
|
|
@item sable98,
|
|
Sproat, R., Hunt, A., Ostendorf, M., Taylor, P., Black, A., Lenzo, K.,
|
|
and Edgington, M. "SABLE: A standard for TTS markup." ICSLP98, Sydney,
|
|
Australia, 1998.
|
|
@item taylor91
|
|
Taylor P., Nairn I., Sutherland A. and Jack M.. "A real time speech synthesis
|
|
system", @emph{Eurospeech91}, vol. 1, pp 341-344, Genoa, Italy. 1991.
|
|
@item taylor96
|
|
Taylor P. and Isard, A. "SSML: A speech synthesis markup language"
|
|
to appear in @emph{Speech Communications}.
|
|
@item wwwxml97
|
|
World Wide Web Consortium Working Draft "Extensible Markup Language
|
|
(XML)Version 1.0 Part 1: Syntax",
|
|
@url{http://www.w3.org/pub/WWW/TR/WD-xml-lang-970630.html}
|
|
@item yarowsky96
|
|
Yarowsky, D., "Homograph disambiguation in text-to-speech synthesis",
|
|
in "Progress in Speech Synthesis," eds. van Santen, J., Sproat, R.,
|
|
Olive, J. and Hirschberg, J. pp 157-172. Springer Verlag, 1996.
|
|
@end table
|
|
|
|
@node Feature functions, Variable list , References , Top
|
|
@chapter Feature functions
|
|
|
|
@cindex features
|
|
This chapter contains a list of a basic feature functions available for
|
|
stream items in utterances. @xref{Features}. These are the basic
|
|
features, which can be combined with relative features (such as
|
|
@code{n.} for next, and relations to follow links). Some of these
|
|
features are implemented as short C++ functions (e.g. @code{asyl_in})
|
|
while others are simple features on an item (e.g. @code{pos}). Note
|
|
that functional feature take precidence over simple features, so
|
|
accessing and feature called "X" will always use the function called "X"
|
|
even if a the simple feature call "X" exists on the item.
|
|
|
|
Unlike previous versions there are no features that are builtin on all
|
|
items except @code{addr} (reintroduced in 1.3.1) which returns a unique
|
|
string for that item (its the hex address on teh item within the
|
|
machine). Features may be defined through Scheme too, these all have
|
|
the prefix @code{lisp_}.
|
|
|
|
The feature functions are listed in the form @var{Relation.name} where
|
|
@var{Relation} is the name of the stream that the function is
|
|
appropriate to and @var{name} is its name. Note that you will not
|
|
require the @var{Relation} part of the name if the stream item you are
|
|
applying the function to is of that type.
|
|
|
|
@include festfeat.texi
|
|
|
|
@node Variable list, Function list , Feature functions , Top
|
|
@chapter Variable list
|
|
|
|
This chapter contains a list of variables currently defined within
|
|
Festival available for general use. This list is automatically
|
|
generated from the documentation strings of the variables as they are
|
|
defined within the system, so has some chance in being up-to-date.
|
|
|
|
Cross references to sections elsewhere int he manual are given where
|
|
appropriate.
|
|
|
|
@include festvars.texi
|
|
|
|
@node Function list, , Variable list , Top
|
|
@chapter Function list
|
|
|
|
This chapter contains a list of functions currently defined within
|
|
Festival available for general use. This list is automatically
|
|
generated from the documentation strings of the functions as they are
|
|
defined within the system, so has some chance in being up-to-date.
|
|
|
|
Note some of the functions which have origins in the SIOD system itself
|
|
are little used in Festival and may not work fully, particularly, the
|
|
arrays.
|
|
|
|
Cross references to sections elsewhere in the manual are given where
|
|
appropriate.
|
|
|
|
@include festfunc.texi
|
|
|
|
@node Index, , , Top
|
|
@unnumbered Index
|
|
|
|
@printindex cp
|
|
|
|
@contents
|
|
|
|
@bye
|