Advanced Usage
This section covers how to extend TTS by implementing your models and experiments. Guidelines on implementation are also elaborated.
For the general deep learning experiment, there are several parts to deal with:
Preprocess the data according to the needs of the model, and iterate the dataset by batch.
Define the model, optimizer, and other components.
Write out the training process (generally including forward / backward calculation, parameter update, log recording, visualization, periodic evaluation, etc.).
Configure and run the experiment.
PaddleSpeech TTS's Model Components
To balance the reusability and function of models, we divide models into several types according to their characteristics.
For the commonly used modules that can be used as part of other larger models, we try to implement them as simple and universal as possible, because they will be reused. Modules with trainable parameters are generally implemented as subclasses of paddle.nn.Layer
. Modules without trainable parameters can be directly implemented as a function, and its input and output are paddle.Tensor
.
Models for a specific task are implemented as subclasses of paddle.nn.Layer
. Models could be simple, like a single-layer RNN. For complicated models, it is recommended to split the model into different components.
For a seq-to-seq model, it's natural to split it into encoder and decoder. For a model composed of several similar layers, it's natural to extract the sublayer as a separate layer.
There are two common ways to define a model which consists of several modules.
Define a module given the specifications. Here is an example with a multilayer perceptron.
class MLP(nn.Layer): def __init__(self, input_size, hidden_size, output_size): self.linear1 = nn.Linear(input_size, hidden_size) self.linear2 = nn.Linear(hidden_size, output_size) def forward(self, x): return self.linear2(paddle.tanh(self.linear1(x)) module = MLP(16, 32, 4) # intialize a module
When the module is intended to be a generic and reusable layer that can be integrated into a larger model, we prefer to define it in this way.
For considerations of readability and usability, we strongly recommend NOT to pack specifications into a single object. Here’s an example below.
class MLP(nn.Layer): def __init__(self, hparams): self.linear1 = nn.Linear(hparams.input_size, hparams.hidden_size) self.linear2 = nn.Linear(hparams.hidden_size, hparams.output_size) def forward(self, x): return self.linear2(paddle.tanh(self.linear1(x))
For a module defined in this way, it’s harder for the user to initialize an instance. Users have to read the code to check what attributes are used.
Also, code in this style tends to be abused by passing a huge config object to initialize every module used in an experiment, though each module may not need the whole configuration.
We prefer to be explicit.
Define a module as a combination given its components. Here is an example of a sequence-to-sequence model.
class Seq2Seq(nn.Layer): def __init__(self, encoder, decoder): self.encoder = encoder self.decoder = decoder def forward(self, x): encoder_output = self.encoder(x) output = self.decoder(encoder_output) return output encoder = Encoder(...) decoder = Decoder(...) # compose two components model = Seq2Seq(encoder, decoder)
When a model is complicated and made up of several components, each of which has a separate functionality, and can be replaced by other components with the same functionality, we prefer to define it in this way.
In the directory structure of PaddleSpeech TTS, modules with high reusability are placed in paddlespeech.t2s.modules
, but models for specific tasks are placed in paddlespeech.t2s.models
. When developing a new model, developers need to consider the feasibility of splitting the modules, and the degree of generality of the modules and place them in appropriate directories.
PaddleSpeech TTS's Data Components
Another critical component for a deep learning project is data. PaddleSpeech TTS uses the following methods for training data:
Preprocess the data.
Load the preprocessed data for training.
Previously, we wrote the preprocessing in the __getitem__
of the Dataset, which will process when accessing a certain batch sample, but encountered some problems:
Efficiency problem. Even if Paddle has a design to load data asynchronously, when the batch size is large, each sample needs to be preprocessed and set up batches, which takes a lot of time, and may even seriously slow down the training process.
Data filtering problem. Some filtering conditions depend on the features of the processed sample. For example, filtering samples that are too short according to text length. If the text length can only be known after
__getitem__
, every time you filter, the entire dataset needed to be loaded once! In addition, if you do not pre-filter, A small exception (such as too short text ) in__getitem__
will cause an exception in the entire data flow, which is not feasible, becausecollate_fn
presupposes that the acquisition of each sample can be normal. Even if some special flags, such asNone
, are used to mark data acquisition failures, and skipcollate_fn
, it will change batch_size.
Therefore, it is not realistic to put preprocessing entirely on __getitem__
. We use the method mentioned above instead.
During preprocessing, we can do filtering, We can also save more intermediate features, such as text length, audio length, etc., which can be used for subsequent filtering. Because of the habit of TTS field, data is stored in multiple files, and the processed results are stored in npy
format.
Use a list-like way to store metadata and store the file path in it, so that you can not be restricted by the specific storage location of the file. In addition to the file path, other metadata can also be stored in it. For example, the path of the text, the path of the audio, the path of the spectrum, the number of frames, the number of sampling points, and so on.
Then for the path, there are multiple opening methods, such as sf.read
, np.load
, etc., so it's best to use a parameter that can be input, we don't even want to determine the reading method by its extension, it's best to let the users input it, in this way, users can define their method to parse the data.
So we learned from the design of DataFrame
, but our construction method is simpler, only need a list of dicts
, a dict represents a record, and it's convenient to interact with formats such as json
, yaml
. For each selected field, we need to give a parser (called converter
in the interface), and that's it.
Then we need to select a format for saving metadata to the hard disk. There are two square brackets when storing the list of records in json
, which is not convenient for stream reading and writing, so we use jsonlines
. We don't use yaml
because it occupies too many rows when storing the list of records.
Meanwhile, cache
is added here, and a multi-process Manager is used to share memory between multiple processes. When num_workers
is used, it is guaranteed that each sub process will not cache a copy.
The implementation of DataTable
can be found in paddlespeech/t2s/datasets/data_table.py
.
class DataTable(Dataset):
"""Dataset to load and convert data for general purpose.
Parameters
----------
data : List[Dict[str, Any]]
Metadata, a list of meta datum, each of which is composed of
several fields
fields : List[str], optional
Fields to use, if not specified, all the fields in the data are
used, by default None
converters : Dict[str, Callable], optional
Converters used to process each field, by default None
use_cache : bool, optional
Whether to use a cache, by default False
Raises
------
ValueError
If there is some field that does not exist in data.
ValueError
If there is some field in converters that does not exist in fields.
"""
def __init__(self,
data: List[Dict[str, Any]],
fields: List[str]=None,
converters: Dict[str, Callable]=None,
use_cache: bool=False):
Its __getitem__
method is to parse each field with their parser and then compose a dictionary to return.
def _convert(self, meta_datum: Dict[str, Any]) -> Dict[str, Any]:
"""Convert a meta datum to an example by applying the corresponding
converters to each field requested.
Parameters
----------
meta_datum : Dict[str, Any]
Meta datum
Returns
-------
Dict[str, Any]
Converted example
"""
example = {}
for field in self.fields:
converter = self.converters.get(field, None)
meta_datum_field = meta_datum[field]
if converter is not None:
converted_field = converter(meta_datum_field)
else:
converted_field = meta_datum_field
example[field] = converted_field
return example
PaddleSpeech TTS's Training Components
A typical training process includes the following processes:
Iterate the dataset.
Process batch data.
Neural network forward/backward calculation.
Parameter update.
Evaluate the model on the validation dataset, when some special conditions are reached.
Write logs, visualize, and in some cases save necessary intermediate results.
Save the state of the model and optimizer.
Here, we mainly introduce the training-related components of TTS in Pa and why we designed it like this.
Global Reporter
When training and modifying Deep Learning models,logging is often needed, and it has even become the key to model debugging and modifying. We usually use various visualization tools,such as , visualdl
in paddle
, tensorboard
in tensorflow
and vidsom
, wnb
,etc. Besides, logging
and print
are usually used for a different purpose.
In these tools, print
is the simplest,it doesn't have the concept of logger
and handler
in logging
、 summarywriter
and logdir
in tensorboard
, when printing, there is no need for global_step
,It's light enough to appear anywhere in the code, and it's printed to a common stdout. Of course, its customizability is limited, for example, it is no longer intuitive when printing dictionaries or more complex objects. And it's fleeting, people need to use redirection to save information.
For TTS models development,we hope to have a more universal multimedia stdout, which is a tool similar to tensorboard
, which allows many multimedia forms, but it needs a summary writer
when using, and a step
when writing information. If the data are images or voices, some format control parameters are needed.
This will destroy the modular design to a certain extent. For example, If my model is composed of multiple sublayers, and I want to record some important information in the forward method of some sublayers. For this reason, I may need to pass the summary writer
to these sublayers, but for the sublayers, its function is the calculation, it should not have extra considerations, and it's also difficult for us to tolerate that the initialization of an nn.Linear
has an optional visualizer
in the method. And, for a calculation module, HOW can it know the global step? These are things related to the training process!
Therefore, a more common approach is not to put writing_log_code in the definition of layer, but return it, then obtain them during training, and write them to summary writer
. However, the return values need to be modified. summary writer
is a broadcaster at the training level, and then each module transmits information to it by modifying the return values.
We think this method is a little ugly. We prefer to return the necessary information only rather than change the return values to accommodate visualization and recording. When you need to report some information, you should be able to report it without difficulty. So we imitate the design of chainer
and use the global repoter
.
It takes advantage of the globality of Python's module-level variables and the effect of context manager.
There is a module-level variable in paddlespeech/t2s/training/reporter.py
OBSERVATIONS
,which is a Dict
to store key-value.
# paddlespeech/t2s/training/reporter.py
@contextlib.contextmanager
def scope(observations):
# make `observation` the target to report to.
# it is basically a dictionary that stores temporary observations
global OBSERVATIONS
old = OBSERVATIONS
OBSERVATIONS = observations
try:
yield
finally:
OBSERVATIONS = old
Then we implement a context manager scope
, which is used to switch the variables bound by the name of OBSERVATIONS
. Then a getter
function is defined to get the dictionary bound by OBSERVATIONS
.
def get_observations():
global OBSERVATIONS
return OBSERVATIONS
Then we define a function to get the current OBSERVATIONS
,and write key-value pair into it.
def report(name, value):
# a simple function to report named value
# you can use it everywhere, it will get the default target and writ to it
# you can think of it as std.out
observations = get_observations()
if observations is None:
return
else:
observations[name] = value
The test code following shows the usage method.
use
first
as the currentOBSERVATION
, writefirst_begin=1
,then, open the second
OBSERVATION
, writesecond_begin=2
,then, open the third
OBSERVATION
, writethird_begin=3
exit the third
OBSERVATION
, we back to the secondOBSERVATION
automaticallywrite some context in the second
OBSERVATION
, then exit it, and we back to the firstOBSERVATION
automatically
def test_reporter_scope():
first = {}
second = {}
third = {}
with scope(first):
report("first_begin", 1)
with scope(second):
report("second_begin", 2)
with scope(third):
report("third_begin", 3)
report("third_end", 4)
report("seconf_end", 5)
report("first_end", 6)
assert first == {'first_begin': 1, 'first_end': 6}
assert second == {'second_begin': 2, 'seconf_end': 5}
assert third == {'third_begin': 3, 'third_end': 4}
In this way, when we write modular components, we can directly call report
. The caller will decide where to report as long as it's ready for OBSERVATION
, then it opens a scope
and calls the component within this scope
.
The Trainer
in PaddleSpeech TTS report the information in this way.
while True:
self.observation = {}
# set observation as the report target
# you can use report freely in Updater.update()
# updating parameters and state
with scope(self.observation):
update() # training for a step is defined here
Updater: Model Training Process
To maintain the purity of function and the reusability of code, we abstract the model code into a subclass of paddle.nn.Layer
, and write the core computing functions in it.
We tend to write the forward process of training in forward()
, but only write to the prediction result, not to the loss. Therefore, this module can be called by a larger module.
However, when we compose an experiment, we need to add some other things, such as the training process, evaluation process, checkpoint saving, visualization, and the like. In this process, we will encounter some things that only exist in the training process, such as optimizer
, learning rate scheduler
, visualizer
, etc. These things are not part of the model, they should NOT be written in the model code.
We made an abstraction for these intermediate processes, that is, Updater
, which takes the model
, optimizer
, and data stream
as input, and its function is training. Since there may be differences in training methods of different models, we tend to write a corresponding Updater
for each model. But this is different from the final training script, there is still a certain degree of encapsulation, just to extract the details of regular saving, visualization, evaluation, etc., and only retain the most basic function, that is, training the model.
Visualizer
Because we choose observation as the communication mode, we can simply write the things in observation into visualizer
.
PaddleSpeech TTS's Configuration Components
Deep learning experiments often have many options to configure. These configurations can be roughly divided into several categories.
Data source and data processing mode configuration.
Save path configuration of experimental results.
Data preprocessing mode configuration.
Model structure and hyperparameter configuration.
Training process configuration.
It’s common to change the running configuration to compare results. To keep track of running configuration, we use yaml
configuration files.
Also, we want to interact with command-line options. Some options that usually change according to running environments are provided by command line arguments. In addition, we want to override an option in the config file without editing it.
Taking these requirements into consideration, we use yacs as a config management tool. Other tools like omegaconf are also powerful and have similar functions.
In each example provided, there is a config.py
, the default config is defined at conf/default.yaml
. If you want to get the default config, import config.py
and call get_cfg_defaults()
to get it. Then it can be updated with yaml
config file or command-line arguments if needed.
For details about how to use yacs in experiments, see yacs.
The following is the basic ArgumentParser
:
--config
is used to support configuration file parsing, and the configuration file itself handles the unique options of each experiment.--train-metadata
is the path to the training data.--output-dir
is the dir to save the training results.(if there are checkpoints incheckpoints/
of--output-dir
, it defaults to reload the newest checkpoint to train)--ngpu
determine operation modes,--ngpu
refers to the number of training processes. Ifngpu
> 0, it means using GPU, else CPU is used.
Developers can refer to the examples in examples
to write the default configuration file when adding new experiments.
PaddleSpeech TTS's Experiment template
The experimental codes in PaddleSpeech TTS are generally organized as follows:
.
├── README.md (help information)
├── conf
│ └── default.yaml (defalut config)
├── local
│ ├── preprocess.sh (script to call data preprocessing.py)
│ ├── synthesize.sh (script to call synthesis.py)
│ ├── synthesize_e2e.sh (script to call synthesis_e2e.py)
│ └──train.sh (script to call train.py)
├── path.sh (script include paths to be sourced)
└── run.sh (script to call scripts in local)
The *.py
files called by above *.sh
are located ${BIN_DIR}/
We add a named argument. --output-dir
to each training script to specify the output directory. The directory structure is as follows, developers should follow this specification:
exp/default/
├── checkpoints/
│ ├── records.jsonl (record file)
│ └── snapshot_iter_*.pdz (checkpoint files)
├── config.yaml (config file of this experiment)
├── vdlrecords.*.log (visualdl record file)
├── worker_*.log (text logging, one file per process)
├── validation/ (output dir during training, information_iter_*/ is the output of each step, if necessary)
├── inference/ (output dir of exported static graph model, which is only used in the final stage of training, if implemented)
└── test/ (output dir of synthesis results)
You can view the examples we provide in examples
. These experiments are provided to users as examples that can be run directly. Users are welcome to add new models and experiments and contribute code to PaddleSpeech.