MultiSpeaker text to Speech

TEXT TO SPEECH

WHAT IT IS?

It is a Text to speech model Based on deepSpeech paper implementation. The Deep Voice 3 architecture consists of three components:

Encoder: A fully-convolutional encoder, which converts textual features to an internal learned representation.
Decoder: A fully-convolutional causal decoder, which decodes the learned representation with a multi-hop convolutional attention mechanism into a low-dimensional audio representation (mel-scale spectrograms) in an autoregressive manner.
Converter: A fully-convolutional post-processing network, which predicts final vocoder parameters (depending on the vocoder choice) from the decoder hidden states. Unlike the decoder, the converter is non-causal and can thus depend on future context information.
HOW TO USE?
To run the Script Sample Command python run.py 20180505_deepvoice3_checkpoint_step000640000.pth nikl_preprocess/example.txt ./

For Help And Other Options - python run.py -h Optional Arguments -- hparams =<parmas> Hyper parameters [default: ].

-- preset =<json> Path of preset parameters (json).

--checkpoint-seq2seq =<path> Load seq2seq model from checkpoint path.

--checkpoint-postnet =<path> Load postnet model from checkpoint path.

--file-name-suffix =<s> File name suffix [default: ].

--max-decoder-steps =<N> Max decoder steps [default: 500].

--replace_pronunciation_prob =<N> Prob [default: 0.0].

--speaker_id=<id> Speaker ID (for multi-speaker model).

--output-html Output html for blog post.

-h, --help Show help message.

WHAT ARE THE REQUIREMENTS?

To get all the requirements and dependencies installed run the command For GPU - pip install -r gpu_requirements.txt For CPU - pip install -r cpu_requirements.txt

Stats

CPU - 0.000558 GPU - 0.000312

Dataset Used	LJSpeech
Framework	PyTorch
OS Used	Linux
Publication	View