The present invention relates to the technical fields of speech synthesis, speech recognition, and sound cloning. The present invention combines speech synthesis technology, speech recognition technology, and transfer learning technology to provide a sound cloning implementation scheme based on Bottleneck features (language features of audio), including a training system and training methods; use a small number of samples to provide TTS services with high naturalness and similarity, so as to provide TTS services with target user characteristics, and solve the problems of large sample size, long production cycle, and high labor cost of speech synthesis technology services. The training system includes: a data acquisition module, an acoustic feature extraction module, a speech recognition module, a prosody module, a multi-person speech acoustic module, and a speech synthesis module; the present invention also provides a training method based on the above-mentioned system, including preparing training corpus, acoustic feature extraction , training and fine-tuning of each module, and speech synthesis.