The invention discloses a personalized voice and video generation system based on phoneme posterior probability. The personalized voice and video generation system mainly comprises the following steps: S1, extracting phoneme posterior probability through an automatic voice recognition system; s2, training a recurrent neural network to learn a mapping relationship between phoneme posterior probability and lip features, and through the network, inputting an audio of any target speaker to output the corresponding lip feature; s3, synthesizing the lip-shaped features into a corresponding face image through face alignment, image fusion, an optical flow method and other technologies; and S4, generating a final speaker speech video from the generated face sequence through dynamic planning and other technologies. The invention relates to the technical field of speech synthesis and speech conversion. According to the method, the lip shape is generated based on the phoneme posteriori probability, the requirement for the video data volume of the target speaker is greatly reduced, meanwhile, the video of the target speaker can be directly generated from the text content, and the audio of the speaker does not need to be additionally recorded.