The invention provides a system and a method for generating a multimedia voice caption. The caption generating system comprises a control module, a caption processing module, a voice processing module, a checking and sectioning module and a caption output module, wherein the caption processing module, the voice processing module, the checking and sectioning module and the caption output module are connected to the control module, the other end of the control module is connected with a cloud server, the method is used for automatically generating the multimedia voice caption through steps of obtaining, analyzing, identifying and sectioning video and audio, generating the caption, checking and subsequently processing the caption, in order to solve the boundedness caused by manually shooting captions for video captions, no matter video and audio files have standard voice documents (i.e., words), captions can be generated automatically, efficiently and continuously, a human friendly man-machine interaction system can select a caption mode according to actual conditions, comprising the number of words in each row, the number of rows and fonts and the like, and the matching rate of the generated caption and the video reaches 100% by multiple times of accurate check.