The invention requests to protect a Chinese speech recognition method combining Transformer and CNN-DFSMN-CTC. The method comprises the steps of: S1, preprocessing a speech signal, and extracting 80-dimensional log mel Fbank features; S2, carrying out convolution on the extracted 80-dimensional Fbank features by using a CNN convolution network; S3, inputting the features into a DFSMN network structure; S4, taking CTC loss as a loss function of an acoustic model, predicting by adopting a Beam search algorithm, and optimizing by using an Adam optimizer; S5, introducing a strong language model Transformer for iterative training until an optimal model structure is achieved; and S6, combining the Transformer with the acoustic model CNN-DFSMN-CTC to carry out adaptation, and carrying out verification on multiple data sets to finally obtain an optimal identification result. According to the method, the recognition accuracy is higher, the decoding speed is higher, the character error rate reaches 11.8% after verification on a plurality of data sets, and the best character error rate reaches 7.8% on an Aidatang data set.