The invention discloses a feature fusion method of a multi-modal deep neural network. In the multi-modal deep three-dimensional CNN, by using a compression excitation (squeeze and excitation, S&E) module on the deep learning feature domain, the information about the mode can be obtained. Channel attention mask between modalities, that is, in all modalities, give greater attention to those channels that significantly help the task goal, thus explicitly establishing the weight of the multimodal 3D depth feature map on the channel distribution; then, using four-dimensional convolution and Sigmoid activation function calculation, the spatial attention mask between the modalities can be obtained, that is, in the three-dimensional feature map of each modality, which positions in space need to be given greater attention , thereby explicitly establishing the spatial correlation of the multimodal 3D depth feature map, and giving greater attention to the positions with important information in the modality, channel, and space, thereby improving the diagnosis of the multimodal intelligent diagnosis system efficacy.