The invention belongs to the technical field of neural networks, and particularly relates to a crowd counting method based on multi-scale feature fusion. The method mainly comprises the following steps: extracting feature maps of three scales from a backbone network, sending the feature maps into a feature fusion sub-network, and calculating a density map by using the fused feature maps so as to predict the number of crowds in the image, wherein the feature fusion sub-network is designed into three convolution network branches, each branch is identical in structure, adopts an attention fusionnetwork and is divided into two paths, each path is composed of a convolution layer, a normalization layer and an activation function, and the two paths are identical in input and different in outputchannel number and are a single channel and an N channel respectively; a single-channel branch learns the feature weight of a multi-channel output branch, the feature weight is multiplied by the output of a multi-channel output feature map, finally, the feature maps of three large branches are superposed, the feature maps are sent to a decoding module together to output an image density map, and the integral value of the density map is the number of people in the image. According to the invention, the people counting precision is improved.