The invention discloses a visual positioning method based on a diverse identification candidate box generation network. The method comprises the following steps: 1, training a diverse identification candidate box generation network; 2, extracting features of the image by using the trained DDPN network,3, extracting text data characteristics,4, constructing a target vector and a target value of a regression frame,5, constructing a deep neural network,6, setting a loss function, 7, training the model,8, calculating a network prediction value. According to the algorithm provided by the invention,especially the DDPN network-based algorithm for extracting the features of the image, a significant improvement effect is achieved on an image visual positioning task, and all mainstream methods on the task at present are greatly exceeded. In addition, the feature extraction algorithm has very important application value and huge potential in other cross-modal related fields such as image contentquestion answering and image description.