The invention provides a visual question-answering method and system based on semantic alignment and a storage medium, and relates to the technical field of visual question-answering. According to theembodiment of the invention, the method comprises the steps: firstly obtaining and preprocessing a data set, extracting original image features and target position features according to an original image, generating an image description statement according to the target position features, obtaining an image description word, question features and image description statement features, and carryingout the semantic alignment of the original image features and the image description word; and obtaining a first image feature, obtaining a second image feature according to the original image featureand the image description statement feature, obtaining a third image feature according to the original image feature and the question feature, fusing the three image features, the image description statement feature and the question feature to obtain a comprehensive feature, and predicting a final answer result. Therefore, the importance of the image information is highlighted, the information involved in the feature fusion process is perfected, and the finally generated answer result is more accurate.