The invention belongs to the technical field of artificial intelligence smart community application, and relates to an unsupervised cross-modal retrieval method based on attention mechanism enhancement, which comprises the following steps of: enhancing visual semantic features of an image, then aggregating feature information of different modals, mapping fused multi-modal features to the same semantic feature space, then, on the basis of a generative adversarial network, adversarial learning is carried out on the image modal and text modal features and the same semantic feature after multi-modal fusion, aligning semantic features of different modals, and finally, generating hash codes by the different modal features after alignment of the generative adversarial network; and performing intra-modal feature and Hash code similarity measurement learning and inter-modal feature and Hash code similarity measurement, so a heterogeneous semantic gap problem between different modalities is reduced, a dependency relationship between different modal features is enhanced, a semantic gap between different modal data is reduced, and semantic common characteristics among different modes can be represented more robustly.