The invention relates to a text generation image method based on cross-modal similarity and a generative adversarial network. The method comprises the steps that S1, training a global consistency model, a local consistency model and a relation consistency model by using matched and unmatched data, wherein three models are used for obtaining global representation, local representation and relationrepresentation of a text and an image respectively; S2, obtaining global representation, local representation and relation representation of the to-be-processed text by utilizing the trained global consistency model, local consistency model and relation consistency model; S3, connecting the global representation, the local representation and the relation representation of the to-be-processed textin series to obtain text representation of the to-be-processed text; S4, converting the text representation of the to-be-processed text into a condition vector by utilizing an Fca condition enhancement module; and S5, inputting the condition vector into a generator to obtain a generated image. Compared with the prior art, the method has the advantages of considering local and relation informationand the like.