The invention discloses a method, system and device for converting voice into a lip shape and a storage medium. The method comprises the steps: acquiring a voice sequence; receiving and processing thevoice sequence by using a trained generative adversarial network model; and obtaining a lip-shaped image output by the trained generative adversarial network model. According to the method, the generative adversarial network model (GAN) is trained, and the trained generative adversarial network model is utilized to convert the voice into the lip shape, so that the lip-shaped image with high quality and high resolution can be obtained; the generative adversarial network model is trained in an unsupervised learning mode, so that the voice quality can be obviously improved, the voice distortionis reduced, and the robustness of the system is enhanced; when changed voice is continuously input, a dynamic lip-shaped image can be finally output, and a smooth visual effect can be provided; and meanwhile, the generated lip-shaped image is combined with the voice, so that a high-quality face speaking video can be synthesized. The method, system and device are widely applied to the technical field of voice data.