A word vector representation method based on Chinese language element and
pinyin joint statistics comprises the following steps of 1, collecting internet text information to construct a corpus, and conducting text cleaning and word segmentation
processing on the constructed corpus; (2) carrying out word segmentation
processing on the Chinese corpus, converting the processed Chinese corpus into
pinyin information which does not reserve tone information, and then respectively carrying out statistical weights TFc, IDFc, TFp and IDFp on word frequency statistics and inverse document probability ofthe morphemes and
pinyin characteristics in the
training set corpus and the whole document; (3) constructing a Chinese single
morpheme representation vector based on a
Chinese word representation model of contextual
morpheme and pinyin joint statistics; And (4) training a three-layer neural
network on the basis of the step (3) for predicting the central target word. According to the method, the adaptability of an offline dictionary and the corpus data scale can be met, large-scale unlabeled internet information text data can be directly learned, the consideration of a conventional
word embedding model on Chinese language difference characteristics can be improved, and the representation and recognition accuracy of wrongly written words can be improved.