Chinese word segmentation method and system based on word vector representation learning

A Chinese word segmentation and word vector technology, applied in the field of word segmentation, can solve the problems of difficulty in building a domain dictionary, destroying the semantic integrity of the word itself, and time-consuming problems

Inactive Publication Date: 2020-12-29
TIBET UNIVERSITY FOR NATIONALITIES
View PDF3 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Secondly, the word vector splits the semantics of the word itself, which destroys the integrity of the semantics of the word itself
In addition, word vector representation learning requires a lot of manpower to sequence the corpus, and domain word segmentation requires the construction of a domain dictionary, which is difficult and time-consuming

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese word segmentation method and system based on word vector representation learning
  • Chinese word segmentation method and system based on word vector representation learning
  • Chinese word segmentation method and system based on word vector representation learning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0057] The present invention mainly solves the problem of unregistered words being cut incorrectly by the existing word segmentation technology; secondly, it solves the problem that word segmentation in a specific field requires a lot of manual construction of domain dictionaries and corpus labeling; Semantics, the problem of destroying the semantic integrity of words.

[0058] One of the purposes of the present invention is to solve the correct word segmentation of unregistered words by machine learning the upper and lower semantic features of words without manually constructing a domain dictionary, thereby improving the word segmentation performance of deep learning technology in a specific field; the second purpose is to use word vector representation to facilitate learning Semantic features at the word level are captured to ensure that word segmentation results do not destroy the semantic completeness of words, and finally realize domain-specific Chinese word segmentation f...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Chinese word segmentation method and system based on word vector representation learning. The method comprises the steps of performing preliminary word segmentation processing on a text to be subjected to word segmentation to obtain a preliminary word segmentation text; inputting the preliminary word segmentation text into a BERT model for training to obtain corpus word vectors; inputting the corpus word vectors into a Bi-GRU model for training to obtain a plurality of feature word vectors; calculating cosine similarity of two adjacent feature word vectors to obtain acosine similarity value; judging whether the cosine similarity value is greater than or equal to a preset threshold value or not; and if the cosine similarity value is greater than or equal to the preset threshold, combining the preliminary word segmentation results of the words corresponding to the two adjacent feature word vectors. By adopting the method and the system provided by the invention, the problem of word segmentation of the unregistered words in the specific field is solved, the word segmentation performance is improved, the semantic integrity of the words is not damaged by the word segmentation result, and a large amount of manpower tagging corpora can be avoided.

Description

technical field [0001] The invention relates to the technical field of word segmentation, in particular to a Chinese word segmentation method and system based on word vector representation learning. Background technique [0002] The accuracy rate of general-purpose word segmentation technology will drop significantly when it is applied in pre-specific field. In the general domain dictionary, that is, unregistered words, the specific domain corpus has a large number of unregistered words. For example, there are a large number of phrases such as "prefix + noun", "place name + noun" and "person name + noun" in the Tibetan animal husbandry corpus. Among them, the entity names of "prefix + noun" include Tibetan sheep, Tibetan pig, Tibetan snow chicken, Tibetan fennel, Tibetan fennel, saffron, etc.; the entity names of "place name + noun" include Zhongba Grassland, Plateau Rabbit, Sanjiang River Basin, and Alpine Vulture , Naqu Cordyceps, etc. The entity names of "person name + ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/289G06F40/242G06K9/62
CPCG06F40/289G06F40/242G06F18/22
Inventor 赵尔平
Owner TIBET UNIVERSITY FOR NATIONALITIES
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products