A Word Segmentation Method for Network Text Based on Domain Adaptability

A word segmentation method and technology in the field, applied to the word segmentation of social network text, the field of social network text word segmentation, can solve the problem of poor effect

Active Publication Date: 2020-04-03
PEKING UNIV
View PDF9 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] In order to overcome the deficiencies of the above-mentioned prior art, the present invention provides a word segmentation method based on domain-adaptive social network texts. By establishing an integrated neural network model and adopting a self-training learning method, using news domain corpus, a small amount of Annotated data and a large amount of unlabeled data are used to train the integrated neural network model, thereby improving the effect of word segmentation in social networks and solving the problem of poor results caused by too little data in social networks

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Word Segmentation Method for Network Text Based on Domain Adaptability
  • A Word Segmentation Method for Network Text Based on Domain Adaptability
  • A Word Segmentation Method for Network Text Based on Domain Adaptability

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0050] Below in conjunction with accompanying drawing, further describe the present invention through embodiment, but do not limit the scope of the present invention in any way.

[0051] The present invention provides a cross-domain social network text word segmentation method. By establishing an integrated neural network model and adopting a self-training learning method, the integrated neural network model is developed using cross-domain labeled data and a large amount of unlabeled data in the social network. training, thereby improving the effect of word segmentation in social networks; figure 1 It is a flow chart of the social network text word segmentation method provided by the present invention. The specific process is as follows:

[0052] 1) The input of the algorithm T={T l , T u} consists of two parts, where T l For labeling data sets, (such as labeling samples: he / where / parachute team / disbandment / helpless / farewell / flying, / is a manually labeled word separator),...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a domain adaptation-based word division method of social network text. Through building an integrated neural network and using a self-training learning method, cross-domain news corpus and labeled data and unlabeled data in a social network are utilized to train an integrated neural network model. The method specifically comprises: dividing the social network text into labeled and unlabeled datasets, and using the datasets as input; using the news domain corpus as source corpus, and pre-training source classifiers on the news source corpus; integrating the source classifiers through a manner of assigning weights to the source classifiers; using the social network corpus to train the integrated neural network model; and utilizing the well-trained integrated neural network model to carry out prediction, and thus improving an effect of word division of the social network. The method can be used to solve the problem of a poor effect caused by very insufficient data in the social network, and can effectively improve the effect of word division of the social network text.

Description

technical field [0001] The invention belongs to the field of natural language processing, relates to word segmentation of social network texts, and in particular to a method for word segmentation of social network texts based on domain adaptability. Background technique [0002] For word segmentation tasks in the traditional news field, statistical methods have initially achieved good results, mainly including conditional random fields and perceptron models. However, these models need to extract a large number of features, so the generalization ability is limited. [0003] In recent years, more and more neural network-based methods have been used for automatic feature extraction, among which there are more word segmentation models, mainly including convolutional neural network (Convolutional Neural Network, CNN), long short-term memory neural network ( Long Short Term Memory Network, LSTM) etc. Although these neural network-based methods are very effective, training these ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/35G06F40/289G06N3/08
CPCG06F16/355G06F40/289G06N3/08
Inventor 孙栩许晶晶马树铭
Owner PEKING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products