Unlock instant, AI-driven research and patent intelligence for your innovation.

Text segmentation method and device

A text and word segmentation technology, which is applied in the fields of instruments, computing, and electrical digital data processing, etc., can solve the problems of incorrect word segmentation, difficulty in finding unrecorded words and named entities, and unsatisfactory results, so as to improve the effect of text segmentation Effect

Pending Publication Date: 2022-06-07
TSINGHUA UNIV
View PDF0 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0002] For a long time, most word segmentation methods have been based on dictionaries. With the help of large-scale dictionaries, dictionary-based word segmentation methods have achieved good results, but this method is not ideal when identifying unregistered words. However, unregistered words The recognition of is an inevitable problem in word segmentation applications
[0003] Since Chinese has no word boundaries, Chinese NLP faces some unique challenges, which are made even more severe when dealing with open-domain Chinese corpora containing many undocumented words and named entities, as they are often entangled: often between unrecorded words and named entities. Inability to correctly segment words when the real vocabulary is known; on the other hand, it is often difficult to accurately discover undocumented words and named entities from open-domain corpora without word segmentation guidance

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text segmentation method and device
  • Text segmentation method and device
  • Text segmentation method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0111] The described embodiments of the present invention will be described below with reference to the accompanying drawings. As those of ordinary skill in the art would realize, the described embodiments may be modified in various different ways or combinations thereof, all without departing from the spirit and scope of the present invention. Accordingly, the drawings and description are illustrative in nature and are not intended to limit the scope of protection of the claims. Furthermore, in this specification, the drawings are not drawn to scale, and the same reference numerals refer to the same parts.

[0112] The text segmentation method of this embodiment includes the following steps:

[0113] Step S1, constructing a Bayesian model framework, the Bayesian model framework is:

[0114] P(θ, B|T, D)∝P(T|D, θ, B) π(θ, B)

[0115] Where π(θ, B) is the prior distribution, P(T|D, θ, B) is the text segmentation prediction model established for Chinese text, P(θ, B|T, D) is ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a text segmentation method and device, and the method comprises the steps: building a Bayesian model framework, enabling a text segmentation prediction model to comprise an initial dictionary D, a word boundary vector set B and a model parameter theta, and enabling the text segmentation prediction model to be used for segmenting each segment of text Tj without word segmentation in a Chinese text sequence T, obtaining the probability of the text Tj of the word segmentation version according to the initial dictionary D, a given word boundary vector Bj and a model parameter theta; determining joint prior distribution pi (theta, B), and integrating prior preferences of word use and text segmentation into a text segmentation prediction model; estimating a posterior peak value of a model parameter theta by using an EM algorithm, removing saliency low words from the initial dictionary by using the posterior peak value, and simplifying D into a final dictionary Df; and obtaining a set B of word boundary vectors by using the posterior peak value final dictionary Df, and realizing segmentation of the text T. According to the method, the granularity of word segmentation is controlled through priori information and selection of the kappa parameter, so that words can be segmented, and the text segmentation effect is improved.

Description

technical field [0001] The present invention relates to the technical field of natural language processing, and in particular, to a text segmentation method and device. Background technique [0002] For a long time, most word segmentation methods are based on dictionaries. With the help of large-scale dictionaries, dictionary-based word segmentation methods have achieved good results, but the results of this method are not ideal when identifying unregistered words. recognition is an inevitable problem in word segmentation applications. [0003] Since Chinese has no word boundaries, Chinese natural language processing faces some unique challenges, which are made even more severe when dealing with open-domain Chinese corpora containing many undocumented words and named entities, as they are often entangled: often in undocumented Correct tokenization is impossible without knowledge of the true vocabulary; on the other hand, it is often difficult to precisely discover undocumen...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F40/216G06F40/284G06F40/242
CPCG06F40/216G06F40/284G06F40/242
Inventor 邓柯潘长在
Owner TSINGHUA UNIV