Network text segmenting method based on genetic algorithm

A genetic algorithm and text segmentation technology, applied in the field of network text segmentation, can solve problems that affect the accuracy of similarity, cannot provide word frequency information, and affect the accuracy of text segmentation results, etc.

A genetic algorithm and text segmentation technology, applied in the field of network text segmentation, can solve problems that affect the accuracy of similarity, cannot provide word frequency information, and affect the accuracy of text segmentation results, etc.

CN101710333AActive Publication Date: 2010-05-19NANTONG LONGXIANG ELECTRICAL APPLIANCE EQUIP +1

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Network text segmenting method based on genetic algorithm
  • Network text segmenting method based on genetic algorithm
  • Network text segmenting method based on genetic algorithm

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0026] With reference to the accompanying drawings, this embodiment is aimed at the target text with the theme of "Beijing Olympics", the language usage is standardized, and the text length is relatively short. The specific steps of text segmentation are as follows:

[0027] The first step is to set the search theme of the web spider as vocabulary related to the Olympic Games, and use the web spider to collect web pages on the Internet. The determination of Olympic theme vocabulary includes the following three steps: 1) Manually determine a number of texts that can represent the search theme, usually 10 to 20; 2) Count the word frequency of nouns and verbs in the text, and select words with high word frequency as the undetermined theme vocabulary set , and the word frequency threshold is set to 30; 3) From the undetermined topic vocabulary set, manually select 10-15 words as topic vocabulary.

[0028] Web pages are all HTML documents, and it is necessary to perform text prepro...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a network text segmenting method based on the genetic algorithm, used for segmenting short network texts. The method comprises the following steps of: evaluating a Latent Dirichlet allocation (LDA) model corresponding to a corpus by using a Gibbs sampling method, inferring latent topic information using the model, representing texts by using the latent topic information; then transforming a text-segmenting process into a multi-target optimum process by using a parallel genetic algorithm, and calculating the coherency of segmented units, the divergence among the segmented units and fitness functions by using deeper semantic information; and carrying out the genetic iteration of the text segmenting process, and determining whether the segmenting process terminates based on the similarity among multi-iteration results or the upper limit of iterations to obtain the global optimal solution for segmenting the texts. Therefore, the invention improves the accuracy for segmenting the short network texts.

Description

technical field [0001] The invention relates to a network text segmentation method, in particular to a network text segmentation method based on a genetic algorithm, which is suitable for segmenting network short-length texts. Background technique [0002] Network text segmentation technology is an important technical means for network public opinion monitoring and network text sentiment analysis, which helps to discover deep semantic information in network texts. [0003] The document "Text Segmentation Model Based on Multivariate Discriminant Analysis, Journal of Software, 2007, 18(3), P 555-564" discloses a method for text segmentation using word frequency information. This method adopts multivariate discriminant analysis method, uses word frequency information to represent text with vector space model, and defines four global evaluation functions considering three factors such as the internal distance of segmentation units, the distance between segmentation units, and th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
19 May 2010
Publication
CN101710333A
IPC
G06F17/30; G06F17/27; G06N3/12
Inventors
θ”‘ηš–δΈœ; θ΅΅η…œ