Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Network text segmenting method based on genetic algorithm

A genetic algorithm and text segmentation technology, applied in the field of network text segmentation, can solve problems that affect the accuracy of similarity, cannot provide word frequency information, and affect the accuracy of text segmentation results, etc.

Active Publication Date: 2010-05-19
NANTONG LONGXIANG ELECTRICAL APPLIANCE EQUIP +1
View PDF0 Cites 61 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, for the short-length text in the network text, due to the data sparsity in the text, sufficient word frequency information cannot be provided; at the same time, because the word frequency information is shallow semantic information, the similarity between the segmentation units is only calculated based on the word frequency, which affects The accuracy of similarity calculation, which in turn affects the accuracy of text segmentation results

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Network text segmenting method based on genetic algorithm
  • Network text segmenting method based on genetic algorithm
  • Network text segmenting method based on genetic algorithm

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0026] With reference to the accompanying drawings, this embodiment is aimed at the target text with the theme of "Beijing Olympics", the language usage is standardized, and the text length is relatively short. The specific steps of text segmentation are as follows:

[0027] The first step is to set the search theme of the web spider as vocabulary related to the Olympic Games, and use the web spider to collect web pages on the Internet. The determination of Olympic theme vocabulary includes the following three steps: 1) Manually determine a number of texts that can represent the search theme, usually 10 to 20; 2) Count the word frequency of nouns and verbs in the text, and select words with high word frequency as the undetermined theme vocabulary set , and the word frequency threshold is set to 30; 3) From the undetermined topic vocabulary set, manually select 10-15 words as topic vocabulary.

[0028] Web pages are all HTML documents, and it is necessary to perform text prepro...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a network text segmenting method based on the genetic algorithm, used for segmenting short network texts. The method comprises the following steps of: evaluating a Latent Dirichlet allocation (LDA) model corresponding to a corpus by using a Gibbs sampling method, inferring latent topic information using the model, representing texts by using the latent topic information; then transforming a text-segmenting process into a multi-target optimum process by using a parallel genetic algorithm, and calculating the coherency of segmented units, the divergence among the segmented units and fitness functions by using deeper semantic information; and carrying out the genetic iteration of the text segmenting process, and determining whether the segmenting process terminates based on the similarity among multi-iteration results or the upper limit of iterations to obtain the global optimal solution for segmenting the texts. Therefore, the invention improves the accuracy for segmenting the short network texts.

Description

technical field [0001] The invention relates to a network text segmentation method, in particular to a network text segmentation method based on a genetic algorithm, which is suitable for segmenting network short-length texts. Background technique [0002] Network text segmentation technology is an important technical means for network public opinion monitoring and network text sentiment analysis, which helps to discover deep semantic information in network texts. [0003] The document "Text Segmentation Model Based on Multivariate Discriminant Analysis, Journal of Software, 2007, 18(3), P 555-564" discloses a method for text segmentation using word frequency information. This method adopts multivariate discriminant analysis method, uses word frequency information to represent text with vector space model, and defines four global evaluation functions considering three factors such as the internal distance of segmentation units, the distance between segmentation units, and th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30G06F17/27G06N3/12
Inventor 蔡皖东赵煜
Owner NANTONG LONGXIANG ELECTRICAL APPLIANCE EQUIP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products