Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A text segmentation method based on a layered Dirichlet model

A Dirichlet model and text segmentation technology, applied in special data processing applications, instruments, electrical and digital data processing, etc., can solve the problems of incomplete, over-fitting text description, insufficient text description, etc. Effect

Active Publication Date: 2019-05-31
STATE GRID ZHEJIANG ELECTRIC POWER
View PDF5 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In practical applications, the setting of the number of topics has a great influence on the text segmentation effect. If the number of topics is set too high, it will cause training overfitting, and if it is set too low, the description of the text will not be comprehensive enough.
[0004] Traditional text segmentation algorithms generally rely on the manual setting of the number of topics. It is difficult to estimate the number of topics in a large corpus, which may easily cause overfitting or incomplete description of the text.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A text segmentation method based on a layered Dirichlet model
  • A text segmentation method based on a layered Dirichlet model
  • A text segmentation method based on a layered Dirichlet model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0042] The technical solutions of the present invention will be further described below in conjunction with the accompanying drawings, but the present invention is not limited to these embodiments.

[0043]The main idea of ​​the present invention is to preprocess the text to be segmented, obtain the word segmentation set of the text to be segmented and count the word frequency, put the result after counting the word frequency into the hierarchical Dirichlet process model, and the hierarchical Dirichlet process model Topic IDs are assigned to each word during iterative reasoning, resulting in topic vectors. This method makes text segmentation no longer dependent on the manual setting of the number of topics, and automatically generates topic vectors through the hierarchical Dirichlet process model, which improves the efficiency of text segmentation.

[0044] like figure 1 As shown, the embodiment of the present invention proposes a text segmentation method based on the hierarc...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention belongs to the technical field of text segmentation, and particularly relates to a text segmentation method based on a layered Dirichlet model, which comprises the following steps: S1, acquiring a news corpus, preprocessing the news corpus to obtain a word segmentation set of the whole news corpus, and performing word frequency statistics on the word segmentation set; S2, putting a result after word frequency statistics into a layered Dirichlet process model for training, and storing the trained layered Dirichlet process model; And S3, obtaining a topic vector of each word in theto-be-segmented text through the trained layered Dirichlet process model, and realizing text segmentation according to the topic vectors. By using the method and the device, the following effects canbe realized: the method enables text segmentation not to depend on manual setting of the number of topics, the topic vectors are automatically generated through the hierarchical Dirichlet process model, and the text segmentation efficiency is improved.

Description

technical field [0001] The invention belongs to the technical field of text segmentation, and in particular relates to a text segmentation method based on a hierarchical Dirichlet model. Background technique [0002] With the rapid development of the network, people have gradually stepped into a new network era, and various electronic text information is growing at an explosive rate. While all kinds of massive information bring convenience to society, it also brings huge challenges to text processing and analysis, such as how to quickly and accurately obtain effective information from this massive information. Text segmentation is to segment the text based on the principle of topic correlation, so that each semantic paragraph has the minimum similarity, and each semantic paragraph has the maximum similarity, so as to find the boundaries of different topics. [0003] Commonly used methods for text segmentation include methods based on vocabulary aggregation, methods based on...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/27
CPCY02D10/00
Inventor 陈建王红凯叶卫龚小刚王以良唐锦江郭亚琼陈超孙嘉赛许敏喻谦吴哲翔姜维
Owner STATE GRID ZHEJIANG ELECTRIC POWER
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products