Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Unstructured text data enhanced distributed large-scale data dimension extracting method

A technology for large-scale data and text data, which is applied in the fields of electrical digital data processing, special data processing applications, instruments, etc., and can solve problems such as the difficulty of unstructured text data and the inability to build dimensions

Active Publication Date: 2017-05-10
ZHEJIANG GONGSHANG UNIVERSITY
View PDF3 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] In order to solve technical problems such as difficulty in building dimensions when analyzing massive unstructured text data, the present invention proposes an enhanced distributed large-scale data dimension extraction method for unstructured text data to achieve

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Unstructured text data enhanced distributed large-scale data dimension extracting method
  • Unstructured text data enhanced distributed large-scale data dimension extracting method
  • Unstructured text data enhanced distributed large-scale data dimension extracting method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0059] An enhanced distributed large-scale data dimension extraction method for unstructured text data, including:

[0060] Step 1: Text word segmentation: Segment the input text, find out the mutual information value between the smallest semantic units, set the first threshold through training, compare the first threshold with the mutual information value between the smallest semantic units, When the mutual information value is greater than or equal to the first threshold, a word segmentation result is obtained;

[0061] Step 2: Word frequency statistics: According to the word segmentation results, perform word frequency statistics on the input text, and establish a corresponding word frequency relationship table;

[0062] Step 3: Input text topic extraction: According to the target field of interest in extraction, determine the set of topic words in the target field, and determine the stability of the topic words in the input text when they appear together with all the words...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides an unstructured text data enhanced distributed large-scale data dimension extracting method. The method includes the steps of text segmentation; word frequency statistics; input text theme extraction and theme term filtering; input text theme extraction, wherein a theme set of a target field is determined according to the target field to which extraction pays attention, the stability that theme terms in an input text and all left words of the theme terms exist at the same time is obtained through calculation, a second threshold value is set through training, the stability and the second threshold value are compared, and when the stability is not smaller than the second threshold value, a set of left theme terms relevant to the input text themes can be obtained. Mass unstructured text data is effectively converted into structured or semi-structured data, and good support is provided for data analysis and mining. A good effect is achieved for solving the complex problem of structured analysis of mass unstructured texts.

Description

technical field [0001] The invention relates to the field of big data dimension extraction, in particular to an unstructured text data enhanced distributed large-scale data dimension extraction method. Background technique [0002] With the explosive growth of information, in the era of big data, data is becoming a key asset that provides important decision-making basis in the process of management change in enterprises, and data is increasingly showing its important role in the field of public utilities. In the era of big data, the seemingly irrelevant data experience under the traditional concept becomes understandable through large-scale parallel distributed computing processing, which can produce great significance. However, due to the large volume, fast speed, and variety of data, big data has brought a large number of heterogeneous and unstructured problems, making many excellent algorithms and tools in the field of traditional data analysis and mining unable to deal w...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 刘东升许翀寰
Owner ZHEJIANG GONGSHANG UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products