Extraction method and device for Internet-oriented meaningful strings

An extraction method and Internet technology, applied in the field of Internet-oriented meaningful string extraction and devices, can solve the problems of similar content, meaningless and redundant calculation of the frequency of occurrence of words, etc., to reduce similarity and improve semantic independence , the effect of improving accuracy

Inactive Publication Date: 2012-02-01
HARBIN ENG UNIV
View PDF1 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The disadvantage of using words as features is that it simply considers whether a word appears in a document and its frequency of occurrence, and regards features as independent existence, while completely ignoring the semantic relationship between text contexts, and does not consider features order of precedence between
[0005] In a nutshell, the existing meaningful string extraction algorithms have the following disadvantages: 1) Using mutual information as a feature in the intra-string analysis cannot filter double-word strings well. The string is actually a single-character string, and it is meaningless to calculate the frequency of occurrence of a single character; 2) Both the internal analysis and the external analysis do not consider the difference between the string and the string, and there will be many string representations in the extracted meaningful string Similarity, resulting in semantic similarity and redundancy of many meaningful strings

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Extraction method and device for Internet-oriented meaningful strings
  • Extraction method and device for Internet-oriented meaningful strings
  • Extraction method and device for Internet-oriented meaningful strings

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0056] In order to make the purpose, technical solution and advantages of the present invention clearer, the method and system for extracting meaningful strings oriented to the Internet of the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments.

[0057] The invention extracts meaningful strings from massive webpages existing on the Internet. Meaningful strings are complete language units with independent semantics, tight coupling, and wide circulation. The meaningful strings extracted by the present invention can be used as the feature representation of the text representation model and applied to the clustering and classification of massive Internet data.

[0058] The present invention divides the meaningful string mining method process into four stages of repeated string discovery, internal analysis, external analysis, and inter-string analysis. The whole process is as follows: figure 1 shown, including ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides an extraction method and a device for Internet-oriented meaningful strings. The extraction method comprises the following steps: extracting repeated character strings and filtering the character strings sequentially by in-string analysis, out-string analysis and among-string analysis; and the extraction device comprises a repeated string discovery module, an in-string analysis module, an out-string analysis module and an among-string analysis module which are successively connected in series. The invention can effectively extract meaningful strings on news pages and forums, and can be widely used in the fields of network public opinion management, Internet intelligent information processing and the like.

Description

technical field [0001] The present invention relates to a technology that utilizes computer technology to assist intelligent analysis of network information or management of public opinion, specifically a method and system for quickly, accurately and efficiently extracting meaningful strings from massive Internet web pages and forum information. Background technique [0002] Text representation is the first step in content-based text processing. Feature items in text representation are important factors affecting text classification and clustering results. Currently commonly used text feature items mainly include words, words, phrases, and semantics. Theoretically speaking, semantic concept (semantic set) is higher than phrase (syntax set), phrase is higher than word (word set), and word is higher than word (character set). Usually, semantic concepts can be obtained by means of semantic dictionaries (synonyms, synonyms dictionaries, etc.) or latent semantic indexes. Howev...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 王巍杨武苘大鹏董红臣
Owner HARBIN ENG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products