Network data excavation method

A technology for network data and web page information, applied in the field of information processing, can solve the problems of low efficiency and waste of storage resources in retrieval, and achieve the effect of benefiting logic, improving retrieval efficiency, and avoiding waste of storage resources

Inactive Publication Date: 2015-01-14
BEIJING ZHITOUJIA INTPROP OPERATION CO LTD
View PDF2 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

For users, it is enough to retrieve only one article of the same article, but for search engines, saving webpages with the same content will also cause waste of storage resources and inefficiency in retrieval

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Network data excavation method
  • Network data excavation method
  • Network data excavation method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0030] refer to figure 1 , a kind of network data mining method that the present invention proposes, carries out text classification and text clustering to the web page information that obtains, thus extracts topic, specifically comprises the following steps:

[0031] S1. The preset network probe captures web page information according to the industry ontology.

[0032] The industry ontology is preset in the network probe, and webpages are detected according to the industry ontology, which narrows the detection range and improves the data detection efficiency. And only when the detected network data meets the requirements, the webpage will be crawled, so that important data will not be missed, and time will not be wasted for useless work. This strategy greatly saves bandwidth and data retrieval volume without losing the volume of industry data collection, and improves the data storage cycle and real-time performance.

[0033] S2. Perform text extraction on the obtained webpa...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a network data excavation method, which is used for performing text classification and text clustering on acquired webpage information so as to extract topics. The network data excavation method specifically comprises the following steps of S1, catching the webpage information by a preset network probe according to an industrial body; S2, performing text extraction on the acquired network information; S3, performing text classification on extracted texts by a preset classifier to generate a plurality of text type systems; S4, clustering texts under each text type system to generate a plurality of text sub types, wherein each text sub type corresponds to each topic; S5, storing webpage links, and constructing an index according to the text type systems and the text sub types. The network data excavation method provided by the invention can combine repeated information.

Description

technical field [0001] The invention relates to the technical field of information processing, in particular to a network data mining method. Background technique [0002] As the degree of informatization continues to deepen, people's desire for intelligence and information integration has become increasingly strong; the Internet's continuous growth of information resources contains a huge amount of valuable information and has become an important source of intelligence information. [0003] There are a lot of repeated information in different websites, and the information is repeatedly indexed by search engines. Therefore, when users use search engines to retrieve information, they will find that many of the same information come from different websites. For users, it is enough to retrieve only one article of the same article, but for search engines, saving web pages with the same content will also cause waste of storage resources and low efficiency of retrieval. Contents...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/951G06F16/24556G06F16/254G06F16/35
Inventor 贾岩
Owner BEIJING ZHITOUJIA INTPROP OPERATION CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products