Unlock instant, AI-driven research and patent intelligence for your innovation.

A Web network-based unstructured text acquisition method supporting user policy configuration

An unstructured, text-acquisition technology, applied in unstructured text data retrieval, network data indexing, network data retrieval, etc., can solve the problems of inflexible web crawler policy configuration and inflexible data adjustment in policy configuration, achieving The effect of saving cycles and improving crawler efficiency

Active Publication Date: 2019-04-09
云南电网有限责任公司信息中心
View PDF14 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006]The web crawler has the defect that the policy configuration is not flexible enough, and the policy configuration cannot be flexibly adjusted according to the characteristics of the collected data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Web network-based unstructured text acquisition method supporting user policy configuration
  • A Web network-based unstructured text acquisition method supporting user policy configuration

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0050] 1. Text collector storage initialization

[0051] Create a Redis storage server and complete the initialization, set up a hierarchical clustering algorithm, and set the hierarchical clustering algorithm to classify all unclassified pages and include them into the existing clustering categories whenever the number of newly added unclassified pages reaches 1000 .

[0052] 2. Start page settings

[0053] Establish a Queuelib structure as the Frontier URL Queue, and input the initial URL addresses, such as www.yn.csg.cn, www.csg.cn, www.sgcc.com.cn, etc., into the Frontier URL Queue. The pages acquired by the above three addresses are not clustered, so their weight values ​​are set to

[0054] 3. Text collector page resource acquisition

[0055] In the frontier border page library, according to the principle that the largest weight is the first to go out of the queue, take out the page address and then obtain the page resource, extract the URL address in the page, and ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a unstructured text acquisition method supporting user policy configuration. The method comprises a text collector storage initialization step, a text collector seed address initialization step, text collector page resource acquisition, page analysis and storage, page text content hierarchical clustering, text data clustering condition feedback, real-time / quasi-real-time user strategy configuration and text collector response user feedback. According to the invention, the selection strategy of the Web text collection system, namely the web crawler, can be dynamically adjusted by evaluating crawled resources; According to the method, better and more efficient text data collection and high-quality text data resource pool construction in a specific organization are realized, an information resource pool can be established for the text data with rich characteristics within a very short period of time, the crawler efficiency is improved, and the information collection period is shortened.

Description

technical field [0001] The present application relates to an information collection and acquisition method, in particular, to a web-based unstructured text acquisition method that supports real-time / quasi-real-time policy configuration by users. This method can be used for the acquisition and aggregation of unstructured text data in the power industry, laying the foundation for the unified management of unstructured text data, and applied to scenarios such as unified management of information resources and knowledge management within the organization. Background technique [0002] Unstructured text data is an extremely important information resource within an organization. It can effectively manage information resources, realize rapid retrieval, analysis and mining of information resources, and provide data and information support for daily office, management, coordination, supervision, decision-making and other activities. , reduce daily operating costs, accumulate and form...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/951G06F16/35
Inventor 张新阳李辉保富
Owner 云南电网有限责任公司信息中心