A Web network-based unstructured text acquisition method supporting user policy configuration

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
An unstructured, text-acquisition technology, applied in unstructured text data retrieval, network data indexing, network data retrieval, etc., can solve the problems of inflexible web crawler policy configuration and inflexible data adjustment in policy configuration, achieving The effect of saving cycles and improving crawler efficiency

Active Publication Date: 2019-04-09

云南电网有限责任公司信息中心

View PDF14 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0006]The web crawler has the defect that the policy configuration is not flexible enough, and the policy configuration cannot be flexibly adjusted according to the characteristics of the collected data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0050] 1. Text collector storage initialization

[0051] Create a Redis storage server and complete the initialization, set up a hierarchical clustering algorithm, and set the hierarchical clustering algorithm to classify all unclassified pages and include them into the existing clustering categories whenever the number of newly added unclassified pages reaches 1000 .

[0052] 2. Start page settings

[0053] Establish a Queuelib structure as the Frontier URL Queue, and input the initial URL addresses, such as www.yn.csg.cn, www.csg.cn, www.sgcc.com.cn, etc., into the Frontier URL Queue. The pages acquired by the above three addresses are not clustered, so their weight values are set to

[0054] 3. Text collector page resource acquisition

[0055] In the frontier border page library, according to the principle that the largest weight is the first to go out of the queue, take out the page address and then obtain the page resource, extract the URL address in the page, and ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a unstructured text acquisition method supporting user policy configuration. The method comprises a text collector storage initialization step, a text collector seed address initialization step, text collector page resource acquisition, page analysis and storage, page text content hierarchical clustering, text data clustering condition feedback, real-time / quasi-real-time user strategy configuration and text collector response user feedback. According to the invention, the selection strategy of the Web text collection system, namely the web crawler, can be dynamically adjusted by evaluating crawled resources; According to the method, better and more efficient text data collection and high-quality text data resource pool construction in a specific organization are realized, an information resource pool can be established for the text data with rich characteristics within a very short period of time, the crawler efficiency is improved, and the information collection period is shortened.

Description

technical field [0001] The present application relates to an information collection and acquisition method, in particular, to a web-based unstructured text acquisition method that supports real-time / quasi-real-time policy configuration by users. This method can be used for the acquisition and aggregation of unstructured text data in the power industry, laying the foundation for the unified management of unstructured text data, and applied to scenarios such as unified management of information resources and knowledge management within the organization. Background technique [0002] Unstructured text data is an extremely important information resource within an organization. It can effectively manage information resources, realize rapid retrieval, analysis and mining of information resources, and provide data and information support for daily office, management, coordination, supervision, decision-making and other activities. , reduce daily operating costs, accumulate and form...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F16/951G06F16/35

Inventor张新阳李辉保富

Owner云南电网有限责任公司信息中心

A Web network-based unstructured text acquisition method supporting user policy configuration

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements:Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology