A Web network-based unstructured text acquisition method supporting user policy configuration
An unstructured, text-acquisition technology, applied in unstructured text data retrieval, network data indexing, network data retrieval, etc., can solve the problems of inflexible web crawler policy configuration and inflexible data adjustment in policy configuration, achieving The effect of saving cycles and improving crawler efficiency
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0050] 1. Text collector storage initialization
[0051] Create a Redis storage server and complete the initialization, set up a hierarchical clustering algorithm, and set the hierarchical clustering algorithm to classify all unclassified pages and include them into the existing clustering categories whenever the number of newly added unclassified pages reaches 1000 .
[0052] 2. Start page settings
[0053] Establish a Queuelib structure as the Frontier URL Queue, and input the initial URL addresses, such as www.yn.csg.cn, www.csg.cn, www.sgcc.com.cn, etc., into the Frontier URL Queue. The pages acquired by the above three addresses are not clustered, so their weight values are set to
[0054] 3. Text collector page resource acquisition
[0055] In the frontier border page library, according to the principle that the largest weight is the first to go out of the queue, take out the page address and then obtain the page resource, extract the URL address in the page, and ...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 

