Theme webpage data capturing method and device, equipment and storage medium
A technology of webpage data and topics, applied in the computer field, can solve the problems of slow crawling webpage data, ignoring webpage text content, rough retrieval results, etc., and achieve the effect of improving search accuracy and search efficiency
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0018] figure 1 It is a schematic flowchart of a method for capturing theme webpage data provided by an embodiment of the present invention. The method can be executed by a device for capturing theme webpage data, wherein the device can be implemented by software and / or hardware, and can generally be integrated in computer equipment such as servers middle. Such as figure 1 As shown, the method includes:
[0019] S110. Determine a target topic according to the search content input by the user, and select a link to be captured from a queue of links to be captured corresponding to the target topic based on a preset search strategy.
[0020] Determining the target topic according to the search content input by the user can be understood as the text information entered by the user when searching on the search engine, and determining the target topic according to the current text information, the current text information can be directly determined as the target topic, or The corr...
Embodiment 2
[0035] The embodiment of the present invention is optimized on the basis of the above embodiments, and the step of obtaining the web page content corresponding to the link to be captured is optimized, including: simulating the client to send an access request corresponding to the link to be captured to the corresponding server, And download the webpage file corresponding to the link to be captured according to the received access response; analyze the webpage file to extract the webpage content in the webpage file, wherein the webpage content includes link information and text information. The advantage of this setting is that by downloading the webpage file corresponding to the link to be crawled, the corresponding webpage content can be accurately analyzed.
[0036] Further, the step of screening target links from the links to be crawled according to the content relevance and link relevance includes: for all the links to be crawled, determining the content relevance according...
Embodiment 3
[0074] image 3 A structural block diagram of a subject web page data grabbing device provided in an embodiment of the present invention, the device can be implemented by software and / or hardware, and generally can be integrated in a computer device such as a server, and can be implemented by executing a subject web page data grabbing method Topic web page data crawling. Such as image 3 As shown, the device includes: a link selection module 31 to be grabbed, a webpage content acquisition module 32 and a target link screening module 33, wherein:
[0075] The link selection module 31 to be grabbed is used to determine the target topic according to the search content input by the user, and select the link to be grabbed from the queue of links to be grabbed corresponding to the target topic based on a preset search strategy;
[0076] A webpage content acquisition module 32, configured to acquire the webpage content corresponding to the link to be captured;
[0077] A target li...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 


