Unlock instant, AI-driven research and patent intelligence for your innovation.

Theme webpage data capturing method and device, equipment and storage medium

A technology of webpage data and topics, applied in the computer field, can solve the problems of slow crawling webpage data, ignoring webpage text content, rough retrieval results, etc., and achieve the effect of improving search accuracy and search efficiency

Pending Publication Date: 2021-09-28
RUN TECH CO LTD BEIJING
View PDF4 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] When users use traditional search engines to search, they can only provide rough search results, and traditional search strategies based on web page content evaluation often ignore the relevance of links between web pages, while search strategies based on link analysis ignore the content of web pages , it is easy to cause the phenomenon of "theme drift"
[0004] The traditional search strategy has the problems of inaccurate automatic search and slow speed of crawling web page data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Theme webpage data capturing method and device, equipment and storage medium
  • Theme webpage data capturing method and device, equipment and storage medium
  • Theme webpage data capturing method and device, equipment and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0018] figure 1 It is a schematic flowchart of a method for capturing theme webpage data provided by an embodiment of the present invention. The method can be executed by a device for capturing theme webpage data, wherein the device can be implemented by software and / or hardware, and can generally be integrated in computer equipment such as servers middle. Such as figure 1 As shown, the method includes:

[0019] S110. Determine a target topic according to the search content input by the user, and select a link to be captured from a queue of links to be captured corresponding to the target topic based on a preset search strategy.

[0020] Determining the target topic according to the search content input by the user can be understood as the text information entered by the user when searching on the search engine, and determining the target topic according to the current text information, the current text information can be directly determined as the target topic, or The corr...

Embodiment 2

[0035] The embodiment of the present invention is optimized on the basis of the above embodiments, and the step of obtaining the web page content corresponding to the link to be captured is optimized, including: simulating the client to send an access request corresponding to the link to be captured to the corresponding server, And download the webpage file corresponding to the link to be captured according to the received access response; analyze the webpage file to extract the webpage content in the webpage file, wherein the webpage content includes link information and text information. The advantage of this setting is that by downloading the webpage file corresponding to the link to be crawled, the corresponding webpage content can be accurately analyzed.

[0036] Further, the step of screening target links from the links to be crawled according to the content relevance and link relevance includes: for all the links to be crawled, determining the content relevance according...

Embodiment 3

[0074] image 3 A structural block diagram of a subject web page data grabbing device provided in an embodiment of the present invention, the device can be implemented by software and / or hardware, and generally can be integrated in a computer device such as a server, and can be implemented by executing a subject web page data grabbing method Topic web page data crawling. Such as image 3 As shown, the device includes: a link selection module 31 to be grabbed, a webpage content acquisition module 32 and a target link screening module 33, wherein:

[0075] The link selection module 31 to be grabbed is used to determine the target topic according to the search content input by the user, and select the link to be grabbed from the queue of links to be grabbed corresponding to the target topic based on a preset search strategy;

[0076] A webpage content acquisition module 32, configured to acquire the webpage content corresponding to the link to be captured;

[0077] A target li...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The embodiment of the invention discloses a theme webpage data capturing method and device, equipment and a storage medium. The method comprises the steps of determining a target theme according to search content input by a user, and selecting a to-be-captured link from a to-be-captured link queue corresponding to the target theme based on a preset search strategy; obtaining webpage content corresponding to the to-be-captured link; and screening a target link from the to-be-captured links according to the content relevancy and the link relevancy, and feeding back the target link as a search result. By adopting the technical scheme, the webpage content and the webpage link are combined, the content relevancy and the link relevancy are judged, and then the target link is screened out from the to-be-captured link, so that the technical effects of improving the search accuracy and improving the search efficiency can be achieved.

Description

technical field [0001] The embodiments of the present invention relate to the field of computer technology, and in particular to a method, device, device and storage medium for capturing subject webpage data. Background technique [0002] The Internet is a huge data collection, and the data of network information resources is increasing exponentially. How to effectively divide the huge data into relevant and irrelevant data according to the user's search query, and display the relevant data is the current research direction. [0003] When users use traditional search engines to search, they can only provide rough search results, and the traditional search strategy based on web page content evaluation often ignores the relevance of links between web pages, while the search strategy based on link analysis ignores the content of the text of the web page , It is easy to cause the phenomenon of "theme drift". [0004] The traditional search strategy has the problems of inaccurat...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/951G06F16/9532
CPCG06F16/951G06F16/9532Y02D10/00
Inventor 史延涛谢永恒火一莽
Owner RUN TECH CO LTD BEIJING