Topic web crawler method and system based on text classification

A web crawler and text classification technology, which is applied in the fields of artificial intelligence and deep learning, can solve problems such as high memory usage, less than 50% information coverage, and weak personalized needs, and achieve the effect of improving the degree of topic matching

Pending Publication Date: 2022-01-11
BEIJING INSTITUTE OF TECHNOLOGYGY +1
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0002] The rapid growth of the scale of the network, while bringing extremely rich information to people, has also brought great challenges to the retrieval of this information. Even the very large-scale Baidu and even Google search engines are also ve...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Topic web crawler method and system based on text classification
  • Topic web crawler method and system based on text classification
  • Topic web crawler method and system based on text classification

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0038] In order to make the purpose, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions in the embodiments of the present invention will be checked and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

[0039] A kind of theme web crawler method based on text classification proposed by the present invention, such as figure 1 shown, including the following steps:

[0040] S1. Receive the theme and initialize the URL task queue;

[0041] S2. Take out a crawler task from the URL task queue to obtain the content of the network ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a topic web crawler method based on text classification. The method comprises the following steps: S1, receiving a topic, and initializing a URL task queue; S2, taking out a crawler task from the URL task queue, and obtaining network document content; S3, obtaining the classification of the network document content; S4, judging whether the classification is the same as the theme or not; if yes, extracting the URL in the network document, and adding the URL into a URL task queue; if the URL task queue is different from the URL task queue and there is a task in the URL task queue, executing the step S2; and S5, circularly executing the steps S2-S4 until no task exists in the URL task queue. According to the invention, the classification model can learn the features of the web text, and the classification accuracy of the classification task can be effectively improved.

Description

technical field [0001] The invention relates to the fields of artificial intelligence and deep learning, in particular to a method and system for a subject web crawler based on text classification. Background technique [0002] The rapid growth of the scale of the network, while bringing extremely rich information to people, has also brought great challenges to the retrieval of this information. Even the very large-scale Baidu and even Google search engines are also very difficult Information coverage is less than 50%. And the search server resources are far behind the ever-increasing speed of the network scale. If the traditional information crawling method is still adopted, it will only make the coverage of information retrieval smaller and smaller. [0003] General search engines use crawler programs to retrieve websites, such as Google, Baidu and other large search engines for all users, use seed pages as the starting point of search, and try to traverse the entire net...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/35G06F16/951G06F16/955
CPCG06F16/35G06F16/951G06F16/955
Inventor 逄金辉徐叶强王纪津
Owner BEIJING INSTITUTE OF TECHNOLOGYGY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products