Topic web crawler method and system based on text classification

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A web crawler and text classification technology, which is applied in the fields of artificial intelligence and deep learning, can solve problems such as high memory usage, less than 50% information coverage, and weak personalized needs, and achieve the effect of improving the degree of topic matching

Pending Publication Date: 2022-01-11

BEIJING INSTITUTE OF TECHNOLOGYGY +1

View PDF0 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0002] The rapid growth of the scale of the network, while bringing extremely rich information to people, has also brought great challenges to the retrieval of this information. Even the very large-scale Baidu and even Google search engines are also very difficult Information coverage is less than 50%

However, for a specific topic, general search engines have problems such as large information redundancy, high memory usage, consumption of system resources, low precision and weak personalized requirements.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0038] In order to make the purpose, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions in the embodiments of the present invention will be checked and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

[0039] A kind of theme web crawler method based on text classification proposed by the present invention, such as figure 1 shown, including the following steps:

[0040] S1. Receive the theme and initialize the URL task queue;

[0041] S2. Take out a crawler task from the URL task queue to obtain the content of the network ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides a topic web crawler method based on text classification. The method comprises the following steps: S1, receiving a topic, and initializing a URL task queue; S2, taking out a crawler task from the URL task queue, and obtaining network document content; S3, obtaining the classification of the network document content; S4, judging whether the classification is the same as the theme or not; if yes, extracting the URL in the network document, and adding the URL into a URL task queue; if the URL task queue is different from the URL task queue and there is a task in the URL task queue, executing the step S2; and S5, circularly executing the steps S2-S4 until no task exists in the URL task queue. According to the invention, the classification model can learn the features of the web text, and the classification accuracy of the classification task can be effectively improved.

Description

technical field [0001] The invention relates to the fields of artificial intelligence and deep learning, in particular to a method and system for a subject web crawler based on text classification. Background technique [0002] The rapid growth of the scale of the network, while bringing extremely rich information to people, has also brought great challenges to the retrieval of this information. Even the very large-scale Baidu and even Google search engines are also very difficult Information coverage is less than 50%. And the search server resources are far behind the ever-increasing speed of the network scale. If the traditional information crawling method is still adopted, it will only make the coverage of information retrieval smaller and smaller. [0003] General search engines use crawler programs to retrieve websites, such as Google, Baidu and other large search engines for all users, use seed pages as the starting point of search, and try to traverse the entire net...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F16/35G06F16/951G06F16/955

CPCG06F16/35G06F16/951G06F16/955

Inventor逄金辉徐叶强王纪津

OwnerBEIJING INSTITUTE OF TECHNOLOGYGY

Topic web crawler method and system based on text classification

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology