Supercharge Your Innovation With Domain-Expert AI Agents!

Webpage content extraction method and system based on heuristic rule

A web content and heuristic technology, applied in the direction of website content management, network data retrieval, character and pattern recognition, etc., can solve the problems of spending a lot of time maintaining and debugging code, low efficiency, etc., to save labor costs, universal strong effect

Pending Publication Date: 2020-12-29
ONE CONNECT SMART TECH CO LTD SHENZHEN
View PDF0 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This method of manually writing network data collection programs often requires writing different programs according to different URLs, which is inefficient and requires a lot of time to maintain and debug codes

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage content extraction method and system based on heuristic rule
  • Webpage content extraction method and system based on heuristic rule
  • Webpage content extraction method and system based on heuristic rule

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0055] refer to figure 1 , shows a flow chart of the steps of the method for extracting web page content based on heuristic rules according to Embodiment 1 of the present invention. It can be understood that the flowchart in this method embodiment is not used to limit the sequence of execution steps. An exemplary description is given below taking the computer device 2 as the execution subject. details as follows.

[0056] Step S100, receiving a target URL of webpage content to be extracted, and obtaining a target webpage source code corresponding to the target URL according to the target URL.

[0057] Specifically, the user opens the input page, and inputs a target URL in a corresponding input box to open the target webpage, or directly opens a target webpage in the browser homepage. After receiving the request extraction instruction for opening the target webpage, obtain the source code corresponding to the target webpage, that is, the source code of the target webpage.

...

Embodiment 2

[0113] read on Figure 7 , shows a schematic diagram of program modules of Embodiment 2 of the system for extracting web page content based on heuristic rules of the present invention. In this embodiment, the web page content extraction system 20 based on heuristic rules may include or be divided into one or more program modules, one or more program modules are stored in a storage medium, and processed by one or more implemented by a device to complete the present invention and realize the above method for extracting webpage content based on heuristic rules. The program module referred to in the embodiment of the present invention refers to a series of computer program instruction segments capable of completing specific functions, which is more suitable than the program itself to describe the execution process of the heuristic rule-based web page content extraction system 20 in the storage medium. The following description will specifically introduce the functions of each ...

Embodiment 3

[0135] refer to Figure 8 , is a schematic diagram of the hardware architecture of the computer device according to Embodiment 3 of the present invention. In this embodiment, the computer device 2 is a device capable of automatically performing numerical calculation and / or information processing according to preset or stored instructions. The computer device 2 may be a rack server, a blade server, a tower server or a cabinet server (including an independent server, or a server cluster composed of multiple servers) and the like. Such as Figure 8 As shown, the computer device 2 at least includes, but is not limited to, a memory 21 , a processor 22 , a network interface 23 , and a web content extraction system 20 based on heuristic rules that can communicate with each other through a system bus. in:

[0136] In this embodiment, the memory 21 includes at least one type of computer-readable storage medium, and the readable storage medium includes flash memory, hard disk, ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a webpage content extraction method and system based on a heuristic rule, and the method comprises the steps: receiving a target website of to-be-extracted webpage content, andobtaining target webpage source codes corresponding to the target website according to the target website; extracting the target webpage source codes based on a preset heuristic rule to obtain webpage text source codes; traversing the webpage text source codes to obtain a target tag and tag content corresponding to the target tag; and according to the tag attribute corresponding to the target tag, storing the target tag content in a data table, and sending the target tag content to a front end for view display. According to the method, webpage extraction can be rapidly and efficiently carriedout, and the universality is high.

Description

technical field [0001] The embodiments of the present invention relate to the field of webpage processing, and in particular to a method and system for extracting webpage content based on heuristic rules. Background technique [0002] At present, with the rapid development of the Internet, network information resources are showing an exponential growth trend. The web pages gather a large amount of valuable data in various industries and fields such as resumes, corporate information, intellectual property rights, and product information. Applications such as retrieval and data mining can be of great help. How to analyze pages more conveniently and accurately and extract valuable data has become an important research issue. [0003] Existing website content extraction is mainly divided into two categories: the first category is news-type websites, and news-type website pages generally include title, time, author, and large text descriptions; for news-type website pages, exist...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/957G06F16/958G06K9/62G06F16/35
CPCG06F16/9577G06F16/986G06F16/35G06F18/2323G06F18/23
Inventor 周威王大伟
Owner ONE CONNECT SMART TECH CO LTD SHENZHEN
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More