Content extraction method based on keyword matching

A keyword matching and keyword technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of difficult webpage text extraction, high error rate, high time complexity, and ensure objectivity and reasonableness. The effect of stability, high accuracy and good versatility

Active Publication Date: 2017-10-03
GUILIN UNIV OF ELECTRONIC TECH
View PDF6 Cites 24 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, there are many types of existing webpages, different webpage structures, and the website will be revised from time to time. At the same time, a large number of advertisements and other noises are inserted in the webpage. These problems make it difficult to extract the text of the webpage.
Existing text extraction methods mainly include: (1) Realize text extraction by analyzing the WLR and node hierarchical relationship of DOM tree nodes. This kind of method has high time complexity and low efficiency; (2) Design label path The feature system realizes the distinction between text and noise from different perspectives. On the basis of feature similarity analysis, the feature fusion strategy based on combined feature selection can quickly and efficiently extract text. However, this type of method is highly dependent on the structure of the website. ; (3) Automatic information extraction, which only extracts web pages based on their own relevant characteristics. This method has a relatively high error rate in the text extraction of short text web pages

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Content extraction method based on keyword matching
  • Content extraction method based on keyword matching
  • Content extraction method based on keyword matching

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0041] The content of the present invention will be further elaborated below in conjunction with the accompanying drawings, but it is not intended to limit the present invention.

[0042] like figure 1 As shown, the text extraction method based on keyword matching of the present invention specifically includes the following steps:

[0043] (1) Webpage preprocessing, counting and extracting the keywords in the Keywords tag of the webpage source code, and establishing a standard library with keywords; using regular expressions to preprocess the webpage to be processed, removing obvious noise text, and obtaining a rough webpage;

[0044] (2) Build a DOM tree, use the Jsoup tool to parse the HTML of the rough web page, and obtain the data of the rough web page; DOM uses a set of structured nodes and objects to represent the structure of the document, that is, each component in the document is defined as a node , so as to connect the webpage, scripting language and progr...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a content extraction method based on keyword matching. Keywords in a webpage source code Keywords label are counted, a standard library is built according to the keywords, and a corresponding DOM tree is built; hierarchical traversal is conducted on the DOM tree, the number of keywords contained in all nodes of the DOM trees is counted, the keyword weights of the nodes are calculated according to the ratio relation between the keyword number of the nodes and the keyword number of father nodes of the nodes, the content node containing a content text is effectively screened out and positioned by judging the maximum keyword weight of node children, and content extraction is completed; and a similarity matching method is proposed in order to solve the problem that a short text cannot be effectively extracted through a keyword matching method, and according to the similarity matching method, paragraph texts and page titles are converted into 8-bit binary data, the similarity is judged according to the Hamming distance, and content extraction of the short text is achieved. According to the content extraction method, matching is conducted according to the keywords set in a webpage, neither data training nor sample learning is needed, and the method is out of limitation of a website structure and has the good universality.

Description

technical field [0001] The invention relates to the technical field of text mining, in particular to a text extraction method based on keyword matching. Background technique [0002] With the rapid development of Web technology, web pages have become the main carrier of information release and information consumption. Therefore, in the monitoring of public opinion on the Internet, it is very important to strengthen the information filtering of web pages; and in the information filtering of web pages, the information extraction or text extraction of web pages becomes the key. However, there are many types of existing webpages, different webpages have different structures, and the website will be revised from time to time. At the same time, a large number of advertisements and other noises are inserted in the webpages. These problems make it difficult to extract the text of the webpage. Existing text extraction methods mainly include: (1) Realize text extraction by analyzing ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/9577G06F40/284
Inventor 武小年孟川王青芝叶志博奚玉昂张润莲
Owner GUILIN UNIV OF ELECTRONIC TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products