Method and device for extracting forum post content

A forum and content technology, applied in the field of forum post content extraction, can solve problems such as forum post content extraction, and achieve the effect of automatic extraction

Inactive Publication Date: 2016-04-20
NEW FOUNDER HLDG DEV LLC +2
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The present invention aims to provide a method and device for extracting forum post content to solve the problem of extracting forum post content

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for extracting forum post content
  • Method and device for extracting forum post content
  • Method and device for extracting forum post content

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0014] The present invention will be described in detail below with reference to the accompanying drawings and in combination with embodiments.

[0015] figure 1 A flowchart showing a method for extracting forum post content according to an embodiment of the present invention, including:

[0016] Step S10, generating an HTML tag tree from the source code posted on the forum;

[0017] Step S20, merging the tag subtrees in the HTML tag tree whose text rate is greater than the first threshold to obtain a maximum candidate subtree. According to the results of multiple experiments, preferably, the first threshold is set to 0.8;

[0018] In step S30, all node clusters with similar structures are filtered from the largest candidate subtree, which are the posts on each floor;

[0019] Step S40, screening node clusters with a text rate greater than the second threshold from the node clusters, according to the results of multiple experiments, preferably, setting the second threshold t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a forum post content extraction method. The forum post content extraction method comprises the following steps: a source code of a forum post generates a hypertext markup language (HTML) label tree, label sub-trees with the text rate larger than a first threshold in the HTML label tree are combined to obtain a largest candidate sub-tree, all node clusters in the similar structure are selected in the largest candidate sub-tree, node clusters with the text rate larger than a second threshold in the node clusters are selected, and text content in the selected node clusters is extracted. The invention further provides a forum post content extraction device. The forum post content extraction method and the extraction device realize automatic extraction of the forum post content.

Description

technical field [0001] The present invention relates to the field of Internet information technology, in particular to a method and device for extracting forum post content. Background technique [0002] With the popularization of Internet applications, online forums are booming, the number of forum users is increasing day by day, and the amount of data is growing explosively, which plays an important role in the dissemination of public opinion. Therefore, applications such as forum data retrieval and mining are becoming more and more important. The correct extraction of web page data is the basis of various forum applications. [0003] Currently, there are two ways to extract web page data information: one is to manually configure templates and use regular expressions to match data information; the other is to automatically extract templates through sample pages, and then use templates to match data information. The above-mentioned method 1 consumes a lot of manpower and r...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 张涛于晓明杨建武
Owner NEW FOUNDER HLDG DEV LLC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products