Method for automatically extracting BBS (bulletin board system) data

An automatic extraction and forum technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve the problems of poor adaptability to web page structure changes, inability to automatically extract large-scale website data, etc.

Inactive Publication Date: 2013-06-05
NINGBO CHENGDIAN TAIKE ELECTRONICS INFORMATION TECH DEV
View PDF6 Cites 16 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] In order to solve the problem that the existing forum data processing methods cannot effectively complete the automatic data extraction of large-scale websites and have poor adaptability to web page structure changes, a method for automatically extracting forum data is proposed

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for automatically extracting BBS (bulletin board system) data
  • Method for automatically extracting BBS (bulletin board system) data
  • Method for automatically extracting BBS (bulletin board system) data

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment approach

[0060] Further, as a specific implementation manner, step c includes the following steps:

[0061] c1. Establishing a four-dimensional feature vector for the visual word string of the item;

[0062] c2. Divide the data set according to the feature vector;

[0063] c3. Giving meaning to the visible character string and forming an extraction template.

[0064] Wherein, the four-dimensional feature vector described in step c1 is F1, F2, F3 and F4, specifically:

[0065] F1: whether it is a number;

[0066] F2: Length;

[0067] F3: Whether it is a time format, the judgment of the time format is to manually collect the time expression format of the website, generate a regular expression, and convert it into a timestamp calculation method according to the modified format;

[0068] F4: Whether it is a hyperlink text;

[0069] The feature vector is put into the path dictionary, and the entropy of all strings on all paths is calculated, and the strings with entropy less than 0.4 a...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for automatically extracting data from BBS (bulletin board system) posts. The method includes the steps of (a) identifying a web page posts based on the characteristics of BBS web page structure by means of web page structure clustering; (b) calculating all entropy of a similar subtree under the same path of a cluster according to the quantity change of the similar subtree of the cluster in the web page posts so as to locate entry information; (c) building a feature set of a visual string of the post web page, using statistical characteristics for dividing the feature set, identifying specific representation meaning of the visual string by using prior knowledge, and generating a template; and (d) completing final data extracting by using the template for parsing the web page.

Description

technical field [0001] The invention belongs to the technical field of network information processing, and relates to network information extraction technology, in particular to a method for automatically extracting forum data. Background technique [0002] A forum is a web page information publishing mode in which one person posts a topic or comment, and multiple people comment or reply below. Webpage subject content The webpage has a single structure, and most of them are listed in the form of item information. The information is generated by a webpage template, and usually includes valid information such as the author, post content, and post time. Whether it is a poster entry or a reply entry, there is a high degree of consistency in structure. [0003] In addition, the forum has the characteristics of large number of users and rapid increase of information. The 29th Statistical Report on Internet Development in China issued by the China Internet Network Information Cen...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 郭成林彭春林刘红玉高云棋刘丹
Owner NINGBO CHENGDIAN TAIKE ELECTRONICS INFORMATION TECH DEV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products