Unlock instant, AI-driven research and patent intelligence for your innovation.

Method and device for structured analysis of web page data

A technology of web page data and analysis method, which is applied in the direction of network data retrieval, electronic digital data processing, other database retrieval, etc., and can solve the problem of high degree of artificial dependence.

Active Publication Date: 2018-02-23
SHANDONG LANGCHAO YUNTOU INFORMATION TECH CO LTD
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This method is mainly completed by professional engineers in this field, which requires a lot of labor to discover relevant patterns or rules, and has a high degree of manual dependence.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for structured analysis of web page data
  • Method and device for structured analysis of web page data
  • Method and device for structured analysis of web page data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment approach

[0039] Embodiment 1 of the present invention provides a method and device for structured parsing of web page data. see figure 1 As shown, as an implementable mode, the method includes steps:

[0040] Step S110, collecting multiple template webpages of the same type in a field, extracting the text of the collected template webpage data and performing structural analysis according to preset rules, and using the extracted text and corresponding parsed data as training corpus.

[0041] Step S111, extract multiple types of template webpages in this field, and obtain structured item names and various aliases in different webpages therefrom.

[0042] Step S112, training an analysis model according to the training corpus.

[0043] Construct an analytical model θ(N, M, A, B, p, q), and the model description is as follows:

[0044] N: number of states, set the state set as S={s 1 ,s 2 ,...,s N}, which corresponds to the tag (Tag) of the item to be extracted in information extracti...

Embodiment 2

[0062] The web page data parsing method provided by Embodiment 2 of the present invention includes the steps of:

[0063] Step S210, for a website in a certain field, collect a certain number of webpages with similar templates. Use ContentExtractor-master to extract the text of this batch of web pages to obtain the text of the web pages; use htmlunit to write parsing rules for the web pages to obtain structured item content. Save the structured valid data and the corresponding text as the training corpus.

[0064] For example, body text could look like the following table:

[0065]

[0066]

[0067] The corresponding structured analysis text is shown in the following table:

[0068]

[0069] Step S211, obtaining all possible names of the implicit state "field name" of the analytical model (that is, the structured item name to be parsed) in different web pages.

[0070] For a website in a certain field, webpage collection is performed to obtain a list of actual name...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a webpage data structured analytic method. The webpage data structured analytic method comprises the steps that a program which fetches information according to rules is written according to template webpages, and a training corpus is obtained; possible names of possible items to be structured are acquired through IDFs; a hidden markov model is trained through the training corpus, and parameters are determined; hidden markov model decoding is conducted on a webpage to be analyzed through a correlation algorithm, so that final structured data are acquired. The invention further provides a webpage data structured analytic device. The webpage data structured analytic device comprises a collection module, an acquisition module, a training module and a decoding module. According to the webpage data structured analytic method and device, operation is accomplished according to the intelligent analysis feature and the self learning feature of the model, domain experts do not need to pay more attention to the operation, the manual dependence degree is low, and the accuracy, the performance and the efficiency of analysis are greatly improved.

Description

technical field [0001] The invention relates to the field of computer application technology, in particular to a method and device for structured analysis of web page data. Background technique [0002] With the advent of the big data era, global companies are full of enthusiasm for big data, and big data analysis and processing have also emerged as the times require. The big data processing process includes data collection, data storage integration, data preprocessing, data mining analysis, and data display applications. When enterprises in traditional industries develop big data, the first thing they face is how to connect internal data and external data, that is, how to obtain Internet data based on internal data outside the enterprise. However, the data collected by the Internet is generally unstructured or semi-structured text, pictures, audio and video, and so on. How to parse and structure these data will be an essential work for data mining integration with organiz...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
CPCG06F16/2246G06F16/958
Inventor 范莹于治楼梁华勇
Owner SHANDONG LANGCHAO YUNTOU INFORMATION TECH CO LTD