Method and device for structured analysis of web page data
A technology of web page data and analysis method, which is applied in the direction of network data retrieval, electronic digital data processing, other database retrieval, etc., and can solve the problem of high degree of artificial dependence.
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment approach
[0039] Embodiment 1 of the present invention provides a method and device for structured parsing of web page data. see figure 1 As shown, as an implementable mode, the method includes steps:
[0040] Step S110, collecting multiple template webpages of the same type in a field, extracting the text of the collected template webpage data and performing structural analysis according to preset rules, and using the extracted text and corresponding parsed data as training corpus.
[0041] Step S111, extract multiple types of template webpages in this field, and obtain structured item names and various aliases in different webpages therefrom.
[0042] Step S112, training an analysis model according to the training corpus.
[0043] Construct an analytical model θ(N, M, A, B, p, q), and the model description is as follows:
[0044] N: number of states, set the state set as S={s 1 ,s 2 ,...,s N}, which corresponds to the tag (Tag) of the item to be extracted in information extracti...
Embodiment 2
[0062] The web page data parsing method provided by Embodiment 2 of the present invention includes the steps of:
[0063] Step S210, for a website in a certain field, collect a certain number of webpages with similar templates. Use ContentExtractor-master to extract the text of this batch of web pages to obtain the text of the web pages; use htmlunit to write parsing rules for the web pages to obtain structured item content. Save the structured valid data and the corresponding text as the training corpus.
[0064] For example, body text could look like the following table:
[0065]
[0066]
[0067] The corresponding structured analysis text is shown in the following table:
[0068]
[0069] Step S211, obtaining all possible names of the implicit state "field name" of the analytical model (that is, the structured item name to be parsed) in different web pages.
[0070] For a website in a certain field, webpage collection is performed to obtain a list of actual name...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 


