Webpage data structured analytic method and device
A technology of web page data and analysis method, which is applied in the direction of network data retrieval, electronic digital data processing, other database retrieval, etc., and can solve the problem of high degree of artificial dependence.
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment approach
[0041] Embodiment 1 of the present invention provides a method and device for structural analysis of web page data. see figure 1 As shown, as an embodiment, the method includes the steps:
[0042] Step S110: Collect a plurality of template web pages of the same type in a field, perform text extraction on the collected template web page data and perform structured analysis according to preset rules, and use the extracted text and corresponding parsed data as training corpus.
[0043] Step S111 , extracting multiple template web pages of various types in the field, and obtaining structured item names and various aliases in different web pages from them.
[0044] Step S112, training an analytical model according to the training corpus.
[0045] An analytical model θ(N, M, A, B, p, q) is constructed, and the model is described as follows:
[0046] N: the number of states, let the state set be S={s 1 , s 2 ,...,s N}, corresponding to the tag (Tag) of the item to be extracted ...
Embodiment 2
[0064] The web page data parsing method provided by the second embodiment of the present invention includes the steps:
[0065] Step S210, for a website in a certain field, collect a certain number of webpages of the same template. Use ContentExtractor-master to extract the body of this batch of web pages to obtain the body of the web page; use htmlunit to write parsing rules for the web pages to obtain the content of structured items. The structured valid data and the corresponding text are saved as training corpus.
[0066] For example, the body text can look like the following table:
[0067]
[0068]
[0069] The corresponding structured parsed text is shown in the following table:
[0070]
[0071] Step S211 , obtain all possible names of the implicit state "field name" of the parsing model (that is, the name of the structured item to be parsed out) in different web pages.
[0072] For a website in a certain domain, web page collection is performed to obtain w...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com