Configurable data analysis method and computer readable storage medium

A data analysis and data object technology, applied in the field of Internet data crawling, can solve the problems of inability to support Json format web page analysis, insufficient adaptation of web page flexibility, and inability to adapt to encapsulation mode, so as to improve analysis efficiency and flexibility, The effect of reducing the amount of analysis data and facilitating machine recognition

Inactive Publication Date: 2019-08-13
厦门商集网络科技有限责任公司
View PDF5 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The disclosed technical solution uses Python's BeautifulSoup library to analyze page text data, but this method has the following two disadvantages: 1. Python's BeautifulSoup library can only parse files in HTML or XML format, and the supported types are limited 2. Python's BeautifulSoup library is an encapsulatio

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Configurable data analysis method and computer readable storage medium
  • Configurable data analysis method and computer readable storage medium
  • Configurable data analysis method and computer readable storage medium

Examples

Experimental program
Comparison scheme
Effect test

Example Embodiment

[0038] Example one

[0039] See figure 1 , A configurable data parsing method, including the following steps: create a new parsing configuration page, and configure the URL, parsing type, parsing attributes and the name of the logical table used to save the parsing results on the parsing table configuration page to be crawled , Generally, the logic table is a two-dimensional table, such as an excel form, which is submitted after completion; the analysis attributes include analysis area and row positioning information, or the analysis attributes only include row positioning information; create a new field configuration page in The field configuration page configures the field name of each field in the logic table, and each field name corresponds to the data object to be extracted, for example, the field name reg_code, the corresponding data object is the number string of the organization code, and the field name is law_person, The corresponding data object is the name of the legal...

Example Embodiment

[0061] Example two

[0062] The difference between this embodiment and the first embodiment is that the field configuration further includes the configuration field identifier. When extracting the data object, the data object to be extracted is matched according to the field identifier, and then mapped to the logical table.

[0063] The data analysis method performs the following steps: configure the URL of the target webpage to be crawled, the analysis type, the analysis attribute, and the logical table name for saving the analysis result on the analysis configuration page, and submit it after completion; the analysis attribute includes the analysis area And line positioning information, or the analytic attribute only includes line positioning information; such as Figure 14 As shown, configure the field name and field identifier of each field in the logical table on the field configuration page; create a blank logical table according to the logical table name, and write each fiel...

Example Embodiment

[0068] Example three

[0069] A computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the following steps are executed: configure the URL of the target webpage to be crawled on the parsing table configuration page, parsing type, parsing attributes, and storing The logical table name of the analysis result, submitted after completion; the analysis attribute includes analysis area and row positioning information, or the analysis attribute only includes row positioning information; the field name of each field in the logical table is configured on the field configuration page, Each field name corresponds to the data object to be extracted; a blank logical table is created according to the logical table name, and each field name is written into the blank logical table, and the sorting of each field is consistent with the extraction order of the data object during parsing; capture The target webpage corresponding to the UR...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a configurable data analysis method. The method comprises the following steps: making analysis configurations; performing field configuration; creating a blank logic table, and writing each field name into the blank logic table; capturing a target webpage corresponding to the URL, extracting a data object according to the analysis type and the analysis attribute, and sequentially mapping the data object into the logic table to form a corresponding relationship with the field name, thereby converting the target webpage into a structured data text. According to the configurable data analysis method, by configuring the analysis table, webpages of different formats can be flexibly coped, webpage analysis is completed, webpage data are converted into structured data texts, and application and mining of information are facilitated.

Description

technical field [0001] The invention relates to a configurable data analysis method and a computer-readable storage medium, belonging to the field of Internet data crawling. Background technique [0002] The information on the Internet is complex, with various types and expressions. The displayed information is only for the convenience of users to browse, not a unified structured data display, so machine identification is not considered. However, since computers do not have natural language or human-like reading ability, the information displayed by the information carrier of Internet web pages is not easy for computers to identify and analyze. In the past IT technology development process, a large number of mining and analysis technologies based on structured data have been accumulated. For unstructured web page data captured from the Internet, we need to first convert it from structured data to facilitate machine analysis. Identification, so as to facilitate the use of s...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/25G06F16/951G06F16/955
CPCG06F16/258G06F16/951G06F16/9566
Inventor 邱涛丘水文陈成乐
Owner 厦门商集网络科技有限责任公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products