Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Configurable data analysis method and computer readable storage medium

A data analysis and data object technology, applied in the field of Internet data crawling, can solve the problems of inability to support Json format web page analysis, insufficient adaptation of web page flexibility, and inability to adapt to encapsulation mode, so as to improve analysis efficiency and flexibility, The effect of reducing the amount of analysis data and facilitating machine recognition

Inactive Publication Date: 2019-08-13
厦门商集网络科技有限责任公司
View PDF5 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The disclosed technical solution uses Python's BeautifulSoup library to analyze page text data, but this method has the following two disadvantages: 1. Python's BeautifulSoup library can only parse files in HTML or XML format, and the supported types are limited 2. Python's BeautifulSoup library is an encapsulation mode. For web pages with different formats and different page types, its adaptation to the flexibility of the web page is obviously not enough, especially for some special processing. When unconventional web pages, or users selectively filter and filter relevant data, the encapsulation mode seems unable to adapt

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Configurable data analysis method and computer readable storage medium
  • Configurable data analysis method and computer readable storage medium
  • Configurable data analysis method and computer readable storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0039] see figure 1 , a configurable data parsing method, comprising the steps of: creating a new parsing configuration page, configuring the URL of the target webpage to be captured, parsing type, parsing attribute, and logical table name for saving parsing results on the parsing table configuration page , generally, the logic table is a two-dimensional table, such as an excel form, and submitted after completion; the analysis attribute includes the analysis area and row positioning information, or the analysis attribute only includes row positioning information; the new field configuration page, in The field configuration page configures the field name of each field in the logic table, each field name corresponds to the data object to be extracted, for example, the field name reg_code, the corresponding data object is the number string of the organization code, the field name is law_person, The corresponding data object is the name of the legal representative; the field name...

Embodiment 2

[0062] The difference between this embodiment and Embodiment 1 is that the field configuration further includes configuration field identifiers, and when data objects are extracted, the data objects to be extracted are matched according to the field identifiers, and then mapped into logical tables.

[0063] Described data parsing method carries out the following steps: configure the URL, parsing type, parsing attribute and the logic table name used to preserve parsing result of the target webpage that will grab in parsing configuration page configuration, submit after finishing; Described parsing property includes parsing area and row positioning information, or the parsing attribute only includes row positioning information; such as Figure 14 As shown, configure the field name and field identifier of each field in the logical table on the field configuration page; create a blank logical table according to the logical table name, and write each field name and corresponding fie...

Embodiment 3

[0069] A computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the following steps are performed: configuring the URL of the target webpage to be grabbed, the analysis type, the analysis attribute, and the method used for saving the analysis table configuration page. The logical table name of the analysis result, submitted after completion; the analysis attribute includes the analysis area and row positioning information, or the analysis attribute only includes the row positioning information; configure the field name of each field in the logic table on the field configuration page, Each field name corresponds to the data object to be extracted; create a blank logical table according to the logical table name, write each field name into the blank logical table, and the order of each field is consistent with the order in which the data object is extracted during parsing; grab The target webpage corresponding to ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a configurable data analysis method. The method comprises the following steps: making analysis configurations; performing field configuration; creating a blank logic table, and writing each field name into the blank logic table; capturing a target webpage corresponding to the URL, extracting a data object according to the analysis type and the analysis attribute, and sequentially mapping the data object into the logic table to form a corresponding relationship with the field name, thereby converting the target webpage into a structured data text. According to the configurable data analysis method, by configuring the analysis table, webpages of different formats can be flexibly coped, webpage analysis is completed, webpage data are converted into structured data texts, and application and mining of information are facilitated.

Description

technical field [0001] The invention relates to a configurable data analysis method and a computer-readable storage medium, belonging to the field of Internet data crawling. Background technique [0002] The information on the Internet is complex, with various types and expressions. The displayed information is only for the convenience of users to browse, not a unified structured data display, so machine identification is not considered. However, since computers do not have natural language or human-like reading ability, the information displayed by the information carrier of Internet web pages is not easy for computers to identify and analyze. In the past IT technology development process, a large number of mining and analysis technologies based on structured data have been accumulated. For unstructured web page data captured from the Internet, we need to first convert it from structured data to facilitate machine analysis. Identification, so as to facilitate the use of s...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F16/25G06F16/951G06F16/955
CPCG06F16/258G06F16/951G06F16/9566
Inventor 邱涛丘水文陈成乐
Owner 厦门商集网络科技有限责任公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products