Supercharge Your Innovation With Domain-Expert AI Agents!

method and a system for collecting data based on a PHP custom rule

A data collection and self-definition technology, applied in the field of web crawlers, can solve problems such as inconvenient program processing and warehousing, difficulty in using collection methods, and complicated collection process, so as to reduce difficulty and learning and use costs, improve collection efficiency, and improve The effect of collection efficiency

Active Publication Date: 2019-05-24
四川商通实业有限公司
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The traditional data collection method usually requires the installation of a third-party client, the collection process is complex, and the collection method is difficult to use; the tag cannot be collected when collecting pictures, and even if it is collected, it is easy to cause the problem of inconvenient program processing and storage

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • method and a system for collecting data based on a PHP custom rule
  • method and a system for collecting data based on a PHP custom rule

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0048] like figure 1 As shown, a method for data collection based on PHP custom rules includes the following steps:

[0049] a. Generate a collection client based on the guzzle component;

[0050] b. Obtain the target website and read its text content;

[0051] c. Perform file slicing and complete data extraction.

[0052] In this embodiment, the php development language is adopted and the guzzle component is used as the collection client (which can be used to simulate various collection platforms at random), and after the text content is read, the text positioning and slicing method is used to slice the file; it can be used as a general collection data The tool reduces the difficulty of collecting rules and the cost of learning and using, and can complete the data collection of a specific website type in a few minutes.

Embodiment 2

[0054] In this embodiment, on the basis of Embodiment 1, said step a includes the following steps:

[0055] According to the requirements, the generated acquisition client is simulated as a corresponding acquisition platform. When the collection client is in use, it can simulate various collection platforms according to the needs; it overcomes the problems of installing a third-party client for traditional data collection, enhances the adaptability of data collection, and improves the collection efficiency.

Embodiment 3

[0057] In this embodiment, on the basis of Embodiment 1, said step c includes the following steps:

[0058] After reading the text content, analyze its elements, and locate the slice label;

[0059] Define rules based on the start and end tags where slice tags are positioned.

[0060] Select the target website, analyze its elements according to the HTML source code, locate the slice tag, include the start tag and end tag, then the rule is "|"; it is convenient to locate the tag position where the required data is located, and then collect the required data .

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a data acquisition method based on a PHP custom rule, and the method comprises the steps: generating an acquisition client based on a guzle assembly; obtaining a target website, and reading the text content of the target website; and performing file slicing to complete data extraction. The invention also discloses a system for data acquisition based on the PHP custom rule.The system comprises an acquisition generation module, a text reading module and a data extraction module. According to the method, a php development language is adopted, a guzle component is used asan acquisition client, and after text content is read, a text positioning and slicing method is used for file slicing; The difficulty of collection rules and the learning and using cost are reduced, the collection efficiency is improved, and data collection of a specific website type can be completed within several minutes.

Description

technical field [0001] The invention relates to the technical field of web crawlers, in particular to a method and system for collecting data based on PHP self-defined rules. Background technique [0002] Web crawlers (also known as web spiders, web robots, and more often called web chasers in the FOAF community) are programs or scripts that automatically grab information on the World Wide Web according to certain rules; Commonly used names are Ant, Autoindex, Simulator, or Worm. [0003] The traditional data collection method usually requires the installation of a third-party client. The collection process is complicated and the collection method is difficult to use; the tag cannot be collected when collecting pictures. Even if it is collected, it will easily cause the problem of inconvenient program processing and storage. Contents of the invention [0004] Based on this, in response to the above problems, it is necessary to propose a method and system for data collecti...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F16/951G06F16/955
Inventor 任毅刘伟
Owner 四川商通实业有限公司
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More