Data analysis method and data analysis system for steel trade industry spot commodity resource

A data analysis and spot technology, applied in the field of data analysis, can solve the problem of low effective data conversion rate and achieve the effect of improving the effective data conversion rate

Active Publication Date: 2015-04-29
SHANGHAI GANGFU E COMMERCE
4 Cites 1 Cited by

AI-Extracted Technical Summary

Problems solved by technology

[0005] The purpose of the present invention is to provide a data analysis method and system for the analysis of resource documents in the steel trade industry in the prior art, which has relatively high requirements for the format specifications of the original...
View more

Abstract

The invention provides a data analysis method and a data analysis system for a steel trade industry spot commodity resource. The method comprises the steps: (1) acquiring a text file including the steel material spot commodity resource; (2) loading a steel material spot commodity exhaustion word bank, splitting each row of the text file by utilizing the steel material specification as a novel node, and acquiring a steel material spot commodity data set; (3) analyzing the steel material spot commodity data set, and splitting the data including the parallel information into multiple data; (4) cleansing the analyzed data to obtain complete data information, and warehousing the data information. According to the method, the steel material spot commodity exhaustion word bank is used for rapidly analyzing the data and effectively defining the data area; by adopting the data analysis method and actual measurement, the valid data conversion rate of the original resource file is improved by 70 percent, and the valid data conversion rate is greatly improved.

Application Domain

Special data processing applications

Technology Topic

Data conversionData decomposition +11

Image

  • Data analysis method and data analysis system for steel trade industry spot commodity resource
  • Data analysis method and data analysis system for steel trade industry spot commodity resource

Examples

  • Experimental program(1)

Example Embodiment

[0011] The method and system for data analysis of spot resources in the steel trade industry provided by the present invention will be described in detail below in conjunction with the accompanying drawings.
[0012] reference figure 1 , A schematic flow chart of the method for data analysis of spot resources in the steel trade industry according to the present invention. The method includes, S12: Obtain a text document containing steel spot resources; S14: Load a steel spot exhaustive vocabulary, and split each line of the text document with steel specifications as a new node to obtain a steel spot data set S16: Analyze the steel spot data collection, and decompose the data containing parallel information into multiple pieces; S18: Clean the parsed data to obtain complete data information and store it in the database. The method of the present invention will be described in detail below.
[0013] S12: Obtain a text file containing steel spot resources.
[0014] Obtaining documents containing steel stock resources may include word documents in the form of .doc or .docx and text documents in the form of .txt. For text documents, the method of the present invention can be directly used for analysis, and for word documents, it needs to be converted into text documents first. Therefore, as a preferred embodiment, the method of the present invention further includes determining whether the obtained document containing the steel spot resources is a word document, and if so, loading a word document parsing program, converting the obtained word document into a text document, and making Unified document format.
[0015] S14: Load the steel spot exhaustive vocabulary, split each line of the text document with the steel specification as a new node, and obtain a steel spot data set.
[0016] The product name, material, steel mill, specification, thickness, width, warehouse, etc. of the steels recorded in the steel spot exhaustive vocabulary; according to the steel spot exhaustive vocabulary, the data in each row of the obtained text document can be parsed Representative specific information. There are certain rules in the expression of steel specifications in the steel trade industry, and generally always contain the following character strings: numbers, asterisks (*), slashes (/), backslashes (\), dashes (-), unit name (for example: mm, millimeter), sum symbol (Σ), etc. When splitting each line with the steel specification as the new node, scan the text document line by line to find a string containing the above characteristics, which is preliminarily identified as a steel specification string, and the next steel specification string Split the line for the split point before starting. For example, a row of data source for Benxi Steel Q235B 2.5*1250=3650, 2.7*1250/1500HPCC 3630; using the steel spot exhaustive thesaurus to scan and analyze this row of data sources, 2.5*1250 can be parsed as a steel specification string. 2.7*1250/1500 is a steel specification string, so the data of 2.7*1250/1500 and later is split from the original row as a new row.
[0017] In order to avoid misinterpreting the data, you can load the steel spot exhaustive vocabulary and the number corresponding Chinese character codes before splitting, analyze the product name, material, steel plant, and warehouse, and perform digital conversion of the product name, material, steel plant, and warehouse into Chinese. , To avoid misunderstandings when parsing steel specifications, causing split failure. In the number corresponding to the Chinese character code, each Arabic number corresponds to a Chinese uppercase Chinese character number; that is, "0123456789" corresponds to "Zero One Two Three Four Wu Lu Qi Ba Ji". For example, for data 409L/2D, use the steel spot exhaustive thesaurus to analyze that it is a kind of steel material, then use the number corresponding to the Chinese character code to convert the data 409L/2D to 409L/2D; thus avoiding the analysis of steel specifications Misreading occurred at the time. After the analysis and splitting are completed, these Chinese characters will be converted into numbers for the convenience of users.
[0018] In order to ensure the integrity of each row of data after splitting, as a preferred embodiment, the present invention further defines global variables and brings the global variables into the corresponding lower layer after each row is split. The global variables include product name and material. , At least one of steel mills and warehouses. That is, when a row contains data such as product name, material, steel mill, warehouse, etc., these data will be brought into the lower layer as global variables to ensure the integrity of each row of data after splitting.
[0019] The defined global variable can be directly brought into the corresponding lower layer after the separation while being split in step S14; it can also be brought into the corresponding lower layer after the split after the split. Wherein, the priority of the local variable of each row is higher than the priority of the global variable, so that when the global variable is brought into the corresponding lower layer after the split, there will be no cross-row carry-in. That is to say, the global variables of this line will only be brought into the corresponding lower layer split from this line; when parsing to the next line, the global variables corresponding to the current line are obtained for subsequent bringing in.
[0020] S16: Analyze the steel spot data collection, and decompose the data containing parallel information into multiple pieces.
[0021] Obtain the steel spot data collection through the operation of step S14, because the data format attached to different data sources is different (for example: the specification may be 0.4*315, it may be 0.4*295/305/315/355, or it may be 0.5*1250 -1445; the price may be 4030 or 4750-4900). Therefore, each row in the set may contain multiple pieces of parallel information, and the parallel information needs to be further split.
[0022] As a preferred embodiment, the present invention further decomposes steel specifications and/or steel price data containing parallel information into multiple pieces according to the corresponding relationship between steel specifications and steel prices; that is, this split is mainly aimed at splitting specifications and prices. For example: original character: 0.4*295/305 4030; split result: 0.4*295 4030 0.4*305 4030. Original characters: 0.5*1250-1445 4750-4900; split result: 0.5*1250 4750 0.5*1445 4900.
[0023] S18: Clean the parsed data to obtain complete data information and store it in the database.
[0024] The so-called data cleaning is to remove invalid data in the results, such as duplicate data, obviously abnormal prices, non-existent suppliers, non-existent models, etc. The data can be cleaned by setting filtering rules, which is a prior art and will not be repeated here.
[0025] Through the exhaustive vocabulary of steel stocks, the data is quickly analyzed and the data area is effectively limited; after actual measurement, using the data analysis method of the present invention, the effective data conversion rate of the original resource document is increased by about 70%, that is, the effective data is greatly improved Data conversion rate.
[0026] An embodiment of the present invention is given below to further explain the data analysis method of the present invention.
[0027] Assume that the original document obtained contains two rows of spot resource data as shown below:
[0028] Bengang Q235B 2.5*1250=3650, 2.7*1250/1500 HPCC 36303.5/3.7/3.75/3.95*1250 3550A
[0029] Zhongtian 409L/2D 0.5*1250-1445 4750-4900.
[0030] Load the steel spot exhaustive vocabulary and the number corresponding to Chinese character codes, analyze the product name, material, steel plant, warehouse, etc., and perform digital conversion of the product name, material, steel plant, and warehouse into Chinese:
[0031] Benxi Iron and Steel Q II Sanwu B 2.5*1250=3650,2.7*1250/1500 HPCC 3630 3.5/3.7/3.75/3.95*1250 3550A
[0032] Zhongtian Si Lingjiu L/II D 0.5*1250-1445 4750-4900.
[0033] Exhaustive thesaurus of steel stocks is used, and each row is split with steel specifications as the new node to obtain the steel stock data collection:
[0034] Benxi Iron and Steel Q II Sanwu B 2.5*1250=3650
[0035] 2.7*1250/1500 HPCC 3630
[0036] 3.5/3.7/3.75/3.95*1250 3550A
[0037] Zhongtian Si Lingjiu L/II D 0.5*1250-1445 4750-4900.
[0038] Bring the global variables into the whole world. Bengang Q2 and 3B to the next level, because the local variables in the row of Zhongtian Silingjiu L/ⅡD (ie Zhongtian Silingjiu L/ⅡD) have higher priority than global variables The priority of Benxi Iron and Steel Q2 and Sanwu B, so that in Benxi Iron and Steel Q2 and Sanwu B will not be brought into the row of Zhongtian Si Lingjiu L/II D, correspondingly:
[0039] Benxi Iron and Steel Q II Sanwu B 2.5*1250=3650
[0040] Benxi Iron and Steel Q II 3 B 2.7*1250/1500 HPCC 3630
[0041] Bengang Q II Sanwu B 3.5/3.7/3.75/3.95*1250 3550A
[0042] Zhongtian Si Lingjiu L/II D 0.5*1250-1445 4750-4900.
[0043] Re-split the steel specifications and/or steel price data containing parallel information to obtain:
[0044] Benxi Iron and Steel Q II Sanwu B 2.5*1250=3650
[0045] Bengang Q II Sanwu B 2.7*1250 HPCC 3630
[0046] Benxi Iron and Steel Q II Sanwu B 2.7*1500 HPCC 3630
[0047] Benxi Iron and Steel Q II Sanwu B 3.5*1250 3550A
[0048] Benxi Iron and Steel Q II Sanwu B 3.7*1250 3550A
[0049] Benxi Iron and Steel Q II Sanwu B 3.75*1250 3550A
[0050] Benxi Iron and Steel Q II Sanwu B 3.95*1250 3550A
[0051] Zhongtian Si Lingjiu L/II D 0.5*1250 4750
[0052] Zhongtian Si Lingjiu L/II D 0.5*1445 4900.
[0053] Utilize the exhaustive vocabulary of steel stocks and the corresponding Chinese character codes of numbers to convert the Chinese of product name, material, steel mill, warehouse, etc. into numbers accordingly:
[0054] Bengang Q235B 2.5*1250=3650
[0055] Bengang Q235B 2.7*1250 HPCC 3630
[0056] Bengang Q235B 2.7*1500 HPCC 3630
[0057] Bengang Q235B 3.5*1250 3550A
[0058] Bengang Q235B 3.7*1250 3550A
[0059] Bengang Q235B 3.75*1250 3550A
[0060] Bengang Q235B 3.95*1250 3550A
[0061] Zhongtian 409L/2D 0.5*1250 4750
[0062] Zhongtian 409L/2D 0.5*1445 4900.
[0063] At this point, a regular data table that meets the requirements of the steel trade industry website is obtained; invalid data in the result can be removed, and then it can be stored in the database.
[0064] reference figure 2 , The schematic diagram of the structure of the data analysis system of spot resources in the steel trade industry according to the present invention. The system includes a document acquisition unit 22, a splitting unit 24, a parsing unit 26, and a data cleaning unit 28. Detailed explanations are given below.
[0065] The document obtaining unit 22 is configured to obtain a text document containing steel spot resources. Obtaining documents containing steel stock resources may include word documents in the form of .doc or .docx and text documents in the form of .txt. For text documents, the method of the present invention can be directly used for analysis, and for word documents, it needs to be converted into text documents first. Therefore, as a preferred embodiment, the system of the present invention further includes a judging unit 21 for judging whether the obtained document containing the steel spot resource is a word document, if so, the word document parsing program is loaded, and the obtained word The document is converted into a text document to unify the document format.
[0066] The splitting unit 24 is connected to the document obtaining unit 22, and is configured to load an exhaustive vocabulary of steel stocks, split each line of the text document with steel specifications as a new node, and obtain a steel stock data set. The product name, material, steel mill, specification, thickness, width, warehouse, etc. of the steels recorded in the steel spot exhaustive vocabulary; according to the steel spot exhaustive vocabulary, the data in each row of the obtained text document can be parsed Representative specific information. There are certain rules in the expression of steel specifications in the steel trade industry, and generally always contain the following character strings: numbers, asterisks (*), slashes (/), backslashes (\), dashes (-), unit name (for example: mm, millimeter), sum symbol (Σ), etc. When splitting each line with the steel specification as the new node, scan the text document line by line to find a string containing the above characteristics, which is preliminarily identified as a steel specification string, and the next steel specification string Split the line for the split point before starting. For example, a row of data source for Benxi Steel Q235B 2.5*1250=3650, 2.7*1250/1500 HPCC 3630; using the steel spot exhaustive thesaurus to scan and analyze this row of data sources, 2.5*1250 can be parsed as a steel specification string , 2.7*1250/1500 is a steel specification string, so the data of 2.7*1250/1500 and later is split from the original row as a new row.
[0067] In order to avoid misreading data, the system further includes a conversion processing unit 23, which is connected to the document acquisition unit, and is used to load the steel stock exhaustive thesaurus and the number corresponding to the Chinese character encoding, and to encode the text document The product names, materials, steel plants, and warehouses contained in it are digitally converted into Chinese. That is, you can load the steel spot exhaustive vocabulary and the number corresponding to Chinese character codes before splitting, analyze the product name, material, steel plant, and warehouse, and perform digital conversion of the product name, material, steel plant, and warehouse into Chinese to avoid analyzing steel. Misreading of specifications caused the split to fail. In the number corresponding to the Chinese character code, each Arabic number corresponds to a Chinese uppercase Chinese character number; that is, "0123456789" corresponds to "Zero One Two Three Four Wu Lu Qi Ba Ji". For example, for data 409L/2D, use the steel spot exhaustive thesaurus to analyze that it is a kind of steel material, then use the number corresponding to the Chinese character code to convert the data 409L/2D to 409L/2D; thus avoiding the analysis of steel specifications Misreading occurred at the time. After the analysis and splitting are completed, these Chinese characters will be converted into numbers for the convenience of users.
[0068] In order to ensure the integrity of each row of data after splitting, as a preferred embodiment, the splitting unit 24 further includes a global variable definition module 241. The global variable definition module 241 is used to define parsed global variables, and The global variables are brought into the corresponding lower layer after each row is split, wherein the priority of the local variables of each row is higher than the priority of the global variables, and the global variables include at least one of product name, material, steel mill, and warehouse. One. That is, when a row contains data such as product name, material, steel mill, warehouse, etc., these data will be brought into the lower layer as global variables to ensure the integrity of each row of data after splitting. The defined global variables can be directly brought into the corresponding lower layer after the split while being split; or after the split, the global variables can be brought into the corresponding lower layer after the split. Wherein, the priority of the local variable of each row is higher than the priority of the global variable, so that when the global variable is brought into the corresponding lower layer after the split, there will be no cross-row carry-in. That is to say, the global variables of this line will only be brought into the corresponding lower layer split from this line; when parsing to the next line, the global variables corresponding to the current line are obtained for subsequent bringing in.
[0069] The parsing unit 26 is connected to the splitting unit 24 and is used to analyze the steel spot data set and decompose the data containing parallel information into multiple pieces.
[0070] The steel spot data collection obtained by the system, because the data format attached to different data sources is different (for example: the specification may be 0.4*315, it may be 0.4*295/305/315/355, or it may be 0.5*1250-1445 ; The price may be 4030 or 4750-4900). Therefore, each row in the set may contain multiple pieces of parallel information, and the parallel information needs to be further split.
[0071] As a preferred embodiment, the analysis unit of the present invention is further used for decomposing steel product specifications and/or steel product price data containing parallel information into multiple pieces according to the corresponding relationship between steel product specifications and steel prices. That is to say, this split is mainly for the split of specifications and prices. For example: original character: 0.4*295/305 4030; split result: 0.4*295 4030 0.4*305 4030. Original characters: 0.5*1250-1445 4750-4900; split result: 0.5*1250 4750 0.5*1445 4900.
[0072] The data cleaning unit 28 is connected to the analysis unit 26, and is used to clean the parsed data to obtain complete data information and store it in the database. The so-called data cleaning is to remove invalid data in the results, such as duplicate data, obviously abnormal prices, non-existent suppliers, non-existent models, etc. The data can be cleaned by setting filtering rules, which is a prior art and will not be repeated here.
[0073] The above are only the preferred embodiments of the present invention. It should be pointed out that for those of ordinary skill in the art, without departing from the principle of the present invention, several improvements and modifications can be made, and these improvements and modifications should also be considered This is the protection scope of the present invention.

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products