Building data cleaning and merging method and device and storage medium

A data cleaning and real estate technology, applied in the field of data cleaning and return, can solve problems such as low efficiency of return rules, affect subsequent application of real estate data, and ensure data validity, so as to achieve accurate deduplication, ensure validity, and improve cleaning efficiency Effect

Pending Publication Date: 2022-04-26
广州探迹科技有限公司
0 Cites 0 Cited by

AI-Extracted Technical Summary

Problems solved by technology

[0003] (1) The same real estate has different names and real estate addresses in each source, and the validity of the data has not been ensured through data cleaning and data return;
[0004] (2) Ordinary data cleaning and merging rules ...
View more

Abstract

The invention discloses a building data cleaning and merging method and device and a storage medium. The method comprises the following steps: acquiring multiple pieces of poi information corresponding to each building name according to the building name in the second building data; obtaining a plurality of poi information corresponding to each building address according to the building address in the second building data; combining the multiple pieces of poi information corresponding to the building name and the multiple pieces of poi information corresponding to the building address of the same building to obtain a poi information set of each building; selecting one piece of poi information from the poi information set of each building as first poi information of each building; and judging whether any two buildings are the same building or not according to the first poi information of each building, and de-duplicating the building data which are judged to be the same building in the second building data to obtain third building data. According to the technical scheme, efficient merging of the building data is achieved, and redundant data are greatly reduced.

Application Domain

Data processing applicationsGeographical information databases +1

Technology Topic

Systems engineeringDatabase +1

Image

  • Building data cleaning and merging method and device and storage medium
  • Building data cleaning and merging method and device and storage medium
  • Building data cleaning and merging method and device and storage medium

Examples

  • Experimental program(1)

Example Embodiment

[0057] The following will be combined with the accompanying drawings of the present invention, the technical solution of the present invention will be described clearly and completely, it is clear that the embodiments described are only part of the embodiments of the present invention, not all embodiments. Based on embodiments in the present invention, all other embodiments obtained by those of ordinary skill in the art without making creative work, are within the scope of protection of the present invention.
[0058] as Figure 1 As shown, an embodiment of the present invention provides a real estate data cleaning and merger method, comprising the following steps:
[0059] Step S101: Obtain the original property data, the property data includes the property name data and the property address data of each building, and establish a mapping relationship between the property name data and the property address data of the same building.
[0060] Step S102: According to the property name data, the property data is screened to obtain the first property data; the first building address data in the first building data is pre-processed to obtain the second property data.
[0061] As one embodiment, the step S102 comprises the following substeps:
[0062] Sub-step S1021: Filter the property data according to the property name data, specifically:
[0063] Delete the property data that contains "main entrance", "front door", "back door", "left door", "right door", "side door", "east gate", "south gate", "west gate", "north gate" and "number gate" in the name of the property;
[0064] Delete the property data that contains "parking" in the name of the said property;
[0065] Delete the property data that contains "entrance", "exit" or "entrance" in the name of the property;
[0066] Delete the property data that contains "Investment Center" or "Community" in the name of the said property;
[0067] Delete the property data that contains "shop" and does not contain "hotel" in the name of the property to get the first property data.
[0068] Sub-step S1022: The first pre-processing of the property address data in the first building data is carried out, specifically:
[0069] Remove the special symbols in the address of the property, the special symbols include spaces, double quotation marks and single quotation marks;
[0070] Remove the brackets in the address of the property and the content in the parentheses;
[0071] According to the comma in the address of the building, the address of the building is divided to obtain several sub-real estate addresses, and one of the sub-real estate addresses is selected according to the preset selection logic to retain, and the other sub-real estate addresses are deleted;
[0072] Remove the number after the "number" word in the address of the building and the address information after the number, remove the letter after the "number" word in the address of the building and the address information after the letter; the number after the "number" word and the number after the "number" word may be spaced with other characters, the "number" word and the letter after the "number" word may be spaced between other characters;
[0073] Remove the address information after the "number" in the address of the building, and the address information after the "self-editing", and the "number" word and the "self-editing" are not separated by other characters, and the second building data is obtained.
[0074] Step S103: According to the name of the building in the second building data, obtain multiple poi information corresponding to the name of each building; according to the address of the second building data, obtain multiple poi information corresponding to each property address.
[0075] As one of the embodiments, according to the second building data in a building name request Baidu map, to obtain the name of the building corresponding to a plurality of poi information, from the multiple poi information to take the first 10 poi information; according to the name of the building in the second building data of the real estate address request Baidu map, to obtain the corresponding plurality of poi information of the real estate address, from the multiple poi information to take the first 10 poi information corresponding to the property name to merge, Get the poi information set of the current property.
[0076] Step S104: According to the mapping relationship between the name of the building and the address of the building, the multiple poi information corresponding to the name of the same building and the multiple poi information corresponding to the real estate address are combined to obtain the poi information set of each building.
[0077] One of the poi messages includes the following:
[0078]
[0079] Among them, "city": "Dongguan City" is the city name of poi, "poi_type": "company enterprise; company" is the poi type.
[0080] Step S105: After deleting the poi information in the poi information set of each building that does not meet the first preset conditions, select a poi information from the poi information set of each building as the first poi information of each building.
[0081] As one embodiment, the step S105 comprises the following substeps:
[0082] Sub-step S1051: Delete the poi information in the said poi information set of each building that does not meet the first preset conditions, specifically:
[0083] Delete the poi information that does not contain poi city name or poi type in the said poi info set;
[0084] Delete the inconsistency between the city name of the poi in the said poi information set and the city name of the corresponding property in the said property data;
[0085] Delete the poi information in the said poi information set where the poi type is not "real estate", "corporate business; park", "shopping; shopping mall", "hotel; star hotel", "corporate business; other" or "shopping; shopping mall".
[0086] After the deletion operation of the sub-step S1051, if the number of poi information of the same building is 0, the property information is deleted, and the real estate information includes the name of the building, the address of the building and other information related to the building.
[0087] Substep S1052: Obtain the first longitude and latitude of the current building from the second building data, calculate the first longitude and latitude between the first latitude and longitude and the second latitude and longitude of each poi information in the current real estate information set; specifically, calculate the first distance d between the first longitude and latitude and the second longitude and latitude, if the first distance exceeds 2 kilometers, then delete the poi information, if the first distance does not exceed 2 kilometers, the first latitude and longitude distance score = (1-d / 2000)。
[0088] Calculate the first editing distance between the real estate name of the current building and the name of each poi information in the current real estate information set;
[0089] Calculate the similarity between the real estate address of the current building and the real estate address of each poi information in the current real estate information set;
[0090] Calculate the first sum similarity score of each poi information in the poi information set of the current building, select the poi information with the highest first sum similarity score as the first poi information of the current building; the first sum similarity score is the sum of the first longitude and latitude distance score, the first edit distance similarity and the second edit distance similarity.
[0091] When the first sum similarity score is less than the third preset threshold, the poi information corresponding to the first sum similarity score is deleted from the poi information set; preferably, if the first sum similarity score is less than 2, the poi information is deleted, if the first sum similarity score is greater than or equal to 2, then select the poi information with the highest first sum similarity score as the first poi information of the current real estate.
[0092] Step S106: According to the first poi information of each building, determine whether any two properties are the same building, and deduplication the real estate data judged to be the same building in the second property data to obtain the third property data.
[0093] As one of the embodiments, according to the first poi information of each building to determine whether any two buildings are the same building, specifically:
[0094] Calculate the longitude and latitude distance between the first poi information of the two buildings s, when the longitude and latitude distance of the two buildings is greater than the first preset threshold, it is judged that the two buildings are not the same building; when the longitude and latitude distance of the two buildings is less than or equal to the first preset threshold, calculate the second latitude and longitude distance score between the first poi information of the two buildings; preferably, the second latitude and longitude distance score = (1-s / 2000).
[0095] Calculate the first editing distance between the real estate name of the current building and the name of each poi information in the current real estate information set;
[0096] Calculate the similarity of the third edit distance between the name of the building with the first poi information of the two properties;
[0097] Calculate the similarity of the fourth edit distance between the address of the first poi of the two properties;
[0098] Calculate the second sum of the similarity score between the first poi information of the two properties, when the second sum of the similarity score is greater than the second preset threshold, it is judged to be the same property; when the second sum of similarity scores less than or equal to the preset threshold, it is judged to be not the same property; the second sum of the similarity score is the second latitude and longitude distance score, the third edit distance similarity and the fourth editing distance similarity of the sum. Preferably, the second preset threshold is 2.5.
[0099] as Figure 2 As shown, another embodiment of the present invention provides a real estate data cleaning and merging device, comprising a real estate data acquisition module, a real estate data screening module, a poi information acquisition module, a poi information calculation module and a real estate data deduplication module;
[0100] The real estate data acquisition module is used to obtain the original real estate data, the real estate data includes the real estate name data and the real estate address data of each building, and establishes a mapping relationship between the property name data and the real estate address data of the same building;
[0101] The real estate data screening module is used to filter the real estate data according to the real estate name data to obtain the first building data; the first building address data in the first building data is pre-processed to obtain the second building data;
[0102] The poi information acquisition module is used to obtain multiple poi information corresponding to each property name according to the real estate name in the second building data; according to the real estate address in the second building data, obtain multiple poi information corresponding to each real estate address; according to the mapping relationship between the real estate name and the real estate address, the plurality of poi information corresponding to the same building name and the multiple poi information corresponding to the real estate address are combined to obtain the poi information set of each building;
[0103] The poi information calculation module is used to delete the poi information in the poi information set of each building that does not meet the first preset conditions, and select a poi information as the first poi information of each building from the said poi information set of each building;
[0104] The real estate data deduplication module is used to determine whether any two real estates are the same building according to the first poi information of each building, and the real estate data judged to be the same building in the second real estate data is deduplication to obtain the third real estate data.
[0105] As one of the embodiments, from the said poi information set of each building selects a poi information as the first poi information of each building, specifically:
[0106] Obtain the first longitude and latitude of the current building from the second property data, and calculate the first longitude and latitude distance score between the first longitude and latitude and the second longitude and latitude of each poi information in the current property's poi information set;
[0107] Calculate the first editing distance between the real estate name of the current building and the name of each poi information in the current real estate information set;
[0108] Calculate the similarity between the real estate address of the current building and the real estate address of each poi information in the current real estate information set;
[0109] Calculate the first sum similarity score of each poi information in the poi information set of the current building, select the poi information with the highest first sum similarity score as the first poi information of the current building; the first sum similarity score is the sum of the first longitude and latitude distance score, the first edit distance similarity and the second edit distance similarity.
[0110] On the basis of the above method item embodiments, the present invention provides a corresponding readable storage medium item embodiment;
[0111] Another embodiment of the present invention provides a readable storage medium, the readable storage medium comprises a stored computer program, when the computer program is executed, the device where the control of the readable storage medium is performed as described in any one of the method items of the present invention embodiment of the real estate data cleaning merger method.
[0112] Exemplary, the computer program may be divided into one or more modules, the one or more modules are stored in the memory, and executed by the processor to complete the present invention. The one or more modules may be capable of performing a particular function of a series of computer program instruction segments, the instruction segment for describing the execution of the computer program in the terminal device.
[0113] The terminal apparatus may be a desktop computer, notebook, handheld computer and cloud server and other computing devices. The terminal apparatus may include, but is not limited to, a processor, a memory.
[0114] The alleged processor may be a central processing unit (CPU), may also be another general purpose processor, digital signal processor (DSP), application specific integrated circuit (ASIC), off-the-shelf programmable gate array ( Field-Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may also be any conventional processor, etc., the processor is the control center of the terminal device, using various interfaces and lines to connect various parts of the entire terminal device.
[0115] The memory may be used to store the computer program and / or module, the processor by running or executing the computer program and / or module stored in the memory, and calling the data stored in the memory, to achieve various functions of the terminal device. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store the operating system, at least one function required application (such as sound playback function, image playback function, etc.) and the like; the storage data area may store data created according to the use of the mobile phone (such as audio data, telephone book, etc.) and the like. Further, the memory may include high-speed random access memory, may also include non-volatile memory, such as hard disk, memory, plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash memory card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.
[0116] Wherein, the terminal device integrated module / unit if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable storage medium (i.e., the above-described readable storage medium). Based on such understanding, the present invention implements all or part of the flow of the above embodiment method, may also be completed by a computer program to instruct the relevant hardware, the computer program may be stored in a computer-readable storage medium, the computer program when executed by the processor, may implement the steps of each of the above method embodiments. Wherein, the computer program includes computer program code, the computer program code may be in source code form, object code form, executable file or some intermediate form and the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, removable hard disk, disk disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), carrier signals, telecommunications signals and software distribution media and the like.
[0117] It should be noted that the above described embodiment of the apparatus is merely schematic, wherein the unit described as a separation member may or may not be physically separate, the component displayed as a unit may or may not be a physical unit, i.e., may be located in one place, or may also be distributed to a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the present embodiment. Further, in the accompanying drawings of the embodiment of the apparatus provided in the present invention, the connection relationship between the modules indicates that there is a communication connection between them, which may be implemented as one or more communication buses or signal lines.
[0118] Those of ordinary skill in the art can be understood and implemented without paying creative labor. The above is a preferred embodiment of the present invention, it should be noted that, for those of ordinary skill in the art, without departing from the principles of the present invention, several improvements and retouchings may also be made, these improvements and retouchings are also regarded as the scope of protection of the present invention. Those of ordinary skill in the art will appreciate that all or part of the process of implementing the above embodiments, may be completed by a computer program to instruct the relevant hardware, the program may be stored in a computer-readable storage medium, the program may include processes such as the above embodiments when executed. Wherein, the storage medium may be a disk, optical disk, read-only memory (Read-Only Memory, ROM) or random storage memory (Random Access Memory, RAM) and the like.

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.

Similar technology patents

Information processing method and system for improving demand response effectiveness of power consumer

PendingCN114186798AGuaranteed validityAvoid blind regulation
Owner:STATE GRID ZHEJIANG ELECTRIC POWER +1

Road passing time estimation system and method based on isochronous positioning monitoring

ActiveCN111968371AGuaranteed validityImprove science and rationality
Owner:SOUTHWEST JIAOTONG UNIV

Screening method of polypeptide compound and related device

PendingCN114724643AFast and accurate activity screening processGuaranteed validity
Owner:TENCENT TECH (SHENZHEN) CO LTD

Data transmission method and system based on block chain

ActiveCN113572715AIncrease the difficulty of crackingGuaranteed validity
Owner:QINGDAO HAIER WASHING ELECTRIC APPLIANCES CO LTD +2

Classification and recommendation of technical efficacy words

Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products