Data restoration method and device and data processing equipment
1. A method of data repair, the method comprising:
acquiring a plurality of pieces of data to be processed from at least two different channels; each piece of data to be processed comprises at least two attributes and corresponding attribute values;
determining the identification attribute and the general attribute of the data to be processed;
taking each piece of data to be processed as a node to construct a weighted undirected graph, wherein connecting lines are arranged among nodes with at least one same identification attribute value, and the weight values of the connecting lines are positively correlated with the number of the identification attributes with the same attribute value among the nodes;
performing connected subgraph detection and splitting on the weighted undirected graph through a connected graph algorithm to obtain a set of connected subgraphs;
if the connected subgraph has nodes with inconsistent effective attribute values of the identification attributes, identifying the connected subgraph as an information conflict subgraph;
for each information conflict subgraph, carrying out community division on nodes in the information conflict subgraph through a community detection algorithm;
and aiming at a pair of conflict nodes which have connecting lines and are positioned in different communities in the information conflict subgraph, taking the corresponding identification attribute of the connecting line as a conflict attribute, and modifying the attribute value of the conflict attribute in the pair of conflict nodes to release the connecting line relation between the pair of conflict nodes.
2. The method of claim 1, further comprising:
and counting the occurrence frequency of different attribute values in each attribute aiming at each community in each information conflict subgraph, and completing the attribute by using the attribute value with the highest occurrence frequency of the attribute in the community, wherein the attribute value of the attribute is absent in the community.
3. The method of claim 1, further comprising:
and performing attribute integration and supplementation on the plurality of pieces of data to be processed to ensure that each piece of data to be processed has the same number of attributes.
4. The method according to claim 1, wherein the step of releasing the connection relationship between the pair of conflict nodes by modifying the attribute value of the conflict attribute in the pair of conflict nodes comprises:
and for each conflict node in the pair of conflict nodes, counting the occurrence probability of different attribute values of the conflict attribute in the community where the conflict node is located, and replacing the attribute value of the conflict attribute in the conflict node with the attribute value with the highest occurrence probability.
5. The method according to claim 1, wherein before the step of constructing the weighted undirected graph with each piece of the to-be-processed data as a node, the method further comprises:
and aiming at the data to be processed acquired from the same channel, carrying out duplicate removal processing on the acquired data to be processed.
6. The method of claim 1, wherein the step of determining the identification attribute and the general attribute of the data to be processed comprises:
for each attribute, the discrimination value D for that attribute is calculated by the following formula:
wherein, A is the total number of the data to be processed, and Dis (A) is the number of the effective values of the attributes after the duplication removal;
if the discrimination value D of the attribute is larger than a preset discrimination threshold value, identifying the attribute as an identification attribute, otherwise identifying the attribute as a general attribute.
7. The method according to claim 1, wherein the step of constructing the weighted undirected graph with each piece of the data to be processed as a node comprises:
taking each piece of data to be processed as a node, and constructing a connecting line between nodes with identification attribute values with the same attribute value;
determining the weight value w of the connecting line according to the following formula:
wherein p iskFor the k-th of said identification attribute, δ (p)k) To identify an attribute pkIf two nodes identify the attribute pkAre the same value, then δ (p)k) Has an attribute value of 1, if two nodes identify the attribute pkIs different, then δ (p)k) Has an attribute value of 0; beta is akIs the k-thA preset importance coefficient of the identification attribute;
nmfor the mth said general property, δ (n)m) Is a general attribute nmThe same value judgment function of (1), if two nodes have general attribute nmAre the same, then δ (n)m) Has an attribute value of 1, if two nodes have a general attribute nmIs different, then δ (n)m) Has an attribute value of 0; lambda [ alpha ]mAnd the preset importance coefficient is the m-th identification attribute.
8. The method of claim 1, further comprising:
if the connected subgraph does not have a node with inconsistent effective attribute values of the identification attributes, identifying the connected subgraph as an information non-conflict subgraph;
and aiming at each node in the information non-conflict subgraph, if a missing attribute with a missing attribute value exists, using the attribute value with the maximum frequency as the attribute value of the missing attribute according to the frequency of the different attribute values of the missing attribute in the information non-conflict subgraph.
9. A data recovery apparatus, characterized in that the apparatus comprises:
the data acquisition module is used for acquiring a plurality of pieces of data to be processed from at least two different channels; each piece of data to be processed comprises at least two attributes and corresponding attribute values;
the data sorting module is used for determining the identification attribute and the general attribute of the data to be processed;
the data restoration module is used for constructing a weighted undirected graph by taking each piece of the data to be processed as a node, wherein a connecting line is arranged between nodes with at least one same identification attribute value, and the weight value of the connecting line is positively correlated with the number of the identification attributes with the same attribute value between the nodes; performing connected subgraph detection and splitting on the weighted undirected graph through a connected graph algorithm to obtain a set of connected subgraphs; if the connected subgraph has nodes with inconsistent effective attribute values of the identification attributes, identifying the connected subgraph as an information conflict subgraph; for each information conflict subgraph, carrying out community division on nodes in the information conflict subgraph through a community detection algorithm; and aiming at a pair of conflict nodes which have connecting lines and are positioned in different communities in the information conflict subgraph, taking the corresponding identification attribute of the connecting line as a conflict attribute, and modifying the attribute value of the conflict attribute in the pair of conflict nodes to release the connecting line relation between the pair of conflict nodes.
Background
In the process of big data processing, data acquired from a plurality of different data sources often need to be integrated, but the data quality of the data acquired from different channels is often different, some data sources have better data quality, and some data sources have poor data quality, for example, problems of wrong filling of data information, information loss, wrong repair and the like may occur in the processes of data storage, copying and repair participated by people. When data to be processed is taken from different channels, it is important how to effectively identify the problematic data and repair the data.
The traditional method for repairing data is generally to acquire an upstream data source by means of manual inquiry when data information is found to conflict, and repair data again by investigating a root cause of data errors, and the method is useful for the situations that the data generation time is short, participants related to upstream and downstream can be found, and the data size is small, but in a big data era, the data size is huge, some data histories are long, and it is difficult to repair data problems by means of inquiry of personnel to upstream and downstream.
Disclosure of Invention
In order to overcome the above-mentioned deficiencies in the prior art, the present application aims to provide a data repair method, which includes:
acquiring a plurality of pieces of data to be processed from at least two different channels; each piece of data to be processed comprises at least two attributes and corresponding attribute values;
determining the identification attribute and the general attribute of the data to be processed;
taking each piece of data to be processed as a node to construct a weighted undirected graph, wherein connecting lines are arranged among nodes with at least one same identification attribute value, and the weight values of the connecting lines are positively correlated with the number of the identification attributes with the same attribute value among the nodes;
performing connected subgraph detection and splitting on the weighted undirected graph through a connected graph algorithm to obtain a set of connected subgraphs;
if the connected subgraph has nodes with inconsistent effective attribute values of the identification attributes, identifying the connected subgraph as an information conflict subgraph;
for each information conflict subgraph, carrying out community division on nodes in the information conflict subgraph through a community detection algorithm;
and aiming at a pair of conflict nodes which have connecting lines and are positioned in different communities in the information conflict subgraph, taking the corresponding identification attribute of the connecting line as a conflict attribute, and modifying the attribute value of the conflict attribute in the pair of conflict nodes to release the connecting line relation between the pair of conflict nodes.
In one possible implementation, the method further includes:
and counting the occurrence frequency of different attribute values in each attribute aiming at each community in each information conflict subgraph, and completing the attribute by using the attribute value with the highest occurrence frequency of the attribute in the community, wherein the attribute value of the attribute is absent in the community.
In one possible implementation, the method further includes:
and performing attribute integration and supplementation on the plurality of pieces of data to be processed to ensure that each piece of data to be processed has the same number of attributes.
In a possible implementation manner, the step of canceling the connection relationship between the pair of conflict nodes by modifying the attribute value of the conflict attribute in the pair of conflict nodes includes:
and for each conflict node in the pair of conflict nodes, counting the occurrence probability of different attribute values of the conflict attribute in the community where the conflict node is located, and replacing the attribute value of the conflict attribute in the conflict node with the attribute value with the highest occurrence probability.
In a possible implementation manner, before the step of constructing the weighted undirected graph by using each piece of the to-be-processed data as a node, the method further includes:
and aiming at the data to be processed acquired from the same channel, carrying out duplicate removal processing on the acquired data to be processed.
In a possible implementation manner, the step of determining the identification attribute and the general attribute of the data to be processed includes:
for each attribute, the discrimination value D for that attribute is calculated by the following formula:
wherein, A is the total number of the data to be processed, and Dis (A) is the number of the effective values of the attributes after the duplication removal;
if the discrimination value D of the attribute is larger than a preset discrimination threshold value, identifying the attribute as an identification attribute, otherwise identifying the attribute as a general attribute.
In a possible implementation manner, the step of constructing a weighted undirected graph by using each piece of the to-be-processed data as a node includes:
taking each piece of data to be processed as a node, and constructing a connecting line between nodes with identification attribute values with the same attribute value;
determining the weight value w of the connecting line according to the following formula:
wherein p iskFor the k-th of said identification attribute, δ (p)k) To identify an attribute pkIf two nodes identify the attribute pkAre the same value, then δ (p)k) Has an attribute value of 1, if two nodes identify the attribute pkIs different, then δ (p)k) Has an attribute value of 0; beta is akA preset importance coefficient for the kth identification attribute;
nmfor the mth said general property, δ (n)m) Is a general attribute nmThe same value judgment function of (1), if two nodes have general attribute nmAre the same, then δ (n)m) Has an attribute value of 1, if two nodes have a general attribute nmIs different, then δ (n)m) Has an attribute value of 0; lambda [ alpha ]mAnd the preset importance coefficient is the m-th identification attribute.
In one possible implementation, the method further includes:
if the connected subgraph does not have a node with inconsistent effective attribute values of the identification attributes, identifying the connected subgraph as an information non-conflict subgraph;
and aiming at each node in the information non-conflict subgraph, if a missing attribute with a missing attribute value exists, using the attribute value with the maximum frequency as the attribute value of the missing attribute according to the frequency of the different attribute values of the missing attribute in the information non-conflict subgraph.
Another object of the present application is to provide a data recovery apparatus, the apparatus comprising:
the data acquisition module is used for acquiring a plurality of pieces of data to be processed from at least two different channels; each piece of data to be processed comprises at least two attributes and corresponding attribute values;
the data sorting module is used for determining the identification attribute and the general attribute of the data to be processed;
the data restoration module is used for constructing a weighted undirected graph by taking each piece of the data to be processed as a node, wherein a connecting line is arranged between nodes with at least one same identification attribute value, and the weight value of the connecting line is positively correlated with the number of the identification attributes with the same attribute value between the nodes; performing connected subgraph detection and splitting on the weighted undirected graph through a connected graph algorithm to obtain a set of connected subgraphs; if the connected subgraph has nodes with inconsistent effective attribute values of the identification attributes, identifying the connected subgraph as an information conflict subgraph; for each information conflict subgraph, carrying out community division on nodes in the information conflict subgraph through a community detection algorithm; and aiming at a pair of conflict nodes which have connecting lines and are positioned in different communities in the information conflict subgraph, taking the corresponding identification attribute of the connecting line as a conflict attribute, and modifying the attribute value of the conflict attribute in the pair of conflict nodes to release the connecting line relation between the pair of conflict nodes.
Another object of the present application is to provide a data processing apparatus, which includes a processor and a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions, and the machine-executable instructions, when executed by the processor, implement the data recovery method provided in the present application.
Another object of the present application is to provide a machine-readable storage medium, wherein the machine-readable storage medium stores machine executable instructions, which when executed by one or more processors, implement the data recovery method provided by the present application.
Compared with the prior art, the method has the following beneficial effects:
according to the data restoration method, the data restoration device and the data processing equipment, the authorized undirected graph is constructed according to the data to be processed, the connected sub-graphs are divided, and the attribute with information conflict is automatically found in a community division mode, so that the problem data in the heterogeneous data can be automatically and effectively identified, and the data can be restored as far as possible.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
Fig. 1 is a schematic diagram of a data recovery method according to an embodiment of the present application;
fig. 2 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;
fig. 3 is a schematic functional block diagram of a data recovery apparatus according to an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
In the description of the present application, it is further noted that, unless expressly stated or limited otherwise, the terms "disposed," "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present application can be understood in a specific case by those of ordinary skill in the art.
Referring to fig. 1, fig. 1 is a schematic flow chart illustrating steps of a data recovery method according to this embodiment, and each step of the method is described in detail below.
Step S100, a plurality of pieces of data to be processed are acquired from at least two different channels. Each piece of data to be processed comprises at least two attributes and corresponding attribute values.
Because a plurality of pieces of to-be-processed data from different channels may have different data structures, in some possible implementation manners, after the plurality of pieces of to-be-processed data are obtained, attribute integration and supplementation may be performed on the plurality of pieces of to-be-processed data first, so that each piece of to-be-processed data has the same number of attributes. For example, for a certain piece of data to be processed, if a certain attribute X is missing compared with other pieces of data to be processed, the data to be processed is supplemented with an attribute X, and the attribute value of the data X is set to be NULL (for example, represented by a NULL value).
In addition, in this embodiment, a channel source attribute may also be appended to each piece of data to be processed, so as to identify the channel from which the data is sourced.
In some possible implementation manners, in order to reduce unnecessary data processing actions, before constructing the weighted undirected graph using each piece of the to-be-processed data as a node, the obtained to-be-processed data may be subjected to de-duplication processing with respect to the to-be-processed data obtained from the same channel.
For example, for the data to be processed in the same channel, the data may be deduplicated based on all attribute data, and only one piece of data to be processed with the same attribute value needs to be randomly reserved. In addition, if a plurality of pieces of data to be processed have the date attribute of the acquired data and the other attributes are identical, only the latest date may be retained.
Step S200, determining the identification attribute and the general attribute of the data to be processed, wherein the identification attribute is an attribute representing the identity information of the data to be processed.
Illustratively, the identification attribute is an attribute capable of distinguishing uniqueness of the data to be processed to a certain extent, and the general attribute is an attribute which does not strongly identify the data in the data. For example, in the personal information, the phone number and the name are identification attributes, but may not have uniqueness (for example, there may be people with the same name, and the phone number may also be reassigned by the operator), while the information such as the height, the age, the province of the user, etc. is a general attribute, and there may be a plurality of pieces of the same general attribute to be processed with the same value.
In some possible implementations, the identifying attributes and the generic attributes may be automatically distinguished by calculating a distinguishing value for each attribute.
Specifically, for each attribute, the discrimination value D of the attribute is calculated by the following formula:
wherein, A is the total number of the data to be processed, and Dis (A) is the number of the effective values of the attributes after the duplication removal;
if the discrimination value D of the attribute is larger than a preset discrimination threshold value, identifying the attribute as an identification attribute, otherwise identifying the attribute as a general attribute.
In other possible implementations, the identification attribute and the general attribute may be distinguished by manual identification.
Step S300, constructing a weighted undirected graph by taking each piece of data to be processed as a node, wherein a connecting line is arranged between nodes with at least one same identification attribute value, and the weight value of the connecting line is positively correlated with the number of identification attributes with the same attribute value between the nodes.
In some possible implementations, two nodes may be considered to have a connection if their identification attributes have the same attribute value, for example, two nodes have the same attribute value of their "name" attribute, and the two nodes correspond to data of which the data to be processed may be the same user. Therefore, in this embodiment, each piece of the to-be-processed data may be used as a node, and a connection line may be constructed between nodes having identification attribute values with the same attribute value.
Then, if the number of attributes having the same attribute value between two nodes is larger, the closer the relationship between the to-be-processed data corresponding to the two nodes is identified, so that the weight value w of the connection line can be determined according to the following formula:
wherein p iskFor the k-th of said identification attribute, δ (p)k) To identify an attribute pkIf two nodes identify the attribute pkAre the same value, then δ (p)k) Has an attribute value of 1, if two sectionsPoint identity attribute pkIs different, then δ (p)k) Has an attribute value of 0; beta is akA preset importance coefficient for the kth identification attribute;
nmfor the mth said general property, δ (n)m) Is a general attribute nmThe same value judgment function of (1), if two nodes have general attribute nmAre the same, then δ (n)m) Has an attribute value of 1, if two nodes have a general attribute nmIs different, then δ (n)m) Has an attribute value of 0; lambda [ alpha ]mAnd the preset importance coefficient is the m-th identification attribute. Wherein, betakAnd λmThe importance of the different attributes may be preset manually.
And S400, performing connected subgraph detection and splitting on the weighted undirected graph through a connected graph algorithm to obtain a set of connected subgraphs.
In a possible implementation manner, for the split connected subgraphs, nodes between different connected subgraphs do not have connecting lines, and nodes of the same connected subgraph have connecting lines.
Step S500, if the connected subgraph has nodes with inconsistent effective attribute values of the identification attributes, identifying the connected subgraph as an information conflict subgraph.
Specifically, in this embodiment, if, in a connected subgraph, for any one identifier attribute, except for a null value, there is a node with an inconsistent identifier attribute value, the connected subgraph is identified as an information conflict subgraph, otherwise, the connected subgraph is identified as an information conflict subgraph.
And S600, aiming at each information conflict subgraph, carrying out community division on nodes in the information conflict subgraph through a community detection algorithm.
In this embodiment, a Fast Unfolding (Fast Unfolding) algorithm may be adopted to perform community detection and division for each information collision subgraph, and divide nodes in the information collision subgraph into a plurality of different communities.
Step S700, regarding a pair of conflict nodes having a connection line and located in different communities in the information conflict subgraph, using the identification attribute corresponding to the connection line as a conflict attribute, and removing the connection line relationship between the pair of conflict nodes by modifying the attribute value of the conflict attribute in the pair of conflict nodes.
In this embodiment, if two nodes corresponding to a certain connection in the information conflict subgraph are respectively in two different communities, the connection is marked as an abnormal connection. Theoretically, the nodes in different communities should be data with loose connection, and the appearance of abnormal connecting lines indicates that there are identification attributes with the same attribute value between conflicting nodes, and the value of the identification attribute is likely to be abnormal, for example, a wrong value is recorded. In this way, such incorrect values can be automatically and quickly detected.
In a possible implementation manner, for each of the pair of conflict nodes, the probability of occurrence of different attribute values of the conflict attribute in the community in which the conflict node is located is counted, and the attribute value with the highest occurrence probability is used to replace the attribute value of the conflict attribute in the conflict node.
Specifically, if for a certain common connection line e, the corresponding two conflict nodes are v and w, and belong to the community C respectivelyjAnd CkThe conflict attribute is attribute X. For a conflicting node v is, Community CjCounting the occurrence frequency of different attribute values of the attribute X, and taking the attribute value with the highest occurrence frequency as the attribute value of the attribute X of the conflict node v; for conflicting node w, in Community CkAnd counting the occurrence frequency of different attribute values of the attribute X, and taking the attribute value with the highest occurrence frequency as the attribute value of the attribute X of the conflict node w. In this way, problematic attributes can be automatically identified and repaired.
In some possible implementation manners, after step S700, for each community in each information conflict subgraph, the frequency of occurrence of different attribute values in each attribute may be counted, a node with an attribute value missing of an attribute in the community is used, and the attribute is complemented by using the attribute value with the highest frequency of occurrence of the attribute in the community. Thus, automatic repair can be performed for an attribute having a certain attribute value.
In some possible implementation manners, if there is no node in the connected subgraph where the effective attribute values of the identifying attributes are inconsistent, the connected subgraph is identified as an information non-conflict subgraph.
And aiming at each node in the information non-conflict subgraph, if a missing attribute with a missing attribute value exists, using the attribute value with the maximum frequency as the attribute value of the missing attribute according to the frequency of the different attribute values of the missing attribute in the information non-conflict subgraph.
In addition, the present embodiment further provides a data processing device, and the data processing device may execute the data recovery method provided in the present embodiment. The data processing apparatus may be, but is not limited to, an electronic apparatus having a data processing capability, such as a server, a Personal Computer (PC), a workstation, and the like.
Referring to fig. 2, fig. 2 is a schematic diagram of a hardware structure of a data processing apparatus 100 according to the present embodiment. The data processing apparatus 100 includes a data recovery device 110, a machine-readable storage medium 120, and a processor 130.
The elements of the machine-readable storage medium 120 and the processor 130 are electrically connected to each other directly or indirectly to enable data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The data recovery device 110 includes at least one software function module which may be stored in the form of software or firmware (firmware) in the machine-readable storage medium 120 or solidified in an Operating System (OS) of the data processing apparatus 100. The processor 130 is configured to execute executable modules stored in the machine-readable storage medium 120, such as software functional modules and computer programs included in the data recovery device 110.
The machine-readable storage medium 120 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The machine-readable storage medium 120 is used for storing a program, and the processor 130 executes the program after receiving an execution instruction. Access to the machine-readable storage medium 120 by the processor 130, and possibly other components, may be under the control of the storage controller 212.
The processor 130 may be an integrated circuit chip having signal processing capabilities. The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Referring to fig. 3, the present embodiment further provides a data recovery apparatus, where the data recovery apparatus includes at least one functional module that can be stored in a machine-readable storage medium in a software form. Functionally divided, the data recovery apparatus may include a data acquisition module 111, a data sorting module 112, and a data recovery module 113.
The data obtaining module 111 is configured to obtain a plurality of pieces of data to be processed from at least two different channels. Each piece of data to be processed comprises at least two attributes and corresponding attribute values.
In this embodiment, the data obtaining module 111 may be configured to execute step S100 shown in fig. 1, and for a detailed description of the data obtaining module 111, reference may be made to the description of step S100.
The data sorting module 112 is configured to determine an identification attribute and a general attribute of the data to be processed.
In this embodiment, the data sorting module 112 may be configured to execute step S200 shown in fig. 1, and for a detailed description of the data sorting module 112, reference may be made to the description of step S200.
The data restoring module 113 is configured to construct a weighted undirected graph with each piece of the to-be-processed data as a node, where at least one node having the same identifier attribute value has a connection line therebetween, and a weight value of the connection line is positively correlated to the number of identifier attributes having the same attribute value between nodes. And performing connected subgraph detection and splitting on the weighted undirected graph through a connected graph algorithm to obtain a set of connected subgraphs. And if the connected subgraph has nodes with inconsistent effective attribute values of the identification attributes, identifying the connected subgraph as an information conflict subgraph. And aiming at each information conflict subgraph, carrying out community division on nodes in the information conflict subgraph through a community detection algorithm. And aiming at a pair of conflict nodes which have connecting lines and are positioned in different communities in the information conflict subgraph, taking the corresponding identification attribute of the connecting line as a conflict attribute, and modifying the attribute value of the conflict attribute in the pair of conflict nodes to release the connecting line relation between the pair of conflict nodes.
In this embodiment, the data recovery module 113 may be configured to execute steps S300 to S700 shown in fig. 1, and the detailed description about the data recovery module 113 may refer to the description about the steps S300 to S700.
In summary, according to the data recovery method, the data recovery device and the data processing equipment, the weighted undirected graph is constructed according to the data to be processed, the connected subgraphs are divided, and the attribute with information conflict is automatically found in a community division mode, so that the problem data in the heterogeneous data can be automatically and effectively identified, and the data can be recovered as much as possible.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above description is only for various embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and all such changes or substitutions are included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.