Data quality health degree analysis method and system based on multidimensional analysis technology
1. A data quality health degree analysis method based on a multidimensional analysis technology is characterized by comprising the following steps:
acquiring a first number of target service data samples;
constructing a data analysis model by utilizing a preset similarity comparison rule, a preset integrity evaluation rule, a preset uniqueness evaluation rule and a preset relevance evaluation rule;
receiving a target evaluation type selected by a target user, and analyzing and evaluating the first number of target service data samples by using the data analysis model according to the target evaluation type to generate a quality health degree analysis report;
displaying the quality health degree analysis report in a graphical format;
wherein the target evaluation type is: one or more of a similarity assessment, an integrity assessment, a uniqueness assessment, and an association assessment;
the method for constructing the data analysis model by utilizing the preset similarity comparison rule, the preset integrity evaluation rule, the preset uniqueness evaluation rule and the preset relevance evaluation rule comprises the following steps:
constructing an initial network model;
setting four network nodes in the initial network model;
respectively corresponding the preset similarity comparison rule, the preset integrity evaluation rule, the preset uniqueness evaluation rule and the preset relevance evaluation rule to the four network nodes;
after the correspondence is finished, detecting the stability of each network node;
when the stability of each network node is qualified, confirming the convergence of the initial network model to obtain the data analysis model;
after the correspondence is completed, detecting the stability of each network node, including:
acquiring the number of times of heartbeat detection overtime of each node within a preset time length;
sequencing the four network nodes according to the overtime times of heartbeat detection in a sequence from the maximum to the minimum to obtain a sequencing result;
determining the network connection state of each network node in the sequencing result;
when the network connection state of each network node is smooth, judging that the working states of the four network nodes are normal, when the network connection state of any one network node is disconnected, determining a first target network node of the disconnected network, judging that the working state of the first target network node is abnormal, generating an abnormal report for displaying, and judging that the stability of the first target network node is poor;
when the working state of each network node is judged to be normal, each network node is used as an initiating node;
sending the first resource occupation state of each initiating node to the adjacent network nodes;
forcibly closing the first resource occupation state of each initiating node and confirming whether the first resource occupation state received by the adjacent network node is changed;
if the change occurs, detecting whether a second resource occupation state of the adjacent network node is the same as a first resource occupation state, if so, determining that the adjacent network node is abnormal, and judging that the stability of the adjacent network node is poor, otherwise, determining that the network node is normal;
when the network nodes are confirmed to be normal, the four network nodes are started simultaneously, whether interference conditions occur among the network nodes is confirmed, if yes, second target network nodes with the interference conditions occur are marked, the stability of the second target network nodes is judged to be poor, and otherwise, the network nodes are confirmed to be normal in working mode;
detecting the difference between the target data output by each network node and the preset data, if the target data output by each network node is the same as the preset data, confirming that the precision of the output data of the network node is normal, judging that the stability of each network node is excellent, if the target data output by any network node is different from the preset data, extracting a third target network node with the output target data different from the preset data, and judging that the stability of the third target network node is poor.
2. The method of claim 1, wherein prior to obtaining the first number of target business data samples, the method further comprises:
determining a first number of data samples according to a preset condition;
determining a state function based on the first number;
determining a screening condition according to the state function, and screening a first number of initial service data samples meeting the screening condition from a second number of initial service data samples, wherein the second number is greater than the first number;
determining the first number of initial traffic data samples as the first number of target traffic data samples.
3. The method of claim 1, wherein before receiving a target evaluation type selected by a target user, and performing an analysis evaluation on the first number of target business data samples by using the data analysis model according to the target evaluation type to generate a quality health analysis report, the method further comprises: inspecting the data analysis model, comprising the steps of:
acquiring a fourth number of preset service data samples;
predetermining the first integrity of each preset service data sample, the first similarity of each preset service data sample and other preset service data samples, the first uniqueness of each preset service data sample and the first relevance of each preset service data sample and other preset service data samples, and obtaining a first determination result;
inputting the fourth number of preset service samples into the data analysis model, receiving a second integrity of each preset service data sample, a second similarity of each preset service data sample and other preset service data samples, a second uniqueness of each preset service data sample and a second relevance of each preset service data sample and other preset service data samples output by the data analysis model, and obtaining a second determination result;
and confirming whether the first determination result is the same as the second determination result, if so, confirming that the data analysis model is accurate, otherwise, confirming that the data output by the data analysis model has deviation, and sending a prompt for repairing the data analysis model to a target user.
4. The method as claimed in claim 1, wherein the receiving a target evaluation type selected by a target user, and performing analysis and evaluation on the first number of target business data samples by using the data analysis model according to the target evaluation type to generate a quality health analysis report includes:
recommending four preset evaluation types to the target user;
receiving a target evaluation type selected by the user from four preset evaluation types;
when the target evaluation type is similarity evaluation, extracting the classified codes and metadata of each target business data sample in the first number of target business data samples, and performing similarity evaluation on the classified codes and metadata of each target business data sample and the classified codes and metadata of other target business data samples by using a similarity algorithm based on lexical analysis and syntactic analysis to generate a first evaluation result;
when the target evaluation type is integrity evaluation, performing integrity process detection on the classification code and the metadata of each target service data sample, wherein the integrity process detection comprises the following steps: whether the data is empty or not, detecting the data length, detecting the data enumeration value and detecting the data consistency to generate a second evaluation result;
when the target evaluation type is uniqueness evaluation, detecting whether the classification code and the metadata of each target business data sample are the only one, if so, confirming that a first number of target business data samples pass the uniqueness detection, otherwise, extracting the repeated target classification code and the target metadata and the defective target business data samples to which the target classification code and the target metadata belong, and generating a third evaluation result;
when the target evaluation type is relevance evaluation, performing relevance evaluation on the classified codes and metadata of each target service data sample and the classified codes and metadata of other target service data samples to obtain a fourth evaluation result;
and performing comprehensive analysis by using the first evaluation result, the second evaluation result, the third evaluation result and the fourth evaluation result to obtain the quality health degree analysis report.
5. The method of claim 4, wherein the displaying the quality health analysis report in a graphical format comprises:
drawing and displaying the first evaluation result, the second evaluation result, the third evaluation result and the fourth evaluation result in a first radar chart format respectively;
and drawing and displaying the quality health degree analysis report subjected to comprehensive analysis by using the first evaluation result, the second evaluation result, the third evaluation result and the fourth evaluation result in a format of a second radar map.
6. The method of claim 1, wherein after obtaining the first number of target business data samples, before constructing the data analysis model using the preset similarity comparison rule, the preset integrity evaluation rule, the preset uniqueness evaluation rule, and the preset association evaluation rule, the method further comprises: and performing qualification detection on the first number of target service data samples, wherein the method specifically comprises the following steps:
acquiring a security coefficient of each target service data sample;
calculating a target security index of each target service data sample according to the confidentiality coefficient of each target service data sample:
wherein, PiTarget safety index, S, expressed as ith target business data sampleiRepresenting the degree of freedom of the ith target service data sample, representing gamma function by gamma () and representing circumference ratio by pi, representing natural logarithm by ln, and XiA privacy coefficient expressed as an ith target traffic data sample;
scanning the sample data content of each target service data sample, and determining the integrity and the truth of each target service data sample according to the sample data content of each target service data sample;
calculating a target qualified coefficient of each service data sample by using the target security index, the integrity and the truth of each target service data sample:
wherein, thetai1Expressed as a weight value, Q, of the target safety index of the ith target traffic data sample in the calculated eligibility coefficient of the ith target traffic data sampleiExpressed as the integrity, θ, of the ith target traffic data samplei2Expressed as the weighted value of the integrity of the ith target service data sample in the calculated qualification coefficient of the ith target service data sample, UiExpressed as the degree of truth, θ, of the ith target traffic data samplei3Expressing the weight value of the truth of the ith target service data sample in the qualified coefficient of the calculated ith target service data sample, wherein N is expressed as a first number, and M is expressed as a first quantityiThe value of the score value is [0.5,1 ] which is marked for the ith target service data sample by using a preset scoring rule]And a is an error factor in the calculation process and takes the value of [0.05, 0.1%],WiA target qualification coefficient expressed as an ith target traffic data sample;
determining whether the target qualified coefficient of each target service data sample is greater than or equal to a preset qualified coefficient, and carrying out quantity statistics on a third target service data sample of which the target qualified coefficient is smaller than the preset qualified coefficient;
confirming that the third target service data samples with the target number cannot pass qualified detection, and generating a detection report;
and displaying the detection report.
7. A data quality health analysis system based on multidimensional analysis techniques, the system comprising:
the acquisition module is used for acquiring a first number of target service data samples;
the construction module is used for constructing a data analysis model by utilizing a preset similarity contrast rule, a preset integrity evaluation rule, a preset uniqueness evaluation rule and a preset relevance evaluation rule;
the generation module is used for receiving a target evaluation type selected by a target user, analyzing and evaluating the first number of target service data samples by using the data analysis model according to the target evaluation type and generating a quality health degree analysis report;
the display module is used for displaying the quality health degree analysis report in a graphical format;
wherein the target evaluation type is: one or more of a similarity assessment, an integrity assessment, a uniqueness assessment, and an association assessment;
the method for constructing the data analysis model by utilizing the preset similarity comparison rule, the preset integrity evaluation rule, the preset uniqueness evaluation rule and the preset relevance evaluation rule comprises the following steps:
constructing an initial network model;
setting four network nodes in the initial network model;
respectively corresponding the preset similarity comparison rule, the preset integrity evaluation rule, the preset uniqueness evaluation rule and the preset relevance evaluation rule to the four network nodes;
after the correspondence is finished, detecting the stability of each network node;
when the stability of each network node is qualified, confirming the convergence of the initial network model to obtain the data analysis model;
after the correspondence is completed, detecting the stability of each network node, including:
acquiring the number of times of heartbeat detection overtime of each node within a preset time length;
sequencing the four network nodes according to the overtime times of heartbeat detection in a sequence from the maximum to the minimum to obtain a sequencing result;
determining the network connection state of each network node in the sequencing result;
when the network connection state of each network node is smooth, judging that the working states of the four network nodes are normal, when the network connection state of any one network node is disconnected, determining a first target network node of the disconnected network, judging that the working state of the first target network node is abnormal, generating an abnormal report for displaying, and judging that the stability of the first target network node is poor;
when the working state of each network node is judged to be normal, each network node is used as an initiating node;
sending the first resource occupation state of each initiating node to the adjacent network nodes;
forcibly closing the first resource occupation state of each initiating node and confirming whether the first resource occupation state received by the adjacent network node is changed;
if the change occurs, detecting whether a second resource occupation state of the adjacent network node is the same as a first resource occupation state, if so, determining that the adjacent network node is abnormal, and judging that the stability of the adjacent network node is poor, otherwise, determining that the network node is normal;
when the network nodes are confirmed to be normal, the four network nodes are started simultaneously, whether interference conditions occur among the network nodes is confirmed, if yes, second target network nodes with the interference conditions occur are marked, the stability of the second target network nodes is judged to be poor, and otherwise, the network nodes are confirmed to be normal in working mode;
detecting the difference between the target data output by each network node and the preset data, if the target data output by each network node is the same as the preset data, confirming that the precision of the output data of the network node is normal, judging that the stability of each network node is excellent, if the target data output by any network node is different from the preset data, extracting a third target network node with the output target data different from the preset data, and judging that the stability of the third target network node is poor.
Background
In the normal operation process of enterprise data standardization, value feedback to business is expected to be managed through data standardization, and the importance of data quality is not excessive no matter how much emphasis is placed. In the normal operation process of enterprise standardized data, the generation of low-quality data is inevitable, and the quality of a data standard coding library is influenced by large-batch data initialization, problem diffusion caused by unprocessed historical data and low-quality data generated by emergency service. The method is a measure which can be organized and developed by enterprises, so that the enterprise data quality management is correctly understood, low-quality data is not generated, the low-quality data is actually a theoretical target, the low-quality data is timely found and effectively processed, and the high health degree of a standard coding library is controlled in the actual operation of the enterprise data quality management through scientific, effective and professional management and technical support, so that the generation rate and the existence rate of the low-quality data are reduced and controlled, the low-quality data is timely found and effectively processed, but the high health degree of the standard coding library is controlled, but the quality assurance is manually performed due to the factors such as huge data quantity of the data coding library, complexity of data information, high professional requirements and the like, the standard data coding library is detected through a professional quality management tool, and missing data and repeated data which need to be removed are found and processed, The noise data to be removed and the abnormal (but real) data to be processed are analyzed through the data health degree provided by a specialized data quality management platform, a basis is provided for data cleaning and treatment, and then the data cleaning platform is used for data cleaning and treatment, so that the data quality such as the integrity, the uniqueness, the consistency, the accuracy, the legality, the timeliness and the like of the data is ensured. The data quality management method in the prior art cannot analyze the data quality comprehensively and efficiently, and further causes incomplete cleaning of useless data, thereby occupying a data memory, influencing user call data and seriously influencing the use experience of a user.
Disclosure of Invention
Aiming at the problems shown above, the invention provides a data quality health degree analysis method and system based on a multidimensional analysis technology, which are used for solving the problems that the data quality management method in the prior art mentioned in the background technology cannot carry out comprehensive and efficient analysis on the data quality, and further the cleaning of useless data is incomplete, so that the data memory is occupied, the data calling of a user is influenced, and the use experience of the user is seriously influenced.
A data quality health degree analysis method based on a multidimensional analysis technology comprises the following steps:
acquiring a first number of target service data samples;
constructing a data analysis model by utilizing a preset similarity comparison rule, a preset integrity evaluation rule, a preset uniqueness evaluation rule and a preset relevance evaluation rule;
receiving a target evaluation type selected by a target user, and analyzing and evaluating the first number of target service data samples by using the data analysis model according to the target evaluation type to generate a quality health degree analysis report;
displaying the quality health degree analysis report in a graphical format;
wherein the target evaluation type is: one or more of a similarity assessment, an integrity assessment, a uniqueness assessment, and an association assessment.
Preferably, before obtaining the first number of target traffic data samples, the method further includes:
determining a first number of data samples according to a preset condition;
determining a state function based on the first number;
determining a screening condition according to the state function, and screening a first number of initial service data samples meeting the screening condition from a second number of initial service data samples, wherein the second number is greater than the first number;
determining the first number of initial traffic data samples as the first number of target traffic data samples.
Preferably, the constructing a data analysis model by using a preset similarity comparison rule, a preset integrity evaluation rule, a preset uniqueness evaluation rule and a preset association evaluation rule includes:
constructing an initial network model;
setting four network nodes in the initial network model;
respectively corresponding the preset similarity comparison rule, the preset integrity evaluation rule, the preset uniqueness evaluation rule and the preset relevance evaluation rule to the four network nodes;
after the correspondence is finished, detecting the stability of each network node;
and when the stability of each network node is qualified, confirming the convergence of the initial network model, and obtaining the data analysis model.
Preferably, before receiving a target evaluation type selected by a target user, performing analysis and evaluation on the first number of target business data samples by using the data analysis model according to the target evaluation type, and generating a quality health degree analysis report, the method further includes: and performing authenticity detection on the first number of target service data samples, wherein the steps comprise:
segmenting each target service data sample to obtain a plurality of data segments;
performing functional data processing on each data segment of each target service data sample to obtain a hash value of each data segment;
acquiring a source weighted value of each target business data sample according to the plurality of hash values of each target business data sample;
calculating the target truth of each target business data sample by utilizing a preset truth algorithm according to the plurality of hash values and the source weighted value of each target business data sample;
deleting the first target service data sample with the target truth smaller than the preset truth, and reserving a second target service data sample with the target truth larger than or equal to the preset truth;
and counting the number of the second target service data samples to obtain a third number of second target service data samples.
Preferably, before receiving a target evaluation type selected by a target user, performing analysis and evaluation on the first number of target business data samples by using the data analysis model according to the target evaluation type, and generating a quality health degree analysis report, the method further includes: inspecting the data analysis model, comprising the steps of:
acquiring a fourth number of preset service data samples;
predetermining the first integrity of each preset service data sample, the first similarity of each preset service data sample and other preset service data samples, the first uniqueness of each preset service data sample and the first relevance of each preset service data sample and other preset service data samples, and obtaining a first determination result;
inputting the fourth number of preset service samples into the data analysis model, receiving a second integrity of each preset service data sample, a second similarity of each preset service data sample and other preset service data samples, a second uniqueness of each preset service data sample and a second relevance of each preset service data sample and other preset service data samples output by the data analysis model, and obtaining a second determination result;
and confirming whether the first determination result is the same as the second determination result, if so, confirming that the data analysis model is accurate, otherwise, confirming that the data output by the data analysis model has deviation, and sending a prompt for repairing the data analysis model to a target user.
Preferably, the receiving a target evaluation type selected by a target user, and performing analysis and evaluation on the first number of target service data samples by using the data analysis model according to the target evaluation type to generate a quality health degree analysis report includes:
recommending four preset evaluation types to the target user;
receiving a target evaluation type selected by the user from four preset evaluation types;
when the target evaluation type is similarity evaluation, extracting the classified codes and metadata of each target business data sample in the first number of target business data samples, and performing similarity evaluation on the classified codes and metadata of each target business data sample and the classified codes and metadata of other target business data samples by using a similarity algorithm based on lexical analysis and syntactic analysis to generate a first evaluation result;
when the target evaluation type is integrity evaluation, performing integrity process detection on the classification code and the metadata of each target service data sample, wherein the integrity process detection comprises the following steps: whether the data is empty or not, detecting the data length, detecting the data enumeration value and detecting the data consistency to generate a second evaluation result;
when the target evaluation type is uniqueness evaluation, detecting whether the classification code and the metadata of each target business data sample are the only one, if so, confirming that a first number of target business data samples pass the uniqueness detection, otherwise, extracting the repeated target classification code and the target metadata and the defective target business data samples to which the target classification code and the target metadata belong, and generating a third evaluation result;
when the target evaluation type is relevance evaluation, performing relevance evaluation on the classified codes and metadata of each target service data sample and the classified codes and metadata of other target service data samples to obtain a fourth evaluation result;
and performing comprehensive analysis by using the first evaluation result, the second evaluation result, the third evaluation result and the fourth evaluation result to obtain the quality health degree analysis report.
Preferably, the quality health analysis report is displayed in a graphical format, and the method includes:
drawing and displaying the first evaluation result, the second evaluation result, the third evaluation result and the fourth evaluation result in a first radar chart format respectively;
and drawing and displaying the quality health degree analysis report subjected to comprehensive analysis by using the first evaluation result, the second evaluation result, the third evaluation result and the fourth evaluation result in a format of a second radar map.
Preferably, after the correspondence is completed, detecting the stability of each network node includes:
acquiring the number of times of heartbeat detection overtime of each node within a preset time length;
sequencing the four network nodes according to the overtime times of heartbeat detection in a sequence from the maximum to the minimum to obtain a sequencing result;
determining the network connection state of each network node in the sequencing result;
when the network connection state of each network node is smooth, judging that the working states of the four network nodes are normal, when the network connection state of any one network node is disconnected, determining a first target network node of the disconnected network, judging that the working state of the first target network node is abnormal, generating an abnormal report for displaying, and judging that the stability of the first target network node is poor;
when the working state of each network node is judged to be normal, each network node is used as an initiating node;
sending the first resource occupation state of each initiating node to the adjacent network nodes;
forcibly closing the first resource occupation state of each initiating node and confirming whether the first resource occupation state received by the adjacent network node is changed;
if the change occurs, detecting whether a second resource occupation state of the adjacent network node is the same as a first resource occupation state, if so, determining that the adjacent network node is abnormal, and judging that the stability of the adjacent network node is poor, otherwise, determining that the network node is normal;
when the network nodes are confirmed to be normal, the four network nodes are started simultaneously, whether interference conditions occur among the network nodes is confirmed, if yes, second target network nodes with the interference conditions occur are marked, the stability of the second target network nodes is judged to be poor, and otherwise, the network nodes are confirmed to be normal in working mode;
detecting the difference between the target data output by each network node and the preset data, if the target data output by each network node is the same as the preset data, confirming that the precision of the output data of the network node is normal, judging that the stability of each network node is excellent, if the target data output by any network node is different from the preset data, extracting a third target network node with the output target data different from the preset data, and judging that the stability of the third target network node is poor.
Preferably, after obtaining the first number of target service data samples, before constructing the data analysis model by using the preset similarity contrast rule, the preset integrity evaluation rule, the preset uniqueness evaluation rule, and the preset association evaluation rule, the method further includes: and performing qualification detection on the first number of target service data samples, wherein the method specifically comprises the following steps:
acquiring a security coefficient of each target service data sample;
calculating a target security index of each target service data sample according to the confidentiality coefficient of each target service data sample:
wherein, PiTarget safety index, S, expressed as ith target business data sampleiDenoted as the degree of freedom of the ith target business data sample, and Γ () denoted as gammaA ma function, where pi is a circumferential ratio, ln is a natural logarithm, and XiA privacy coefficient expressed as an ith target traffic data sample;
scanning the sample data content of each target service data sample, and determining the integrity and the truth of each target service data sample according to the sample data content of each target service data sample;
calculating a target qualified coefficient of each service data sample by using the target security index, the integrity and the truth of each target service data sample:
wherein, thetai1Expressed as a weight value, Q, of the target safety index of the ith target traffic data sample in the calculated eligibility coefficient of the ith target traffic data sampleiExpressed as the integrity, θ, of the ith target traffic data samplei2Expressed as the weighted value of the integrity of the ith target service data sample in the calculated qualification coefficient of the ith target service data sample, UiExpressed as the degree of truth, θ, of the ith target traffic data samplei3Expressing the weight value of the truth of the ith target service data sample in the qualified coefficient of the calculated ith target service data sample, wherein N is expressed as a first number, and M is expressed as a first quantityiThe value of the score value is [0.5,1 ] which is marked for the ith target service data sample by using a preset scoring rule]And a is an error factor in the calculation process and takes the value of [0.05, 0.1%],WiA target qualification coefficient expressed as an ith target traffic data sample;
determining whether the target qualified coefficient of each target service data sample is greater than or equal to a preset qualified coefficient, and carrying out quantity statistics on a third target service data sample of which the target qualified coefficient is smaller than the preset qualified coefficient;
confirming that the third target service data samples with the target number cannot pass qualified detection, and generating a detection report;
and displaying the detection report.
A data quality health analysis system based on multidimensional analysis techniques, the system comprising:
the acquisition module is used for acquiring a first number of target service data samples;
the construction module is used for constructing a data analysis model by utilizing a preset similarity contrast rule, a preset integrity evaluation rule, a preset uniqueness evaluation rule and a preset relevance evaluation rule;
the generation module is used for receiving a target evaluation type selected by a target user, analyzing and evaluating the first number of target service data samples by using the data analysis model according to the target evaluation type and generating a quality health degree analysis report;
the display module is used for displaying the quality health degree analysis report in a graphical format;
wherein the target evaluation type is: one or more of a similarity assessment, an integrity assessment, a uniqueness assessment, and an association assessment.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.
FIG. 1 is a flowchart illustrating a method for analyzing data quality and health based on multidimensional analysis;
FIG. 2 is another flowchart of a method for analyzing data quality and health based on multidimensional analysis provided in the present invention;
FIG. 3 is a flowchart illustrating a method for analyzing data quality and health based on multidimensional analysis;
FIG. 4 is a screenshot of a workflow of a data quality and health analysis platform based on a multidimensional analysis technique according to the present invention;
FIG. 5 is a functional diagram of a data quality health analysis platform based on a multidimensional analysis technique according to the present invention;
FIG. 6 is a data quality health analysis dimension screenshot of a data quality health analysis platform based on a multidimensional analysis technique according to the present invention;
fig. 7 is a schematic structural diagram of a data quality health degree analysis system based on a multidimensional analysis technique according to the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
In the normal operation process of enterprise data standardization, value feedback to business is expected to be managed through data standardization, and the importance of data quality is not excessive no matter how much emphasis is placed. In the normal operation process of enterprise standardized data, the generation of low-quality data is inevitable, and the quality of a data standard coding library is influenced by large-batch data initialization, problem diffusion caused by unprocessed historical data and low-quality data generated by emergency service. The method is a measure which can be organized and developed by enterprises, so that the enterprise data quality management is correctly understood, low-quality data is not generated, the low-quality data is actually a theoretical target, the low-quality data is timely found and effectively processed, and the high health degree of a standard coding library is controlled in the actual operation of the enterprise data quality management through scientific, effective and professional management and technical support, so that the generation rate and the existence rate of the low-quality data are reduced and controlled, the low-quality data is timely found and effectively processed, but the high health degree of the standard coding library is controlled, but the quality assurance is manually performed due to the factors such as huge data quantity of the data coding library, complexity of data information, high professional requirements and the like, the standard data coding library is detected through a professional quality management tool, and missing data and repeated data which need to be removed are found and processed, The noise data to be removed and the abnormal (but real) data to be processed are analyzed through the data health degree provided by a specialized data quality management platform, a basis is provided for data cleaning and treatment, and then the data cleaning platform is used for data cleaning and treatment, so that the data quality such as the integrity, the uniqueness, the consistency, the accuracy, the legality, the timeliness and the like of the data is ensured. The data quality management method in the prior art cannot analyze the data quality comprehensively and efficiently, and further causes incomplete cleaning of useless data, thereby occupying a data memory, influencing user call data and seriously influencing the use experience of a user. In order to solve the above problem, the present embodiment discloses a data quality health degree analysis method based on a multidimensional analysis technology.
A data quality health degree analysis method based on a multidimensional analysis technology is shown in FIG. 1, and comprises the following steps:
step S101, obtaining a first number of target service data samples;
step S102, constructing a data analysis model by utilizing a preset similarity contrast rule, a preset integrity evaluation rule, a preset uniqueness evaluation rule and a preset relevance evaluation rule;
step S103, receiving a target evaluation type selected by a target user, and analyzing and evaluating the first number of target service data samples by using the data analysis model according to the target evaluation type to generate a quality health degree analysis report;
step S104, displaying the quality health degree analysis report in a graphical format;
wherein the target evaluation type is: one or more of a similarity assessment, an integrity assessment, a uniqueness assessment, and an association assessment.
The working principle of the technical scheme is as follows: the method comprises the steps of obtaining a first number of target business data samples, constructing a data analysis model by utilizing a preset similarity contrast rule, a preset integrity evaluation rule, a preset uniqueness evaluation rule and a preset relevance evaluation rule, receiving a target evaluation type selected by a target user, analyzing and evaluating the first number of target business data samples by utilizing the data analysis model according to the target evaluation type, generating a quality and health degree analysis report, and displaying the quality and health degree analysis report in a graphical format.
The beneficial effects of the above technical scheme are: the quality health degree analysis of the integrity of the service data sample by utilizing the data analysis model can avoid the occurrence of manpower waste caused by manual investigation and can also accurately carry out comprehensive and efficient analysis on the data quality of the service data sample, and can timely eliminate useless data to avoid the occurrence of data occupation of the useless data, so that a user can avoid the interference of the useless data, the use experience of the user is improved, further, the user can pertinently select the analysis angle of the service data sample, the experience of the user is further improved, the final data quality health degree analysis result is more accurate and correct due to the single angle analysis, and the stability is improved.
In one embodiment, as shown in fig. 2, before obtaining the first number of target traffic data samples, the method further comprises:
step S201, determining a first number of data samples according to preset conditions;
step S202, determining a state function based on the first number;
step S203, determining a screening condition according to the state function, and screening a first number of initial service data samples meeting the screening condition from a second number of initial service data samples, wherein the second number is greater than the first number;
step S204, determining the first number of initial service data samples as the first number of target service data samples.
The beneficial effects of the above technical scheme are: the first number of target business data samples meeting the condition can be screened out reasonably from the customer by determining the screening condition by using the state function, so that the selected samples are more practical and representative, the accuracy of the data is ensured, and good samples are provided for subsequent data quality and health degree analysis.
In one embodiment, as shown in fig. 3, the building the data analysis model by using the preset similarity comparison rule, the preset integrity evaluation rule, the preset uniqueness evaluation rule and the preset association evaluation rule includes:
s301, constructing an initial network model;
step S302, four network nodes are set in the initial network model;
step S303, respectively corresponding the preset similarity comparison rule, the preset integrity evaluation rule, the preset uniqueness evaluation rule and the preset association evaluation rule to the four network nodes;
step S304, after the correspondence is finished, the stability of each network node is detected;
and S305, when the stability of each network node is qualified, confirming the convergence of the initial network model, and obtaining the data analysis model.
The beneficial effects of the above technical scheme are: the mode of setting the network nodes is used for corresponding to each rule, so that each node can independently complete the analysis of one item of the service data sample, the situation that the final analysis result is disordered due to the fact that a plurality of analysis items are mixed together is avoided, and the stability is further improved.
In one embodiment, before receiving a target evaluation type selected by a target user, performing analysis evaluation on the first number of target business data samples by using the data analysis model according to the target evaluation type, and generating a quality health analysis report, the method further includes: and performing authenticity detection on the first number of target service data samples, wherein the steps comprise:
segmenting each target service data sample to obtain a plurality of data segments;
performing functional data processing on each data segment of each target service data sample to obtain a hash value of each data segment;
acquiring a source weighted value of each target business data sample according to the plurality of hash values of each target business data sample;
calculating the target truth of each target business data sample by utilizing a preset truth algorithm according to the plurality of hash values and the source weighted value of each target business data sample;
deleting the first target service data sample with the target truth smaller than the preset truth, and reserving a second target service data sample with the target truth larger than or equal to the preset truth;
and counting the number of the second target service data samples to obtain a third number of second target service data samples.
The beneficial effects of the above technical scheme are: the data precision can be further ensured by carrying out authenticity detection on the service data samples, and meanwhile, the authenticity evaluation is carried out by utilizing the unique hash value of each target service data sample, so that the authenticity of each target service data sample can be more truly and accurately calculated, and the safety is improved.
In one embodiment, before receiving a target evaluation type selected by a target user, performing analysis evaluation on the first number of target business data samples by using the data analysis model according to the target evaluation type, and generating a quality health analysis report, the method further includes: inspecting the data analysis model, comprising the steps of:
acquiring a fourth number of preset service data samples;
predetermining the first integrity of each preset service data sample, the first similarity of each preset service data sample and other preset service data samples, the first uniqueness of each preset service data sample and the first relevance of each preset service data sample and other preset service data samples, and obtaining a first determination result;
inputting the fourth number of preset service samples into the data analysis model, receiving a second integrity of each preset service data sample, a second similarity of each preset service data sample and other preset service data samples, a second uniqueness of each preset service data sample and a second relevance of each preset service data sample and other preset service data samples output by the data analysis model, and obtaining a second determination result;
and confirming whether the first determination result is the same as the second determination result, if so, confirming that the data analysis model is accurate, otherwise, confirming that the data output by the data analysis model has deviation, and sending a prompt for repairing the data analysis model to a target user.
The beneficial effects of the above technical scheme are: the final quality health degree analysis result of the analysis model can be perfectly matched with the actual result by checking the analysis model, the condition of missing identification of useless data is avoided, and the stability and the experience of a user are further provided.
In one embodiment, the receiving a target evaluation type selected by a target user, and performing analysis and evaluation on the first number of target business data samples by using the data analysis model according to the target evaluation type to generate a quality health degree analysis report includes:
recommending four preset evaluation types to the target user;
receiving a target evaluation type selected by the user from four preset evaluation types;
when the target evaluation type is similarity evaluation, extracting the classified codes and metadata of each target business data sample in the first number of target business data samples, and performing similarity evaluation on the classified codes and metadata of each target business data sample and the classified codes and metadata of other target business data samples by using a similarity algorithm based on lexical analysis and syntactic analysis to generate a first evaluation result;
when the target evaluation type is integrity evaluation, performing integrity process detection on the classification code and the metadata of each target service data sample, wherein the integrity process detection comprises the following steps: whether the data is empty or not, detecting the data length, detecting the data enumeration value and detecting the data consistency to generate a second evaluation result;
when the target evaluation type is uniqueness evaluation, detecting whether the classification code and the metadata of each target business data sample are the only one, if so, confirming that a first number of target business data samples pass the uniqueness detection, otherwise, extracting the repeated target classification code and the target metadata and the defective target business data samples to which the target classification code and the target metadata belong, and generating a third evaluation result;
when the target evaluation type is relevance evaluation, performing relevance evaluation on the classified codes and metadata of each target service data sample and the classified codes and metadata of other target service data samples to obtain a fourth evaluation result;
and performing comprehensive analysis by using the first evaluation result, the second evaluation result, the third evaluation result and the fourth evaluation result to obtain the quality health degree analysis report.
The beneficial effects of the above technical scheme are: by carrying out all-around analysis on the target service data sample, obtaining a plurality of evaluation results and then carrying out comprehensive analysis according to the plurality of evaluation results to generate the quality health degree analysis report, the evaluation of each project can be guaranteed to be independent, and the influence of other projects can not be said, the accuracy of each evaluation result can be guaranteed, and meanwhile, the accuracy of the final quality health degree analysis report is also guaranteed.
In one embodiment, the quality health analysis report is presented in a graphical format comprising:
drawing and displaying the first evaluation result, the second evaluation result, the third evaluation result and the fourth evaluation result in a first radar chart format respectively;
and drawing and displaying the quality health degree analysis report subjected to comprehensive analysis by using the first evaluation result, the second evaluation result, the third evaluation result and the fourth evaluation result in a format of a second radar map.
The beneficial effects of the above technical scheme are: the detection items in the evaluation results of the first number of target service samples can be accurately and comprehensively displayed through the radar map, so that a user can look up and understand the quality and health degree analysis report at a glance, and the experience of the user is further improved.
In one embodiment, after the mapping is completed, detecting the stability of each network node includes:
acquiring the number of times of heartbeat detection overtime of each node within a preset time length;
sequencing the four network nodes according to the overtime times of heartbeat detection in a sequence from the maximum to the minimum to obtain a sequencing result;
determining the network connection state of each network node in the sequencing result;
when the network connection state of each network node is smooth, judging that the working states of the four network nodes are normal, when the network connection state of any one network node is disconnected, determining a first target network node of the disconnected network, judging that the working state of the first target network node is abnormal, generating an abnormal report for displaying, and judging that the stability of the first target network node is poor;
when the working state of each network node is judged to be normal, each network node is used as an initiating node;
sending the first resource occupation state of each initiating node to the adjacent network nodes;
forcibly closing the first resource occupation state of each initiating node and confirming whether the first resource occupation state received by the adjacent network node is changed;
if the change occurs, detecting whether a second resource occupation state of the adjacent network node is the same as a first resource occupation state, if so, determining that the adjacent network node is abnormal, and judging that the stability of the adjacent network node is poor, otherwise, determining that the network node is normal;
when the network nodes are confirmed to be normal, the four network nodes are started simultaneously, whether interference conditions occur among the network nodes is confirmed, if yes, second target network nodes with the interference conditions occur are marked, the stability of the second target network nodes is judged to be poor, and otherwise, the network nodes are confirmed to be normal in working mode;
detecting the difference between the target data output by each network node and the preset data, if the target data output by each network node is the same as the preset data, confirming that the precision of the output data of the network node is normal, judging that the stability of each network node is excellent, if the target data output by any network node is different from the preset data, extracting a third target network node with the output target data different from the preset data, and judging that the stability of the third target network node is poor.
The beneficial effects of the above technical scheme are: whether the work of the target network node meets the actual requirement or not can be determined macroscopically by judging the stability of the target network node from multiple angles, the risk is reduced, the working performance of each target network node is guaranteed, the accuracy of the subsequent quality health degree evaluation result of the target business data sample can be further guaranteed, meanwhile, the stability of the model is also improved, the data analysis model can conduct quality health degree evaluation on a large number of business data samples, and the working efficiency is improved.
In one embodiment, after obtaining the first number of target business data samples, before constructing the data analysis model using the preset similarity contrast rule, the preset integrity evaluation rule, the preset uniqueness evaluation rule, and the preset association evaluation rule, the method further includes: and performing qualification detection on the first number of target service data samples, wherein the method specifically comprises the following steps:
acquiring a security coefficient of each target service data sample;
calculating a target security index of each target service data sample according to the confidentiality coefficient of each target service data sample:
wherein, PiTarget safety index, S, expressed as ith target business data sampleiRepresenting the degree of freedom of the ith target service data sample, representing gamma function by gamma () and representing circumference ratio by pi, representing natural logarithm by ln, and XiA privacy coefficient expressed as an ith target traffic data sample;
scanning the sample data content of each target service data sample, and determining the integrity and the truth of each target service data sample according to the sample data content of each target service data sample;
calculating a target qualified coefficient of each service data sample by using the target security index, the integrity and the truth of each target service data sample:
wherein, thetai1Expressed as a weight value, Q, of the target safety index of the ith target traffic data sample in the calculated eligibility coefficient of the ith target traffic data sampleiExpressed as the integrity, θ, of the ith target traffic data samplei2Expressed as the weighted value of the integrity of the ith target service data sample in the calculated qualification coefficient of the ith target service data sample, UiExpressed as the degree of truth, θ, of the ith target traffic data samplei3Expressing the weight value of the truth of the ith target service data sample in the qualified coefficient of the calculated ith target service data sample, wherein N is expressed as a first number, and M is expressed as a first quantityiThe value of the score value is [0.5,1 ] which is marked for the ith target service data sample by using a preset scoring rule]A is expressed as calculatedThe error factor in the process is [0.05,0.1 ]],WiA target qualification coefficient expressed as an ith target traffic data sample;
determining whether the target qualified coefficient of each target service data sample is greater than or equal to a preset qualified coefficient, and carrying out quantity statistics on a third target service data sample of which the target qualified coefficient is smaller than the preset qualified coefficient;
confirming that the third target service data samples with the target number cannot pass qualified detection, and generating a detection report;
and displaying the detection report.
The beneficial effects of the above technical scheme are: the integrity and the truth of each target service data sample can be roughly calculated according to the safety index by calculating the target safety index of each target service data sample, and the integrity and the truth of the data with higher safety are higher, so the qualification coefficient of each target service data sample is calculated according to the integrity, the truth and the safety index of each target service data sample, the qualification of the target service data sample is determined together with the self parameters of the target service data sample from the external aspect, the accuracy of the final qualification detection is ensured, further, the target service data sample can be selectively replaced by a user by displaying the third target service data samples with unqualified target quantity, and the accuracy of the quality and health evaluation of the subsequent target service data samples is further ensured, and meanwhile, a qualified and perfect data sample is provided for the quality health degree evaluation of the subsequent target business data sample.
In one embodiment, as shown in fig. 4-6, includes:
a data quality health degree analysis platform based on a multidimensional analysis technology utilizes the method of the invention, and the working process comprises the steps of obtaining business data by utilizing an entity data model, determining dynamic data, namely main data, in the business data, carrying out health analysis on the business data according to a similarity rule, an integrity rule, a uniqueness rule and an association rule in the data analysis model, displaying an analysis result in a graphical format, and generating a data quality analysis report.
This platform still has following function:
configuration of a condition for supporting coincident code matching;
the system supports the regular main data repeated code check and provides a repeated code list of the main data;
the method supports the accurate duplicate checking function and can configure duplicate checking rules;
supporting the establishment of a uniform auditing process;
support the announcement and opinion collection of the duplicate code list: the main data duplication list is disclosed only to the subsidiary companies or business units using the main data to be deleted in the business system.
Performing multiple checking functions on the data through configurable data checking conditions;
the batch export of the main data coincident code list is supported;
the method supports the tracking of the processing condition of each service system on the issued duplicate code list: establishing a mapping relation of main data recoding codes, and tracking the service processing (including uncleared service and main data processing state) condition of the deleted main data;
establishing a data constraint rule;
realizing the field mandatory check function;
realizing a relation field checking function;
the system supports regular main data health degree analysis, checks the main data coincident codes and provides a coincident code list of the main data;
the method supports the examination, the public, the opinion collection, the release and the export of the coincident code list;
and processing and tracking of the issued duplicate code list are supported.
The data management platform supports various check rules and can customize the check rules, such as: and value range verification, related attached table verification, regular expression verification, homonymy library verification and custom rule verification are supported.
An input selection comprising: the method supports value list template selection, supports user-defined auxiliary table selection, supports uploading of any accessories and analysis of data health degree.
And configuring health degree analysis parameters is supported, normal state monitoring analysis of the standard coding library is realized, state analysis reports of various main data coding libraries are generated according to the health degree parameter model, a data list to be processed is provided, and a basis is provided for data cleaning.
The beneficial effects of the above technical scheme are: corresponding quality control and analysis parameters are configured for different types of data models through a data quality management platform, normal quality monitoring management is carried out on different types of standard data, accurate duplicate checking and fuzzy duplicate checking of the data can be realized, and various configurable data checking functions can be provided. And the data uniqueness, integrity and consistency check is supported.
The embodiment also discloses a data quality and health degree analysis system based on the multidimensional analysis technology, as shown in fig. 7, the system includes:
an obtaining module 701, configured to obtain a first number of target service data samples;
a building module 702, configured to build a data analysis model by using a preset similarity comparison rule, a preset integrity evaluation rule, a preset uniqueness evaluation rule, and a preset association evaluation rule;
a generating module 703, configured to receive a target evaluation type selected by a target user, perform analysis and evaluation on the first number of target service data samples by using the data analysis model according to the target evaluation type, and generate a quality and health degree analysis report;
a display module 704, configured to display the quality health analysis report in a graphical format;
wherein the target evaluation type is: one or more of a similarity assessment, an integrity assessment, a uniqueness assessment, and an association assessment.
The working principle and the advantageous effects of the above technical solution have been explained in the method claims, and are not described herein again.
It will be understood by those skilled in the art that the first and second terms of the present invention refer to different stages of application.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.