Missing energy consumption data interpolation method and device based on MIDAE model and storage medium
1. A missing energy consumption data interpolation method based on an MIDAE model is characterized by comprising the following steps:
s1, dividing the data set into X incomplete data sets XmAnd a complete data set Xc(ii) a The data set X is a set of a series of observed values acquired by an energy management system based on a time series form, and the observed values describe a data record set of the energy consumption influence characteristics of the extruderd attribute components;
s2, in the incomplete data set XmUnder the guidance of deletion patterns, synthetic MVs are introduced into the complete data set XcIn (1), generating a loss data setTraining is carried out;
s3, based on the deletion index matrix SmIdentifying incomplete data set XmIf the lost pattern is a univariate missing pattern or a monotonic missing pattern, go to step S4, otherwise go to step S5;
s4, incomplete data set XmDividing the data into several subsets, wherein each subset only contains one incomplete attribute, and then, independently and sequentially carrying out MVs interpolation on each incomplete attribute based on the MIDAE model;
s5, training a unified MIDAE model to carry out incomplete data set XmPerforming batch interpolation on the MVs;
s6, outputting an interpolation data set X*。
2. The method for interpolating missing energy consumption data based on MIDAE model as claimed in claim 1, wherein the MIDAE model comprises:
and (3) generating damage input: generating corrupt inputs by marking some values of original input x, i.e., composite MVs, and populating the composite MVs with some default valuesIn order to adapt to effective MV interpolation, the missing pattern of the corrupted input generated by merging in the training set needs to be consistent with the missing pattern of the incomplete data set to be interpolated;
and (3) encoding: the encoder will damage the inputIs converted into an h-dimensional embedded y,
and (3) decoding: the decoder takes as input the embedded y learned by the encoder and converts it back to z, which is g (yW '+ b'); its purpose is to reconstruct the original input x;
wherein the encoder f (-) and decoder g (-) are nonlinear activation functions that generate embedding y and reconstruction z, respectively; w is d x h code weight matrix, b is h code bias vector; w 'is h x d decoding weight matrix, b' is d decoding deviation vector.
3. The method for interpolating missing energy consumption data based on the MIDAE model as claimed in claim 2, wherein the objective function of the MIDAE model is expressed as the following with the reconstruction error of the minimum MVs:
in the formula, xi=(xi,1,xi,2,...,xi,d)∈XtIs a training set of original observations, i.e. not corrupted, zi=(zi,1,zi,2,...,zi,d)∈ZtIs the output xiRebuilding is carried out; missing indicator vector si=(si,1,si,2,...,si,d) Corresponds to xiIs represented as xiMVs in (1), wherein if xi,jIs the missing value si,j1, otherwise si,j=0;si·xiAnd si·ziTo calculate(s) separatelyi·xi) And(s)i·zi) The inner product of (d).
4. The method for interpolating missing energy consumption data based on MIDAE model according to claim 3, characterized in that in the MIDAE model, for numerical data, a square error loss function is adopted:
for classified data, a cross entropy loss function is adopted, and one-time one-hot coding is carried out:
for mixed data, the two loss functions are weighted and unified to generate a final loss function:
in formula (II), x'i=si·xi,z′i=si·ziAnd dnIs the number of numerical attributes in the mixed data type; omeganAnd ωcWeights, ω, of the numerical attribute and the classification attribute, respectivelyn+ωc=1。
5. The MIDAE model-based missing energy consumption data interpolation method as claimed in claim 4, characterized in that the interpolation method is based on a complete data set XcTo train an effectively learned MIDAE model, and to X pairs based on the learned MIDAE modelmPerforming effective interpolation on the MVs; in the model training phase, corrupted inputs in MIDAE generate missing patterns that depend on the incomplete data set to be interpolated.
6. The method for interpolating missing energy consumption data based on MIDAE model as claimed in claim 5, wherein in step S4, given a data set X, assuming p incomplete attributes and d-p complete attributes in an observation, a is given for each incomplete attributei(i is more than or equal to 1 and less than or equal to p), training the MIDAE model, and utilizing observed values on the complete attributes to measure aiPerforming interpolation on the MVs; once an incomplete attribute aiMVs of (a) are interpolated, aiIs considered to be oneThe complete attribute is used for interpolating MVs on other incomplete attributes later; in order to reduce the influence of inaccurate interpolation values, sequential interpolation is carried out from the incomplete attribute with the minimum MVs; the method specifically comprises the following steps:
a model training stage: given an incomplete data set XmFor each incomplete attribute, there is an incomplete subsetI is more than or equal to 1 and less than or equal to p, and the i consists of an observed value of the ith incomplete attribute missing value; further, XmThe observed value in (1) only contains the ith incomplete attribute and all complete attributes, namely the incomplete attributes except the ith attribute are discarded in the training data preparation;
for incomplete subset Xm,iI is more than or equal to 1 and less than or equal to p, and the inclusion and X are adopted in the model training stagem,iComplete data set of the same attributeAs an input; to adapt the learning model to the target data set, i.e. to the incomplete subset Xm,iEfficient interpolation of medium MVs at corrupted inputsThe deletion pattern of the synthetic MVs should be matched with Xm,iThe patterns in (1) are consistent; due to incomplete subset Xm,iContaining only one incomplete attribute and having all values on the incomplete attribute missing, i.e. Xm,iIs univariate, generates corrupt inputs by deleting all observations on the ith incomplete attribute and replacing them with default valuesNext, based on the generated damage input, a training algorithm is used to interpolate the incomplete subset Xm,iMIDAE model of medium MVs;
and (4) MV interpolation stage: by using the corresponding trained MIDAE model, at eachIncomplete subset Xm,iInterpolating MVs, wherein MVs only appear on the ith incomplete attribute; the interpolation starts from the minimal incomplete attribute of the MVs, namely, the incomplete subset with the minimal observation value and the interpolated MVs are used for the subsequent interpolation; for the ith incomplete attribute, X is first initialized with the default values employed in the training phasem,iMVs in (1); wherein, only the MVs on the ith incomplete attribute is taken as the MVs to be interpolated, and the MVs interpolated previously on the other incomplete attributes is taken as the 'basic true value' to initialize the incomplete subset Xm,iAs a corrupted input, in reconstruction Z by means of a mapping function in encoding and decodingm,iTo find X inmInterpolation results of medium MVs; after sequential interpolation of MVs for all incomplete subsets, the final interpolated data set X*Thus deriving.
7. The method for interpolating missing energy consumption data based on MIDAE model according to claim 5, wherein in the step S5:
in the model training phase, the complete data set X is usedcAs an input, the first step is by selecting XcA small number of elements as MVs and replace them with some default value to generate a corrupted inputTo make it possible toDeletion pattern of (2) and XmAccording to the miss indication matrix SmCalculating XmThe ratio of each MV arrangement occurring in (a); in the missing indication matrix SmIn (1), a vector S is definedm,i∈SmAs possible MV arrangements to indicate MVs at the corresponding observation oi∈XmOf (a) is present.
8. The method for interpolating missing energy consumption data based on the MIDAE model according to any one of claims 1 to 7, wherein the identification of the missing pattern in the step S3 specifically comprises:
given an incomplete data set XmBased on the corresponding miss indication matrix SmTo determine its deletion pattern; when the ith observed valueS in the absence of a value on the jth attribute of (1)ij1, otherwise sij0; thus, in the matrix SmThe sum per row is the number of MVs in each observation, and the sum per column is the number of MVs per attribute; if the sum row/column is zero, then no MV in the observed value is on the attribute; since the complete attribute, i.e. the attribute without MVs, does not affect the identification of the missing pattern, from the missing index matrix SmRemoving the column with the sum being zero, and mixing S'mAs a simplified missing index matrix with an incomplete attribute of d'; through inspection of S'mWhether only one row of attributes are left in the pattern is determined, and whether the pattern is a single variable missing pattern is determined;
for monotonic miss mode and general miss mode, S 'are ordered according to MVs number on each attribute'mReordering incomplete attributes in the list; then, for S'mWhen the first "1", i.e. the first MV, appears, in the case of the monotonic deletion mode, all values on its back attribute are "1"; specifically, for S'mLine i in (1), assuming the index of the first "1" is j, starting from 0, then the number of "1" S in this line should be d '-j, i.e., S'mThe sum of the ith row in (1) should equal d' -j; if S'mAll rows in (2) satisfy the above condition, then the incomplete data set XmThe missing pattern of (a) is monotonic; otherwise, incomplete data set XmThe deletion pattern of (2) is general.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 8 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 8.
Background
The functions of energy consumption data acquisition, energy consumption data query, historical energy consumption, energy consumption trend analysis, energy consumption state and the like are integrated in industrial production enterprises, and in the data acquisition process, the problems of insufficient current, overlarge voltage, network interruption, equipment failure and the like exist in the sensor equipment for acquiring data, so that the acquired data is interrupted or deviates from the value in a normal range, the problems of noise, data imbalance, data loss and the like exist in the acquired data, and the abnormal detection precision is influenced. Because of various uncontrollable factors, Missing Values (MVs) are widely present in various real-world datasets, and for many algorithms used in data analysis, data mining, and machine learning, data integrity is a prerequisite because they are ineffective at processing datasets with MVs. Furthermore, the presence of MVs can result in loss of information, possibly resulting in reduced performance of the algorithm employed.
At present, there are several common processing methods for the general data loss problem: deleting or discarding all observed values or samples of missing values; estimating missing values manually, namely replacing the missing values with known values; replacing the missing value by using the mean value; replacing the missing value by using the similar sample; and fifthly, constructing a missing data interpolation model by using a machine learning algorithm and a neural network, thereby effectively interpolating the missing data. The first four methods interpolate missing segments of data, but all involve sample values. Both of these methods produce estimates of the deviation due to the uncertainty of the known values and the mean of the samples, resulting in smaller variance and standard deviation. An interpolation model is built by utilizing a machine learning algorithm and a neural network to carry out missing data interpolation, and the process of building the model is complex. In some prior art methods, missing values on attributes of data records are interpolated by weighted averaging of values on the same attribute of some similar data records, such as hot deck interpolation and KNN interpolation, which define some similarity functions and interpolate missing values through the first k similar data records. However, determining the appropriate similarity function and the appropriate size of the set of similar records is very difficult.
In summary, the above method improves the accuracy and efficiency of missing data interpolation to some extent, but the missing value interpolation work rarely captures the non-linear correlation between data attributes effectively. Therefore, the accuracy of missing data interpolation in an industrial production environment is not high. Further, before processing missing data, the missing data is generally classified into a monotone missing pattern, a univariate missing pattern, and a general missing pattern. The method only researches a general missing mode, but for the process of industrial production environment data acquisition, the missing data mode usually has various forms, and the missing data is generally caused by various factors such as equipment faults, working condition changes, extrusion process changes, climate changes and the like. The existing research method cannot effectively capture the non-linear correlation among the energy consumption data attributes of the extruder and is difficult to adapt to the interpolation of data missing modes in various forms, so that the problems of low interpolation precision and low efficiency are caused.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention provides a missing energy consumption data interpolation method, equipment and a storage medium based on an MIDAE model, which can efficiently process a plurality of missing data modes in an industrial energy management system.
In order to solve the technical problems, the invention adopts the technical scheme that: a missing energy consumption data interpolation method based on an MIDAE model comprises the following steps:
s1, dividing a data set X into incomplete data sets XmAnd a complete data set Xc(ii) a The data set X is a set of a series of observed values acquired by an energy management system based on a time series form, describes a data record set of extruder energy consumption influence characteristics and consists of d attributes;
s2, in the incomplete data set XmUnder the guidance of deletion patterns, synthetic MVs are introduced into the complete data set XcIn (1), generating a loss data setTraining is carried out;
s3, based on the deletion index matrix SmIdentifying incomplete data set XmIf the lost pattern is a univariate missing pattern or a monotonic missing pattern, go to step S4, otherwise go to step S5;
s4, incomplete data set XmDividing the data into several subsets, wherein each subset only contains one incomplete attribute, and then, independently and sequentially carrying out MVs interpolation on each incomplete attribute based on the MIDAE model;
s5, training a unified MIDAE model to carry out incomplete data set XmPerforming batch interpolation on the MVs;
s6, outputting an interpolation data set X*。
Further, the MIDAE model includes:
and (3) generating damage input: generating corrupt inputs by marking some values of original input x, i.e., composite MVs, and populating the composite MVs with some default valuesIn order to adapt to effective MV interpolation, the missing pattern of the corrupted input generated by merging in the training set needs to be consistent with the missing pattern of the incomplete data set to be interpolated;
and (3) encoding: the encoder will damage the inputIs converted into an h-dimensional embedded y,
and (3) decoding: the decoder takes as input the embedded y learned by the encoder and converts it back to z, which is g (yW '+ b'); its purpose is to reconstruct the original input x;
wherein the encoder f (-) and decoder g (-) are nonlinear activation functions that generate embedding y and reconstruction z, respectively; w is d x h code weight matrix, b is h code bias vector; w 'is h x d decoding weight matrix, b' is d decoding deviation vector.
Further, the objective function of the MIDAE model is expressed as the following to minimize the reconstruction error of MVs:
in the formula, xi=(xi,1,xi,2,...,xi,d)∈XtIs a training set of original observations, i.e. not corrupted, zi=(zi,1,zi,2,...,zi,d)∈ZtIs the output xiRebuilding is carried out; missing indicator vector si=(si,1,si,2,...,si,d) Corresponds to xiIs represented as xiMVs in (1), wherein if xi,jIs the missing value si,j1, otherwise si,j=0;si·xiAnd si·ziTo calculate(s) separatelyi·xi) And(s)i·zi) The inner product of (d).
Further, in the MIDAE model, for numerical data, a squared error loss function is adopted:
for classified data, a cross entropy loss function is adopted, and one-time one-hot coding is carried out:
for mixed data, the two loss functions are weighted and unified to generate a final loss function:
in the formula, xi'=si·xi,zi'=si·ziAnd dnIs the number of numerical attributes in the mixed data type; omeganAnd ωcWeights, ω, of the numerical attribute and the classification attribute, respectivelyn+ωc=1。
Furthermore, an interpolation method based on the MIDAE model is based on the complete data set XcTo train an effectively learned MIDAE model, and to X pairs based on the learned MIDAE modelmPerforming effective interpolation on the MVs; in the model training phase, corrupted inputs in MIDAE generate missing patterns that depend on the incomplete data set to be interpolated
Further, in the step S4, given a data set X, assuming that there are p incomplete attributes and d-p complete attributes in an observation, for each incomplete attribute ai(i is more than or equal to 1 and less than or equal to p), training the MIDAE model, and utilizing observed values on the complete attributes to measure aiPerforming interpolation on the MVs; once an incomplete attribute aiMVs of (a) are interpolated, aiIs considered as a complete attribute and is used for interpolating MVs on other incomplete attributes later; in order to reduce the influence of inaccurate interpolation values, sequential interpolation is performed from the incomplete attribute with the minimum MVs.
Further, the step S4 specifically includes the following steps:
a model training stage: given an incomplete data set XmFor each incomplete attribute, there is an incomplete subsetThe method comprises the observation value of the ith incomplete attribute missing value; further, XmThe observed value in (1) only contains the ith incomplete attribute and all complete attributes, namely the incomplete attributes except the ith attribute are discarded in the training data preparation;
for aComplete subset Xm,iI is more than or equal to 1 and less than or equal to p, and the inclusion and X are adopted in the model training stagem,iComplete data set of the same attributeAs an input; to adapt the learning model to the target data set, i.e. to the incomplete subset Xm,iEfficient interpolation of medium MVs at corrupted inputsThe deletion pattern of the synthetic MVs should be matched with Xm,iThe patterns in (1) are consistent; due to incomplete subset Xm,iContaining only one incomplete attribute and having all values on the incomplete attribute missing, i.e. Xm,iIs univariate, generates corrupt inputs by deleting all observations on the ith incomplete attribute and replacing them with default valuesNext, based on the generated damage input, a training algorithm is used to interpolate the incomplete subset Xm,iMIDAE model of medium MVs;
and (4) MV interpolation stage: by using the corresponding trained MIDAE model, in each incomplete subset Xm,iInterpolating MVs, wherein MVs only appear on the ith incomplete attribute; the interpolation starts from the minimal incomplete attribute of the MVs, namely, the incomplete subset with the minimal observation value and the interpolated MVs are used for the subsequent interpolation; for the ith incomplete attribute, X is first initialized with the default values employed in the training phasem,iMVs in (1); wherein, only the MVs on the ith incomplete attribute is taken as the MVs to be interpolated, and the MVs interpolated previously on the other incomplete attributes is taken as the 'basic true value' to initialize the incomplete subset Xm,iAs a corrupted input, in reconstruction Z by means of a mapping function in encoding and decodingm,iTo find X inmInterpolation results of medium MVs; after sequential interpolation of MVs for all incomplete subsets, the final interpolated data set X*Thus deriving.
Further, in the step S5:
in the model training phase, the complete data set X is usedcAs an input, the first step is by selecting XcA small number of elements as MVs and replace them with some default value to generate a corrupted inputTo make it possible toDeletion pattern of (2) and XmAccording to the miss indication matrix SmCalculating XmThe ratio of each MV arrangement occurring in (a); in the missing indication matrix SmIn (1), a vector S is definedm,i∈SmAs possible MV arrangements to indicate MVs at the corresponding observation oi∈XmOf (a);
further, the identifying of the missing pattern in step S3 specifically includes:
given an incomplete data set XmBased on the corresponding miss indication matrix SmTo determine its deletion pattern; when the ith observed valueS in the absence of a value on the jth attribute of (1)ij1, otherwise sij0; thus, in the matrix SmThe sum per row is the number of MVs in each observation, and the sum per column is the number of MVs per attribute; if the sum row/column is zero, then no MV in the observed value is on the attribute; since the complete attribute, i.e. the attribute without MVs, does not affect the identification of the missing pattern, from the missing index matrix SmRemoving the column with the sum being zero, and mixing S'mAs a simplified missing index matrix with an incomplete attribute of d'; through inspection of S'mWhether only one row of attributes are left in the pattern is determined, and whether the pattern is a single variable missing pattern is determined;
for monotonic miss mode and general miss mode, S 'are ordered according to MVs number on each attribute'mReordering incomplete attributes in the list; then, the user can use the device to perform the operation,to S'mWhen the first "1", i.e. the first MV, appears, in the case of the monotonic deletion mode, all values on its back attribute are "1"; specifically, for S'mLine i in (1), assuming the index of the first "1" is j, starting from 0, then the number of "1" S in this line should be d '-j, i.e., S'mThe sum of the ith row in (1) should equal d' -j; if S'mAll rows in (2) satisfy the above condition, then the incomplete data set XmThe missing pattern of (a) is monotonic; otherwise, incomplete data set XmThe deletion pattern of (2) is general.
The invention also provides a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of the method described above when executing the computer program.
The invention also provides a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method described above.
Compared with the prior art, the beneficial effects are:
1. compared with the traditional DAE model, the MIDAE model provided by the invention has two main aspects of improvement. First, in the MIDAE model, a damaged input is generated using a pattern missing in an incomplete data set to be interpolated. Second, the objective function of the MIDAE model is to minimize the reconstruction error of the MVs, rather than the entire input reconstruction error in the DAE.
2. The invention designs two interpolation methods aiming at different missing modes, and can better adapt to data with various missing modes.
In conclusion, the method provided by the invention can be well suitable for interpolation of various missing data modes, has high interpolation precision, effectively avoids energy consumption waste of an extruder, ensures the production stability of aluminum profile enterprises, and has profound significance for energy conservation and emission reduction of the aluminum profile enterprises.
Drawings
FIG. 1 is a time series, subsequence, sliding window relationship diagram.
FIG. 2 is a denoising autoencoder architecture.
Fig. 3 is a schematic frame diagram of a missing data interpolation method based on the MIDAE model according to the present invention.
FIG. 4 is a common missing data pattern.
Fig. 5 is a schematic diagram of the architecture of the MIDAE model proposed by the present invention.
FIG. 6 shows interpolation accuracy for various activation functions in an embodiment of the present invention.
FIG. 7 shows interpolation performance of all comparison methods in various missing modes according to an embodiment of the present invention.
Detailed Description
The drawings are for illustration purposes only and are not to be construed as limiting the invention; for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted. The positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the invention.
First, the definition related to the present embodiment
Definition 1: the data set X is a set of a series of observed values acquired by an energy principle system based on a time sequence form, describes a data record set of the energy consumption influence characteristics of the extruder and consists of d attributes;
the sequence is as follows: refers to a series of sets of equal time intervals arranged in time order. Let time series at tiRecorded value of time vi(ti) Recording the time tiIs strictly increasedThe time series can be written as X ═ X<x1=(t1,v1(t1)),x2=(t2,v2(t2)),...,xn=(tn,vn(tn))>Abbreviated as X ═<x1,x2,...,xn>. If it isPresence collectionsWhereinAnd 1 is<i1<i2<...<in<n, then called X1Is a subsequence of X. On a time series basis, the definition of a sliding window is given as: for a given time sequence of length n, X ═<x1,x2,...,xn>And a sliding window l of length lnAnd placing the sliding window at the initial position of X, wherein the sliding window corresponds to a section with the length of l on X: the sequence, then the time window moves forward, and then the 2 nd subsequence of the sequence is taken as a starting point to form another sliding window with the length of l; by analogy, in total, can obtainA sliding window sequence of length l, which is represented by B ═ B1,b2,...,bn/m-l+1) And (4) showing. The relationship between time series, sub-series and sliding window is shown in FIG. 1. In the figure, x1,x2,...,xnIs a piece of data with time series properties, FM represents a time series sub-sequence, and F represents a sliding window of length l.
Definition 2: incomplete observations are data records where values on some attributes are missing, while complete observations are data records where values on all attributes exist.
Definition 3: given a data set consisting of N observations, it can be represented by an N X d data matrix X, where each vector XiIs an observed value, denoted by z. Furthermore, a dataset consists of two disjoint subsets: incomplete data set (X)m) And complete data set (X)c) And respectively containing the incomplete observation value and the complete observation value in the X.
Definition 4: given a data set X, the missing index matrix is given by S ═ S1,s2,...,sN]Is shown in the formula XThe number of missing values of (a),the missing value is expressed as MV,Multiple missing values for MVsRepresents, wherein the ith vector si=(si,1,si,2,...,si,d) Corresponding to observed value xi. If the observed value xiS is lost on the jth attribute ofi,j1, otherwise si,j=0。
Secondly, simply introducing the denoising automatic encoder
De-noising auto-coders (DAE) are a powerful non-linear mapping model for learning low-dimensional efficient representations of raw data. Without loss of generality, the single-layer DAE model in fig. 2 is taken as an example for illustration.
First, to make the learned model more robust and avoid overfitting, DAE corrupts the original input x to some default values (e.g., zero or average over the corresponding attribute) by adding some additional small noise (e.g., isotropic gaussian noise) or forcing some elements in x to some default valuesNext, the damaged input is input by the encoderMapping to h-dimensional hidden representation (Embedded)Where f (-) is a user-specified activation function, W is a d × h code weight matrix, and b is an h code bias vector. Typically, the dimension of the embedding layer is smaller than the input, i.e. h<d, this corresponds to the mechanism by which the DAE implements data compression. Finally, the resulting embedding y is mapped back to the original input x by the decoder. Variables have a similar formula z ═ g (yW '+ b'), where g (·) is also the user finger activation function, and W ', b' are h × d decoding weight matrix and d decoding bias vector, respectively.
The objective function of the DAE is to minimize the reconstruction error of the original inputs x and z (reconstruction), i.e., toWhere θ is the parameter to be optimized and L (-) is the loss function that measures the distance between inputs x and z. Notably, the output z of the DAE is a corrupt inputRather than the original input x. And z should be as close as possible to the original input x. The basic idea of parameter optimization is to derive the best-fit-if-embedding from a corrupted versionCaptures the useful features of the original input x, which can reconstruct well the z of the original input x. Therefore, minimizing the reconstruction error by training the model is equivalent to generating a good embedding that preserves most of the information in the original input.
In summary, the DAE handling missing data has two advantages: (1) attempting to encode the corrupted input for good embedding; (2) attempts are made to undo the effect of the corruption process randomly applied to the original input. In other words, the DAE performs a denoising process in addition to good embedding learning, and can be used for MV interpolation.
Thirdly, introducing the missing energy consumption data interpolation method based on the MIDAE model provided by the invention
Given a data set with MVs, the MIDAE model aims to capture the hidden correlation between MVs and non-MVs, and then estimate MVs for interpolation. In addition, the proposed MIDAE is an MV-driven model, i.e. the model training process and MV estimation strategy are different for various missing patterns. In the examples, three common deletion patterns are highlighted: univariate missing patterns (MVs occur on only a single attribute), monotonic missing patterns (where MVs are concentrated on several attributes and attributes can be conveniently sorted according to the percentage of missing values on each attribute), and general missing patterns (where MVs may appear on any attribute). In the embodiment, two MV interpolation methods, namely MIDAE-Sequential and MIDAE-Batch, are designed based on the MIDAE model so as to better adapt to data with various missing modes. The MIDAE-Sequential is used to train a separate MIDAE model (with MV attributes) for each incomplete attribute, and to assign MVs to different incomplete attributes based on the corresponding learned MIDAE model. In addition, in order to further improve the interpolation precision, the MIDAE-Sequential interpolates the MVs to different incomplete attributes in sequence; the interpolation starts from the incomplete attribute with the least MV, and the interpolated MV can be used for sequentially interpolating MVs on other incomplete attributes later. On the other hand, the designed MIDAE-Batch, MIDAE Batch processing, can train a unified MIDAE model and insert MVs in Batch. MIDAE-Sequential and MIDAE-Batch can be reduced to the same method for processing datasets with univariate deletion patterns. MIDAE-Sequential is capable of handling datasets with monotonic miss patterns, while MIDAE-Batch is capable of handling datasets with regular miss patterns.
Fig. 3 is a schematic system frame diagram of the missing data interpolation method based on the MIDAE model according to this embodiment. Given a dataset X with MV as input, it is divided into incomplete datasets XmAnd a complete data set Xc. The MV interpolation process includes the following six steps:
s1, dividing a data set X into incomplete data sets XmAnd a complete data set Xc(ii) a The data set X is a set of a series of observed values acquired by an energy management system based on a time series form, describes a data record set of extruder energy consumption influence characteristics and consists of d attributes;
s2, in the incomplete data set XmUnder the guidance of deletion patterns, synthetic MVs are introduced into the complete data set XcIn (1), generating a loss data setTraining is carried out;
s3, based on the deletion index matrix SmIdentifying incomplete data set XmIf the lost pattern is a univariate missing pattern or a monotonic missing pattern, go to step S4, otherwise go to step S5;
s4, incomplete data set XmDivided into several subsets, each subset containing only one but notThe complete attribute is added, and then MVs interpolation is independently and sequentially carried out on each incomplete attribute based on the MIDAE model; performing sequence interpolation on MVs with each incomplete attribute based on MIDAE-Sequential;
s5, training a unified MIDAE model to carry out incomplete data set XmPerforming Batch interpolation on the MVs, namely performing Batch interpolation on the MVs based on the MIDAE-Batch;
s6, outputting an interpolation data set X*。
3.1, MIDAE model
The invention provides a DAE-based missing data interpolation denoising automatic encoder Model (MIDAE), which is suitable for MV interpolation. FIG. 5 illustrates a single-layer MIDAE model in which the red portion in each node of the defect input represents the synthetically generated MVs, while the yellow portion in each node of the output layer is the corresponding reconstruction. Given an original input x, the data conversion between layers in the MIDAE is described below.
And (3) generating damage input: generating corrupt inputs by marking some values of original input x, i.e., composite MVs, and populating the composite MVs with some default valuesIn order to adapt to effective MV interpolation, the missing pattern of the corrupted input generated by merging in the training set needs to be consistent with the missing pattern of the incomplete data set to be interpolated;
and (3) encoding: the encoder will damage the inputIs converted into an h-dimensional embedded y,
and (3) decoding: the decoder takes as input the embedded y learned by the encoder and converts it back to z, which is g (yW '+ b'); its purpose is to reconstruct the original input x;
wherein the encoder f (-) and decoder g (-) are nonlinear activation functions that generate embedding y and reconstruction z, respectively; w is d x h code weight matrix, b is h code bias vector; w 'is h x d decoding weight matrix, b' is d decoding deviation vector.
In this embodiment, the objective function of MIDAE is modified to minimize the reconstruction error of MVs, rather than the reconstruction error of the entire input observations, as shown in the following equation:
wherein xi=(xi,1,xi,2,...,xi,d)∈XtIs a training set of original observations (undamaged), zi=(zi,1,zi,2,...,zi,d)∈ZtIs the output xiAnd (4) reconstructing. Missing indicator vector si=(si,1,si,2,...,si,d) Corresponds to xiIs represented by xiMVs in (1), wherein if xi,jIs the missing value si,j1, otherwise si,j=0。si·xiAnd si·ziTo calculate(s) separatelyi·xi) And(s)i·zi) The inner product of (d). Finally, the parameter θ ═ { W, W ', b, b' } is initialized randomly and optimized by random gradient descent. It is noted that the loss function may be customized for different data types. For numerical data, a squared error loss function L (-) is used as in equation (2), and for classified data, a cross entropy loss function is used as in equation (3), and one-time one-hot encoding is performed. In addition, for the mixed type data, the two loss functions are weighted and unified to generate the final loss function, as shown in formula (4).
Wherein x'i=si·xi,z’i=si·ziAnd dnIs the number of numerical attributes in the mixed data type. In addition, ωnAnd ωcWeights, ω, of the numerical attribute and the classification attribute, respectivelyn+ωc=1。
The MIDAE model proposed in this embodiment is improved in two ways compared to the conventional DAE model. First, in the MIDAE model, a damaged input is generated using a pattern missing in an incomplete data set to be interpolated. Second, the objective function of the MIDAE model is to minimize the reconstruction error of the MVs, rather than the entire input reconstruction error in the DAE. The reason for this is mainly the following two points:
(1) the machine learning model to be built is data dependent, which contains two intuitive meanings: (i) the training set and the test set are similar in data content. (ii) The training set and the test set are distributed relatively closely.
Similar to the intuitive nature of data dependence in machine learning, the MIDAE model proposed in this embodiment is used to populate MVs in a given incomplete data set (i.e., test set). Thus, it is assumed that the MVs in the training set and test set follow the same distribution (with a certain deviation limit), especially when the dropout rate is relatively high.
(2) As previously introduced, the proposed MIDAE aims to accurately recover MVs. To achieve this goal, MIDAE extracts the correlation between MVs and non-MVs by focusing only on the reconstruction accuracy of MVs, thereby achieving efficient interpolation of MVs.
Given a data set X with MVs, respectively integrating the data sets from the viewpoint of deep learningAnd incomplete data setRespectively as a training set for model learning and a testing set for MV interpolation. For the interpolation of DAE to MV, there are two stages:
1) model training is based on a complete data set XcTraining a DAE model;
2) MV interpolation deep learning-based DAE model in incomplete data set XmAnd (5) interpolating MVs.
In the model training phase, we use the complete data set XcAs the original input. Corrupted inputBy randomly selecting the original input XcAre generated as a composite MVs and replace the basic truth of the MVs with default values (generated according to a user-specified scheme). Let embedding and output (reconstruction) be Y respectivelycAnd Zc. By minimizing XcAnd reconstruction thereof ZcThe reconstruction error between to optimize the parameter theta (i.e., train the DAE model).
3.2 interpolation method based on MIDAE model
Given a dataset X with MVs, our goal is to be at an incomplete datasetEffectively interpolating MVs. To achieve this goal, as with DAE to MV interpolation, there are two stages, namely based on the complete data set XcTo learn an effective MIDAE model, and to pair X based on the learned MIDAE modelmThe MVs in (1) are effectively interpolated.
In the model training phase, corrupted inputs in MIDAE generate missing patterns that depend on the incomplete data set to be interpolated. The miss pattern describes the arrangement of missing and non-missing values in the data. There are three commonly discussed loss patterns, namely, univariate, monotonic, and general, as shown in FIG. 4, assuming that there are five attributes a in the observation1~a5. For single variable loss patterns, MVs in the data appear on only a single attribute. As shown in FIG. 4a, MVs exist only in the third attribute(i.e. a)3) The above. Whereas in the monotonic loss mode, MVs in the data appear on multiple attributes. In addition, when the attribute a of the observed valueiWhen the value of (a) is lost, the subsequent attribute a of the same observation valuej(j>i) All values above are also lost. The interpolation strategy is different for different deletion modes. Therefore, we propose two MV interpolation methods based on MIDAE model to adapt to various missing data modes.
3.21 interpolation method based on MIDAE-Sequential
The basic idea of the MIDAE-Sequential method is to perform MVs interpolation for each incomplete attribute independently and sequentially.
Given a data set X, assume that there are p incomplete attributes and d-p complete attributes in an observation. For each incomplete attribute ai(i is more than or equal to 1 and less than or equal to p), training the MIDAE model, and utilizing observed values on the complete attributes to measure aiThe above MVs are interpolated. In addition, once an incomplete attribute aiMVs of (a) are interpolated, aiIt is considered to be a complete attribute and used to interpolate MVs later on for other incomplete attributes. In order to reduce the influence of inaccurate interpolation values, sequential interpolation is performed from the incomplete attribute with the minimum MVs. For example, there are three incomplete attributes in extruder energy consumption data, extrusion pressure, extrusion speed, and extrusion time. The MVs values on each incomplete attribute are 5, 2 and 1, respectively. Therefore, the extrusion time, the extrusion speed, and the extrusion pressure are interpolated in this order.
A model training stage: given an incomplete data set XmFor each incomplete attribute, there is an incomplete subsetIt consists of the observed value of the ith incomplete attribute missing value. Further, XmThe observed value in (1) contains only the ith incomplete attribute and all complete attributes, i.e., incomplete attributes other than the ith attribute are discarded in the training data preparation.
For incomplete subset Xm,iI is more than or equal to 1 and less than or equal to p, and a bag is adopted in the model training stageContaining with Xm,iComplete data set of the same attributeAs an input. As described in the previous section, to adapt the learning model to the target data set (i.e., incomplete subset X)m,i) Efficient interpolation of medium MVs at corrupted inputsThe deletion pattern of the synthetic MVs should be matched with Xm,iThe patterns in (1) are consistent. Due to incomplete subset Xm,iContaining only one incomplete attribute and having all values on the incomplete attribute missing, i.e. Xm,iBy deleting all observations on the ith incomplete attribute and replacing them with default values (generated according to a user-specified scheme), we generate a corrupted inputNext, based on the generated damage input, a training algorithm is used to interpolate the incomplete subset Xm,iMIDAE model of medium MVs.
And (4) MV interpolation stage: at this stage, in each incomplete subset X, by using its corresponding trained MIDAE modelm,iAnd (c) interpolating MVs (where MVs only appears on the ith incomplete attribute). As mentioned above, the interpolation starts from the incomplete MVs minimum attribute, i.e. the incomplete subset with the least observed value and the interpolated MVs are used for the subsequent interpolation. For the ith incomplete attribute, we first initialize X with the default values adopted in the training phasem,iThe MVs in (1). Note that only MVs on the i-th incomplete attribute are taken as MVs to be interpolated, and MVs interpolated previously on the other incomplete attributes (as complete attributes when MV interpolation is performed on the i-th incomplete attribute) are taken as "basic truth". With an initialized incomplete subset Xm,iAs a corrupted input, in reconstruction Z by means of a mapping function in encoding and decodingm,iCan find X inmAnd (5) interpolation result of the MVs. After sequential interpolation of MVs for all incomplete subsets, final interpolationData set X of*Thus deriving.
3.22 MIDAE-Batch interpolation method
The basic idea of the MIDAE-Batch interpolation method is to train a unified MIDAE model and perform interpolation on an incomplete data set XmThe MVs in (1) are subjected to batch interpolation.
In the model training phase, the complete data set X is usedcAs an input, the first step is by selecting XcA small number of elements as MVs and replace them with some default value to generate a corrupted inputTo make it possible toDeletion pattern of (2) and XmAccording to the miss indication matrix SmCalculating XmThe ratio of each MV arrangement occurring in (a); in the missing indication matrix SmIn (1), a vector S is definedm,i∈SmAs possible MV arrangements to indicate MVs at the corresponding observation oi∈XmOf (a) is present.
3.3 deletion Pattern recognition
There are three general discussion of missing data patterns, namely univariate missing patterns, monotonic missing patterns, and general missing patterns.
As shown in FIG. 4, assume that there are five attributes a in the observation1~a5For single variable loss patterns, MVs in the data appear on only a single attribute. As shown in FIG. 4a, MVs exist only in the third attribute (i.e., a)3) The above. In the monotonic loss mode, MVs in the data appear on multiple attributes. In addition, when the attribute a of the observed valueiWhen lost, the subsequent attribute a of the same observationj(j>i) All values above are also lost. As shown in fig. 4b, when the observed attribute a2When the value of (A) is lost, a3~a5All values above are also lost, i.e., the proportion of MVs on the incomplete attribute is monotonic. For general loss patterns, MVs may appear randomly on any attribute。
Specifically, given an incomplete data set XmBased on the corresponding deficiency index matrix SmTo determine its deletion pattern. According to definition 4, when the ith observation valueS in the absence of a value on the jth attribute of (1)ij1, otherwise sij0. Thus, in the matrix SmThe sum per row is the number of MVs in each observation, and the sum per column is the number of MVs on each attribute. If the sum of the rows (columns) is zero, then there are no MVs in the observed value (on the attribute). Since the complete attributes (i.e., attributes without MVs) do not affect the identification of the missing pattern, the index matrix S is derived from the missing index matrixmRemoving them (and zero column), and converting S'mAs a simplified missing index matrix with an incomplete property of d'. Through inspection of S'mIf there is only one column (attribute) left, the single variable missing pattern can be easily determined. Next, how to determine X is describedmWhether the missing pattern of (2) is monotonic.
First, we pair S 'in ascending order according to MVs number on each attribute'mThe incomplete attributes in (1) are reordered. Then, for S'mWhen the first "1" (i.e., the first MV) appears, all values on the following property are "1" in the case of the monotonic deletion mode. The decision criterion is determined according to the definition of the monotonic loss pattern introduced above, i.e. when a value on an attribute of an observation is missing, all values on subsequent attributes of the same observation are also missing. Specifically, for S'mLine i in (c), assuming the index of the first "1" is j (starting from 0), then the number of "1" S in this line should be d '-j, i.e., S'mThe sum of the ith row in (1) should be equal to d' -j. If S'mAll rows in (2) satisfy the above condition, then the incomplete data set XmThe missing pattern of (a) is monotonic; otherwise, incomplete data set XmThe deletion pattern of (2) is general.
Examples
The experimental environment of this embodiment is performed in a system with i5-8265U @1.6GHz processor, 16G memory, and 764 Windows bits, and all programs are implemented using Python to perform experimental evaluation of effectiveness of the interpolation method proposed herein.
1. Test data
The experimental data of the embodiment mainly come from a database collected by an energy management system of an aluminum profile enterprise in south China, the production data collected by the selected extruder equipment with the model number of SY-3600Ton in 6 to 9 months in 2020 is used as a sample set of data, and the data is time sequence data collected once every 20 seconds. It contains 8991 observed values, and each observed value has 6 attribute values: electrical energy, outlet temperature, extrusion pressure, extrusion time, and extrusion speed. In order to comprehensively evaluate the performance of the proposed MV interpolation method on the collected observed values, the missing rate is used as a variable, and the interpolation precision of various MV interpolation methods is shown by changing the missing rate in the following evaluation. Therefore, the data sets are all considered to be original clean data sets, and incomplete data sets with different scales are generated according to different missing rates. For example, when the missing rate is 5%, 5% of observations in the entire data set are randomly selected to form an incomplete data set to report interpolation results, and the rest of observations form a complete data set to train the learning model.
In addition, to generate an incomplete data set X with various missing patternsmWe introduce MVs in different ways, the details of which are as follows.
Univariate deletion pattern: randomly selecting an attribute ai(1. ltoreq. i. ltoreq. d) as incomplete attribute and is represented by XmMiddle marker observation attribute aiAll values of (c) above.
Monotonic missing mode: first, half of the attributes are randomly selected as incomplete attributes. Then, at XmThe proportion of MVs on each incomplete attribute gradually decreases from 100% with a step size of 10%. In addition, incomplete attribute aiMVs above are from the previous incomplete attribute ai-1Randomly selected ones of the selected observations. Finally, the value selected as MVs is marked.
General deletion pattern: for XmRandomly selecting half of the attributes as incomplete attributes, and marking values on the incomplete attributes.
2. Evaluation index
In order to evaluate the performance of various MV interpolation methods, the present embodiment employs a Root Mean Square Error (RMSE) commonly used in the field of data prediction as an evaluation index.
RMSE root mean square error of measurement error. Mathematically, it can be expressed as:
in the formula, Y' represents the predicted value size, and Y represents the actual value size.
The Root Mean Square Error (RMSE) is used to measure the interpolation deviation between the interpolated result and the true value of the data. The lower the RMSE, the closer the interpolation result is to reality, so the better the interpolation performance.
3. Model set-up
To facilitate experimental evaluation and to facilitate faster convergence, the collected data observations are first subjected to a normalization pre-process. In addition, each test was repeated 10 times and the average results were reported to obtain reliable experimental results. For the statistical interpolation method KNN, the parameter k is set to 10, because the interpolation error is low for all data sets within an acceptable time range when k is 10. For the machine learning method, bayesian model, the dimension of the submatrix or potential space is specified as half the input dimension. For the MIDAE-Sequential and MIDAE-Batch methods based on deep learning models, each model was trained using a period of 1000, a learning rate of 0.01 and a Batch size of 256. In addition, the interpolation accuracy based on the simple model is lower than that of the conventional MV interpolation method. Therefore, the MIDAE model in the present embodiment has one hidden (embedded) layer.
4. Selecting activation function
The interpolation method based on the MIDAE model is a deep learning model, and a deep neural network structure with the optimal performance needs to be determined. Since different activation functions may affect the performance of the data set and the application, various activation functions (including Simgioid, TanH, ReLU, Softplus, and ELU) are used in this embodiment to verify the performance of MIDAE-Sequential and MIDAE-Batch, with the experimental performance of the various activation functions shown in FIG. 6.
FIG. 6 shows the interpolation performance of MIDAE-Sequential and MIDAE-Batch with different activation functions by different deletion ratios. For the experimental data set, MIDAE-Sequential and MIDAE-Batch both achieved the lowest RMSE using the Sigmoid activation function under various deletion modes, demonstrating that both methods presented herein have the best performance under the Sigmoid activation function. Therefore, we adopt Sigmoid function as the default activation function of MIDAE model.
5. Comparative analysis
The MIDAE-Sequential and MIDAE-Batch methods proposed by the invention are compared with the existing MV interpolation method, and the experimental result is shown in FIG. 7.
Figure 7 shows the interpolation performance of all comparison methods on extruder energy consumption data sets with various missing patterns. For the neighborhood based interpolation method (i.e., KNN), MVs are interpolated by using the correlation between the observations, whereas the MIDAE based interpolation method proposed by the present invention interpolates MVs by using the internal interpolation method. For naive bayes, the MVs is interpolated by exploring the covariance matrix of the data, so that the complex correlation of the data is easily underestimated, and the interpolation precision is not satisfactory. As can be seen from fig. 7, the MIDAE-based approach achieves the lowest RMSE because the proposed MIDAE model effectively captures the non-linear correlation between MVs and non-MVs in the data and shows good performance under different missing data patterns.
The average RMSE values of the method mentioned in Table 1 and other methods under different missing data patterns
Table 1 summarizes the average RMSE values of the method and other methods under the extruder energy consumption data sets of different deletion modes, and it can be seen that the RMSE values of the MIDAE-Sequential method and the MIDAE-Batch method under the different deletion data modes are always the minimum, which shows that the interpolation method based on the MIDAE provided by the invention always achieves the best effect on the interpolation precision. In addition, both MIDIA-Sequential and MIDIA-Batch performed well for datasets with univariate deletion patterns. For datasets with monotonic loss patterns, MIDIA-Sequential performs better than MIDIA-Batch, while for datasets with regular loss patterns, MIDIA-Batch performs better than MIDIA-Sequential. Therefore, the method provided by the invention can be well suitable for the interpolation of various missing data modes and has higher interpolation precision.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.