Electrocardiogram data processing method based on statistical learning
1. An electrocardiogram data processing method based on statistical learning is characterized by comprising the following steps:
s1, reading the electrocardiogram xml data file and converting the electrocardiogram xml data file into csv or txt format data;
s2, carrying out abnormity detection on the data in the step S1, and checking whether file conversion is successful;
s3, when the conversion is judged to be successful in the step S2, carrying out feature extraction on electrocardiogram data, and calculating an orthogonal norm Fourier basis;
s4, when the conversion is judged to fail in the step S2, the format of the input data is wrong;
s5, calculating a characteristic matrix of the sample database data by a least square method according to statistical learning;
and S6, calculating the distance classification between the input sample data to be classified and the sample data in the specific sample library, and obtaining the data quantization index value or index value function change curve of the comparison result of the analysis data of the electrocardiosignal characteristics of different people.
2. The method for processing electrocardiographic data based on statistical learning according to claim 1, wherein said step S1 comprises the steps of:
s11, reading in the sample data file of the specific crowd 1 in an xml format;
s12, reading in the sample data file of the specific crowd 2 in an xml format;
s13, converting each xml data file in the steps S11 and S12 into a csv or txt format, wherein the method comprises the steps of extracting xml, and storing data in every two digits feature points in the file into csv or txt format after the data in every two digits feature points are sorted into 12-dimensional data;
and S14, judging whether singularity exists in the csv or txt format electrocardiogram data.
3. The method for processing electrocardiographic data based on statistical learning according to claim 2, wherein said step S1 further comprises the steps of:
s15, reading csv or txt data of a specific population 1 sample, a 12-dimensional lead time-series of 20000 x 12 or a single-lead time-series of 20000 x 1;
s16, carrying out normalization processing on the one-dimensional time series obtained in the step S15, and the steps are as follows:
s161, solving second-order difference of the data with the serial numbers from 2000 to 18000, determining the minimum value of the second-order difference, defining a threshold value, and taking out all fragments with the second-order difference smaller than the minimum value multiplied by the threshold value;
s162, aiming at the segment obtained in the S161, calculating a median of the corresponding electrocardiosignal, and returning the median to a median corresponding unit, namely a maximum value unit with representative significance;
s163, cutting 2000 units after the maximum value unit with the representative meaning as a normalized time sequence (vector of 2000 x 1);
s17, converting the function data of the normalized time sequence, comprising the following steps:
s171, setting a fixed time interval [0,1000], and determining 300 orthogonal canonical Fourier bases in the interval;
s172, calculating a linear combination of fourier bases closest to the normalized data by a least square method, and outputting a coefficient of the combination as a feature expression (a vector of 300 × 1) of the functional data, as follows:
wherein S represents the sequence to be fitted taken at S163, eiFor the ith fourier basis in S171, v (S) is the desired functional data characterization;
s18, storing the coefficients of the function type data in the step S17 as a characteristic database (300 x n matrix) of the specific crowd 1, wherein n is the total sample data of the specific crowd 1;
and S19, repeating the steps for the data of the specific crowd 2 to obtain a characteristic database (300 x m matrix) of the specific crowd 2, wherein m is the total sample data of the specific crowd 2.
4. The method for processing electrocardiographic data based on statistical learning according to claim 3, wherein said step S15 further comprises the steps of:
s151, when 12-lead electrocardiosignals are input, carrying out VCG (vector electrocardiograph) linear change, and outputting a first component of VCG as a one-dimensional time sequence;
s152, when the input is the single-lead electrocardiosignal, the data is reserved as a one-dimensional time sequence.
5. The method for processing electrocardiographic data based on statistical learning according to claim 1, further comprising the step of analyzing the electrocardiographic signals to be classified, comprising the following steps:
s61, reading the electrocardiogram data files to be classified in an xml format;
s62, converting xml data into csv or txt format;
s63, inputting the csv or txt format data, performing S15, S16 and S17 in the steps of data file normalization and functional data conversion, and acquiring a vector of the functional characteristic representation 300 x 1 of the electrocardiosignal;
s64, performing regression analysis on the functional feature representation and outputting a result, wherein the algorithm for the regression analysis comprises a KNN (K-nearest neighbor, K-value neighbor) algorithm and an SVM (Support vector Machine) algorithm;
the KNN algorithm:
s651, calculating the distance between the feature representation obtained in step S63 and the feature data of the specific population 1 in step S18 in the step of normalizing and functionally describing the data file, that is, the minimum value of the distance between the feature representation obtained in step S63 and the column vector of the feature database matrix of the specific population 1 in step S18 in the step of normalizing and functionally describing the data file, the method of calculating the distance includes:
s6511, Euclidean distance;
s6512, harmonic mean of minimum K euclidean distances:
s652, calculating the distance between the feature representation obtained in step S63 and the feature data of the specific population 2 in step S19 in the step of normalizing and functionally describing the data file, that is, the minimum value of the distance between the feature representation obtained in step S63 and the column vector of the feature database matrix of the specific population 2 in step S19 in the step of normalizing and functionally describing the data file, and the distance calculating method is the same as S651;
s653, comparing the minimum value output in the step S651 and the step S652, outputting comparison data, and outputting a classification result;
the SVM algorithm:
s66, using a cost function of Soft-margin to linearly separate two characteristic database libraries obtained in the steps of S18 and S19 in the step of data file normalization and function type data, wherein the optimization formula is as follows (the corresponding labels of the specific population 1 and the specific population 2 are respectively-1 and 1, and the sizes of characteristic data sample libraries are respectively m0 and m1), wherein xi and yi are respectively a characteristic vector and a label in the sample library, and b and lambda are adjustable parameters;
for the newly entered data x, wx-b is calculated, and if the new entered data x is greater than or equal to 1, the new entered data x is classified as a specific population 2, and if the new entered data x is less than or equal to-1, the new entered data x is classified as a specific population 1.
Background
Electrocardiography (ECG) is an important electrocardiographic signal analysis means due to its advantages of being noninvasive, nondestructive, simple and convenient to operate, mature in experience and the like, and has been developed for hundreds of years since the invention.
However, the conventional electrocardiogram method is a graphic analysis technique, mainly depends on manual judgment, and has low efficiency and accuracy, and has poor stability. Although a part of intelligent analysis technology is realized due to the development of computer technology in recent years, the efficiency is improved, but the information capable of being represented is still limited to the graph and does not break through the basic technology.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide an electrocardiogram data processing method based on statistical learning, which is used for quantitatively describing electrocardiogram data from a statistical view angle and realizing interpretable fine classification exceeding the traditional electrocardiogram method. According to the method, a series of analysis methods such as statistical feature extraction, feature matrix calculation, feature distance classification and the like are adopted, the distortion degree of the conventional approximate algorithm is reduced to the maximum extent, and the statistical features of the electrocardiogram data can be identified quickly and accurately. Meanwhile, the method also has the ability of reinforcement learning, along with the enrichment and expansion of the sample library, the classification precision and the stability are higher and higher, and the method has wide application prospect as a basic technology for electrocardiogram data analysis.
The invention provides an electrocardiogram data processing method based on statistical learning, which comprises the following steps:
s1, reading the electrocardiogram xml data file and converting the electrocardiogram xml data file into csv or txt format data;
s2, carrying out abnormity detection on the data in the step S1, and checking whether file conversion is successful;
s3, when the conversion is judged to be successful in the step S2, carrying out feature extraction on electrocardiogram data, and calculating an orthogonal norm Fourier basis;
s4, when the conversion is judged to fail in the step S2, the format of the input data is wrong;
s5, calculating a characteristic matrix of the sample database data by a least square method according to statistical learning;
and S6, calculating the distance classification between the input sample data to be classified and the sample data in the specific sample library, and analyzing to obtain the data quantization index value or index value function change curve of the comparison result of the analysis data of the electrocardiosignal characteristics of different people.
Preferably, the step S1 includes the steps of:
s11, reading in the sample data file of the specific crowd 1 in an xml format;
s12, reading in the sample data file of the specific crowd 2 in an xml format;
s13, converting each xml data file in the steps S11 and S12 into a csv or txt format, wherein the method comprises the steps of extracting xml, and storing data in every two digits feature points in the file into csv or txt format after the data in every two digits feature points are sorted into 12-dimensional data;
and S14, judging whether singularity exists in the csv or txt format electrocardiogram data.
Preferably, the step S1 further includes the steps of:
s15, reading csv or txt data of a specific population 1 sample, wherein the csv or txt data is a 12-dimensional lead time sequence of 20000 x 12 or a single-lead time sequence of 20000 x 1;
s16, carrying out normalization processing on the one-dimensional time series obtained in the step S15, and the steps are as follows:
s161, solving second-order difference of the data with the serial numbers from 2000 to 18000, determining the minimum value of the second-order difference, defining a threshold value, and taking out all fragments with the second-order difference smaller than the minimum value multiplied by the threshold value;
s162, aiming at the segment obtained in the S161, calculating a median of the corresponding electrocardiosignal, and returning the median to a median corresponding unit, namely a maximum value unit with representative significance;
s163, cutting 2000 units after the maximum value unit with the representative meaning as a vector of the normalized time series 2000 x 1;
s17, converting the function data of the normalized time sequence, comprising the following steps:
s171, setting a fixed time interval [0,1000], and determining 300 orthogonal canonical Fourier bases in the interval;
s172, calculating a linear combination of fourier bases closest to the normalized data by a least square method, and outputting a coefficient of the combination as a vector representing 300 × 1 as a feature of the functional data), as follows:
wherein S represents the sequence to be fitted taken at S163, eiFor the ith fourier basis in S171, v (S) is the desired functional data characterization;
s18, storing the coefficients of the function type data in the step S17 as a specific population 1 feature database (300 x n matrix), wherein n is the total amount of sample data of the specific population 1;
and S19, repeating the steps for the data of the specific crowd 2 to obtain a characteristic database (300 x m matrix) of the specific crowd 2, wherein m is the total sample data of the specific crowd 2.
Preferably, the step S15 further includes the steps of:
s151, when 12-lead electrocardiosignals are input, carrying out VCG linear change, and outputting a first component of VCG as a one-dimensional time sequence;
s152, when the input is the single-lead electrocardiosignal, the data is reserved as a one-dimensional time sequence.
Preferably, the method also comprises the analysis of the newly recorded electrocardiosignals, and comprises the following steps:
s61, reading in the electrocardiogram data file to be analyzed in an xml format;
s62, converting xml data into csv or txt format;
s63, inputting the csv or txt format data, performing S15, S16 and S17 in the steps of data file normalization and functional data conversion, and acquiring a vector of the functional characteristic representation 300 x 1 of the electrocardiosignal;
s64, performing regression analysis on the functional feature representation and outputting a result, wherein the algorithm for the regression analysis comprises a KNN (K-nearest neighbor, K-value neighbor) algorithm and an SVM (Support vector Machine) algorithm;
the KNN algorithm:
s651, calculating the distance between the feature representation obtained in step S63 and the feature data of the specific population 1 in step S18 in the step of normalizing and functionally describing the data file, that is, the minimum value of the distance between the feature representation obtained in step S63 and the column vector of the feature database matrix of the specific population 1 in step S18 in the step of normalizing and functionally describing the data file, the method of calculating the distance includes:
s6511, Euclidean distance;
s6512, harmonic mean of minimum K euclidean distances:
s652, calculating the distance between the feature representation obtained in step S63 and the feature data of the specific population 2 in step S19 in the step of normalizing and functionally describing the data file, that is, the minimum value of the distance between the feature representation obtained in step S63 and the column vector of the feature database matrix of the specific population 2 in step S19 in the step of normalizing and functionally describing the data file, and the distance calculating method is the same as S651;
s653, comparing the minimum value output in the step S651 and the step S652, outputting comparison data, and outputting a classification result;
the SVM algorithm:
s66, using a cost function of Soft-margin to linearly separate two characteristic database libraries obtained in the steps of S18 and S19 in the step of data file normalization and function type data, wherein the optimization formula is as follows (the corresponding labels of the specific population 1 and the specific population 2 are respectively-1 and 1, and the sizes of characteristic data sample libraries are respectively m0 and m1), wherein xi and yi are respectively a characteristic vector and a label in the sample library, and b and lambda are adjustable parameters;
for the newly entered data x, wx-b is calculated, and if the new entered data x is greater than or equal to 1, the new entered data x is classified as a specific population 2, and if the new entered data x is less than or equal to-1, the new entered data x is classified as a specific population 1.
Compared with the prior art, the invention has the beneficial effects that: by searching the whole sample database based on the statistical learning method, the increase of the number of samples can lead to more accurate analysis results. In the classification process, the re-specification and calculation of the sample library are not needed, the samples to be classified only need to download or update the characteristic function matrix of the sample library, the precise classification is directly carried out, the distortion degree of the conventional approximate algorithm is reduced to the maximum extent, and the statistical characteristics of the electrocardiogram data can be rapidly and accurately identified.
Drawings
Fig. 1 is a schematic structural diagram of an electrocardiogram data processing method based on statistical learning according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, an electrocardiogram data processing method based on statistical learning includes: s1, reading the electrocardiogram xml data file and converting the electrocardiogram xml data file into csv or txt format data;
s2, carrying out abnormity detection on the data in the step S1, and checking whether file conversion is successful;
s3, when the conversion is judged to be successful in the step S2, carrying out feature extraction on electrocardiogram data, and calculating an orthogonal norm Fourier basis;
s4, when the conversion is judged to fail in the step S2, the format of the input data is wrong;
s5, calculating a characteristic matrix of the sample database data by a least square method according to statistical learning;
and S6, calculating the distance classification between the input sample data to be classified and the sample data in the specific sample library, and analyzing to obtain the data quantization index value or index value function change curve of the comparison result of the analysis data of the electrocardiosignal characteristics of different people.
Further, the step S1 includes the following steps:
s11, reading in the sample data file of the specific crowd 1 in an xml format;
s12, reading in the sample data file of the specific crowd 2 in an xml format;
s13, converting each xml data file in the steps S11 and S12 into a csv or txt format, wherein the method comprises the steps of extracting xml, and storing data in every two digits feature points in the file into csv or txt format after the data in every two digits feature points are sorted into 12-dimensional data;
and S14, judging whether singularity exists in the csv or txt format electrocardiogram data.
Further, the step S1 further includes the following steps: :
s15, reading csv or txt data of a specific population 1 sample, wherein the csv or txt data is a 12-dimensional lead time sequence of 20000 x 12 or a single-lead time sequence of 20000 x 1;
s16, carrying out normalization processing on the one-dimensional time series obtained in the step S15, and the steps are as follows:
s161, solving second-order difference of the data with the serial numbers from 2000 to 18000, determining the minimum value of the second-order difference, defining a threshold value, and taking out all fragments with the second-order difference smaller than the minimum value multiplied by the threshold value;
s162, aiming at the segment obtained in the S161, calculating a median of the corresponding electrocardiosignal, and returning the median to a median corresponding unit, namely a maximum value unit with representative significance;
s163, cutting 2000 units after the maximum value unit with the representative meaning as a vector of the normalized time series 2000 x 1;
s17, converting the function data of the normalized time sequence, comprising the following steps:
s171, setting a fixed time interval [0,1000], and determining 300 orthogonal canonical Fourier bases in the interval;
and S172, calculating a linear combination of Fourier bases closest to the normalized data by a least square method, and outputting coefficients of the combination as a vector of which the characteristic of the functional data is 300 x 1, wherein the formula is as follows:
wherein S represents the sequence to be fitted taken at S163, eiFor the ith fourier basis in S171, v (S) is the desired functional data characterization;
s18, storing the coefficients of the functional data in the step S17 as a matrix of the specific population 1 feature database 300 n, wherein n is the total amount of sample data of the specific population 1;
and S19, repeating the steps for the data of the specific crowd 2 to obtain a matrix of a feature database 300 m of the specific crowd 2, wherein m is the total sample data of the specific crowd 2.
Further, the step S15 further includes the following steps:
s151, when 12-lead electrocardiosignals are input, carrying out VCG linear change, and outputting a first component of VCG as a one-dimensional time sequence;
s152, when the input is the single-lead electrocardiosignal, the data is reserved as a one-dimensional time sequence.
Further, the method also comprises the analysis of the newly recorded electrocardiosignals, and comprises the following steps:
s61, reading the electrocardiogram data files to be classified in an xml format;
s62, converting xml data into csv or txt format;
s63, inputting the csv or txt format data, performing S15, S16 and S17 in the steps of data file normalization and functional data conversion, and acquiring a vector of the functional characteristic representation 300 x 1 of the electrocardiosignal;
s64, performing regression analysis on the functional feature representation and outputting a result, wherein the algorithm for the regression analysis comprises a KNN (K-nearest neighbor, K-value neighbor) algorithm and an SVM (Support vector Machine) algorithm;
the KNN algorithm:
s651, calculating the distance between the feature representation obtained in step S63 and the feature data of the specific population 1 in step S18 in the step of normalizing and functionally describing the data file, that is, the minimum value of the distance between the feature representation obtained in step S63 and the column vector of the feature database matrix of the specific population 1 in step S18 in the step of normalizing and functionally describing the data file, the method of calculating the distance includes:
s6511, Euclidean distance;
s6512, harmonic mean of minimum K euclidean distances:
s652, calculating the distance between the feature representation obtained in step S63 and the feature data of the specific population 2 in step S19 in the step of normalizing and functionally describing the data file, that is, the minimum value of the distance between the feature representation obtained in step S63 and the column vector of the feature database matrix of the specific population 2 in step S19 in the step of normalizing and functionally describing the data file, and the distance calculating method is the same as S651;
s653, comparing the minimum value output in the step S651 and the step S652, outputting comparison data, and outputting a classification result;
the SVM algorithm:
s66, using a cost function of Soft-margin to linearly separate two characteristic database libraries obtained in the steps of S18 and S19 in the step of data file normalization and function type data, wherein the optimization formula is as follows (the corresponding labels of the specific population 1 and the specific population 2 are respectively-1 and 1, and the sizes of characteristic data sample libraries are respectively m0 and m1), wherein xi and yi are respectively a characteristic vector and a label in the sample library, and b and lambda are adjustable parameters;
and calculating wx-b for the newly recorded data x, wherein if the wx-b is greater than or equal to 1, the data is a specific crowd 2, and if the wx-b is less than or equal to-1, the data is a specific crowd 1.
Example 1
The invention is further described below;
reading in the sample data file of the specific crowd 1 in an xml format;
reading in the sample data file of the specific crowd 2 in an xml format;
for each xml data file, traversing all feature points, finding all digits pairs, extracting all data between the digits pairs, rearranging the data into multi-dimensional parallel data, and storing the data into a new csv or txt format file;
detecting whether a NaN value, a 0 value or a blank value exists in each csv or txt format electrocardiogram data file;
the data dimensions of the input csv or txt electrocardiogram data file are judged as a 12-dimensional lead time-series of 20000 x 12 or a single-lead time-series of 20000 x 1 (depending on the measurement instrument):
if an electrocardiosignal with 12 leads is input, VCG linear change is carried out, and particularly, a first component of VCG is output as a one-dimensional time sequence;
Vx=0.38*I-0.07*II-0.13*V1+0.05*V2-0.01*V4+0.06*V5+0.54*V6;
Vy=-0.07*I+0.93*II+0.06*V1-0.02*V2-0.05*V3+0.06*V4-0.17*V5+0.13*V6;
Vz=0.11*I-0.23*II-0.43*V1-0.06*V2-0.14*V3-0.20*V4-0.11*V5+0.31*V6;
wherein, I, II, V1, V2, V3, V4, V5 and V6 are standard 12-lead electrocardiogram data;
if the input is a single lead electrocardiosignal, the data is reserved as a one-dimensional time sequence;
and (3) carrying out normalization treatment on the one-dimensional time sequence obtained by [0019-0024 ]: obtaining the unit with the maximum value of the representative meaning between the serial numbers 2000 and 18000, and cutting the 2000 units to be used as the vector of the normalized time sequence 2000 x 1;
selecting a unit method with the maximum value having the representative meaning, and referring to the step S161S 162;
and performing functional data conversion on the normalized time series: setting a fixed time interval [0,1000], and determining 300 orthogonal canonical Fourier bases in the interval;
calculating a linear combination of Fourier bases closest to the normalized data by a least square method, and outputting coefficients of the combination as a vector of the functional data representing 300 x 1;
storing the coefficient of the function type data as a feature matrix as a specific population 1 feature database (300 x n matrix), wherein n is the total amount of sample data of the specific population 1;
for the data of the specific crowd 2, repeating [ 0016-;
the analysis of the newly entered cardiac signal is as follows:
reading in an electrocardiogram data file to be analyzed in an xml format, and repeating the steps [ 0016-;
repeating [ 0025-;
if a KNN algorithm is used, calculating the minimum value of the Euclidean distance between the feature expression and the column vector of the feature database matrix of the specific crowd 1 obtained from [0029 ];
calculating the minimum value of Euclidean distance between the feature expression and the column vector of the feature database matrix of the specific population 2 obtained from [0030 ];
if the output result of [0034] is smaller than the output result of [0035], classifying the electrocardiogram data as a specific population 1, otherwise, classifying the electrocardiogram data as a specific population 2;
and outputting a classification result.
If the SVM algorithm is used, the specific steps refer to step S66.
The number of devices and the scale of the processes described herein are intended to simplify the description of the invention, and applications, modifications and variations of the invention will be apparent to those skilled in the art.
While embodiments of the invention have been described above, it is not limited to the applications set forth in the description and the embodiments, which are fully applicable in various fields of endeavor to which the invention pertains, and further modifications may readily be made by those skilled in the art, it being understood that the invention is not limited to the details shown and described herein without departing from the general concept defined by the appended claims and their equivalents.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:一种PCB生产线故障定位方法和设备