Method for predicting storage years of white spirit
1. The method for predicting the storage years of the white spirit is characterized by comprising the following steps:
step 1, acquiring volatile flavor component fingerprints of white spirits in different storage times by adopting GC-MS;
step 2, screening out modeling characteristics from the fingerprint spectrum through extreme random forest regression and sklern characteristics;
step 3, establishing a prediction model by taking the modeling characteristics as the characteristics of the XGboost regression model;
and 4, predicting the storage year of the white spirit through a prediction model.
2. The method for predicting the storage years of white spirit according to claim 1, wherein in the step 1, the specific method for acquiring the volatile flavor component fingerprints of the white spirit in different storage time by adopting GC-MS comprises the following steps:
step 101, taking liquor base liquor with different storage times as a sample to be detected, reducing the alcoholic strength of the liquor sample to be below a set value by adopting ultrapure water, and simultaneously adding sodium chloride and an internal standard substance to obtain a sample to be detected;
102, extracting volatile compounds from a sample to be detected in a headspace manner by using a headspace solid phase microextraction method through an extraction head;
and 103, after the extraction head is analyzed and adsorbed at the sample inlet, acquiring volatile component fingerprint information by adopting GC-MS (gas chromatography-mass spectrometry), and counting corresponding data to obtain volatile flavor component fingerprints of the white spirit in different storage times.
3. The method for predicting the storage years of white spirit according to claim 2, wherein the specific method for obtaining the sample to be tested in the step 101 comprises the following steps:
reducing the alcohol content of a white spirit sample to 5-10% vol by taking white spirits with different ageing times as the sample to be detected, putting 4-8 mL of the sample into a sample injection bottle, adding 0.2g/mL of sodium chloride until the solution is saturated, and adding 10 mu L of an internal standard substance to obtain the sample to be detected; wherein the internal standard is tert-amyl alcohol; the concentration of the internal standard substance is 8.05 g/L.
4. The method for predicting the storage years of white spirit according to claim 2, wherein in the step 102, the parameters of the headspace solid-phase microextraction are as follows: balancing at 40-60 ℃ for 1-25 min, and extracting for 5-180 min.
5. The method for predicting the storage years of white spirit according to claim 2, wherein in the step 103, the GC analysis conditions are as follows: a60 m × 0.25mm × 0.50 μmTG-WAXMS capillary gas chromatography column was used, the carrier gas was high purity helium, the flow rate was 1.0mL/min, the split ratio: 20: 1, temperature programming is as follows: the temperature is maintained at 50 ℃ for 2min, the temperature is raised to 145 ℃ at 3 ℃/min, then the temperature is raised to 230 ℃ at 15 ℃/min and maintained for 3min, and the temperature of the injection port is maintained at 250 ℃.
6. The method for predicting the storage years of liquor according to claim 5, wherein in step 103, the MS analysis conditions are as follows: transmission line temperature 200 ℃, ion source temperature 260 ℃, scanning mass range m/z: 33-350 amu, ionization mode: EI +; electron energy: 70 eV.
7. The method for predicting the storage years of white spirit according to claim 2, wherein in the step 2, the specific method for screening out modeling characteristics in the fingerprint spectrum by using extreme random forest regression and sklern characteristic screening comprises the following steps:
step 201, dividing the statistical corresponding data into a test set and a training set according to a set proportion;
202, adopting an extreme random forest regression model for training collection, and screening characteristics of N1-N2 before contribution to regression analysis of the white spirit storage years, wherein N1 and N2 are positive integers, and N1 is less than N2;
step 203, screening the first N1-N2 characteristics most relevant to the storage year of the white spirit by using the F _ regression and the mutual _ info _ regression in the sklern characteristic selection module;
and step 204, acquiring the intersection feature screened in the step 202 and the step 203, wherein the intersection feature is used as a modeling feature.
8. The method for predicting storage years of Baijiu according to claim 7, wherein in step 3, the model evaluation index of the prediction model is R2Wherein the effective characteristics are the characteristics which are common to the first N3 characteristics screened in the steps 202 and 203, N3 is a positive integer, and N1 < N3 < N2.
9. The method for predicting the storage years of the white spirit according to claim 8, wherein in the step 4, the specific method for predicting the storage years of the white spirit through the prediction model comprises the following steps:
wein analysis is carried out on the first N3 features screened in the steps 202 and 203, a prediction model is established by taking the shared N4 features as modeling features, the prediction model is applied to a test set for prediction, N4 is a positive integer, and N4 is less than N3.
10. The method of predicting the years of storage of white spirit according to claim 9, wherein modeling features includes: ethyl elaeate, ethyl linoleate, undecanol, 2-phenylethyl acetate, 1-methylene-1H-indene, butyric acid, ethyl 3-hexenoate, hexanoic acid, isobutyraldehyde, ethyl pentadecate, diethyl succinate, 3-methylbutyl heptanoate, ethyl hexadecanoate, vegetable ketones, ethyl 9-hexadecenoate, octyl octanoate, ethyl tridecanoate, L (-) -ethyl lactate, 2-phenylethyl hexanoate, octyl 3-methylbutyrate, ethyl trans-4-decanoate, heptanoic acid, furfural, 2, 4-di-tert-butylphenol, butyl valerate, 2-pentadecanone, n-propyl acetate, octyl butyrate, 1-methylhexyl hexanoate, ethyl undecanoate, ethyl tetradecanoate, and 3-methylbutyl octanoate.
Background
Wine tends to be older and more fragrant, and in view of the characteristic, many 'year wine' appears in the market nowadays, and the price is relatively expensive. However, at present, some enterprises have the condition of random year labeling, the disorder degree of the year wine market is increased, and the image of the white wine industry is influenced. Therefore, consumers strongly call for strengthening the liquor market standard of the liquor year, and liquor enterprises also increase the research and development force, so as to establish a method for identifying the year and make people clearly see and hear. Under the market demand, the high-quality characteristics of the yearly wine are analyzed, the product quality of the yearly wine is displayed by scientific language and data, and the establishment of a stable and high-operability method for identifying the yearly wine is a problem which must be considered by the liquor industry nowadays.
In the field of the research of the liquor year supervision and identification technology, no applicable national standard exists at present, and the main identification technology proposed by researchers comprises the following steps: the slow occupation provides a method for identifying the volatility coefficient of the spirit year, and the spirit year identification is realized by constructing a functional relation between the spirit storage life and the volatile matter content of the spirit year. According to the Yangtao et Al, the content variation relationship of Al, Fe, Cu and other metal ions in the wine in different years, the relationship between the viscosity of wine body and the storage time of white spirit, and the relationship between trace amount of conjugated unsaturated double bond molecules in white spirit and the storage time of the wine in different years are utilized to identify the wine in multiple aspects. The Qin Renwei proposes that the relationship between carbon-14 decay rate and the storage time of the yearly wine is used for identifying and determining the production years of the yearly wine. The research methods provide various identification schemes for liquor identification in the year, but the methods either need professional large-scale instruments or have complicated analysis steps and long analysis time.
In addition, white spirit is a complex system, volatile components are affected by many factors, and therefore, aged information is often buried in noisy backgrounds. At present, the age of the white spirit is judged by an infrared spectrum, a fluorescence spectrum, a Raman spectrum or an electrochemical method commonly used for the annual identification of the white spirit through a data set of the whole volatile components, the interference of noise is difficult to eliminate, and the detection accuracy is not high. Therefore, there is still a large research space for scientific statistical analysis, quantitative association of specific marker compounds. At present, the market of liquor for years is increasingly huge, the number of samples to be detected is increasingly increased, and how to develop a simple, rapid and accurate detection and identification technology becomes a new urgent need.
Disclosure of Invention
The invention aims to provide a method for predicting the storage years of white spirit, which realizes the rapid and accurate prediction of the storage years of the white spirit.
The invention adopts the following technical scheme to realize the aim, and the method for predicting the storage years of the white spirit comprises the following steps:
step 1, acquiring volatile flavor component fingerprints of white spirits in different storage times by adopting GC-MS;
step 2, screening out modeling characteristics from the fingerprint spectrum through extreme random forest regression and sklern characteristics;
step 3, establishing a prediction model by taking the modeling characteristics as the characteristics of the XGboost regression model;
and 4, predicting the storage year of the white spirit through a prediction model.
Further, in the step 1, the specific method for acquiring the volatile flavor component fingerprints of the white spirit in different storage times by adopting GC-MS comprises the following steps:
step 101, taking liquor base liquor with different storage times as a sample to be detected, reducing the alcoholic strength of the liquor sample to be below a set value by adopting ultrapure water, and simultaneously adding sodium chloride and an internal standard substance to obtain a sample to be detected;
102, extracting volatile compounds from a sample to be detected in a headspace manner by using a headspace solid phase microextraction method through an extraction head;
and 103, after the extraction head is analyzed and adsorbed at the sample inlet, acquiring volatile component fingerprint information by adopting GC-MS (gas chromatography-mass spectrometry), and counting corresponding data to obtain volatile flavor component fingerprints of the white spirit in different storage times.
Further, in step 101, a specific method for obtaining a sample to be measured includes:
reducing the alcohol content of a white spirit sample to 5-10% vol by taking white spirits with different ageing times as the sample to be detected, putting 4-8 mL of the sample into a sample injection bottle, adding 0.2g/mL of sodium chloride until the solution is saturated, and adding 10 mu L of an internal standard substance to obtain the sample to be detected; wherein the internal standard is tert-amyl alcohol; the concentration of the internal standard substance is 8.05 g/L.
Further, in step 102, parameters of the headspace solid-phase microextraction are as follows: balancing at 40-60 ℃ for 1-25 min, and extracting for 5-180 min.
Further, in step 103, the GC analysis conditions are: a60 m × 0.25mm × 0.50 μm TG-WAXMS capillary gas chromatography column was used, the carrier gas was high-purity helium gas, the flow rate was 1.0mL/min, the split ratio: 20: 1, temperature programming is as follows: the temperature is maintained at 50 ℃ for 2min, the temperature is raised to 145 ℃ at 3 ℃/min, then the temperature is raised to 230 ℃ at 15 ℃/min and maintained for 3min, and the temperature of the injection port is maintained at 250 ℃.
In step 103, the MS analysis conditions are: transmission line temperature 200 ℃, ion source temperature 260 ℃, scanning mass range m/z: 33-350 amu, ionization mode: EI +; electron energy: 70 eV.
Further, in the step 2, the specific method for screening out modeling characteristics in the fingerprint spectrum through extreme random forest regression and sklern characteristic screening comprises the following steps:
step 201, dividing the statistical corresponding data into a test set and a training set according to a set proportion;
202, adopting an extreme random forest regression model for training collection, and screening characteristics of N1-N2 before contribution to regression analysis of the white spirit storage years, wherein N1 and N2 are positive integers, and N1 is less than N2;
step 203, screening the first N1-N2 characteristics most relevant to the storage year of the white spirit by using the F _ regression and the mutual _ info _ regression in the sklern characteristic selection module;
and step 204, acquiring the intersection feature screened in the step 202 and the step 203, wherein the intersection feature is used as a modeling feature.
Further, in step 3, the model evaluation index of the prediction model is R2Wherein the effective characteristics are the characteristics which are common to the first N3 characteristics screened in the steps 202 and 203, N3 is a positive integer, and N1 < N3 < N2.
Further, in step 4, the specific method for predicting the storage year of the white spirit through the prediction model comprises the following steps:
wein analysis is carried out on the first N3 features screened in the steps 202 and 203, a prediction model is established by taking the shared N4 features as modeling features, the prediction model is applied to a test set for prediction, N4 is a positive integer, and N4 is less than N3.
Further, the modeling characteristics comprise ethyl elaeate, ethyl linoleate, undecanol, 2-phenylethyl acetate, 1-methylene-1H-indene, butyric acid, ethyl 3-hexenoate, hexanoic acid, isobutyraldehyde, ethyl pentadecate, diethyl succinate, 3-methylbutyl heptanoate, ethyl hexadecanoate, vegetable ketone, ethyl 9-hexadecanoate and octyl octanoate, ethyl tridecanoate, ethyl L (-) -lactate, 2-phenylethyl hexanoate, octyl 3-methylbutyrate, ethyl trans-4-decanoate, heptanoic acid, furfural, 2, 4-di-tert-butylphenol, butyl valerate, 2-pentadecanone, n-propyl acetate, octyl butyrate, 1-methylhexyl hexanoate, ethyl undecanoate, ethyl tetradecanoate, and 3-methylbutyl octanoate.
The method has simple processing steps, convenient operation, suitability for processing and screening large-scale samples, stable and mature gas chromatography-mass spectrometry (GC-MS) technology, higher instrument analysis precision, small error among samples, high repeatability, reliable result and large analysis flux; extreme random forest regression and sklern feature screening are utilized to effectively model the features, compression of feature space dimensions is successfully realized, and modeling quality is effectively and reliably improved; the XGBoost algorithm has the advantages of allowing a missing value to be the missing value, supporting multi-thread calculation, effectively preventing overfitting through internal regularization and the like, and can obviously improve the accuracy of identifying the storage year of the white spirit.
Drawings
FIG. 1 is a flow chart of the method for predicting the storage year of white spirit according to the invention.
Fig. 2 is a wien analysis example of 59 features of the present invention.
FIG. 3 is a schematic diagram of an embodiment of applying a predictive model to a test set for prediction.
FIG. 4 is a schematic diagram showing the comparison of accuracy before and after modeling of the screening features on the training set.
FIG. 5 is a schematic diagram of a model classification confusion matrix before and after modeling of screening features on a test set.
Detailed Description
The method for predicting the storage years of the white spirit disclosed by the invention is shown in figure 1 and comprises the following steps:
s1, acquiring the fingerprint of the volatile flavor components of the white spirit in different storage time by adopting GC-MS;
s2, screening out modeling characteristics from the fingerprint spectrum through extreme random forest regression and sklern characteristics;
s3, establishing a prediction model by taking the modeling characteristics as the characteristics of the XGBoost regression model;
and step S4, predicting the storage years of the white spirit through the prediction model.
In the step S1, the specific method for acquiring the volatile flavor component fingerprints of the white spirit with different storage times by adopting GC-MS comprises the following steps:
step 101, taking liquor base liquor with different storage times as a sample to be detected, reducing the alcoholic strength of the liquor sample to be below a set value by adopting ultrapure water, and simultaneously adding sodium chloride and an internal standard substance to obtain a sample to be detected;
102, extracting volatile compounds from a sample to be detected in a headspace manner by using a headspace solid phase microextraction method through an extraction head;
and 103, after the extraction head is analyzed and adsorbed at the sample inlet, acquiring volatile component fingerprint information by adopting GC-MS (gas chromatography-mass spectrometry), and counting corresponding data to obtain volatile flavor component fingerprints of the white spirit in different storage times.
In step 101, a specific method for obtaining a sample to be measured includes:
reducing the alcohol content of a white spirit sample to 5-10% vol by taking white spirits with different ageing times as the sample to be detected, putting 4-8 mL of the sample into a sample injection bottle, adding 0.2g/mL of sodium chloride until the solution is saturated, and adding 10 mu L of an internal standard substance to obtain the sample to be detected; wherein the internal standard is tert-amyl alcohol; the concentration of the internal standard substance is 8.05 g/L; the property of the tertiary amyl alcohol is stable, the result is not easily influenced by unnecessary reaction, meanwhile, the tertiary amyl alcohol is not changed due to storage, the tertiary amyl alcohol is suitable for internal standard selection, the physicochemical property is similar to that of volatile components in the white spirit, and the error can be reduced.
In step 102, parameters of headspace solid phase microextraction are as follows: balancing at 40-60 ℃ for 1-25 min, and extracting for 5-180 min.
In step 103, the GC analysis conditions are: a60 m × 0.25mm × 0.50 μm TG-WAXMS capillary gas chromatography column was used, the carrier gas was high-purity helium gas, the flow rate was 1.0mL/min, the split ratio: 20: 1, temperature programming is as follows: the temperature is maintained at 50 ℃ for 2min, the temperature is raised to 145 ℃ at 3 ℃/min, then the temperature is raised to 230 ℃ at 15 ℃/min and maintained for 3min, and the temperature of the injection port is maintained at 250 ℃.
In step 103, the MS analysis conditions are: transmission line temperature 200 ℃, ion source temperature 260 ℃, scanning mass range m/z: 33-350 amu, ionization mode: EI +; electron energy: 70 eV.
In step S2, the specific method for screening modeling features in a fingerprint by extreme random forest regression and sklern feature screening includes:
step 201, dividing the statistical corresponding data into a test set and a training set according to a set proportion;
202, adopting an extreme random forest regression model for training collection, and screening characteristics of N1-N2 before contribution to regression analysis of the white spirit storage years, wherein N1 and N2 are positive integers, and N1 is less than N2;
step 203, screening the first N1-N2 characteristics most relevant to the storage year of the white spirit by using the F _ regression and the mutual _ info _ regression in the sklern characteristic selection module;
and step 204, acquiring the intersection feature screened in the step 202 and the step 203, wherein the intersection feature is used as a modeling feature.
Wherein, the importance sequence of the characteristic variables obtained according to the extreme random forest regression algorithm is shown in a table 2;
the importance sequence of the feature variables obtained according to the f _ regression in the sklern feature selection module is shown in a table 3;
the feature variable importance rankings derived from the mutual _ info _ regression in the sklern feature selection module are shown in table 4.
In step S3, the model evaluation index of the prediction model is R2Wherein the effective characteristics are the characteristics which are common to the first N3 characteristics screened in the steps 202 and 203, N3 is a positive integer, and N1 < N3 < N2.
In step S4, the specific method for predicting the storage year of the white spirit by the prediction model includes:
wein analysis is carried out on the first N3 features screened in the steps 202 and 203, a prediction model is established by taking the shared N4 features as modeling features, the prediction model is applied to a test set for prediction, N4 is a positive integer, and N4 is less than N3.
Modeling features include ethyl elaeate, ethyl linoleate, undecanol, 2-phenylethyl acetate, 1-methylene-1H-indene, butyric acid, ethyl 3-hexenoate, hexanoic acid, isobutyraldehyde, ethyl pentadecate, diethyl succinate, 3-methylbutyl heptanoate, ethyl hexadecanoate, vegetable ketones, ethyl 9-hexadecenoate, octyl octanoate, ethyl tridecanoate, L (-) -ethyl lactate, 2-phenylethyl hexanoate, octyl 3-methylbutyrate, ethyl trans-4-decanoate, heptanoic acid, furfural, 2, 4-di-tert-butylphenol, butyl valerate, 2-pentadecanone, n-propyl acetate, octyl butyrate, 1-methylhexyl hexanoate, ethyl undecanoate, ethyl tetradecanoate, and 3-methylbutyl octanoate, when modeling with the 32 compounds, the R of the predicted result and the actual value of the test set is determined by the secondary model2Can reach 0.987.
Example 1
The method for predicting the storage years of the white spirit comprises the following steps:
A. preparing a white spirit sample: reducing the alcohol content of 7 production batches of the base liquor of the Luzhou-flavor liquor to 52% vol, filtering, subpackaging each batch into 10 sample bottles, and sequentially storing for 0 month, 2 months, 4 months, 6 months, 9 months, 12 months, 15 months, 17 months, 21 months and 24 months, wherein the total number of sample points of 7 batches with different storage times is 70;
B. preparing an extraction sample: taking liquor base liquors with different storage times as samples to be detected, reducing the alcoholic strength of the liquor samples to be below 10% vol by adopting ultrapure water, and simultaneously adding sodium chloride and an internal standard substance to obtain samples to be detected;
C. volatile compound extraction: b, extracting volatile compounds from the sample to be detected obtained in the step B through the head space by using a head space solid phase microextraction method;
D. collecting a fingerprint spectrum: after the extraction head is analyzed and attached at the sample inlet, acquiring volatile component fingerprint information by adopting GC-MS (gas chromatography-Mass spectrometer), and counting corresponding data;
gas Chromatography (GC) conditions:
a60 m × 0.25mm × 0.50 μm TG-WAXMS capillary gas chromatography column was used, the carrier gas was high-purity helium gas, the flow rate was 1.0mL/min, the split ratio: 20: 1, temperature programming is as follows: the temperature is maintained at 50 ℃ for 2min, the temperature is raised to 145 ℃ at 3 ℃/min, then the temperature is raised to 230 ℃ at 15 ℃/min and maintained for 3min, and the temperature of the injection port is maintained at 250 ℃.
Mass Spectrometry (MS) analysis conditions:
transmission line temperature 200 ℃, ion source temperature 260 ℃, scanning mass range m/z: 33-350 amu, ionization mode: EI +; electron energy: 70 eV.
E. Dividing a data set into a test set and a training set according to the ratio of 8: 2;
F. on the test set, collecting the characteristics which are 25 to 80 ℃ before the regression analysis contribution degree of the white spirit storage year by adopting an extreme random forest regression algorithm (a first characteristic screening method); screening the intersection of the first 25-80 important features most relevant to the storage year of the white spirit by using the F _ regression and the mutual _ info _ regression in the sklern feature selection module (a second feature screening method); taking intersection features obtained by the two feature screening methods as a feature establishing model of the XGboost regression model, wherein the model evaluation index is R2The final most effective modeling features are the features common to the first 59 features of the three screening methods;
wherein, the intersection of the first 25-80 important features obtained by the two feature screening methods is divided into a training set and a test set according to a ratio of 8:2, and then the data of 10-fold cross validation is carried out on the training set and is shown in a table 5.
G. And F, performing Wien analysis on the first 59 features of the two feature screening methods in the step F, modeling by using the intersection of the two features as modeling features, and applying the modeling features to a test set for prediction. Wherein the wien analysis diagram is shown in fig. 2, the features are respectively from the features selected by the extra trees (extreme random tree) regression, F regression in the sklern feature selection module, and the mutual _ info regression.
An analysis schematic diagram of applying a model with better performance on a training set to a test set for prediction is shown in fig. 3, wherein in fig. 3, the abscissa is a real value, the ordinate is a predicted value, the MSE is a mean square error, and the smaller the value is, the higher the fitting degree is, if the fitting curve is y ═ x, the predicted value and the actual value are completely the same, and the closer the fitting curve is, the higher the fitting degree is.
Fig. 4 is a schematic diagram showing the comparison of the accuracy before and after modeling of the screened features on the training set, and it can be seen that the accuracy corresponding to the optimized features is much higher than that corresponding to the full features.
After the screening by the method, the most effective modeling characteristics are 32 compounds such as ethyl elaeate, ethyl formate, butyl caproate and the like (see table 1), and the model is utilized to predict the results of the test set and the R of the actual values2Can reach 0.987.
TABLE 1 two screening methods for screening for common compounds
TABLE 2 ranking of feature variable importance based on extreme random forest regression algorithm
Table 3 ranks feature variable importance according to f _ regression in skleran feature selection Module
Table 4 ranks feature variable importance based on the most _ info _ regression in the skleran feature selection Module
Table 5 intersection of the first 25-80 important features, after the training set and the test set were divided according to 8:2, the result of cross validation was performed 10 times on the training set
Example 2
The method for predicting the storage years of the white spirit comprises the following steps:
A. taking 5 brands of strong aromatic bottled white spirit, and dividing the bottled white spirit into 4 groups according to factory labels: the number of samples is 1 in each year, and the number of samples is 20 in total, wherein the number of samples is 0-1 year, 1-2 years, 2-3 years and 3-4 years, and each sample is subjected to parallel measurement for 6 times. The rest of the analysis was consistent with example 1;
B-D, the method for obtaining the fingerprint spectrum of the volatile flavor substances of the white spirit is consistent with the steps B to D of the first embodiment;
E. dividing a data set into a test set and a training set according to the ratio of 8: 2;
F. on a test set, collecting 25-80 characteristics (a first characteristic screening method) before contribution degree to classification analysis of the white wine storage year by adopting an extreme random forest classification model; screening the first 25-80 characteristics most relevant to the classification of the storage years of the white spirit by using the F _ classif and the mutual _ info _ classif in the sklearn characteristic selection module (a second characteristic screening method); intersection features obtained by the two feature screening methods are used as a feature establishing model of the XGboost classification model, and the model evaluation index is accuracy (accuracy).
FIG. 5 is a schematic diagram of a model classification confusion matrix before and after feature modeling screening on a test set, wherein a in FIG. 5 is full feature modeling, and b is optimized feature modeling, wherein numbers represent the number of samples in the class, for example, 1 represents that the real value and the predicted value are the same and 1, 2 represents that the real value and the predicted value are the same and 2, if the predicted value and the actual class are consistent on a diagonal line, and in b, the number on the diagonal line is more than that in a, which indicates that the classification effect of the optimized feature modeling is better than that of the full feature modeling.
After the characteristic screening, the accuracy of the classification model is greatly improved. And compared with a classification model without characteristic screening, the classification effect applied to the test set is obviously improved.
In conclusion, the method and the device realize the rapid and accurate prediction of the storage year of the white spirit.