Model training method and device, behavior prediction method and device, equipment and medium
1. A model training method, comprising:
constructing time sequence data sequence samples of a plurality of users, wherein each time sequence data sequence sample comprises a plurality of samples under continuous time points, and each sample comprises at least one preset type of data of one user under the current time point;
extracting data of the time sequence data sequence sample by adopting a preset sliding time window to obtain window data of the sliding time window corresponding to each sliding time point;
and training a behavior prediction model based on the window data of the sliding time window corresponding to each sliding time point.
2. The method of claim 1, wherein training a behavior prediction model based on window data for each of the sliding time points corresponding to a sliding time window comprises:
carrying out data derivation processing on the window data of the sliding time window corresponding to each sliding time point to obtain various types of derived data;
training a behavior prediction model based on the window data and the plurality of types of derivative data.
3. The method according to claim 1, wherein the extracting data of the time series sample sequence by using a preset sliding time window to obtain window data of the sliding time window corresponding to each sliding time point comprises:
searching for a second time point of the plurality of consecutive time points;
and sliding the sliding time window from the second time point to the last time point to extract data, so as to obtain window data of the sliding time window corresponding to each sliding time point.
4. The method of claim 1, wherein the types of samples include default samples and normal samples, the method further comprising:
when default samples appear in the sliding time window, determining default users corresponding to the default samples;
rejecting all samples of the default user in the time series of data samples.
5. The method of claim 1, wherein the method further comprises:
acquiring a current sliding time point and a time point corresponding to a current sliding time window;
and rejecting samples with effective time of the samples in the sliding time window being smaller than the current sliding time point of the sliding time window.
6. The method according to any of claims 1 to 5, wherein the time length of the sliding time window comprises at least two consecutive time points.
7. The method of any one of claims 1 to 5, wherein the at least one predetermined type of data is loan-related data.
8. The method according to any one of claims 1 to 5, wherein before training the behavior prediction model based on the window data of each sliding time point corresponding to the sliding time window, the method comprises:
calculating stability indexes of each type of data in the time sequence data sequence samples under different time windows;
judging whether the stability index of each type of data is larger than a preset threshold value or not;
and if the stability index of the data with one type is larger than a preset threshold value, removing the data with the type.
9. The method according to claim 2, wherein the deriving the data of the window data of the sliding time window corresponding to each sliding time point to obtain multiple types of derived data includes:
counting the number of types of data in all samples in the window data;
calculating an entropy value for each type of data;
and performing derivative processing on the data in each sample according to the entropy value of each type of data.
10. A method of behavioral prediction, comprising:
acquiring a time sequence data sequence of a user to be predicted, wherein the time sequence data sequence comprises at least one preset type of data under a plurality of continuous time points;
extracting data of the time sequence data sequence by adopting a preset sliding time window to obtain window data of the sliding time window corresponding to each sliding time point;
performing behavior prediction on the user to be predicted based on window data of a sliding time window corresponding to each sliding time point and a pre-trained behavior prediction model to obtain a prediction result related to the data of at least one preset type;
wherein the pre-trained behavior prediction model is obtained by training according to the model training method of any one of claims 1 to 9.
11. A model training apparatus comprising:
the time sequence data sequence sample comprises a plurality of samples under continuous time points, and each sample comprises at least one preset type of data of a user under the current time point;
the first extraction module is used for extracting data of the time sequence data sequence sample by adopting a preset sliding time window to obtain window data of the sliding time window corresponding to each sliding time point;
and the training module is used for training a behavior prediction model based on the window data of the sliding time window corresponding to each sliding time point.
12. A behavior prediction device, comprising:
the system comprises an acquisition module, a prediction module and a prediction module, wherein the acquisition module is used for acquiring a time sequence data sequence of a user to be predicted, and the time sequence data sequence comprises at least one preset type of data under a plurality of continuous time points;
the second extraction module is used for extracting data of the time sequence data sequence by adopting a preset sliding time window to obtain window data of the sliding time window corresponding to each sliding time point;
the prediction module is used for carrying out behavior prediction on the user to be predicted based on window data of a sliding time window corresponding to each sliding time point and a pre-trained behavior prediction model to obtain a prediction result related to the data of at least one preset type;
wherein the pre-trained behavior prediction model is trained by the model training apparatus according to claim 11.
13. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-9 or perform the method of claim 10.
14. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1 to 9 or to perform the method of claim 10.
Background
The small and micro enterprises are the general names of small enterprises, micro enterprises, family workshop type enterprises and individual industrial and commercial enterprises. In recent years, small and micro enterprises in China keep a rapid growth situation. However, most of small and micro enterprises have unstable operation conditions and opaque financial conditions, and the internal supervision mechanism has serious loopholes or even no stable operation place.
At present, a credit management system for small and micro enterprises is carried out in a mode of manual pre-warning, pre-intervention and post-hastening by a user manager. The human factors are large, the personnel investment is large, the requirement on a user manager is high, the uncontrollable factors are large, the misjudgment is strong, and the operation risk is high.
Disclosure of Invention
In view of the foregoing, the present disclosure provides a model training method and apparatus, a behavior prediction method and apparatus, a device, a medium, and a program product, which can predict a default behavior of a user.
According to a first aspect of the present disclosure, there is provided a model training method, comprising:
constructing time sequence data sequence samples of a plurality of users, wherein each time sequence data sequence sample comprises a plurality of samples under continuous time points, and each sample comprises at least one preset type of data of one user under the current time point;
extracting data of the time sequence data sequence sample by adopting a preset sliding time window to obtain window data of the sliding time window corresponding to each sliding time point;
and training a behavior prediction model based on the window data of the sliding time window corresponding to each sliding time point.
In an embodiment, the training the behavior prediction model based on the window data of the sliding time window corresponding to each sliding time point includes:
carrying out data derivation processing on the window data of the sliding time window corresponding to each sliding time point to obtain various types of derived data;
training a behavior prediction model based on the window data and the plurality of types of derivative data.
In an embodiment, the extracting data of the time sequence sample sequence by using a preset sliding time window to obtain window data of the sliding time window corresponding to each sliding time point includes:
searching for a second time point of the plurality of consecutive time points;
and sliding the sliding time window from the second time point to the last time point to extract data, so as to obtain window data of the sliding time window corresponding to each sliding time point.
In an embodiment, the types of samples include default samples and normal samples, the method further comprising:
when default samples appear in the sliding time window, determining default users corresponding to the default samples;
rejecting all samples of the default user in the time series of data samples.
In an embodiment, the method further comprises:
acquiring a current sliding time point and a time point corresponding to a current sliding time window;
and rejecting samples with effective time of the samples in the sliding time window being smaller than the current sliding time point of the sliding time window.
In an embodiment, the time length of the sliding time window comprises at least two consecutive time points.
In one embodiment, the at least one predetermined type of data is loan-related data.
In an embodiment, before training the behavior prediction model based on the window data of the sliding time window corresponding to each sliding time point, the method includes:
calculating stability indexes of each type of data in the time sequence data sequence samples under different time windows;
judging whether the stability index of each type of data is larger than a preset threshold value or not;
and if the stability index of the data with one type is larger than a preset threshold value, removing the data with the type.
In an embodiment, the deriving the data of the window data of the sliding time window corresponding to each sliding time point to obtain multiple types of derived data includes:
counting the number of types of data in all samples in the window data;
calculating an entropy value for each type of data;
and performing derivative processing on the data in each sample according to the entropy value of each type of data.
A second aspect of the present disclosure provides a behavior prediction method, including:
acquiring a time sequence data sequence of a user to be predicted, wherein the time sequence data sequence comprises at least one preset type of data under a plurality of continuous time points;
extracting data of the time sequence data sequence by adopting a preset sliding time window to obtain window data of the sliding time window corresponding to each sliding time point;
performing behavior prediction on the user to be predicted based on window data of a sliding time window corresponding to each sliding time point and a pre-trained behavior prediction model to obtain a prediction result related to the data of at least one preset type;
wherein the pre-trained behavior prediction model is obtained by training according to the model training method of the first aspect.
In an embodiment, the performing behavior prediction on the user to be predicted based on the window data of the sliding time window corresponding to each sliding time point and a pre-trained behavior prediction model, and obtaining a prediction result includes:
carrying out data derivation processing on the window data of the sliding time window corresponding to each sliding time point to obtain various types of derived data;
and inputting the various types of derivative data into the pre-trained behavior prediction model to obtain a prediction result.
In an embodiment, the extracting data of the time sequence sample sequence by using a preset dynamic sliding time window to obtain window data of the sliding time window corresponding to each sliding time point includes:
searching for a second time point of the plurality of consecutive time points;
and sliding the sliding time window from the second time point to the last time point to extract data, so as to obtain window data of the sliding time window corresponding to each sliding time point.
In an embodiment, the method further comprises:
acquiring a current sliding time point and a time point corresponding to a current sliding time window;
and rejecting data of which the effective time of the samples in the sliding time window is less than the current sliding time point of the sliding time window.
In one embodiment, the at least one predetermined type of data is loan-related data;
the prediction result related to the at least one preset type of data is the default result of the user to be predicted.
A third aspect of the present disclosure provides a model training apparatus, comprising:
the time sequence data sequence sample comprises a plurality of samples under continuous time points, and each sample comprises at least one preset type of data of a user under the current time point;
the first extraction module is used for extracting data of the time sequence data sequence sample by adopting a preset sliding time window to obtain window data of the sliding time window corresponding to each sliding time point;
and the training module is used for training a behavior prediction model based on the window data of the sliding time window corresponding to each sliding time point.
In one embodiment, the training module comprises:
the data derivation submodule is used for carrying out data derivation processing on the window data of the sliding time window corresponding to each sliding time point to obtain various types of derived data;
and the training sub-module is used for training a behavior prediction model based on the window data and the multiple types of derivative data.
In one embodiment, the first extraction module comprises:
a search sub-module for searching for a second time point of said plurality of successive time points;
and the sliding submodule is used for performing data extraction by sliding the sliding time window from the second time point to the last time point to obtain window data of the sliding time window corresponding to each sliding time point.
In one embodiment, the types of samples include default samples and normal samples, the apparatus further comprising:
the determining module is used for determining default users corresponding to the default samples when the default samples appear in the sliding time window;
a first culling module for culling all samples of the default user from the time series data sequence samples.
In one embodiment, the apparatus further comprises:
the acquisition module is used for acquiring a current sliding time point and a time point corresponding to a current sliding time window;
and the second eliminating module is used for eliminating the samples of which the effective time of the samples in the sliding time window is less than the current sliding time point of the sliding time window.
In an embodiment, the time length of the sliding time window comprises at least two consecutive time points.
In one embodiment, the at least one predetermined type of data is loan-related data.
A fourth aspect of the present disclosure provides a behavior prediction apparatus comprising:
the system comprises an acquisition module, a prediction module and a prediction module, wherein the acquisition module is used for acquiring a time sequence data sequence of a user to be predicted, and the time sequence data sequence comprises at least one preset type of data under a plurality of continuous time points;
the second extraction module is used for extracting data of the time sequence data sequence by adopting a preset sliding time window to obtain window data of the sliding time window corresponding to each sliding time point;
the prediction module is used for carrying out behavior prediction on the user to be predicted based on window data of a sliding time window corresponding to each sliding time point and a pre-trained behavior prediction model to obtain a prediction result related to the data of at least one preset type;
wherein the pre-trained behavior prediction model is obtained by training with the model training apparatus according to the third aspect.
A fifth aspect of the present disclosure provides an electronic device, comprising: one or more processors; memory for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to perform the model training method of the first aspect or the behavior prediction method of the second aspect.
The sixth aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions, which when executed by a processor, cause the processor to perform the model training method of the first aspect or the behavior prediction method of the second aspect.
A seventh aspect of the present disclosure also provides a computer program product comprising a computer program that, when executed by a processor, implements the model training method of the first aspect or the behavior prediction method of the second aspect.
Drawings
The foregoing and other objects, features and advantages of the disclosure will be apparent from the following description of embodiments of the disclosure, which proceeds with reference to the accompanying drawings, in which:
FIG. 1 schematically illustrates an application scenario diagram of a model training method and apparatus, a behavior prediction method and apparatus, a device, a medium, and a program product according to embodiments of the disclosure;
FIG. 2 schematically illustrates a flow diagram of a model training method according to an embodiment of the present disclosure;
FIG. 3 schematically illustrates a flow chart of a model training method according to an embodiment of the present disclosure;
FIG. 4 schematically illustrates a flow chart of a model training method according to an embodiment of the present disclosure;
FIG. 5 schematically illustrates a flow chart of a model training method according to an embodiment of the present disclosure;
FIG. 6 schematically illustrates a flow chart of a model training method according to an embodiment of the present disclosure;
FIG. 7 schematically illustrates a flow chart of a behavior prediction method according to an embodiment of the present disclosure;
FIG. 8 schematically illustrates a block diagram of a model training apparatus according to an embodiment of the present disclosure;
fig. 9 schematically shows a block diagram of a behavior prediction apparatus according to an embodiment of the present disclosure; and
FIG. 10 schematically illustrates a block diagram of an electronic device suitable for implementing a model training method or a behavior prediction method according to an embodiment of the present disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.
Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
It should be noted that the model training method and apparatus, the behavior prediction method and apparatus, the device, the medium, and the program product provided by the present disclosure may be used in the financial field, and may also be used in any field other than the financial field. In the present disclosure, the financial field is taken as an example for illustrative purposes.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, necessary security measures are taken, and the customs of the public order is not violated.
The embodiment of the present disclosure provides a model training method, including: the method comprises the steps of constructing time sequence data sequence samples of a plurality of users, wherein each time sequence data sequence sample comprises samples under a plurality of continuous time points, each sample comprises at least one preset type of data of one user under the current time point, adopting a preset dynamic sliding time window to extract data of the time sequence data sequence samples, obtaining window data of the sliding time window corresponding to each sliding time point, and training a behavior prediction model based on the window data of the sliding time window corresponding to each sliding time point.
The embodiment of the present disclosure further provides a behavior prediction method, including: the method comprises the steps of collecting a time sequence data sequence of a user to be predicted, wherein the time sequence data sequence comprises at least one preset type of data under a plurality of continuous time points, extracting data of the time sequence data sequence by adopting a preset sliding time window to obtain window data of the sliding time window corresponding to each sliding time point, and predicting the behavior of the user to be predicted based on the window data of the sliding time window corresponding to each sliding time point and a pre-trained behavior prediction model to obtain a prediction result related to the at least one preset type of data.
Fig. 1 schematically illustrates an application scenario diagram of a model training method and apparatus, a behavior prediction method and apparatus, a device, a medium, and a program product according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.
As shown in fig. 1, the system architecture 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired and/or wireless communication links, and so forth.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The end devices 101, 102, 103 may have various communication client applications installed thereon, such as shopping applications, financial applications, mailbox clients, and/or social platform software (for example only).
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 105 may be a server providing various services, such as a background management server (for example only) providing support for websites browsed by users using the terminal devices 101, 102, 103. The background management server may analyze and perform other processing on the received data such as the user request, and feed back a processing result (e.g., a webpage, information, or data obtained or generated according to the user request) to the terminal device.
It should be noted that the model training method and the behavior prediction method provided by the embodiments of the present disclosure may be generally executed by the server 105. Accordingly, the model training device and the behavior prediction device provided by the embodiments of the present disclosure may be generally disposed in the server 105. The model training method and the behavior prediction method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the model training device and the behavior prediction device provided by the embodiments of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Alternatively, the model training method and the behavior prediction method provided by the embodiment of the present disclosure may also be executed by the terminal device 101, 102, or 103, or may also be executed by another terminal device different from the terminal device 101, 102, or 103. Accordingly, the model training apparatus and the behavior prediction apparatus provided in the embodiments of the present disclosure may also be disposed in the terminal device 101, 102, or 103, or disposed in another terminal device different from the terminal device 101, 102, or 103.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
FIG. 2 schematically shows a flow diagram of a model training method according to an embodiment of the disclosure.
As shown in fig. 2, the method includes operations S201 to S203.
In operation S201, time series data samples of a plurality of users are constructed, each time series data sample includes a plurality of samples at successive time points, and each time series sample includes at least one preset type of data of a user at a current time point.
In the present disclosure, the user may be a personal-end loan user and/or a legal-end loan user. The individual-side loan user means that an individual must have a business and is a legal representative of a small and micro business, and the individual uses pure credit consumption loan or mortgage-free operation loan products successfully. The legal-side loan user refers to a small and micro enterprise that successfully uses the fast-loan-in-business product.
In the present disclosure, the preset type of data may be at least one of personal basic information, personal property liability information, personal fund transaction information, personal credit investigation information and personal loan information for the personal end loan user. For the legal-side loan user, the preset type of data can be at least one of basic enterprise information, enterprise fund flow information, enterprise transaction purpose information and enterprise credit investigation information. Furthermore, related data such as investment and financing relation, guarantee relation and the like can be included.
In the present disclosure, the time interval between the plurality of consecutive time points may be divided by week, month, or season, for example, taking a sample between 5 months of 2018 and 4 months of 2019, and the plurality of consecutive time points are 5 months of 2018, 6 months of 2018, 7 months of 2018, 8 months of 2018, 9 months of 2018, 10 months of 2018, 11 months of 2018, 12 months of 2018, 1 month of 2019, 2 months of 2019, 3 months of 2019, and 4 months of 2019.
In operation S202, a preset sliding time window is used to perform data extraction on the time series data sequence sample, so as to obtain window data of the sliding time window corresponding to each sliding time point.
In this disclosure, the number of the sliding time points may be determined according to the number of the time points included by the plurality of continuous time points, for example, the number of the time points included by the plurality of continuous time points is the same as the number of the sliding time points, that is, one time point corresponds to one sliding time point, and for example, the number of the time points included by the plurality of continuous time points is different from the number of the sliding time points, and one sliding time point may correspond to two continuous time points or three continuous time points, which is not limited in this disclosure.
In the present disclosure, the time lengths of the sliding time windows corresponding to the sliding time points may be the same or different, for example, at the first sliding time point, the time length of the sliding time window corresponding to the sliding time point is 4 months, at the second sliding time point, the time length of the sliding time window corresponding to the sliding time point is 5 months, and at the third sliding time point, the time length of the sliding time window corresponding to the sliding time point is 5 months.
In operation S203, a behavior prediction model is trained based on window data of a sliding time window corresponding to each of the sliding time points.
In the present disclosure, the behavior prediction model may be any one of logistic regression, random forest, GBDT, decision tree, multi-layered perceptron, and the like.
In the present disclosure, after obtaining window Data of a sliding time window corresponding to each sliding time point, Exploratory Data Analysis (EDA) may be performed on all the window Data, the Exploratory Data Analysis includes descriptive statistics and correlation Analysis, and the descriptive statistics includes, for example, interval variables — statistical minimum, maximum, mean, standard deviation, skewness, kurtosis, quantile, missing value, 5% quantile, 10% quantile, 25% quantile, 50% quantile, 75% quantile, 90% quantile, 95% quantile; character type variables-horizontal number, frequency distribution, target user number distribution, missing values.
In this disclosure, after obtaining the window data of each sliding time point corresponding to the sliding time window, data cleaning and conversion can be performed on the window data, and corresponding adjustment is mainly performed on different characteristics of different window data, including: filling in, abnormal value processing, variable conversion and the like.
Specifically, filling in the gaps: the cause of the deficiency is first analyzed and it is decided how to handle the deficiency for different causes. There are basically three treatment methods: a) deleting the sample; b) deleting the variable; c) and filling in the gaps. There are many methods for filling in, and each variable with a missing value is analyzed by a different filling method. The discrete variable common defect filling method comprises the following steps: mode fill-in, special value substitution (e.g.: unknown', -999); the interval type variable common filling method comprises the following steps: median filling, mean filling, etc. In addition to the above-mentioned conventional gap-filling methods, the gap-filling method proposed by the service is also very important. Meanwhile, some more complicated defect filling methods exist: hot card padding, heuristic padding, regression padding, and the like.
Abnormal value processing: the confirmation of the abnormal value of the data is firstly determined by combining the results of the business and exploratory data analysis, secondly determined by some methods commonly used in statistics, the reason for the abnormal value is searched as much as possible, and then the abnormal value is correspondingly processed to eliminate the influence of the abnormal value on the model training.
And (3) variable conversion: the variable conversion mainly comprises mathematical conversion of variables and binning transformation of the variables. The mathematical transformation of variables requires certain requirements on the distribution of data due to the use conditions of some models, and actual data often cannot meet the requirements, so that appropriate mathematical transformation needs to be performed on the original variables to meet the requirements of the models, and commonly used transformation includes logarithmic transformation, exponential transformation, reciprocal transformation and the like. The common methods for variable binning transformation are optimal binning, binning by quantile, binning by bucket, woe conversion, etc.
In the disclosure, in the verification stage of the model, the generalization capability of the model can be ensured by adopting a cross-period verification mode. For example, when the period of the samples used in training the model is from 5 months 2018 to 4 months 2019, the samples from 5 months 19 to 10 months 19 are used for cross-term verification, and the data of the first month is removed for verification.
According to the embodiment of the disclosure, time series data sequence samples of a plurality of users are constructed, each time series data sequence sample comprises samples under a plurality of continuous time points, each sample comprises at least one preset type of data of one user under the current time point, data extraction is carried out on the time series data sequence sample by adopting a preset sliding time window, window data of the sliding time window corresponding to each sliding time point is obtained, and a behavior prediction model and a trained behavior prediction model are trained on the basis of the window data of the sliding time point corresponding to the sliding time window, so that whether the user possibly generates default behaviors or not can be predicted.
In an embodiment of the disclosure, the time length of the sliding time window includes at least two consecutive time points, which can improve the stability of the sliding time point of the sliding time window and the loan default rate of the user. For example, set to 5 months, 6 months, 7 months, etc., as the present disclosure does not limit. Let T be the sliding time point of the sliding time window, and take the time length of the sliding time window as 6 months as an example, the time length of the sliding time window can be represented as (T-6, T), i.e. the observation period of each sample is 6 months.
In an embodiment of the present disclosure, before the operation S301, that is, before the model training, the method further includes: and calculating the stability indexes of each type of data in the time sequence data sequence sample under different time windows, judging whether the stability index of each type of data is greater than a preset threshold value, and removing the type of data if the stability index of the type of data is greater than the preset threshold value. Wherein, the different time windows may be a modeling window, a cross-time window, a scoring window, etc., the type of data is, for example, the number of transactions of three months, the number of historical loans, etc., and the Stability Index (PSI) reflects the Stability of the distribution of the Index in each segment. Generally, a model or variable may be considered stable if the deviation between the future sample distribution and the historical sample distribution is within an acceptable range. In the modeling process, a training sample is generally used as a historical sample, a cross-period verification sample is used as a future sample, and the distribution change of each index in the two samples is evaluated.
FIG. 3 schematically shows a flow chart of a model training method according to an embodiment of the present disclosure.
As shown in FIG. 3, the method includes operations S301-304.
In operation S301, time series data samples of a plurality of users are constructed, where each time series data sample includes a plurality of samples at successive time points, and each sample includes at least one preset type of data of a user at a current time point;
in operation S302, a preset sliding time window is used to perform data extraction on the time series data sample, so as to obtain window data of the sliding time window corresponding to each sliding time point.
In operation S303, data derivation is performed on the window data of the sliding time window corresponding to each sliding time point, so as to obtain multiple types of derived data.
In operation S304, a behavior prediction model is trained based on the window data and the plurality of types of derivative data.
In an embodiment of the present disclosure, the operation S303 specifically includes counting the number of types of data in all samples in the window data, calculating an entropy value of each type of data, and performing derivation processing on the data in each sample according to the entropy value of each type of data.
In the present disclosure, data derivation may be performed based on business scenarios. For example, for a business scenario in which a lending user has a repayment capability, the following aspects can be considered, 1) total amount, monthly average, variation, etc. of assets, liabilities, deposits, and loan amount in different time windows; 2) the money amount, the stroke number, the extreme value, the application, the change condition, the concentration condition and the like of the legal person and the personal transaction; 3) derivation and processing of credit investigation information of individuals and legal persons; 4) derivation of variables such as types, applications, examination and approval conditions, loan conditions, repayment conditions and the like in the loan information; 5) derivation of information such as the association condition of the investment and financing and guarantee relationship, credit of the associated party, assets and the like; 6) derivation of business registration information, etc.
According to the embodiment of the disclosure, the window data and the data derived based on the window data are used as training data together to train the behavior prediction model, so that the diversity of the training data can be increased, and the prediction accuracy of the behavior prediction model can be improved.
FIG. 4 schematically shows a flow chart of a model training method according to an embodiment of the present disclosure.
As shown in FIG. 4, the method includes operations S401 to 404.
In operation S401, time series data samples of a plurality of users are constructed, each time series data sample includes a plurality of samples at successive time points, and each time series sample includes at least one preset type of data of a user at a current time point.
In operation S402, a second time point of the plurality of consecutive time points is searched.
In operation S403, data extraction is performed by sliding the sliding time window from the second time point to the last time point, and window data corresponding to the sliding time window at each sliding time point is obtained.
In operation S404, a behavior prediction model is trained based on window data of a sliding time window corresponding to each of the sliding time points.
In the present disclosure, for example, if a sample between 2018 and 2019, month 5 and month 4 is taken, the sliding time window needs to be slid from 2018 and month 6, so that data crossing situations such as credit investigation data and inline rating data can be avoided. In a possible situation, when the default occurs for the first time in the first month data, namely the preset type data at T is abnormal, the user is pre-warned at T-1, the first month user is removed, the fact that the user does not conform to the actual scene is avoided, and it is ensured that the second time point is the newly increased default user.
FIG. 5 schematically illustrates a flow chart of a model training method, the types of samples including default samples and normal samples, according to an embodiment of the disclosure.
As shown in FIG. 5, the method includes operations S501 and S502 in addition to operations S201-203.
In operation S501, when a default sample occurs in the sliding time window, a default user corresponding to the default sample is determined;
in operation S502, all samples of the default user are eliminated from the time series of data samples.
In the present disclosure, in the context of loan transactions, a default sample refers to a loan that is not repayed by a specified term. The term can be 7 days, 15 days and the like, more, default samples can be classified in a grade mode, borrowings which are higher than a preset grade and are not paid according to a specified term are default samples, and borrowings which are lower than the preset grade are normal samples.
In the disclosure, as long as a default sample is detected in the current sliding time window, all samples of the default user corresponding to the default sample are removed from the time series data sequence sample, and it is ensured that no default sample appears in the next sliding time window.
FIG. 6 schematically shows a flow chart of a model training method according to an embodiment of the present disclosure.
As shown in FIG. 6, the method includes operations S601 and S602 in addition to operations S201-203.
In operation S601, a current sliding time point and a time point corresponding to the current sliding time window are obtained.
In operation S602, samples within the sliding time window whose effective time is less than the current sliding time point of the sliding time window are rejected.
In the present disclosure, in the context of a loan transaction, the validity time refers to the expiration time of the loan data. In one example, the sliding time window includes three loans, the expiration times of the three loans are 2018, 5-month and 10-month, 2018, 6-month and 10-month and 2018, 6-month and 28-month, the sliding time point T of the sliding time window is 2018, 6-month (taking 2018, 6-month and 1-month), and the loans with the expiration time of 2018, 5-month and 10-month are removed to ensure that the extracted loans are unexpired data, so that the loans include the loan duration and the data before the loan.
Fig. 7 schematically illustrates a flow chart of a behavior prediction method according to an embodiment of the present disclosure.
As shown in fig. 7, the method includes operations S701 to S703.
In operation S701, a time series data sequence of a user to be predicted is acquired, the time series data sequence including at least one preset type of data at a plurality of consecutive time points.
In operation S702, data extraction is performed on the time sequence data sequence by using a preset sliding time window, so as to obtain window data of the sliding time window corresponding to each sliding time point.
In operation S703, the behavior of the user to be predicted is predicted based on the window data of the sliding time window corresponding to each sliding time point and a pre-trained behavior prediction model, so as to obtain a prediction result related to the data of the at least one preset type.
In the present disclosure, the trained behavior prediction model is a behavior prediction model trained according to the embodiments shown in fig. 2 to 6. The processing method of the time series data sequence of the user to be predicted is the same as the processing method of the embodiment shown in fig. 2 to 6. And will not be described in detail herein.
In an embodiment of the present disclosure, when the trained model is used for behavior prediction, KS (Kolmogorov-Smirnov), roc (receiver Operating characterization) curve and auc (area Under rock) value can be used to identify the risk differentiation capability of the model.
The KS is used for evaluating the model risk distinguishing capacity, the index measures the difference value between positive and negative sample (default sample and normal sample) accumulation subsections, and the index measures the difference between the positive and negative sample distribution from the angle of probability. The greater the cumulative difference between positive and negative samples, the greater the KS index, and the greater the risk discrimination ability of the model.
The ROC curve and AUC are often used to evaluate the goodness of a binary classifier, reflecting the classifier's ability to rank samples. AUC essentially reflects the probability that given a positive sample and a negative sample at random, the probability value that the classifier outputs the positive sample as positive is greater than the probability value that the classifier outputs the negative sample as positive. When different models are evaluated, only ROC curves of the models need to be drawn into the same coordinate, so that the advantages and disadvantages are visually identified, and the learner represented by the ROC curves close to the upper left corner has the highest accuracy. And besides reflecting the distinguishing capability of the model, the ROC curve can also see the sequencing capability of the model. The trained model can preferentially investigate users with high default probability when the traffic is too large after early warning and a manager of the users is not enough to investigate all early warnings.
In the present disclosure, the predicted result may be a specific value of credit, or a default level, which is not limited by the present disclosure. The default levels are, for example, four levels of red, yellow, green and blue, five levels of ABCDE, and the like, and different management is performed for customers of different levels.
In an embodiment of the present disclosure, operation S703 includes: carrying out data derivation processing on the window data of the sliding time window corresponding to each sliding time point to obtain various types of derived data; and inputting the various types of derivative data into the pre-trained behavior prediction model to obtain a prediction result.
In an embodiment of the present disclosure, operation S702 includes: and searching a second time point in the plurality of continuous time points, and sliding the sliding time window from the second time point to the last time point to extract data so as to obtain window data of the sliding time window corresponding to each sliding time point.
In an embodiment of the present disclosure, the behavior prediction method shown in fig. 7 further includes: and acquiring the current sliding time point and a time point corresponding to the current sliding time window, and rejecting data of which the effective time of the sample in the sliding time window is less than the current sliding time point of the sliding time window.
In an embodiment of the present disclosure, the at least one predetermined type of data in operation S703 is data related to loan. The prediction result related to the at least one preset type of data is the default result of the user to be predicted.
Based on the model training method shown in fig. 2 to 6, the present disclosure also provides a model training device. The apparatus will be described in detail below with reference to fig. 8.
Fig. 8 schematically shows a block diagram of a model training apparatus according to an embodiment of the present disclosure.
As shown in fig. 8, the model training apparatus 800 of this embodiment includes a building module 810, a first extraction module 820, and a training module 830.
A constructing module 810, configured to construct time series data samples of multiple users, where each time series data sample includes multiple samples at consecutive time points, and each time series data sample includes at least one preset type of data of a user at a current time point. In an embodiment, the building module 810 may be configured to perform the operation S201 described above, which is not described herein again.
In an embodiment, the first extracting module 820 may be configured to perform the operation S202 described above, and details are not repeated here.
The training module 830 is configured to train a behavior prediction model based on the window data of the sliding time window corresponding to each sliding time point. In an embodiment, the training module 830 may be configured to perform the operation S203 described above, which is not described herein again.
In an embodiment of the present disclosure, the training module 830 includes:
the data derivation submodule is used for carrying out data derivation processing on the window data of the sliding time window corresponding to each sliding time point to obtain various types of derived data;
and the training submodule is used for training the behavior prediction model based on the window data and the various types of derivative data.
In an embodiment of the present disclosure, the first extraction module 820 includes:
a search sub-module for searching for a second time point of the plurality of consecutive time points;
and the sliding submodule is used for performing data extraction by sliding the sliding time window from the second time point to the last time point to obtain window data of the sliding time window corresponding to each sliding time point.
In an embodiment of the present disclosure, the types of the samples include default samples and distance samples, and the apparatus further includes:
the determining module is used for determining default users corresponding to the default samples when the default samples appear in the sliding time window;
and the first rejection module is used for rejecting all samples of the default user from the time series data sequence samples.
In an embodiment of the present disclosure, the apparatus further includes:
the acquisition module is used for acquiring a current sliding time point and a time point corresponding to a current sliding time window;
and the second eliminating module is used for eliminating the samples of which the effective time of the samples in the sliding time window is less than the current sliding time point of the sliding time window.
In an embodiment of the present disclosure, the time length of the sliding time window includes at least two consecutive time points.
In one embodiment of the present disclosure, the at least one predetermined type of data is data associated with a loan.
Based on the model training method shown in fig. 7, the present disclosure also provides a behavior prediction apparatus. The apparatus will be described in detail below with reference to fig. 9.
Fig. 9 schematically shows a block diagram of a model training apparatus according to an embodiment of the present disclosure.
As shown in fig. 9, the model training apparatus 900 of this embodiment includes an acquisition module 910, a second extraction module 920, and a prediction module 930.
An acquiring module 910, configured to acquire a time series data sequence of a user to be predicted, where the time series data sequence includes at least one preset type of data at a plurality of consecutive time points. In an embodiment, the training module 830 may be configured to perform the operation S701 described above, which is not described herein again.
The second extracting module 920 is configured to perform data extraction on the time sequence data sequence by using a preset sliding time window, so as to obtain window data of the sliding time window corresponding to each sliding time point. In an embodiment, the training module 830 may be configured to perform the operation S702 described above, and is not described herein again.
A predicting module 930, configured to perform behavior prediction on the user to be predicted based on the window data of the sliding time window corresponding to each sliding time point and a pre-trained behavior prediction model, so as to obtain a prediction result related to the at least one preset type of data. In an embodiment, the training module 830 may be configured to perform the operation S703 described above, which is not described herein again.
In an embodiment of the present disclosure, the prediction module 930 is specifically configured to perform data derivation processing on window data of a sliding time window corresponding to each sliding time point to obtain multiple types of derived data; and inputting the various types of derivative data into the pre-trained behavior prediction model to obtain a prediction result.
In an embodiment of the present disclosure, the second extraction module 920 includes: searching for a second time point of the plurality of consecutive time points; and performing data extraction by using the sliding time window from the second time point to the last time point to obtain window data of the sliding time window corresponding to each sliding time point.
In an embodiment of the present disclosure, the apparatus further includes:
the time acquisition module is used for acquiring a current sliding time point and a time point corresponding to a current sliding time window;
and the data removing module is used for removing the data of which the effective time of the samples in the sliding time window is less than the current sliding time point of the sliding time window.
In one embodiment of the present disclosure, the at least one predetermined type of data is data associated with a loan. The prediction result related to the at least one preset type of data is the default result of the user to be predicted.
According to an embodiment of the present disclosure, any plurality of the building module 810, the first extraction module 820, and the training module 830 may be combined and implemented in one module, or any one of them may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the disclosure, at least one of the building module 810, the first extraction module 820, and the training module 830 may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or in any one of or a suitable combination of software, hardware, and firmware. Or at least one of the building module 810, the first extraction module 820 and the training module 830 may be at least partially implemented as a computer program module, which when executed may perform the respective functions.
FIG. 10 schematically illustrates a block diagram of an electronic device suitable for implementing a model training method or a behavior prediction method according to an embodiment of the present disclosure.
As shown in fig. 10, an electronic device 1000 according to an embodiment of the present disclosure includes a processor 1001 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. Processor 1001 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 1001 may also include onboard memory for caching purposes. The processor 1001 may include a single processing unit or multiple processing units for performing different actions of a method flow according to embodiments of the present disclosure.
In the RAM 1003, various programs and data necessary for the operation of the electronic apparatus 1000 are stored. The processor 1001, ROM 1002, and RAM 1003 are connected to each other by a bus 1004. The processor 1001 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 1002 and/or the RAM 1003. Note that the program may also be stored in one or more memories other than the ROM 1002 and the RAM 1003. The processor 1001 may also perform various operations of method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.
Electronic device 1000 may also include an input/output (I/O) interface 1005, the input/output (I/O) interface 1005 also being connected to bus 1004, according to an embodiment of the present disclosure. Electronic device 1000 may also include one or more of the following components connected to I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output section 1007 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 1008 including a hard disk and the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The driver 1010 is also connected to the I/O interface 1005 as necessary. A removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1010 as necessary, so that a computer program read out therefrom is mounted into the storage section 1008 as necessary.
The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, a computer-readable storage medium may include the ROM 1002 and/or the RAM 1003 described above and/or one or more memories other than the ROM 1002 and the RAM 1003.
Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the method illustrated in the flow chart. When the computer program product runs in a computer system, the program code is used for causing the computer system to realize the item recommendation method provided by the embodiment of the disclosure.
The computer program performs the above-described functions defined in the system/apparatus of the embodiments of the present disclosure when executed by the processor 1001. The systems, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.
In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted in the form of a signal on a network medium, distributed, downloaded and installed via the communication part 1009, and/or installed from the removable medium 1011. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
In such an embodiment, the computer program may be downloaded and installed from a network through the communication part 1009 and/or installed from the removable medium 1011. The computer program performs the above-described functions defined in the system of the embodiment of the present disclosure when executed by the processor 1001. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.
In accordance with embodiments of the present disclosure, program code for executing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, these computer programs may be implemented using high level procedural and/or object oriented programming languages, and/or assembly/machine languages. The programming language includes, but is not limited to, programming languages such as Java, C + +, python, the "C" language, or the like. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.
The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.