Data mining method based on social network analysis technology

文档序号:8209 发布日期:2021-09-17 浏览:25次 中文

1. A method for data mining based on social network analysis technology, comprising: the system comprises four modules of credit investigation wind control modeling, a characteristic variable library, a simulation database and scene variable data mining;

the specific operation steps of the credit investigation wind control modeling are as follows:

s1, preparing data and collecting stock data samples;

s2, preprocessing a data sample, and establishing a characteristic variable broad table data set according to the attribute of the data sample;

s3, dividing the data set into a training set, a testing set and a verification set;

s4, establishing a training model;

s5, training by using a training model, and adjusting a loss function and an optimizer;

s6, generating a scoring card, and evaluating the test accuracy of the training model by using the tester in the S2;

s7, publishing and applying in a service scene;

s8, precipitating and feeding back the applied data to the collected sample at S1;

the characteristic variable library is used for constructing a behavior sequence according to the behavior data of the credit customer and generating a characteristic vector; the characteristic variable library is used for extracting attribute characteristic variables, relationship characteristic variables, behavior characteristic variables and rule variables of the credit customer, and constructing a characteristic variable table according to the attribute characteristic variables, the relationship characteristic variables, the behavior characteristic variables and the rule variables;

the simulation database is a data set of highly simulated real data which is manufactured by manpower strictly according to a data source format;

the specific steps of the scene variable data mining are as follows: acquiring a behavior characteristic query statement, wherein the behavior characteristic query statement comprises behavior characteristic information of requesting query; querying annotation scene data corresponding to the behavior characteristic information in an annotation scene database; and extracting the queried scene data from the original scene database according to the time label corresponding to the queried labeled scene data to generate characteristic scene data.

2. The method for data mining based on social network analysis technology as claimed in claim 1, wherein the data samples include basic information data of users, bank transaction running information data, product holding information data, credit investigation data.

3. The social network analysis technology-based data mining method of claim 1, further comprising credit investigation wind control scoring card modeling, wherein the credit investigation wind control scoring card modeling comprises a customer information collection module and a credit scoring model building module, and the customer information collection module is used for collecting credit history, behavior preference, performance capability, identity quality and interpersonal relationship of the customer.

4. The social network analysis technology-based data mining method of claim 3, wherein the client information collection module comprises a search engine and a storage unit, and is used for inquiring related credit investigation data in the system according to the information of age, income, occupation, academic calendar, assets and liabilities submitted by the borrower and recording the information into the storage unit; the client information acquisition module comprises a crawler engine, and the crawler engine is used for continuously capturing internet data such as social contact, e-commerce, communication, trip and the like on the internet in real time according to the information of the borrower, and storing and recording the processed internet data into the storage unit; the credit investigation wind control scoring card modeling is based on deep learning and combines various algorithms to construct dozens of wind control models, the characteristics of distinguishing user risks are found, then the models are established, scoring is carried out on users, and the average default rate is calculated.

5. The method of claim 1, wherein the simulation database is configured to obtain the location information and the social activity track of the client mobile phone by embedding a device identification script in the website or the mobile terminal, so as to identify whether the user frequently changes the mobile phone card, intentionally hides personal information, intentionally exposes personal information in a short period of time, and the like.

6. The method of data mining based on social network analysis technology as claimed in claim 1, wherein the feature variable library is a process of converting raw data into features, and the features can better describe potential problems to the prediction model, so as to improve the accurate recognition capability of the model for unseen data.

7. The method of data mining based on social network analysis technology of claim 1, wherein the simulation database comprises: the system comprises a data mining system, a data acquisition system, a data analysis system, a data filtering system and a data forming system; wherein the content of the first and second substances,

the data mining system: first inputting a given large data sample set M, where M ═ { M1, M2.., Mn }; then, integrating and normalizing the input sample set; selecting an n value and an H (H1, H2.. Hm) as parameters of the number of generated clusters and the initial quality of the mean clustering algorithm respectively; performing a mean clustering algorithm to obtain F clusters { F1, F2., Fm }; taking each Fi of the f clusters as a sub-cluster of the initial cluster; computing a feature vector K, where the feature vector K is represented as: k ═ K (K1, K2.., Km); setting an exploration interest parameter d, outputting an interest characteristic Ki when Ki is less than d, or else, not processing;

the data acquisition system is used for acquiring data, and specifically comprises the following operations: firstly, setting keywords as a search engine for social network data acquisition; decomposing keywords used for input into a plurality of subscription requests through a data preprocessing module in the data mining system according to synonyms, then submitting acquisition tasks to a data acquisition module through a task scheduling module, preprocessing documents obtained by the acquisition module according to effective time, discarding documents exceeding the time effectiveness, and storing the retained documents in a database and transmitting the documents to a data analysis system;

the data analysis system is used for processing the acquired data; specifically, the method comprises the following steps: triggering a semantic analysis module to perform a document analysis task by using a scheduling task module according to the achieved triggering condition, performing general classification on the collected documents by the document analysis task, namely performing text segmentation and semantic analysis on words, performing semantic analysis on the abstracts when the abstracts of the text are extracted, and judging whether the document content is accurate or not; extracting accurate information and providing the accurate information to a data filtering system;

the data filtering system is used for further analyzing and processing the data; respectively reading the analyzed data into a data table, wherein the data tables are collectively called as containers; setting configuration data in a container as a filtering and screening configuration node, wherein the configuration node sets a filtering attribute or a screening attribute; then, according to the configuration node setting, hierarchically displaying the configuration data in the container to an analysis interface in a tree structure;

the data forming system is used for forming a final database; and when each mining algorithm is executed to realize the processing task, the mining algorithm realization processing tasks are distributed to the Map tasks which are executed in parallel through a Map/Reduce mechanism to be processed, and the processing results of the Map tasks corresponding to the mining algorithm realization processing tasks are merged through the corresponding Reduce tasks to obtain the processing results of the corresponding mining algorithm realization processing tasks.

8. The method for data mining based on social network analysis technology as claimed in claim 1, wherein the data acquisition system and the data analysis system are integrated with the application system by loose coupling; when the analysis results of the data acquisition system and the data analysis system trigger the service request of the CRM module, and the service request is distributed to the application system, the requested content and the related information of the user are displayed to related personnel, and the related personnel judge whether to follow up the interaction with the user according to the information: if so, triggering the process of user communication, and interacting with the user in the process; if further processing is required, a background process is entered.

9. The method of claim 1, wherein the data forming system performs correlation analysis on the preprocessed target data set, counts out the correlation information content according to the database, performs distributed classification clustering, performs distributed fragmentation calculation on the data, summarizes and performs parallel processing on the results, divides the common features of a group of data objects stored in the database into different classes according to the classification mode, maps the data items in the database to a given class through an information classification algorithm, groups the event classification types and features, performs multi-dimensional analysis, and counts out the essential information data to form the big database.

Background

With the continuous development of social informatization and the continuous expansion of the application field of information technology, more and more data are accumulated in each application field including economy, medical treatment, construction, environment and the like. Since the eighties of the last century, the total amount of data around the world has been rapidly increased, even doubled for several months, but how to effectively utilize and analyze the data information and obtain the hidden useful information from the data information has become a great challenge. Among these huge amounts of data, a part of the data is arranged in time sequence, and such data is called time series (TimeSeries). Time sequences exist in various application fields, and through deep research on the time sequences, potential laws hidden behind the sequences and valuable information are found to have great social significance and economic value.

In recent years, with the increase of data volume, some data analysis methods cannot effectively extract more valuable data information, so a new data analysis method, namely a data mining (DataMining) technology, is generated. The data mining technology can not only analyze the existing data, but also predict future unknown information from the original data, for example, the sales volume of a certain market in the next month can be predicted through data mining. What is data mining? Data mining can be defined in many different forms, and in brief, the data mining is to extract valuable information from massive data information, and most of original data are data with fuzzy noise, but the data have many potential values. The mining process is to process and analyze mass data by utilizing technical knowledge in various fields, and to mine contents which can be beneficial to people to carry out higher-level analysis decisions.

At present, although the research on data mining at home and abroad has achieved a lot of results, the mining of time series of each application field has no universality, for example, the performance effect obtained when the method for data mining in the financial field is applied in the medical field is not good. Most of the existing methods may only show good performance in one aspect, but cannot be combined to have good performance in other aspects. Obviously, the previous research on time series still has some defects, and for the problem of time series mining in different fields, the traditional mining method is not applicable, and how to seek some new data mining is a problem that needs to be solved urgently by those skilled in the art.

Disclosure of Invention

In view of the above, the present invention aims to provide a method for data mining based on a social network analysis technology, so that the method can effectively process mass data, improve the operation speed and accuracy of data mining, effectively extract required exploration interest feature data, and has the characteristics of wide coverage, strong flexibility, and high risk identification rate.

In order to achieve the purpose, the invention provides the following technical scheme:

a method for data mining based on social network analysis technology, comprising: the system comprises four modules of credit investigation wind control modeling, a characteristic variable library, a simulation database and scene variable data mining;

the specific operation steps of the credit investigation wind control modeling are as follows:

s1, preparing data and collecting stock data samples;

s2, preprocessing a data sample, and establishing a characteristic variable broad table data set according to the attribute of the data sample;

s3, dividing the data set into a training set, a testing set and a verification set;

s4, establishing a training model;

s5, training by using a training model, and adjusting a loss function and an optimizer;

s6, generating a scoring card, and evaluating the test accuracy of the training model by using the tester in the S2;

s7, publishing and applying in a service scene;

s8, precipitating and feeding back the applied data to the collected sample at S1;

the characteristic variable library is used for constructing a behavior sequence according to the behavior data of the credit customer and generating a characteristic vector; the characteristic variable library is used for extracting attribute characteristic variables, relationship characteristic variables, behavior characteristic variables and rule variables of the credit customer, and constructing a characteristic variable table according to the attribute characteristic variables, the relationship characteristic variables, the behavior characteristic variables and the rule variables;

the simulation database is a data set of highly simulated real data which is manufactured by manpower strictly according to a data source format;

the specific steps of the scene variable data mining are as follows: acquiring a behavior characteristic query statement, wherein the behavior characteristic query statement comprises behavior characteristic information of requesting query; querying annotation scene data corresponding to the behavior characteristic information in an annotation scene database; and extracting the queried scene data from the original scene database according to the time label corresponding to the queried labeled scene data to generate characteristic scene data.

Preferably, in one method of the data mining based on the social network analysis technology, the data sample includes basic information data of a user, bank transaction running information data, product holding information data and credit investigation data.

Preferably, in one of the above data mining methods based on social network analysis technology, the method further includes credit investigation wind control scoring card modeling, where the credit investigation wind control scoring card modeling includes a client information collection module and a credit scoring model establishment module, and the client information collection module is used to collect credit history, behavior preference, performance capability, identity traits and interpersonal relationships of the client.

Preferably, in one method of the data mining based on the social network analysis technology, the client information collection module comprises a search engine and a storage unit, and is used for inquiring related credit data in the system according to the age, income, occupation, academic calendar, assets and liability information submitted by the borrower and recording the related credit data into the storage unit; the client information acquisition module comprises a crawler engine, and the crawler engine is used for continuously capturing internet data such as social contact, e-commerce, communication, trip and the like on the internet in real time according to the information of the borrower, and storing and recording the processed internet data into the storage unit; the credit investigation wind control scoring card modeling is based on deep learning and combines various algorithms to construct dozens of wind control models, the characteristics of distinguishing user risks are found, then the models are established, scoring is carried out on users, and the average default rate is calculated.

Preferably, in the method for data mining based on social network analysis technology, the simulation database is used for acquiring the location information and social activity track of the client mobile phone by embedding the device identification script in the website or the mobile terminal, so as to identify whether the user frequently changes the mobile phone card, intentionally hides personal information, intentionally exposes personal information in a short period, and the like.

Preferably, in one method of the data mining based on the social network analysis technology, the feature variable library is a process of converting original data into features, and the features can better describe potential problems to the prediction model, so that the accurate identification capability of the model on unseen data is improved.

Preferably, in one method of the data mining based on social network analysis technology, the simulation database includes: the system comprises a data mining system, a data acquisition system, a data analysis system, a data filtering system and a data forming system; wherein the content of the first and second substances,

the data mining system: first inputting a given large data sample set M, where M ═ { M1, M2.., Mn }; then, integrating and normalizing the input sample set; selecting an n value and an H (H1, H2.. Hm) as parameters of the number of generated clusters and the initial quality of the mean clustering algorithm respectively; performing a mean clustering algorithm to obtain F clusters { F1, F2., Fm }; taking each Fi of the f clusters as a sub-cluster of the initial cluster; computing a feature vector K, where the feature vector K is represented as: k ═ K (K1, K2.., Km); setting an exploration interest parameter d, outputting an interest characteristic Ki when Ki is less than d, or else, not processing;

the data acquisition system is used for acquiring data, and specifically comprises the following operations: firstly, setting keywords as a search engine for social network data acquisition; decomposing keywords used for input into a plurality of subscription requests through a data preprocessing module in the data mining system according to synonyms, then submitting acquisition tasks to a data acquisition module through a task scheduling module, preprocessing documents obtained by the acquisition module according to effective time, discarding documents exceeding the time effectiveness, and storing the retained documents in a database and transmitting the documents to a data analysis system;

the data analysis system is used for processing the acquired data; specifically, the method comprises the following steps: triggering a semantic analysis module to perform a document analysis task by using a scheduling task module according to the achieved triggering condition, performing general classification on the collected documents by the document analysis task, namely performing text segmentation and semantic analysis on words, performing semantic analysis on the abstracts when the abstracts of the text are extracted, and judging whether the document content is accurate or not; extracting accurate information and providing the accurate information to a data filtering system;

the data filtering system is used for further analyzing and processing the data; respectively reading the analyzed data into a data table, wherein the data tables are collectively called as containers; setting configuration data in a container as a filtering and screening configuration node, wherein the configuration node sets a filtering attribute or a screening attribute; then, according to the configuration node setting, hierarchically displaying the configuration data in the container to an analysis interface in a tree structure;

the data forming system is used for forming a final database; and when each mining algorithm is executed to realize the processing task, the mining algorithm realization processing tasks are distributed to the Map tasks which are executed in parallel through a Map/Reduce mechanism to be processed, and the processing results of the Map tasks corresponding to the mining algorithm realization processing tasks are merged through the corresponding Reduce tasks to obtain the processing results of the corresponding mining algorithm realization processing tasks.

Preferably, in the method for data mining based on social network analysis technology, the data acquisition system and the data analysis system are integrated and connected with the application system in a loose coupling manner; when the analysis results of the data acquisition system and the data analysis system trigger the service request of the CRM module, and the service request is distributed to the application system, the requested content and the related information of the user are displayed to related personnel, and the related personnel judge whether to follow up the interaction with the user according to the information: if so, triggering the process of user communication, and interacting with the user in the process; if further processing is required, a background process is entered.

Preferably, in the method for data mining based on social network analysis technology, the data forming system performs correlation analysis on a preprocessed target data set, counts out correlation information content according to a database, performs distributed classification clustering, performs distributed fragment calculation on data, summarizes and performs parallel processing on results, divides common characteristics of a group of data objects stored in the database into different classes according to a classification mode, maps data items in the database to a given class through an information classification algorithm, groups event classification types and characteristics, performs multi-dimensional analysis, counts out substantial information data, and forms a large database.

Through the technical scheme, compared with the prior art, the invention has the beneficial effects that:

the coverage rate of modeling (wind control) effective information supporting multi-service scenes, such as card issuing, quota adjusting, charging and the like, reaches 95-99%;

the batch running is high-efficiency; rapid modeling-shortening from 1-3 months to 1-3 days; 1 ten thousand credit investigation reports, 90 percent of response time is less than 1000 milliseconds, and the average response time length is 759 milliseconds;

under the condition of 100 concurrent users, 99% of real-time synchronous calculation is completed within one second, and the requirements of a decision engine can be supported;

the accuracy of the feature variable processing code reaches 99.99%, the reliability of the model is effectively guaranteed, the risk recognition rate is high, the benefit opportunity recognition rate is high, and the analysis capability of the line side data is greatly improved; the temporary development task pressure of the information science and technology department is greatly reduced, and a systematic framework is formed on business knowledge; updating newly-increased amount and inputting more network loan information newly-increased common borrowing marks in a staged manner in real time;

the credit investigation characteristic variable library is full of business knowledge, the total library characteristic variable number reaches 700,000+, the credit investigation characteristic variable library can flexibly and effectively deal with the pain points of the industries, the compatibility of all-around support of various business scene requirements is good, and a multi-party data source can be fused.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Please refer to a method for data mining based on social network analysis technology disclosed by the present invention, which specifically comprises: a method for data mining based on social network analysis technology, comprising: the system comprises four modules of credit investigation wind control modeling, a characteristic variable library, a simulation database and scene variable data mining;

the specific operation steps of the credit investigation wind control modeling are as follows:

s1, preparing data and collecting stock data samples;

s2, preprocessing a data sample, and establishing a characteristic variable broad table data set according to the attribute of the data sample;

s3, dividing the data set into a training set, a testing set and a verification set;

s4, establishing a training model;

s5, training by using a training model, and adjusting a loss function and an optimizer;

s6, generating a scoring card, and evaluating the test accuracy of the training model by using the tester in the S2;

s7, publishing and applying in a service scene;

s8, precipitating and feeding back the applied data to the collected sample at S1;

the characteristic variable library is used for constructing a behavior sequence according to the behavior data of the credit customer and generating a characteristic vector; the characteristic variable library is used for extracting attribute characteristic variables, relationship characteristic variables, behavior characteristic variables and rule variables of the credit customer, and constructing a characteristic variable table according to the attribute characteristic variables, the relationship characteristic variables, the behavior characteristic variables and the rule variables;

the simulation database is a data set of highly simulated real data which is manufactured by manpower strictly according to a data source format;

the specific steps of the scene variable data mining are as follows: acquiring a behavior characteristic query statement, wherein the behavior characteristic query statement comprises behavior characteristic information of requesting query; querying annotation scene data corresponding to the behavior characteristic information in an annotation scene database; and extracting the queried scene data from the original scene database according to the time label corresponding to the queried labeled scene data to generate characteristic scene data.

In order to further optimize the technical scheme, the data samples comprise basic information data of a user, bank transaction running information data, product holding information data and credit investigation data.

In order to further optimize the technical scheme, the method further comprises the step of modeling the credit investigation wind control scoring card, wherein the modeling of the credit investigation wind control scoring card comprises a customer information acquisition module and a credit scoring model building module, and the customer information acquisition module is used for acquiring credit history, behavior preference, performance capability, identity traits and relationship of people of a customer.

In order to further optimize the technical scheme, the client information acquisition module comprises a search engine and a storage unit, and is used for inquiring related credit investigation data in the system according to the age, income, occupation, academic calendar, assets and liability information submitted by the borrower and recording the related credit investigation data into the storage unit; the client information acquisition module comprises a crawler engine, and the crawler engine is used for continuously capturing internet data such as social contact, e-commerce, communication, trip and the like on the internet in real time according to the information of the borrower, and storing and recording the processed internet data into the storage unit; the credit investigation wind control scoring card modeling is based on deep learning and combines various algorithms to construct dozens of wind control models, the characteristics of distinguishing user risks are found, then the models are established, scoring is carried out on users, and the average default rate is calculated.

In order to further optimize the technical scheme, the simulation database is used for acquiring the position information and the social activity track of the mobile phone of the client by embedding the equipment identification script in a website or a mobile terminal, and is used for identifying whether the user frequently changes the mobile phone card, intentionally hides personal information, intentionally exposes the personal information in a short period and the like.

In order to further optimize the technical scheme, the characteristic variable library is a process of converting original data into characteristics, and the characteristics can better describe potential problems to the prediction model, so that the accurate identification capability of the model on unseen data is improved.

In order to further optimize the technical solution, the simulation database comprises: the system comprises a data mining system, a data acquisition system, a data analysis system, a data filtering system and a data forming system; wherein the content of the first and second substances,

the data mining system: first inputting a given large data sample set M, where M ═ { M1, M2.., Mn }; then, integrating and normalizing the input sample set; selecting an n value and an H (H1, H2.. Hm) as parameters of the number of generated clusters and the initial quality of the mean clustering algorithm respectively; performing a mean clustering algorithm to obtain F clusters { F1, F2., Fm }; taking each Fi of the f clusters as a sub-cluster of the initial cluster; computing a feature vector K, where the feature vector K is represented as: k ═ K (K1, K2.., Km); setting an exploration interest parameter d, outputting an interest characteristic Ki when Ki is less than d, or else, not processing;

the data acquisition system is used for acquiring data, and specifically comprises the following operations: firstly, setting keywords as a search engine for social network data acquisition; decomposing keywords used for input into a plurality of subscription requests through a data preprocessing module in the data mining system according to synonyms, then submitting acquisition tasks to a data acquisition module through a task scheduling module, preprocessing documents obtained by the acquisition module according to effective time, discarding documents exceeding the time effectiveness, and storing the retained documents in a database and transmitting the documents to a data analysis system;

the data analysis system is used for processing the acquired data; specifically, the method comprises the following steps: triggering a semantic analysis module to perform a document analysis task by using a scheduling task module according to the achieved triggering condition, performing general classification on the collected documents by the document analysis task, namely performing text segmentation and semantic analysis on words, performing semantic analysis on the abstracts when the abstracts of the text are extracted, and judging whether the document content is accurate or not; extracting accurate information and providing the accurate information to a data filtering system;

the data filtering system is used for further analyzing and processing the data; respectively reading the analyzed data into a data table, wherein the data tables are collectively called as containers; setting configuration data in a container as a filtering and screening configuration node, wherein the configuration node sets a filtering attribute or a screening attribute; then, according to the configuration node setting, hierarchically displaying the configuration data in the container to an analysis interface in a tree structure;

the data forming system is used for forming a final database; and when each mining algorithm is executed to realize the processing task, the mining algorithm realization processing tasks are distributed to the Map tasks which are executed in parallel through a Map/Reduce mechanism to be processed, and the processing results of the Map tasks corresponding to the mining algorithm realization processing tasks are merged through the corresponding Reduce tasks to obtain the processing results of the corresponding mining algorithm realization processing tasks.

In order to further optimize the technical scheme, the data acquisition system and the data analysis system are integrally connected with the application system in a loose coupling mode; when the analysis results of the data acquisition system and the data analysis system trigger the service request of the CRM module, and the service request is distributed to the application system, the requested content and the related information of the user are displayed to related personnel, and the related personnel judge whether to follow up the interaction with the user according to the information: if so, triggering the process of user communication, and interacting with the user in the process; if further processing is required, a background process is entered.

In order to further optimize the technical scheme, the data forming system performs correlation analysis on a preprocessed target data set, performs correlation information content statistics according to a database, performs distributed classification clustering, performs distributed fragmentation calculation on data, summarizes and performs parallel processing on results, divides common characteristics of a group of data objects stored in the database into different classes according to a classification mode, maps data items in the database to a given class through an information classification algorithm, performs grouping on event classification types and characteristics, performs multi-dimensional analysis, and performs statistics on substantial information data to form a large database.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:模型训练方法、信息提取方法、相关装置及存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!

技术分类