Construction method of knowledge map system based on real-time big data
1. The method for constructing the knowledge map system based on the real-time big data is characterized by comprising the following steps of:
a knowledge extraction step; acquiring user information from a business database in real time, and extracting map data based on the user information; the service database comprises a plurality of data sources, the user information comprises a plurality of variables, and the map data comprises entities, attributes and relations;
a knowledge fusion step, namely fusing variables of different data sources about the same entity into the same variable;
and a knowledge representation step, namely constructing a knowledge graph based on graph data.
2. The method for constructing a knowledge graph system based on real-time big data according to claim 1, wherein the method comprises the following steps: in the knowledge extraction step, the acquired user information is sent to the message middleware in real time, the user information is received from the message middleware, and then the atlas data is extracted based on the user information.
3. The method for constructing a knowledge graph system based on real-time big data according to claim 2, wherein the method comprises the following steps: in the knowledge extraction step, after receiving user information from a message middleware, judging whether the user information is a json file, when the user information is the json file, analyzing the json file, and disassembling unstructured data in the json file into structured data; extracting basic variables from the structured data;
and judging whether the structured data comprises preset variables or not, if so, extracting the preset variables, and assembling the preset variables to generate derived variables.
4. The method for constructing a knowledge graph system based on real-time big data according to claim 3, wherein the method comprises the following steps: the message middleware adopts Kafka.
5. The method for constructing the knowledge graph system based on the real-time big data as claimed in claim 4, wherein: in the knowledge representation step, a triple resource description framework is adopted as a knowledge structure of the knowledge graph.
6. The method for constructing a knowledge graph system based on real-time big data according to claim 5, wherein the method comprises the following steps: the method also comprises a knowledge storage step of storing the knowledge graph through a database.
7. The method for constructing a knowledge graph system based on real-time big data according to claim 6, wherein the method comprises the following steps: in the knowledge storage step, an ArangoDB database is adopted to store the knowledge map.
8. The method for constructing a knowledge graph system based on real-time big data according to claim 7, wherein the method comprises the following steps: the method also comprises a knowledge visualization step, wherein the knowledge visualization step is used for acquiring query requirements, searching the knowledge graph based on the query requirements and returning visualization information.
9. The method for constructing a knowledge graph system based on real-time big data according to claim 8, wherein the method comprises the following steps: the query requirements include entity queries, shortest path queries, and index queries.
10. The method for constructing a knowledge graph system based on real-time big data according to claim 9, wherein: and during index query, calculating the one-degree association number, the one-degree black touch number and the one-degree black touch rate of the user.
Background
Due to the specific business model of network finance, the business development process faces a variety of risks for a long time, once the risks occur, the influence range is rapidly expanded, and the follow-up and compensation are difficult to carry out afterwards.
The financial knowledge graph is a large-scale semantic network in the financial industry, has double characteristics of a graph and a spectrum, has the advantages of intelligence and high-efficiency knowledge organization, and can help a user or a financial institution to quickly and accurately inquire, associate and mine potential information so as to reduce risks. However, most of the prior knowledge maps are offline knowledge maps, and the offline knowledge maps cannot describe the associated network between the pseudo-trust main bodies in the online transaction stage in real time, and cannot effectively solve the problems of data isolated islands, data misalignment and the like.
Therefore, an online knowledge graph is needed to effectively overcome the data island problem and further reduce the risk.
Disclosure of Invention
The invention provides a method for constructing a knowledge graph system based on real-time big data, which can effectively overcome the problem of data isolated island.
In order to solve the technical problem, the present application provides the following technical solutions:
the construction method of the knowledge map system based on the real-time big data comprises the following steps:
a knowledge extraction step; acquiring user information from a business database in real time, and extracting map data based on the user information; the service database comprises a plurality of data sources, the user information comprises a plurality of variables, and the map data comprises entities, attributes and relations;
a knowledge fusion step, namely fusing variables of different data sources about the same entity into the same variable;
and a knowledge representation step, namely constructing a knowledge graph based on graph data.
The basic scheme principle and the beneficial effects are as follows:
in the scheme, the service database comprises a plurality of data sources, so that when the user information is acquired from the service database in real time, the user information can be acquired from the plurality of data sources, and the user information is richer and more complete. And extracting entities, attributes and relationships from user information, wherein the three dimensions can effectively reflect the user condition and provide necessary elements for establishing a knowledge graph.
Compared with an off-line knowledge map, the method has the advantages that the user information is obtained from the business database in real time, the knowledge map is guaranteed to be updated in time, risk assessment can be carried out in time in an on-line transaction stage, and accordingly fraud risks are effectively reduced.
Because there are multiple data sources, and variable descriptions of the multiple data sources about the same entity may be different, if separate extraction is performed, the entity in the knowledge graph may be repeated, and subsequent query may be affected. In the scheme, the variables of different data sources about the same entity are fused into the same variable, so that the repetition caused by inconsistent variable description and the like is avoided, and the accuracy of acquiring the user information from a plurality of data sources is ensured.
In conclusion, the data island problem can be effectively solved, the user information is obtained from a plurality of data sources, the knowledge graph is constructed, and risks such as hidden personal fraud and group fraud in a potential network are conveniently mined.
Further, in the knowledge extraction step, the acquired user information is sent to the message middleware in real time, the user information is received from the message middleware, and then the atlas data is extracted based on the user information.
Because the service database comprises a plurality of data sources, the user information is transferred through the message middleware, and the information congestion is avoided. The smooth acquisition of the user information can be ensured.
Further, in the knowledge extraction step, after receiving the user information from the message middleware, judging whether the user information is a json file, when the user information is the json file, analyzing the json file, and disassembling unstructured data in the json file into structured data; extracting basic variables from the structured data;
and judging whether the structured data comprises preset variables or not, if so, extracting the preset variables, and assembling the preset variables to generate derived variables.
The irregular user information can be subjected to standardization processing, and subsequent extraction of atlas data is facilitated.
Further, the message middleware adopts Kafka.
Kafka has the advantages of high throughput and distributed publishing of subscription messages, can effectively avoid information congestion, and is convenient for real-time importing of user information.
Further, in the knowledge representation step, a triple resource description framework is adopted as a knowledge structure of the knowledge graph.
Further, the method also comprises a knowledge storage step of storing the knowledge graph through a database.
Further, in the knowledge storage step, an ArangoDB database is used for storing the knowledge graph.
The ArangoDB is an original-ecology graph database, and has key/value pairs, graph graphs and document data models, so that a unified database query language covering three data models is provided, and the three models can be mixed in a single query to facilitate subsequent queries.
And further, a knowledge visualization step is included, wherein query requirements are acquired, the knowledge graph is searched based on the query requirements, and visualization information is returned.
Through visual information, a user can visually know the query result, and whether risks exist or not can be quickly judged.
Further, the query requirements include entity queries, shortest path queries, and index queries.
Further, when the index is queried, the one-degree association number, the one-degree black touch number and the one-degree black touch rate of the user are calculated.
Drawings
FIG. 1 is a diagram illustrating a knowledge-graph visualization in accordance with an embodiment;
fig. 2 is a diagram illustrating shortest path query according to the first embodiment.
Detailed Description
The following is further detailed by way of specific embodiments:
example one
As shown in fig. 1, the method for constructing a knowledge graph system based on real-time big data according to this embodiment includes the following steps:
a knowledge extraction step; user information is obtained from a service database.
The business database comprises a plurality of data sources, and in the embodiment, the data sources comprise internal data of the bank, pedestrian credit data and data of a three-party data source. The three-party data source refers to other data sources except internal data of the bank and pedestrian credit data.
The user information includes several variables, such as:
user basic information: name, identification number, mobile phone number, device fingerprint, contact, spouse information, marital status, gender, year and month of birth, etc.
Data of company where user is located: corporate social unification codes, corporate operations, and the like.
Quota application data: the user quota application number, the application passing condition, the rejection rule condition and the like.
Borrowing application data: the time of the user borrowing, the amount of the borrowed money, the time of the borrowing, the reason for the rejection, etc.
Credit investigation data: the user is advised of overdue conditions, hit on a blacklist, etc.
Deposit data: in the case of a credit balance, a refund amount, an overdue condition, etc.
Vehicle loan data: frame, engine number, license plate number, etc.
The identity card number, the equipment fingerprint, the company social uniform code, the user limit application number and the like are all variables.
Specifically, the incremental log of the service database is analyzed, and the incremental user information in the service database is acquired in real time after the analysis. In this embodiment, incremental log parsing is implemented by the open-source framework canal.
And sending the acquired user information to a message middleware Kafka of the real-time platform. In this embodiment, the real-time platform is built by using an open source flow processing framework Apache Flink.
And receiving user information from the message middleware Kafka, judging whether the user information is a json file, analyzing the json file when the user information is the json file, and disassembling unstructured data in the json file into structured data.
For example, in the json file, the user information of Zhang III is name Zhang III; { idcard1 × 0cell156 × 0000180 × 0000 … }, where three pieces of user information are all present in an array, it is inconvenient for subsequent utilization, and the unstructured data is disassembled into structured data, and then three pieces of user information are changed into a list form:
name: zhang III;
idcard:1******0;
cell:156****0000;
180****0000;
……
extracting basic variables from the structured data; in this embodiment, the basic variables refer to directly usable variables such as IP, device fingerprint, contact, spouse information, marital status.
And judging whether the structured data comprises preset variables or not, if so, extracting the preset variables, and assembling the preset variables to generate derived variables. In this embodiment, the preset variable is a variable that needs to be processed.
For example: the overdue condition of Zhang III is a preset variable:
500 Yuan 6 Yuan in 2020, 100 Yuan in 7 Yuan in 2020, and 1000 Yuan in 12 Yuan in 2020.
The total amount of overdue of the last five years assembled from the overdue conditions described above is (i.e., a derivative variable): 1600 yuan.
Graph data is extracted based on the user information, the graph data including entities, attributes, and relationships.
For example, the entities include: identity card number, mobile phone number, company social uniform code, license plate number and the like;
for example, the entity of zhang san corresponds to a mobile phone number, and the relation is owned by zhang san to the mobile phone number.
For example, zhang san entity, gender, year and month of birth, marital status, etc. are the corresponding attributes.
And a knowledge fusion step, namely fusing the variables of different data sources about the same entity into the same variable.
The three data sources are different and may describe the same variable differently, for example, the court case number is the entity, and the variables returned in the data source 1 are: court case number: 11 x 0, the variable returned in the data source 2 is the criminal case number 22 x 1; the returned criminal case number in data source 2 was converted to the court case number 22 x 1, and 11 x 0 and 22 x 1 were used as the content of the entity of the court case number. If court case number: 11 x 0 and criminal case No. 11 x 0 were identical, only one was retained after transformation.
And a knowledge representation step, namely constructing the knowledge graph based on graph data by adopting a triple (RDF) resource description framework as a knowledge structure of the knowledge graph.
A knowledge storage step, namely storing a knowledge map by adopting an ArangoDB database;
as shown in fig. 1, the knowledge visualization step includes acquiring query requirements, searching a knowledge graph based on the query requirements, and returning visualization information. In this embodiment, the query requirement includes an entity query, a shortest path query, and an index query.
For example, an entity of inquiring the identity card number of the borrower returns information of entity data dimensions related to the borrower and information of entity data dimensions of other people related to the information, such as contact information, borrowing application information and labels (such as blacklists, lost letters and the like) of each entity, complex information can be presented in a very intuitive mode through visualization, and the coming and going of hidden information can be clear at a glance.
The shortest path refers to the association distance between two users or a plurality of users, i.e. the association degree, and the shortest association distance between two entities can be returned by searching the shortest path. As shown in fig. 2, for example, by inputting the identification numbers of two users, the mobile phone numbers associated with the two identification numbers can be queried.
In this embodiment, during index query, the first-degree association number, the first-degree blackout rate, and the like of the user are further calculated, so as to further mine hidden fraud risk information. Wherein the first degree association number refers to the number of entities with a distance of 1 from the target node in the knowledge graph; the one-time black-touching number refers to the number of blacklist entities with the distance from the target node being 1; the one-degree blackout rate refers to a ratio of the one-degree blackout number to the one-degree correlation number. In other embodiments, the number of second degree associations, the number of second degree blackout, the second degree blackout rate, the number of third degree associations, and the like of the user may also be calculated.
Example two
The difference between this embodiment and the first embodiment is that in this embodiment, a recording step is further included, in which a user corresponding to an entity is obtained from an entity query and a shortest path query, and a queried user information table is generated and stored. For example, in the shortest path query, the associated distance between the user a and the user B is queried, and the user a and the user B are obtained in the recording step.
In the knowledge extraction step, whether the incremental user information in the service database exceeds a threshold value is also judged, if so, whether a user belongs to the inquired user information table in the incremental user information is judged, and if so, the corresponding user information is preferentially extracted.
When the incremental user information in the service database exceeds the threshold value, the fact that more user information is added indicates that congestion of subsequent processing is easily caused, and the knowledge graph of the users can be improved preferentially by preferentially extracting the user information corresponding to the existing users in the inquired user information table. Timely updates may reduce the probability of a wind control error because these users are queried more frequently.
The above are merely examples of the present invention, and the present invention is not limited to the field related to this embodiment, and the common general knowledge of the known specific structures and characteristics in the schemes is not described herein too much, and those skilled in the art can know all the common technical knowledge in the technical field before the application date or the priority date, can know all the prior art in this field, and have the ability to apply the conventional experimental means before this date, and those skilled in the art can combine their own ability to perfect and implement the scheme, and some typical known structures or known methods should not become barriers to the implementation of the present invention by those skilled in the art in light of the teaching provided in the present application. It should be noted that, for those skilled in the art, without departing from the structure of the present invention, several changes and modifications can be made, which should also be regarded as the protection scope of the present invention, and these will not affect the effect of the implementation of the present invention and the practicability of the patent. The scope of the claims of the present application shall be determined by the contents of the claims, and the description of the embodiments and the like in the specification shall be used to explain the contents of the claims.