Real-time and off-line service unified processing method and device based on big data technology
1. A real-time and off-line service unified processing method based on big data technology is characterized by comprising the following steps:
writing the service data collected in real time into the KAFKA cluster;
processing and integrating data in the KAFKA cluster based on real-time service requirements and an offline service range to obtain first detail data of the real-time service and second detail data corresponding to the offline service range;
and analyzing and processing the first detail data corresponding to the real-time service, wherein the second detail data is used for analyzing and processing the subsequent off-line service at a preset time.
2. The method for processing the real-time and offline services uniformly based on big data technology as claimed in claim 1, wherein said processing and integrating the data in the KAFKA cluster based on the real-time service requirement and the offline service range to obtain the first detail data of the real-time service and the second detail data corresponding to the offline service range comprises:
determining attribute dimensionality after data processing integration based on real-time service requirements and an offline service range;
extracting, converting and carrying out dimension correlation operation on the data in the KAFKA cluster based on the attribute dimension to obtain detail data after processing and integration;
and screening first detail data corresponding to the real-time service from the detail data, and screening second detail data corresponding to the offline service range from the detail data.
3. The method according to claim 2, wherein the extraction condition of the extraction operation is determined based on the attribute dimension, the type of the associated data corresponding to the target data in the conversion operation is determined based on the attribute dimension, and the dimension association operation is used to generate a view table associating the integrated data.
4. The method of claim 3, wherein the extracting, converting and dimension associating operations are performed on the data in the KAFKA cluster based on the attribute dimensions to obtain detail data after processing integration, and the method comprises:
extracting, converting and carrying out dimension correlation processing on data in the KAFKA cluster based on a manually written target script statement control Flink computing frame on a preset Web streaming platform to obtain detail data after processing and integration;
the target script statement is an SQL statement comprising the instructions of the extraction, the conversion and the dimension association processing, and is packaged and uploaded to the Flink computing framework by the preset Web streaming platform so that the Flink computing framework can be converted into the Flink statement to be executed.
5. The big data technology-based real-time and offline service unified processing method according to claim 4, wherein the preset Web streaming platform comprises pre-bound and associated topic data in KAFKA cluster and pre-encapsulated function modules;
correspondingly, the process of manually writing the target script statement comprises the following steps:
and performing manual configuration on a processing template statement formed on the basis of the theme data and the functional module to obtain a target script statement.
6. The big data technology-based real-time and offline service unified processing method according to any one of claims 2 to 5, wherein the screening out the first detail data corresponding to the real-time service from the detail data includes:
screening out first detail data corresponding to the real-time service from the detail data;
and creating a KAFKA real-time detail width table, and importing the first detail data into the KAFKA real-time detail width table for subsequent analysis and processing of the real-time service.
7. The big data technology-based real-time and offline service unified processing method according to any one of claims 2 to 5, wherein the step of screening out the second detail data corresponding to the offline service range from the detail data comprises:
screening out second detail data corresponding to the offline service range from the detail data;
creating an offline real-time detail and width table, and importing the second detail data into the offline real-time detail and width table for subsequent analysis and processing of the offline service;
and the offline real-time detail width table is a HIVE real-time detail width table or an HDFS real-time detail width table.
8. A real-time and off-line service unified processing device based on big data technology is characterized by comprising:
the writing unit is used for writing the service data acquired in real time into the KAFKA cluster;
the unified integration unit is used for processing and integrating the data in the KAFKA cluster based on real-time service requirements and an offline service range to obtain first detail data of the real-time service and second detail data corresponding to the offline service range;
and the branch unit is used for analyzing and processing the first detail data corresponding to the real-time service, and the second detail data is used for analyzing and processing the subsequent off-line service at a preset time.
9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the big data technology-based real-time and offline business unified processing method according to any one of claims 1 to 7 when executing the program.
10. A non-transitory computer readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of the big data technology based real-time and offline business unified processing method according to any of claims 1 to 7.
Background
In various industries, attention is being paid to short-term real-time use and long-term use of daily collected service data. For a shopping website, clicking, browsing and purchasing behaviors of a user are critical to forming an accurate user portrait, for taxi-taking software, travel information of the user is also critical to predicting taxi-taking requirements of the user in a future time period, some data analysis mining has high requirements for real-time performance, for example, hot search lists of social software, hit quantity of hot news being checked and the like can be updated in real time, some data analysis mining has low requirements for real-time performance, for example, user portraits and taxi-taking behavior predictions of the previous examples can be analyzed and mined when a server is not busy, for example, the shopping website and taxi-taking software can analyze and mine user data collected in the morning in low browsing quantity and low pick-up quantity, and business data of more business software are directly stored for future business analysis requirements to reuse, for example, the demand for analysis mining within the current business scope is only concentrated on 5 types, but with the expansion of companies, the expansion of businesses, the cooperation of companies across companies, the participation in charitable activities and the like, the business data stored by the companies before may be pulled again for analysis and mining of new business types. Therefore, merchants in various industries continuously record own service data, some of the merchants process and analyze real-time services to obtain real-time results for display, some merchants wait for certain data to be pulled by offline services for analysis and mining in a specific period after storage to obtain results with low real-time requirements for display, and some merchants only store the results for the offline services expanded in the future, so that potential analyzed and mined values are provided.
At present, for streaming data and batch data, the traditional method mainly adopts a method of a respective independent acquisition system and a data analysis and mining system, namely, real-time data analysis and mining service and offline data analysis and mining service are divided into two independent lines for respective acquisition, analysis and mining; according to the method, two sets of programs need to be developed to acquire, access and process in a data acquisition and access stage and a data integration stage, so that the problems of redundancy in a development process, increase in development time cost and improvement of environmental resource requirements are directly brought, and the defect that data is not unified exists.
Therefore, how to avoid the problem that the existing real-time analysis mining and offline analysis mining of the collected data are separately processed to cause redundancy in repeated processing and calculation of partial data, and two sets of separate programs need to be developed to respectively access and process the data which are analyzed and mined in real time and offline, so that the development time cost of the analysis mining function is raised, and the data is not unified, which is still a problem to be solved by the technical personnel in the field.
Disclosure of Invention
The invention provides a method and a device for uniformly processing real-time and off-line services based on big data technology, which are used for solving the problems that the prior method for separately processing the real-time analysis mining and the off-line analysis mining of collected data causes redundant repeated processing and calculation of partial data, and two sets of independent programs need to be developed to respectively access and process the data of the real-time analysis mining and the off-line analysis mining, so that the development time cost of an analysis mining function is raised, and the data is not unified, wherein the method comprises the steps of writing the real-time collected service data into a KAFKA cluster, uniformly processing and integrating the data in the KAFKA cluster based on the real-time service requirement and the off-line service requirement to obtain detailed data, taking one part of the detailed data as first detailed data for the analysis mining of the real-time services, and screening the other part of the detailed data as second detailed data for the analysis mining of the off-line services, the first detail data directly enters an analysis processing link corresponding to the real-time service, and is analyzed and mined to obtain an analysis result corresponding to the real-time service, and the second detail data is stored as a database detail width table and is analyzed and mined to obtain an analysis result corresponding to the off-line service after being acquired by the off-line service request. Therefore, the steps of processing and integrating the acquired data in the real-time service and the off-line service are combined and unified, the problem that part data processing and integrating in the acquired data are repeated when two programs are developed to respectively perform the acquired data in the real-time service and the off-line service is solved, the data acquired in the real-time service can be directly stored in a database detail and width table after processing and integrating, the problem that the off-line service needs to use the same data and perform processing and integrating once is solved, the integration processing of the off-line service is also combined into the integration processing of the real-time service, the problem that two programs aiming at the access processing of the real-time service and the off-line service are developed is avoided, the development cost is saved, and the form structure of integrating the access data processing in the real-time service and the off-line service is unified.
The invention provides a real-time and offline service unified processing method based on big data technology, which comprises the following steps:
writing the service data collected in real time into the KAFKA cluster;
processing and integrating data in the KAFKA cluster based on real-time service requirements and an offline service range to obtain first detail data of the real-time service and second detail data corresponding to the offline service range;
and analyzing and processing the first detail data corresponding to the real-time service, wherein the second detail data is used for analyzing and processing the subsequent off-line service at a preset time.
According to the method for processing the real-time and offline services uniformly based on the big data technology, the data in the KAFKA cluster are processed and integrated based on the real-time service requirement and the offline service range to obtain the first detail data of the real-time service and the second detail data corresponding to the offline service range, and the method comprises the following steps:
determining attribute dimensionality after data processing integration based on real-time service requirements and an offline service range;
extracting, converting and carrying out dimension correlation operation on the data in the KAFKA cluster based on the attribute dimension to obtain detail data after processing and integration;
and screening first detail data corresponding to the real-time service from the detail data, and screening second detail data corresponding to the offline service range from the detail data.
According to the real-time and offline service unified processing method based on the big data technology, provided by the invention, the extraction condition of the extraction operation is determined based on the attribute dimension, the type of the associated data corresponding to the target data in the conversion operation is determined based on the attribute dimension, and the dimension association operation is used for generating a view chart of the associated and integrated data.
According to the method for processing the real-time and offline services uniformly based on the big data technology, the extraction, conversion and dimension association operations are carried out on the data in the KAFKA cluster based on the attribute dimension to obtain the detail data after processing and integration, and the method comprises the following steps:
extracting, converting and carrying out dimension correlation processing on data in the KAFKA cluster based on a manually written target script statement control Flink computing frame on a preset Web streaming platform to obtain detail data after processing and integration;
the target script statement is an SQL statement comprising the instructions of the extraction, the conversion and the dimension association processing, and is packaged and uploaded to the Flink computing framework by the preset Web streaming platform so that the Flink computing framework can be converted into the Flink statement to be executed.
According to the real-time and offline service unified processing method based on the big data technology, the preset Web streaming platform comprises theme data in a pre-bound and associated KAFKA cluster and a pre-packaged functional module;
correspondingly, the process of manually writing the target script statement comprises the following steps:
and performing manual configuration on a processing template statement formed on the basis of the theme data and the functional module to obtain a target script statement.
According to the method for uniformly processing the real-time service and the off-line service based on the big data technology, the step of screening out the first detail data corresponding to the real-time service from the detail data comprises the following steps:
screening out first detail data corresponding to the real-time service from the detail data;
and creating a KAFKA real-time detail width table, and importing the first detail data into the KAFKA real-time detail width table for subsequent analysis and processing of the real-time service.
According to the real-time and offline service unified processing method based on the big data technology, the step of screening out second detail data corresponding to the offline service range from the detail data comprises the following steps:
screening out second detail data corresponding to the offline service range from the detail data;
creating an offline real-time detail and width table, and importing the second detail data into the offline real-time detail and width table for subsequent analysis and processing of the offline service;
the offline real-time detail width table is a HIVE real-time detail width table or an HDFS (Hadoop Distributed File System) real-time detail width table.
The invention also provides a real-time and off-line service unified processing device based on big data technology, comprising:
the writing unit is used for writing the service data acquired in real time into the KAFKA cluster;
the unified integration unit is used for processing and integrating the data in the KAFKA cluster based on real-time service requirements and an offline service range to obtain first detail data of the real-time service and second detail data corresponding to the offline service range;
and the branch unit is used for analyzing and processing the first detail data corresponding to the real-time service, and the second detail data is used for analyzing and processing the subsequent off-line service at a preset time.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor realizes the steps of the real-time and offline service unified processing method based on the big data technology when executing the program.
The present invention also provides a non-transitory computer readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the method for unified processing of real-time and offline services based on big data technology as described in any of the above.
The method and the device for uniformly processing the real-time and offline services based on the big data technology provided by the invention have the advantages that the service data collected in real time are written into the KAFKA cluster; processing and integrating data in the KAFKA cluster based on real-time service requirements and an offline service range to obtain first detail data of the real-time service and second detail data corresponding to the offline service range; and analyzing and processing the first detail data corresponding to the real-time service, wherein the second detail data is used for analyzing and processing the subsequent off-line service at a preset time. The steps of processing and integrating the acquired data in the real-time service and the off-line service are combined and unified, so that the repeated processing and integration of part of the acquired data in the real-time service and the off-line service by developing two programs are avoided, the data acquired in the real-time service can be directly stored in a database detail and width table after being processed and integrated, the situation that the off-line service needs to use the same data and is processed and integrated once is avoided, the integration processing of the off-line service is also combined into the integration processing of the real-time service, the development of two programs aiming at the access processing of the real-time service and the off-line service is avoided, the development cost is saved, and the form structure of the integration of the access data processing in the real-time service and the off-line service is unified. Therefore, the method and the device provided by the invention avoid redundant calculation of repeated access processing of partial data, and only one set of program needs to be developed to carry out uniform access processing on the data which are analyzed and mined in real time and analyzed and mined in offline, so that the development time cost of the analysis and mining function is reduced, and the form and the structure of the integrated access data processing of the real-time service and the offline service are uniform.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
Fig. 1 is a flow chart of processing of an offline service and a real-time service provided by the prior art;
fig. 2 is a schematic flow chart of a method for processing real-time and offline services in a unified manner based on big data technology according to the present invention;
FIG. 3 is a schematic diagram of an operation instruction script for extracting a real-time log table structure of a traveling crane according to the present invention;
FIG. 4 is a schematic diagram of an operation instruction script for creating a station and platform dimension table according to the present invention;
FIG. 5 is an exemplary diagram of an implementation instruction script for a process of forming driving detail view chart data according to the present invention;
FIG. 6 is an exemplary diagram of an implementation instruction script for the KAFKA real-time broad table creation, late vehicle data generation and import process provided by the present invention;
FIG. 7 is an exemplary diagram of an implementation instruction script for creating a schedule table based on HIVE and importing late vehicle data according to the present invention;
FIG. 8 is a schematic flow chart of a method for uniformly collecting and integrating the data of the traffic logs of the rail transit based on the batch flow integrated big data technology provided by the present invention;
fig. 9 is a schematic structural diagram of a real-time and offline service unified processing apparatus based on big data technology according to the present invention;
fig. 10 is a schematic physical structure diagram of an electronic device according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The problems that part of data is repeatedly processed and calculated redundantly due to separate processing, and two sets of independent programs need to be developed to respectively access and process the data which are analyzed and mined in real time and analyzed and mined in offline are generally solved in the existing real-time analysis and mining and offline analysis and mining of collected data, so that the development time cost of the analysis and mining functions is increased, and the data are not unified. The following describes a unified processing method for real-time and offline services based on big data technology according to the present invention with reference to fig. 1 to 7.
The process of the prior art, which is separate from the real-time analysis mining and the off-line analysis mining of the collected data, is firstly processed. Fig. 1 is a flow chart of processing of an offline service and a real-time service provided in the prior art, as shown in fig. 1, in a conventional data analysis and mining process, the offline service and the real-time service are divided into two sets of access processing programs, the flow of the offline service is indicated on the left side of a dotted line, a database or a storage module storing real-time acquisition data is given as default for the offline service flow shown in fig. 1, data in the database or the storage module storing the acquisition data is firstly pulled and accessed through a timing batch data acquisition system, then the pulled data is extracted, converted and dimension-associated through a specially developed batch computation framework for processing the offline pulled data, and then an offline detail table is generated, and data in the offline detail table is used for subsequent offline data analysis and mining.
Here, the extraction, conversion and dimension association are described, the extraction is to extract specific data related to offline service from the pulled data based on the offline service, corresponding extraction conditions (i.e. query conditions) are usually set according to the offline service, data useful for the offline service is eliminated from the pulled large amount of data, and then data conversion is performed, that is, the extracted data and the locally related data are associated to determine the type of the data to be associated, for example, a shopping website needs to take an image of people who consume more than two thousand of clothes per month, wherein the extraction conditions are that clothes consume more than two thousand per month, and the data conversion is to integrate the extracted user information, and the user information includes locally stored user basic information such as age, registered mailbox, gender and information to be extracted from the pulled large amount of data, such as daily browsing volume at the shopping site, daily browsing time at the shopping site, and recent consumption related to the user's clothing (such as cosmetics, etc.), which belong to data conversion, the last dimension association is to insert the related data into a detailed table with the user account or user name as a directory, and generate a view table associating the integrated data.
The right side of the dotted line in fig. 1 shows a flow chart of the real-time service, for the real-time service flow shown in fig. 1, the real-time data monitoring system is used for collecting the service data in real time, then the data collected by the real-time data monitoring system is written in real time by using a flow type calculation frame, and the extraction, conversion and dimension association processes are carried out through a specially developed streaming computing framework for processing real-time pull data, the operation of extraction, conversion and dimension association in the specially developed streaming computing framework for processing the real-time pull data is that a developer writes corresponding codes to execute after determining the specific requirements of the real-time service, and finally sends a generated real-time detail table formed by related data specially used for analyzing and mining the corresponding real-time service to a real-time data analyzing and mining module to analyze and mine the corresponding real-time service.
As can be seen from fig. 1, for streaming data and batch data, a traditional method mainly adopts a method of an acquisition system and a data analysis and mining system which are independent of each other, and the method needs to develop two sets of programs to acquire, access and process in a data acquisition access stage and a data integration stage, so that the problems directly brought about are redundancy in a development process, increase in development time cost and improvement in environmental resource requirements, and the disadvantage of non-uniformity of data.
Fig. 2 is a schematic flow chart of a method for processing real-time and offline services in a unified manner based on big data technology, as shown in fig. 2, the method includes:
and step 210, writing the service data collected in real time into the KAFKA cluster.
Optionally, the method for processing the real-time and offline services uniformly based on the big data technology provided by the present invention provides a function of uniformly accessing and processing the offline services and the real-time services, and firstly writes the service data collected by the real-time monitoring system into the KAFKA cluster, that is, uniformly accesses all the collected data into the KAFKA cluster. The KAFKA cluster is briefly introduced here, and is a distributed message system with the characteristics of high level of expansion and high throughput, and in the KAFKA cluster, there is no concept of "central master node", and all nodes in the cluster are peer-to-peer; some concepts in KAFKA are introduced (referred to below): 1. subject matter: the KAFKA can classify messages, each class of message is called a topic (topic), and consumers can process different topics, and the topic corresponds to a set of service data of different service types in real-time service or offline service; 2. agent (Broker): each agent is a KAFKA service instance, a plurality of agents form a KAFKA cluster, messages issued by a producer are stored in the agents, and a consumer pulls the messages from a Broker for consumption (consume); 3. producer (Producer): responsible for generating messages and sending to the Broker; 4. the consumer: is responsible for consuming the theme messages in the Broker. It should be noted that the service data may be data of different fields, may be user behavior data of a shopping website, may also be route driving log data in a track traffic system, and the like, and is not limited herein specifically.
Step 220, processing and integrating the data in the KAFKA cluster based on the real-time service requirement and the offline service range to obtain first detail data of the real-time service and second detail data corresponding to the offline service range.
Optionally, the function of unified access processing of the offline service and the real-time service provided by the present invention requires processing and integrating data collected and stored in the KAFKA cluster, and the processing and integration need to be performed according to a certain rule, and the certain rule is determined based on the real-time service requirement and the offline service range. For example, if the service data is the route trip log data collected in real time in the rail transit system and then stored in the KAFKA cluster, the real-time service requirement is the information of mining the trains at the early and late points, the offline service range includes the statistics of the reasons of train faults and the analysis of passenger flow conditions, then the certain rule needs to take the trains at the late points as query conditions to extract all the relevant information of the trains at the early and late points from the data stored in the KAFKA cluster, and also needs to take the train faults and the passenger flow distribution intervals as query conditions to extract the train fault information and the passenger flow information from the data stored in the KAFKA cluster, and the extraction, conversion and dimensional correlation related to the specified query conditions belong to fine-grained processing integration; the other data in the KAFKA cluster can be extracted, converted and dimension associated in coarse granularity according to the offline service which is possibly expanded in the future, corresponding detailed data is prepared in advance for the offline service which is possibly expanded in the future, although the granularity is coarse, the data can be further extracted when the data needs to be used in the future, in general, the processing integration is completed in advance together with the real-time service, the calculation amount can be reduced globally, and the processing integration of the same data is prevented from being performed for multiple times. Therefore, the first detail data is related data corresponding to the current real-time service requirement, the second detail data comprises all data pulled in the KAFKA cluster, and the second detail data is generated only according to uniform processing integration, and the second detail data comprises the related data corresponding to the current existing off-line service requirement and the related data corresponding to the off-line service requirement which is expanded in the future with coarser granularity, so that the current existing off-line service and the off-line service which is expanded in the future are summarized by the off-line service range.
It should be noted that the first detail data and the second detail data are detail table data with a uniform structure, and the difference is that the first detail data is stored in a KAFKA cluster in a streaming data form to form a KAFKA real-time detail table for subsequent real-time service direct pull to perform corresponding analysis and mining, and the second detail data is separately batched into a database to form a database detail table for subsequent offline service direct pull to perform corresponding analysis and mining.
Step 230, performing analysis processing corresponding to the real-time service on the first detail data, where the second detail data is used for performing corresponding analysis processing on the subsequent offline service at a preset time.
Optionally, the first detail data is directly pulled out for consumption as a real-time service of the consumer, the first detail data is processed by analysis and mining corresponding to the real-time service, the second detail data is stored as a database detail table, pulling of the corresponding offline service is waited for in a preset time, and the pulled second detail data is processed by analysis and mining corresponding to the offline service. The method comprises the steps that after detail data are generated through unified processing and integration, a branch flow is entered, one branch corresponds to a real-time service, streaming first detail data are pulled in real time to analyze and mine the real-time service, the other branch corresponds to an off-line service, second detail data are stored in a database in a separated batch to form a database detail table, the analysis time corresponding to the off-line service is waited, and the off-line service pulls the database detail table stored in the database and analyzes and mines the database detail table.
The method provided by the invention writes the service data collected in real time into the KAFKA cluster; processing and integrating data in the KAFKA cluster based on real-time service requirements and an offline service range to obtain first detail data of the real-time service and second detail data corresponding to the offline service range; and analyzing and processing the first detail data corresponding to the real-time service, wherein the second detail data is used for analyzing and processing the subsequent off-line service at a preset time. The steps of processing and integrating the acquired data in the real-time service and the off-line service are combined and unified, so that the repeated processing and integration of part of the acquired data in the real-time service and the off-line service by developing two programs are avoided, the data acquired in the real-time service can be directly stored in a database detail and width table after being processed and integrated, the situation that the off-line service needs to use the same data and is processed and integrated once is avoided, the integration processing of the off-line service is also combined into the integration processing of the real-time service, the development of two programs aiming at the access processing of the real-time service and the off-line service is avoided, the development cost is saved, and the form structure of the integration of the access data processing in the real-time service and the off-line service is unified. Therefore, the method provided by the invention avoids redundant calculation of repeated access processing of partial data, only needs to develop a set of program to carry out uniform access processing on the data analyzed and mined in real time and analyzed and mined in off-line, reduces the development time cost of the analysis and mining function, and enables the form structure of the integrated access data processing of the real-time service and the off-line service to be uniform.
Based on the above embodiment, in the method, the processing and integrating data in the KAFKA cluster based on the real-time service requirement and the offline service range to obtain the first detail data of the real-time service and the second detail data corresponding to the offline service range includes:
determining attribute dimensionality after data processing integration based on real-time service requirements and an offline service range;
extracting, converting and carrying out dimension correlation operation on the data in the KAFKA cluster based on the attribute dimension to obtain detail data after processing and integration;
and screening first detail data corresponding to the real-time service from the detail data, and screening second detail data corresponding to the offline service range from the detail data.
Optionally, how to process and integrate the data in the KAFKA cluster according to the real-time service requirement and the offline service range is further defined, and how to define the processing and integrating rule according to the real-time service requirement and the offline service range. In the unified processing and integrating function of the data acquired by the real-time service and the data acquired by the off-line service provided by the invention, the processing and integrating function is determined according to the real-time service requirement and the off-line service range, or the service data is used as the travel log data in the rail transit system for explanation, if the real-time service needs to determine the early and late point conditions of all running trains in real time, the attribute dimensions of the relevant data needing to be extracted comprise the travel log data, the station dimension table and the platform dimension table, so that the early and late point conditions of each early and late point train can be specifically issued to the corresponding station and platform, therefore, after the real-time service requirement is determined that the early and late point information of the train is broadcasted in real time at the corresponding station platform, the attribute dimensions after the processing and integrating corresponding to the real-time service can be determined to be the travel log data, the station dimension table and the platform dimension table, then, the extraction, conversion and dimension association operations in the process of processing and integrating can determine and set corresponding extraction query conditions, association modes and visual chart forms of detail data formed by integration according to the required travel log data, the station dimension table and the platform dimension table. Similarly, for the offline service, the attribute dimension after the fixed data of the offline service is processed and integrated is also determined, if the offline service is the statistics of the number of vehicles at a later time based on the dimensions of year, month, day, station, vehicle and the like, the corresponding attribute dimension is train running log data, a station dimension table and a platform dimension table, and the extraction, conversion and dimension association operations in the corresponding integration processing process can determine and set corresponding extraction query conditions (for example, the later condition of a certain month of a train number, the later condition of a certain year of a train number at a certain platform of a train number), association modes and a visual chart form of detail data formed by integration according to the required running log data, the station dimension table and the platform dimension table.
The method comprises the steps that after data in a KAFKA cluster are extracted, converted and dimension-associated according to a processing and integrating rule determined by real-time service requirements and an offline service range, a detail data set is obtained, first detail data which is used immediately by real-time services are picked out from the detail data set, determined offline detail data which is used by the existing offline services at a future corresponding moment and suspected offline detail data which is used by new offline services which are possibly expanded in the future are picked out from the detail data set to be combined into second detail data, the processed and integrated detail data are provided for the existing offline services and the new expanded offline services in the future, and steps of offline service analysis and mining are saved. The first detail data is related data corresponding to the current real-time service requirement, the second detail data comprises all data pulled in the KAFKA cluster, and the second detail data is generated only according to unified processing integration, and the second detail data comprises the related data corresponding to the current existing off-line service requirement and the related data corresponding to the off-line service requirement which is expanded in the future with coarser granularity, so that the off-line service range is used for summarizing the current existing off-line service and the off-line service which is expanded in the future.
Based on the above embodiment, in the method, the extraction condition of the extraction operation is determined based on the attribute dimension, and the type of the associated data corresponding to the target data in the conversion operation is determined based on the attribute dimension, where the dimension association operation is used to generate a view table associating the integrated data.
Optionally, specific actions to handle the extraction, transformation and dimension association operations in the integration are further defined. The extraction operation firstly determines to extract the data object, so that a query condition of the target data object needs to be found according to the attribute dimension, then the query condition is used for constructing the extraction target data object from the streaming big data in the KAFKA cluster, the conversion operation needs to determine the type of the associated data corresponding to the target data in the conversion operation based on the attribute dimension, and the dimension association operation is used for generating a view chart of the associated and integrated data. The following description is also given by taking service data as the travel log data in the rail transit system as an example to determine a processing and integrating rule of real-time service for broadcasting the information of the early and late points of the train in real time at the corresponding station platform: firstly, extracting the data of a morning and evening point schedule (3-bit evening point and 4-bit serious evening point) and a station arrival identifier train arflg (1 represents that a train arrives stably and stops and 3 represents that the train leaves the station) from a driving log; secondly, associating the extracted data with a station dimension table and a platform dimension table joion to obtain unified detailed data; and finally, storing the detail data integrated into a unified list in a tmp _ trainlog _ with view chart.
Based on the above embodiment, in the method, the extracting, converting, and dimension associating the data in the KAFKA cluster based on the attribute dimension to obtain the detail data after processing and integration includes:
extracting, converting and carrying out dimension correlation processing on data in the KAFKA cluster based on a manually written target script statement control Flink computing frame on a preset Web streaming platform to obtain detail data after processing and integration;
the target script statement is an SQL statement comprising the instructions of the extraction, the conversion and the dimension association processing, and is packaged and uploaded to the Flink computing framework by the preset Web streaming platform so that the Flink computing framework can be converted into the Flink statement to be executed.
Specifically, in the prior art, since the offline service and the real-time service are processed separately, for the respective stages of extracting, converting, and integrating the dimension association into detailed data, the offline batch processing storage computation framework used in the offline service is MapReduce, Spark, Hive, etc., and the computation framework used in the real-time service is Storm, Flink, etc. Different from the prior art, the method and the device can bind and associate the KAFKA theme data only according to the Flink grammar rule definition based on the self-built preset Web streaming platform definition table structure script, and realize the effect of monitoring and acquiring the KAFKA driving theme data in real time. Specifically, a developer writes an operation code for extraction, conversion and dimension association operations in a specific real-time service through a newly-built processing integration task on a preset Web streaming platform to obtain a target scripting language, the preset Web streaming platform packages and uploads the operation code to a Flink computing framework for the edited target scripting language, the Flink computing framework converts the script language obtained after unpacking into a Flink language and then executes instructions in the Flink language, and the instructions comprise extraction, conversion and dimension association of corresponding data in a KAFKA cluster to form detailed data. And the process of packaging the target script statements by the preset Web streaming platform comprises the steps that a background system of the Web streaming platform automatically packages the target script statements and converts the target script statements into a DAG (direct current) graph identified by a Flink framework engine. The foregoing describes a method for implementing unified extraction, conversion and dimension association processing on collected data for real-time services and offline services.
Based on the above embodiment, in the method, the preset Web streaming platform includes the theme data in the KAFKA cluster bound and associated in advance, and the function module encapsulated in advance;
correspondingly, the process of manually writing the target script statement comprises the following steps:
and performing manual configuration on a processing template statement formed on the basis of the theme data and the functional module to obtain a target script statement.
Optionally, the preset Web streaming platform set up by the developer can enable the developer to write a script on the platform by using SQL statements, the script realizes the operations of extraction, conversion and dimension association of big data in the KAFKA cluster by defining a table structure, and the preset Web streaming platform can bind and associate the KAFKA theme data only according to the Flink syntax rule definition, so as to realize the effect of monitoring and acquiring the KAFKA driving theme data in real time. When a developer actually writes a script on the preset Web streaming platform, the script can obtain a target script statement only by calling a processing template statement and then writing configuration information because the preset Web streaming platform is pre-bound and associated with the theme data in the KAFKA cluster and integrates a pre-packaged functional module. The embodiment of the invention provides a way for presetting an optimized platform in the process of building a Web streaming platform, which can enable template processing sentences to exist on the platform through encapsulating functional modules and binding associated KAFKA themes so that developers can directly splice and configure the template sentences to complete target script sentences, thereby reducing the workload of encoding of the developers.
Based on the above embodiment, in the method, the screening out the first detail data corresponding to the real-time service from the detail data includes:
screening out first detail data corresponding to the real-time service from the detail data;
and creating a KAFKA real-time detail width table, and importing the first detail data into the KAFKA real-time detail width table for subsequent analysis and processing of the real-time service.
Optionally, the detail data used by the real-time service in the obtained detail data is used as the first detail data, for the real-time service, the scheme of the invention is that the real-time data source is stored based on KAFKA, namely the storage form of the first detail data is defined, the real-time detail data source is directly stored in the real-time detail width table of the existing KAFKA cluster, and the corresponding connector is configured to be of an up-KAFKA type based on KAFKA storage of the real-time data source, so that the table supports the read-write operation of the data. For example, the service data in the former text is line driving log data collected in real time in a rail transit system and then stored in the KAFKA cluster, and the real-time service requirement is information for analyzing and mining early and late trains, the operations of pulling, converting and dimension association of big data in the KAFKA cluster are firstly carried out, fig. 3 is an operation instruction script schematic diagram for extracting a driving real-time log table structure provided by the invention, as shown in fig. 3, the created ods _ trainlog table structure needs to be determined according to a KAFKA subject data structure and a field type, wherein the created connector is of a KAFKA type, and the KAFKA subject name of table association binding is ods _ trainlog; in order to reconstruct the detail table, besides the above ods _ trainlog real-time driving log data table, the method of the invention needs to construct a DIM dimension table in advance, and dimension data of the dimension table can be stored in a structured database such as MySQL as required. Fig. 4 is a schematic diagram of an operation instruction script for creating a station and platform dimension table according to the present invention, and scripting languages on the left and right sides of fig. 4 are a station dimension table and a platform dimension table, respectively, in the scheme of the present invention, a DIM dimension table required for constructing a trip log detail data width table is taken as an example, the dimension table data storage and JDBC type library, the number and type of table fields must be in one-to-one correspondence with those in the JDBC library, and a specific detailed script is shown in fig. 4.
In the unified early-stage data integration, the schemes of the offline service and the real-time service provided by the prior art are implemented based on an independent set of programs respectively at the stages of extraction, conversion and association in the early stage of data. Compared with the scheme in the prior art, the method and the system can uniformly realize the integration of the early-stage data and form the uniform detail and width table data only by developing based on the FlinkSQL script. Fig. 5 is an exemplary diagram of an instruction script for implementing a process of forming driving detail view table data, as shown in fig. 5, the operation process of the present invention for performing unified processing on the extraction, conversion and integration stages of the previous stage data is as follows:
firstly, extracting the data of a morning and evening point schedule (3-bit evening point and 4-bit serious evening point) and a station arrival identifier train arflg (1 represents that a train arrives stably and stops and 3 represents that the train leaves the station) from a driving log; secondly, associating the extracted data with a station dimension table and a platform dimension table joion to obtain unified detailed data; and finally, storing the detail data integrated into a unified list in a tmp _ trainlog _ with view chart.
FIG. 6 is an exemplary diagram of an implementation instruction script of the KAFKA real-time wide table creation, late vehicle data generation and import process provided by the present invention, as shown in FIG. 6, a real-time data source should be configured to be of an upsert-KAFKA type based on KAFKA storage so that the table supports read-write operations on data; the real-time traffic demand from the current trip log is a real-time statistic of the number of vehicles at night per minute, so the window data is divided using a mobile window of the TUMBLE, and in order to ensure data uniqueness, the settings PRIMARY KEY in this example are time _ window _ start, time _ window _ end, train id, server number and order number.
Based on the above embodiment, in the method, the screening out second detail data corresponding to the offline service range from the detail data includes:
screening out second detail data corresponding to the offline service range from the detail data;
creating an offline real-time detail and width table, and importing the second detail data into the offline real-time detail and width table for subsequent analysis and processing of the offline service;
and the offline real-time detail width table is a HIVE real-time detail width table or an HDFS real-time detail width table.
Optionally, for the second detail data corresponding to the offline service scope, a corresponding offline detail width table is created in the designated database, and then the second detail data is stored in the offline detail width table in the database in a split batch manner, where the type of the designated database is defined as HIVE or HDFS. Next, in the above-mentioned example that train trip log data is used as service data, fig. 7 is an exemplary diagram of an implementation instruction script that is provided by the present invention and that creates a detailed table based on HIVE and imports late vehicle data, and as shown in fig. 7, for service requirements of offline data analysis and mining of a trip, batch data may be separated and stored in HIVE or HDFS. The current service requirement of the trip log offline is to count the number of vehicles at a later time based on dimensions of year, month, day, station, vehicle and the like. According to the method, the driving detail data can be stored in the HIVE partition table every day only by configuring the script based on the HIVE connector and setting according to the day partition, and data required by the driving off-line service is distributed.
Based on the above embodiment, the present invention provides a method for uniformly acquiring and integrating track traffic log data based on a batch flow integration big data technology, fig. 8 is a schematic flow chart of the method for uniformly acquiring and integrating track traffic log data based on the batch flow integration big data technology, as shown in fig. 8, the flow includes the following steps: 1. monitoring and acquiring the traffic log data of the rail transit line in real time and writing the data into a KAFKA cluster; 2. developing a Flink script based on a self-developed Web streaming platform to consume KAFKA driving log theme access data in real time; 3. and integrating the data in the previous stage. Uniformly extracting, converting and dimension associating and integrating data based on a FlinkSQL script to form uniform detailed data; 4. the off-line service data is separated from the real-time service data. And respectively writing the integrated detailed data into a HIVE/HDFS (high-level oriented architecture)/KAFKA (KaFKA), analyzing and mining the real-time service demand data based on a KAFKA data source, and analyzing and mining the off-line service demand data based on the HIVE/HDFS. Compared with the method in the prior art, the method can provide a uniform mode for acquiring the data required by accessing and integrating the offline service and the data required by the real-time service, and based on self-developed Web platform development, uniform data acquisition and detailed and broad table data construction can be realized only by providing a set of SQL script.
The real-time and offline service unified processing device based on big data technology provided by the present invention is described below, and the real-time and offline service unified processing device based on big data technology described below and the real-time and offline service unified processing method based on big data technology described above can be referred to correspondingly.
Fig. 9 is a schematic structural diagram of a real-time and offline service unified processing apparatus based on big data technology according to the present invention, as shown in fig. 9, the apparatus includes a writing unit 910, a unified integration unit 920, and a branching unit 930, wherein,
the writing unit 910 is configured to write the service data acquired in real time into the KAFKA cluster;
the unified integration unit 920 is configured to process and integrate data in the KAFKA cluster based on a real-time service requirement and an offline service range, so as to obtain first detail data of the real-time service and second detail data corresponding to the offline service range;
the branch unit 930 is configured to perform analysis processing corresponding to the real-time service on the first detail data, and the second detail data is used for performing corresponding analysis processing on the subsequent offline service at a preset time.
The device provided by the invention writes the real-time collected service data into the KAFKA cluster; processing and integrating data in the KAFKA cluster based on real-time service requirements and an offline service range to obtain first detail data of the real-time service and second detail data corresponding to the offline service range; and analyzing and processing the first detail data corresponding to the real-time service, wherein the second detail data is used for analyzing and processing the subsequent off-line service at a preset time. The steps of processing and integrating the acquired data in the real-time service and the off-line service are combined and unified, so that the repeated processing and integration of part of the acquired data in the real-time service and the off-line service by developing two programs are avoided, the data acquired in the real-time service can be directly stored in a database detail and width table after being processed and integrated, the situation that the off-line service needs to use the same data and is processed and integrated once is avoided, the integration processing of the off-line service is also combined into the integration processing of the real-time service, the development of two programs aiming at the access processing of the real-time service and the off-line service is avoided, the development cost is saved, and the form structure of the integration of the access data processing in the real-time service and the off-line service is unified. Therefore, the device provided by the invention avoids redundant calculation of repeated access processing of partial data, only needs to develop a set of program to carry out uniform access processing on the data analyzed and mined in real time and analyzed and mined in off-line, reduces the development time cost of the analysis and mining function, and enables the form structure of the integrated access data processing of the real-time service and the off-line service to be uniform.
Based on the above embodiment, in the apparatus, the unified integration unit is specifically configured to:
determining attribute dimensionality after data processing integration based on real-time service requirements and an offline service range;
extracting, converting and carrying out dimension correlation operation on the data in the KAFKA cluster based on the attribute dimension to obtain detail data after processing and integration;
and screening first detail data corresponding to the real-time service from the detail data, and screening second detail data corresponding to the offline service range from the detail data.
Based on the above-described embodiments, in this device,
and determining an extraction condition of the extraction operation based on the attribute dimension, and determining the type of associated data corresponding to target data in the conversion operation based on the attribute dimension, wherein the dimension association operation is used for generating a view chart of the associated and integrated data.
Based on the above embodiment, in the apparatus, the extracting, converting, and dimension associating operations on the data in the KAFKA cluster based on the attribute dimension to obtain the detail data after processing and integration includes:
extracting, converting and carrying out dimension correlation processing on data in the KAFKA cluster based on a manually written target script statement control Flink computing frame on a preset Web streaming platform to obtain detail data after processing and integration;
the target script statement is an SQL statement comprising the instructions of the extraction, the conversion and the dimension association processing, and is packaged and uploaded to the Flink computing framework by the preset Web streaming platform so that the Flink computing framework can be converted into the Flink statement to be executed.
Based on the above embodiment, in the apparatus, the preset Web streaming platform includes theme data in the KAFKA cluster bound and associated in advance, and a function module encapsulated in advance;
correspondingly, the process of manually writing the target script statement comprises the following steps:
and performing manual configuration on a processing template statement formed on the basis of the theme data and the functional module to obtain a target script statement.
Based on the above embodiment, in the apparatus, the screening out the first detail data corresponding to the real-time service from the detail data includes:
screening out first detail data corresponding to the real-time service from the detail data;
and creating a KAFKA real-time detail width table, and importing the first detail data into the KAFKA real-time detail width table for subsequent analysis and processing of the real-time service.
Based on the above embodiment, in the apparatus, the screening out the second detail data corresponding to the offline service range from the detail data includes:
screening out second detail data corresponding to the offline service range from the detail data;
creating an offline real-time detail and width table, and importing the second detail data into the offline real-time detail and width table for subsequent analysis and processing of the offline service;
and the offline real-time detail width table is a HIVE real-time detail width table or an HDFS real-time detail width table.
Fig. 10 is a schematic physical structure diagram of an electronic device provided in the present invention, and as shown in fig. 10, the electronic device may include: a processor (processor)1010, a communication Interface (Communications Interface)1020, a memory (memory)1030, and a communication bus 1040, wherein the processor 1010, the communication Interface 1020, and the memory 1030 communicate with each other via the communication bus 1040. The processor 1010 may call logic instructions in the memory 1030 to perform a big data technology-based real-time and offline business unified processing method, the method comprising: writing the service data collected in real time into the KAFKA cluster; processing and integrating data in the KAFKA cluster based on real-time service requirements and an offline service range to obtain first detail data of the real-time service and second detail data corresponding to the offline service range; and analyzing and processing the first detail data corresponding to the real-time service, wherein the second detail data is used for analyzing and processing the subsequent off-line service at a preset time.
Furthermore, the logic instructions in the memory 1030 can be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, which includes a computer program stored on a non-transitory computer readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, the computer can execute the method for unified processing of real-time and offline services based on big data technology, which is provided by the above methods, and the method includes: writing the service data collected in real time into the KAFKA cluster; processing and integrating data in the KAFKA cluster based on real-time service requirements and an offline service range to obtain first detail data of the real-time service and second detail data corresponding to the offline service range; and analyzing and processing the first detail data corresponding to the real-time service, wherein the second detail data is used for analyzing and processing the subsequent off-line service at a preset time.
In still another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to, when executed by a processor, perform a method for unified processing of real-time and offline services based on big data technology, the method including: writing the service data collected in real time into the KAFKA cluster; processing and integrating data in the KAFKA cluster based on real-time service requirements and an offline service range to obtain first detail data of the real-time service and second detail data corresponding to the offline service range; and analyzing and processing the first detail data corresponding to the real-time service, wherein the second detail data is used for analyzing and processing the subsequent off-line service at a preset time.
The above-described server embodiments are only illustrative, and the units described as separate components may or may not be physically separate, and components displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.