Data processing method, device and equipment for online analysis processing engine
1. A data processing method for an online analytical processing engine, comprising:
performing dimensionality modeling on the operation data by using an online analysis processing engine to obtain a corresponding data report; and
and storing the data report into a database associated with the online analysis processing engine so as to query the data report through the online analysis processing engine.
2. The method of claim 1, wherein the performing dimensional modeling on the operational data using the online analytical processing engine to obtain a corresponding data report comprises:
embedding an offline computing engine in the online analysis processing engine; and
and performing dimensional modeling on the operation data by utilizing an offline calculation engine embedded in the online analysis processing engine to obtain the corresponding data report.
3. The method of claim 2, wherein performing dimensional modeling on the operational data using an offline computation engine embedded in the online analytical processing engine to obtain the corresponding data report comprises: utilizing an off-line compute engine embedded within the online analytical processing engine,
performing dimension modeling on the operation data to obtain a corresponding fact table and a corresponding dimension table; and
and associating the dimension table with the fact table to obtain the corresponding data report.
4. The method of claim 1, wherein storing the data report in a database associated with the online analytics processing engine comprises:
and storing the data report into an application data layer of a database associated with the online analysis processing engine.
5. The method of claim 1, further comprising:
and responding to the column of the report query request hit aggregation query preprocessing task, and utilizing the online analysis processing engine to query the data report.
6. The method of claim 5, further comprising:
and responding to the report query request which does not hit the column of the aggregation query preprocessing task, and performing data report query by utilizing a preset offline computing engine.
7. A data processing apparatus for an online analytical processing engine, comprising:
the data modeling module is used for carrying out dimensional modeling on the operation data by utilizing an online analysis processing engine to obtain a corresponding data report; and
and the report storage module is used for storing the data report into a database associated with the online analysis processing engine so as to query the data report through the online analysis processing engine.
8. The apparatus of claim 7, wherein the data modeling module comprises:
the engine processing unit is used for embedding an offline computing engine into the online analysis processing engine; and
and the data modeling unit is used for performing dimensional modeling on the operation data by utilizing an off-line calculation engine embedded in the on-line analysis processing engine to obtain the corresponding data report.
9. The apparatus of claim 8, wherein the data modeling unit comprises:
the table generation subunit is used for performing dimension modeling on the operation data by utilizing an offline calculation engine embedded in the online analysis processing engine to obtain a corresponding fact table and a corresponding dimension table; and
and the table association subunit is used for associating the dimension table with the fact table by utilizing an offline calculation engine embedded in the online analysis processing engine to obtain the corresponding data report.
10. The apparatus of claim 7, wherein the report storage module is further configured to:
and storing the data report into an application data layer of a database associated with the online analysis processing engine.
11. The apparatus of claim 7, further comprising:
and the first report query module is used for responding to the column of the report query request hit aggregation query preprocessing task and utilizing the online analysis processing engine to query the data report.
12. The apparatus of claim 11, further comprising:
and the second report query module is used for responding to the report query request which does not hit the column of the aggregation query preprocessing task and utilizing a preset offline calculation engine to perform data report query.
13. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.
14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.
15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-6.
Background
The business data of the internet company usually relates to log, back-end database and other multi-source data. The problems of wide data source, poor index expansibility, non-standard buried points, repeated development, low query speed, high backtracking difficulty, demand guidance and the like increasingly become the pain points of offline data construction in internet companies.
Disclosure of Invention
The present disclosure provides a data processing method, apparatus, device, storage medium, and computer program product for an online analytics processing engine.
According to an aspect of the present disclosure, there is provided a data processing method for an online analysis processing engine, including: performing dimensionality modeling on the operation data by using an online analysis processing engine to obtain a corresponding data report; and storing the data report in a database associated with the online analysis processing engine so as to query the data report through the online analysis processing engine.
According to another aspect of the present disclosure, there is provided a data processing apparatus for an online analytical processing engine, comprising: the data modeling module is used for carrying out dimensional modeling on the operation data by utilizing an online analysis processing engine to obtain a corresponding data report; and the report storage module is used for storing the data report into a database associated with the online analysis processing engine so as to query the data report through the online analysis processing engine.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to the embodiments of the present disclosure.
According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements a method according to embodiments of the present disclosure.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 illustrates a system architecture suitable for embodiments of the present disclosure;
FIG. 2 illustrates a flow diagram of a data processing method for an online analytical processing engine according to an embodiment of the present disclosure;
FIG. 3 illustrates a schematic diagram of a report query for an online analytics processing engine, in accordance with an embodiment of the present disclosure;
FIG. 4 illustrates a schematic diagram of dimensional modeling in accordance with an embodiment of the present disclosure;
FIG. 5 illustrates a schematic diagram of a bin layering according to an embodiment of the present disclosure;
FIG. 6 illustrates a block diagram of a data processing apparatus for an online analytics processing engine in accordance with an embodiment of the present disclosure; and
FIG. 7 illustrates a block diagram of an electronic device for implementing a data processing method for an online analytics processing engine of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
It should be understood that the offline data construction of large internet companies at present generally adopts the following two ways:
firstly, a MapReduce computing engine based on Hadoop or an offline ETL (Extract-Transform-Load) of a Spark computing engine is used to describe the process of extracting, converting and loading data from a source end to a destination end. The method is a current mainstream offline data processing scheme, and can be used for performing dimensional modeling, multi-bin layering, complex logic processing, multi-format conversion and PB-level large-data-volume ETL.
It should be appreciated that Hadoop is a distributed system infrastructure developed by the Apacche Foundation. A user can develop a distributed program without knowing the distributed underlying details.
It should also be understood that the MapReduce computation engine is a distributed computation engine implemented based on the MapReduce algorithm.
It should also be appreciated that the Spark calculation engine is a fast, general purpose calculation engine designed specifically for large scale data processing.
In the second mode, the offline data Processing scheme based on an OLAP (Online Analytical Processing, abbreviated as Online Analytical Processing) engine, such as clickhouse, kylin, etc. The off-line data processing scheme is a popular off-line data processing scheme at present, and can be used for multi-dimensional data query, large data volume pre-calculation, ad hoc query and the like.
It should be understood that clickhouse is a columnar database management system for OLAP. Kylin is an open source distributed analysis engine.
It should also be understood that, for the first mode, the processing scheme based on the MapReduce or Spark calculation engine has the biggest defects that the ETL processing time is too long, and the queries of hive or Spark SQL (Structured Query Language) are all at minute level or even hour level, so that the ad hoc Query cannot be achieved. In addition, the first method cannot realize multi-dimensional data query, and the cube query capability and the large data amount pre-calculation capability of the first method are lacked. For the second mode, the processing scheme based on the OLAP engine cannot adapt to complex application scenarios such as multi-bin layering, dimensional modeling, complex logic processing, multi-format conversion and the like.
It should be noted that hive is a data warehouse tool based on Hadoop, which is used for data extraction, transformation, and loading, and is a mechanism that can store, query, and analyze large-scale data stored in Hadoop.
In this regard, the embodiments of the present disclosure provide an improved data processing scheme for an OLAP engine, which can take advantages of both an offline computation engine and the OLAP engine. Namely, dimension modeling, bin layering, complex logic processing, multi-format conversion and PB-level large data volume ETL can be performed, and multi-dimensional data query, large data volume pre-calculation and ad hoc query can also be performed.
The present disclosure will be described in detail below with reference to the drawings and specific embodiments.
A system architecture of a data processing method and apparatus for an online analytical processing engine suitable for embodiments of the present disclosure is presented below.
FIG. 1 illustrates a system architecture suitable for embodiments of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, but does not mean that the embodiments of the present disclosure may not be used in other environments or scenarios.
As shown in fig. 1, the system architecture 100 may include: an online analysis processing engine 101, an offline calculation engine 102, a reporting side 103 and a data warehouse 104.
In the embodiment of the present disclosure, the online analysis processing engine 101 is associated with the data warehouse 104, and the online analysis processing engine 101 may obtain the data report requested to be queried by the user from the data warehouse 104 and feed back the data report to the user in response to the report query request.
The data warehouse 104, in order from bottom to top, may include: an Operation Data Store (ODS), a Detail Data Store (DWD), a Summary Data Store (DWS), and an Application Data Store (ADS).
In the embodiment of the present disclosure, the offline calculation engine 102 embedded in the online analysis processing engine 101 may be used to perform dimensional modeling on the operation data of multiple data sources, so as to obtain corresponding data reports.
Specifically, the offline computation engine 102 embedded in the online analysis processing engine 101 may perform ETL processing on operation data (including intermediate tables) from multiple data sources, and store the operation data obtained after ETL processing in the ODS layer. Further, the offline computation engine 102 may also read corresponding operation data from the ODS layer, perform complex aggregation on the operation data, obtain corresponding detail data, such as a multiple transaction fact table, and store the obtained detail data in the DWD layer. Further, the offline calculation engine 102 may also summarize the detail data in the DWD layer to obtain a corresponding snapshot table (fact table) and multidimensional table (multidimensional table), and store the snapshot table and multidimensional table in the DWS layer. Further, the offline calculation engine 102 may associate the corresponding at least one dimension table with the fact table, generate a corresponding data report, and store the corresponding data report in the ADS layer. That is, the data report is stored in a database (data warehouse) associated with the OLAP engine, so that the online analysis processing engine 101 performs a query of the data report based on the database in response to a report query request from the reporting end 103.
It should be understood that the number of data warehouses in FIG. 1 is merely illustrative. There may be any number of data warehouses, as desired for implementation.
An application scenario of the data processing method and apparatus for an online analytical processing engine suitable for the embodiments of the present disclosure is described below.
It should be understood that the data processing scheme for the online analysis processing engine provided by the embodiment of the present disclosure may be used in an intelligent search scenario related to report presentation, and particularly, may be used in an ad hoc query scenario of a multidimensional data table.
According to an embodiment of the present disclosure, a data processing method for an online analytical processing engine is provided.
FIG. 2 illustrates a flow diagram of a data processing method for an online analytical processing engine according to an embodiment of the present disclosure.
As shown in FIG. 2, a data processing method 200 for an online analytical processing engine may include: operations S210 and S220.
In operation S210, an online analysis processing engine is used to perform dimensional modeling on the operation data, so as to obtain a corresponding data report.
In operation S220, the data report is stored in a database associated with the online analysis processing engine so that the data report can be queried by the online analysis processing engine.
It should be understood that in the disclosed embodiments, dimensional modeling is a data modeling method in data warehouse construction, a logical design method for structuring data, which divides the objective world into metrics and contexts. In brief, dimensional modeling may be understood as building data warehouses and data marts, etc. in terms of fact tables and dimension tables.
It should be understood that, in the related art, the dimension modeling can only be applied to offline computing engines such as Spark computing engine and MapReduce computing engine, and cannot be applied to the OLAP engine, so that the method cannot adapt to complex application scenarios such as multi-bin layering, dimension modeling, complex logic processing, multiple format conversion and the like when offline data construction is performed by using the OLAP engine.
In the embodiment of the present disclosure, the dimension modeling is introduced into the OLAP engine, the OLAP engine may be used to perform the dimension modeling on the operation data from one or more data sources, and finally obtain the corresponding data report, and store the obtained data report in the database associated with the OLAP engine, so as to query the data reports through the OLAP engine.
By the aid of the method and the device, dimension modeling is introduced into an offline data construction scheme based on the OLAP engine, so that the OLAP engine can have dimension modeling capability, the problem that the OLAP engine cannot adapt to complex application scenes such as multi-bin layering, dimension modeling, complex logic processing and multi-format conversion due to lack of dimension modeling capability in the related technology can be solved, and the technical effect of taking the advantages of the OLAP engine and the offline calculation engine into consideration can be achieved. Namely, dimension modeling, bin layering, complex logic processing, multi-format conversion and PB-level large data volume ETL can be performed, and multi-dimensional data query, large data volume pre-calculation and ad hoc query can also be performed.
In other words, in the related art, an independent offline computation engine is used for dimensional modeling, but in this scheme, the offline computation engine needs to perform batch processing on operation data, so that the report query speed is slow. In addition, in the related art, data can be queried in real time by using a separate OLAP engine, but the data modeling capability of the scheme is poor. Through the embodiment of the disclosure, dimension modeling is introduced into the OLAP engine, so that the advantages of the OLAP engine and the offline computing engine can be taken into consideration.
Experiments show that in the embodiment of the disclosure, after the dimension modeling is introduced into the OLAP engine, the execution efficiency of data/tasks can be improved, the average execution time of the single-day tasks of the final complex logic is less than 1 second, and the large-span fast backtracking of the data can be supported.
Experiments also show that the data query time of the report end of about 7 days can be reduced from more than 3 seconds to less than 0.1 second on average by the embodiment of the disclosure, so that the data can be obtained after being queried, and the query is not perceived. In addition, the data model code amount of the report end can be reduced to dozens of lines from hundreds of lines per sheet, and a lightweight code model is realized. And moreover, the large-span fast backtracking of the data can be supported. And, multidimensional data query of complex logic can be supported. And, the presentation layer is made less heavily dependent on upstream tasks. In addition, the OLAP engine can have data modeling capability and index expansion capability, and further the OLAP engine can cope with complex logic query and data hierarchical scheduling of PB-level large data volume. Moreover, the data after the dimensionality modeling can ensure that the historical details of the data are not lost and can reflect historical changes, so that the data structure based on the dimensionality aggregation is clearer.
As an alternative embodiment, performing the dimension modeling on the operation data by using the OLAP engine to obtain the corresponding data report may include the following operations.
An offline computation engine is embedded within the OLAP engine.
And performing dimensional modeling on the operation data by using an offline calculation engine embedded in the OLAP engine to obtain a corresponding data report.
Through the embodiment of the disclosure, the offline computing engine is embedded in the OLAP engine, so that the OLAP engine has the dimension modeling capability. Compared with the dimension modeling capability of an independent offline calculation engine, the dimension modeling capability of the OLAP engine with the embedded offline calculation engine is stronger, and the processing efficiency of offline data is higher, so that the processing efficiency of data/tasks can be improved, and the OLAP engine can be used for realizing ad hoc query on a data report.
Further, as an alternative embodiment, embedding an offline computation engine in the OLAP engine may include: a Spark calculation engine or a MapReduce calculation engine is embedded in the OLAP engine.
By the embodiment of the disclosure, the advantages of the OLAP engine and the Spark calculation engine (or the MapReduce calculation engine) can be considered. That is, a Spark calculation engine (or MapReduce calculation engine) is embedded in the OLAP engine to provide the OLAP engine with dimensional modeling capability. Compared with the dimension modeling capability of an independent offline calculation engine, the dimension modeling capability of the OLAP engine with the embedded offline calculation engine is stronger, and the processing efficiency of offline data is higher, so that the processing efficiency of data/tasks can be improved, and the OLAP engine can be used for realizing ad hoc query on a data report.
In one embodiment of the present disclosure, an offline computation engine may be embedded in the OLAP engine, and the OLAP engine preprocesses the operation data or the intermediate table using the embedded offline computation engine, so that the operation data or the intermediate table from different data sources can be preprocessed into the fact table and the multiple dimension tables associated therewith, thereby implementing the dimension modeling. In addition, the natural real-time query capability of the OLAP engine can be utilized to perform ad hoc query on the data report generated based on the dimensional modeling, so that real-time multi-dimensional data query is realized.
It should be understood that real-time query cannot be achieved by either the spark-based offline computing engine or the MapReduce offline computing engine, and offline data batch processing cannot be achieved by the OLAP engine, while real-time query and large-span data backtracking are required for data query at the report end. The offline computation engine described above can thus be combined with the OLAP engine to take advantage of both engines. However, simply combining the two engines necessitates offline data processing across multiple platforms, resulting in lengthy data flow.
In contrast, the embodiment of the present disclosure proposes to embed a spark offline calculation engine or a MapReduce offline calculation engine into an OLAP engine, so as to solve the contradiction between real-time and accurate data query and overlong data stream, and simultaneously take the advantages of both engines into consideration.
In the embodiment of the disclosure, an embedded offline computing engine (i.e., an offline data platform) is responsible for offline batch processing of the operation data of the ODS layer and the detail data of the DWD layer in the data warehouse, and an OLAP engine is responsible for real-time query of the detail data of the DWD layer and the datagram table of the ADS layer in the data warehouse. All of the modified data in the data warehouse associated with the OLAP engine may also facilitate company-level data flow.
For an exemplary report query process for an OLAP engine, reference may be made to FIG. 3. The specific process may include the following operations: storing the operation data extracted from the plurality of data sources into a data warehouse; scheduling ODS layer data in a data warehouse and carrying out ETL processing; importing the processing result into a data warehouse of an OLAP engine; dispatching DWD layer data and DWS layer data in a data warehouse by using an embedded offline calculation engine and carrying out ETL processing; the processing result is imported into a data warehouse of the OLAP engine again; for data circulation, an intermediate table obtained by ETL processing of the scheduled DWD layer data and the DWS layer data can be imported into an ODS layer of a data warehouse; exposing the data report and/or performing temporal running data operations based on the data layers of the data warehouse.
Further, as an optional embodiment, performing dimensional modeling on the operation data by using an offline calculation engine embedded in the OLAP engine to obtain a corresponding data report may include: with an offline compute engine embedded within the OLAP engine, the following operations are performed.
And carrying out dimension modeling on the operation data to obtain a corresponding fact table and a corresponding dimension table.
And associating the dimension table obtained by the operation with the fact table to obtain a corresponding data report.
In an embodiment of the present disclosure, by embedding an offline computation engine in an OLAP engine, based on a data source of a dotting specification and based on a complex logic processing capability of the embedded offline computation engine, such as a Spark offline computation engine, operation data is extracted from the data source, and after performing data cleaning and format conversion on the extracted data, the finally obtained operation data is imported into an ODS layer of a data warehouse of the OLAP engine. Furthermore, an embedded offline calculation engine such as a Spark offline calculation engine is used in the OLAP engine to perform offline batch processing on the operation data of the ODS layer and then import the operation data into the DWD layer of the data warehouse. Further, an embedded offline calculation engine such as a Spark offline calculation engine is used in the OLAP engine to perform complex aggregation on the data in the DWD layer, and the result data is continuously imported into the DWS layer of the data warehouse. Furthermore, the fact table and the dimension table are mapped based on the data in the DWS layer, and then the obtained data report can be directly stored in the ADS layer of the data warehouse.
Illustratively, reference may be made to FIG. 4 through dimensional modeling implemented by an offline computation engine embedded within the OLAP engine. As shown in FIG. 4, the final generated data report may include the XX trade multiple-transaction fact table, as well as the category dimension table, after-sales dimension table, miscellaneous dimension table, user dimension table, store dimension table, and merchandise dimension table associated with the fact table. As shown in FIG. 4, the XX trade multiple transaction fact table may include: order ID, user ID, store ID, article ID, purchase quantity, after-sale ID, class ID, order time, payment time, order status update time, refund amount, partition time, and order date. The category dimension table may include: primary class ID and primary class name. The after-market dimension table may include: after-sale ID, after-sale bill application time, after-sale bill state, after-sale bill updating time and other information. The miscellaneous dimension table may include: order ID, payment state, order channel, external content source, order content source, payment channel, risk identification, equipment type and service source identification. The user dimension table may include: user ID, user shipping address ID, user purchase preferences, user last login time, etc. The store dimension table may include: information such as store ID, store name, store establishment time, and first transaction time of the store. The commodity dimension table may include: commodity ID, commodity payment amount, commodity unit price and the like.
By means of the method and the device for processing the OLAP off-line data, the OLAP engine and the off-line calculation engine are communicated, an OLAP engine off-line data processing scheme based on dimensional modeling can be achieved, dimensional modeling can be conducted, complex logic query can be conducted, and fast routine scheduling can be achieved.
In the embodiment of the present disclosure, the OLAP engine, the offline calculation engine, and the report-side multidimensional query data stream full link are opened for the first time, and the method has a fast query capability for a complex statistical result and a multidimensional query capability for detailed data, that is, the method has the above dual capabilities.
Additionally, as an alternative embodiment, storing the data report in a database associated with the OLAP engine may include: the data report is stored in an application data layer of a database associated with an OLAP engine, which is used in response to a report query request.
Illustratively, reference may be made to FIG. 5 for a bin hierarchy implemented by an offline compute engine built into the OLAP engine. As shown in fig. 5, the data warehouse may include a DWD layer and an ADS layer. The DWD layer is detail data and can include various fact tables, such as a transaction multiple-transaction fact table, an applet multiple-transaction fact table, an App multiple-transaction fact table, an H5 multiple-transaction fact table, a live multiple-transaction fact table, and the like. The statistical monitoring information and the operation decision information obtained based on the transaction multi-transaction fact table can be stored in the ADS layer. The statistical monitoring information obtained based on the transaction multi-transaction fact table may include various snapshot tables, such as a store transaction snapshot table, a user transaction snapshot table, a buyer-seller transaction snapshot table, a gross transaction snapshot table, a commodity transaction snapshot table, and the like. The operation decision information obtained based on the transaction multiple transaction fact table may include: user life cycle, after sale, E-commerce GMV, transaction wind control, explosive/commodity sales and the like. The statistical monitoring information obtained based on the applet multi-transaction fact table, the App multi-transaction fact table, the H5 multi-transaction fact table, and the like may include various snapshot tables, such as an applet traffic snapshot table (such as start times, duration, and the like), an applet retention snapshot table (such as new retention, active retention, and the like), an App traffic snapshot table (such as start times, duration, and the like), an App retention snapshot table (such as new retention, active retention, and the like), an H5 traffic snapshot table (such as start times, duration, and the like), an H5 retention snapshot table (such as new retention, active retention, and the like), and the like. The operation decision information obtained based on the applet multi-transaction fact table, the App multi-transaction fact table, the H5 multi-transaction fact table, etc. may include: full-end traffic (such as user size, retention, daily additions, channel sources, etc.), user profile, user behavior trajectory, user preferences, and the like. The statistical monitoring information obtained based on the live multi-transaction fact table may include a merchant/live snapshot table. The operation decision information obtained based on the live multi-event fact table may include the number of times/duration of the broadcast session, the number of merchants/anchor people, the viewing duration/online peak of the broadcast session, the live interaction rate, the live conversion funnel, and the like. As shown in fig. 5, the DWD layer data can satisfy 10% of temporary needs (e.g., manpower free). The ADS layer data can be displayed in a user report form, 70% of long-term statistical monitoring requirements (such as specification, quick query and repeated development avoidance) can be met, and 20% of operation decision requirements (such as specification, quick query) can also be met. As shown in fig. 5, 70% of the long-term statistical monitoring data in the ADS layer may provide core indicators (e.g., coarsest granularity, most recently needed, data that needs to be viewed every day, etc.). 70% of the long-term statistical monitoring data in the ADS layer can also provide basic indexes (such as long-term observation, granularity finer than core indexes, coverage dimension more, service line commonality indexes, etc.). As shown in fig. 5, 20% of the operation decision information in the ADS layer and 10% of the temporary requirement data in the DWD layer may provide decision indexes (such as temporary, personalized, activity monitoring, and computational complexity indexes). The core index, the decision index, the basic index and the like can be content data from the aspects of user growth, content ecology, users, advertisement delivery, live broadcast, e-commerce and the like, and meanwhile, the core index, the decision index and the basic index can also provide help for operation decisions of the aspects of user growth, content ecology, users, advertisement delivery, live broadcast, e-commerce and the like.
It should be understood that the dimension modeling and data layering mode is adopted in the offline data warehouse based on the Spark calculation engine and the MapReduce calculation engine, and multi-layer isolation can be performed between the presentation of the data report and the data source, so that the output data can be ensured to have the characteristics of uniform and complete indexes and clear data blood relationship. However, this is an advantage of a data warehouse based on separate Spark calculation engines and MapReduce calculation engines. The OLAP engine does not have dimension modeling capability, and the OLAP engine serves for multidimensional analysis and data rapid calculation. However, when the OLAP engine and Spark (or MapReduce) computing engine are enabled, fast task/data execution and dimensional modeling are possible.
Furthermore, in an embodiment of the present disclosure, a data warehouse associated with the OLAP engine may include, in order from bottom to top: ODS layer, DWD layer, DWS layer, ADS layer. The data stored in the ODS layer, the DWD layer, the DWS layer, and the ADS layer may refer to the description in other embodiments, and are not described herein again.
Through the embodiment of the disclosure, after the dimension modeling is introduced into the OLAP engine, the corresponding warehouse layering can be realized, so that the data structure is clearer.
It should be appreciated that in embodiments of the present disclosure, the OALP engine may be provided with dimensional modeling capabilities and the ODS layer data sources corresponding to the data warehouse may satisfy company-level data flows. The multiple-transaction fact table of the DWD layer can meet 10% of temporary requirements, and human resources are greatly liberated. The ADS layer can be responsible for 70% of long-term statistical monitoring requirements, and can be responsible for 20% of operation decision personalized index requirements. And the data after the dimensionality modeling can ensure that the historical details of the data are not lost, can reflect historical changes, and has a clearer data structure after the dimensionality aggregation. Particularly, compared with the execution time of the industry on the small scale of the offline task, the dimension modeling data calculated by the OLAP engine is adopted, and the execution time is usually on the second scale.
In addition, as an optional embodiment, the method further comprises: and responding to the column of the aggregate query preprocessing task hit by the report query request, and performing data report query by utilizing an OLAP engine.
And/or, as an alternative embodiment, the method further comprises: and responding to the column of the aggregate query preprocessing task missed by the report query request, and performing data report query by using a preset offline computing engine.
Through the embodiment of the disclosure, dimension modeling, bin layering and data instant query can be realized based on an OLAP engine. Further, the scenario that cannot be implemented by the OLAP engine may be used to solve the problem, that is, the external independent offline computing engine (for example, a Spark computing engine and a MapReduce computing engine that are independent from the OLAP engine, and different from the embedded offline computing engine) is used to implement the data query, so as to enhance the data query capability of the system.
According to an embodiment of the present disclosure, the present disclosure also provides a data processing apparatus for an online analysis processing engine.
FIG. 6 illustrates a block diagram of a data processing apparatus for an online analytics processing engine, in accordance with an embodiment of the present disclosure.
As shown in fig. 6, the data processing apparatus 600 for the online analysis processing engine may include: a data modeling module 610 and a report storage module 620.
And the data modeling module 610 is used for performing dimensional modeling on the operation data by using an online analysis processing engine to obtain a corresponding data report.
A report storage module 620, configured to store the data report in a database associated with the online analysis processing engine, so as to query the data report through the online analysis processing engine.
As an alternative embodiment, the data modeling module comprises: the engine processing unit is used for embedding an offline computing engine into the online analysis processing engine; and the data modeling unit is used for carrying out dimensional modeling on the operation data by utilizing an off-line calculation engine embedded in the on-line analysis processing engine to obtain the corresponding data report.
As an alternative embodiment, the data modeling unit includes: the table generation subunit is used for performing dimension modeling on the operation data by utilizing an offline calculation engine embedded in the online analysis processing engine to obtain a corresponding fact table and a corresponding dimension table; and the table association subunit is used for associating the dimension table with the fact table by utilizing an offline calculation engine embedded in the online analysis processing engine to obtain the corresponding data report.
As an alternative embodiment, the engine processing unit is further configured to: embedding a Spark calculation engine or a MapReduce calculation engine in the online analysis processing engine.
As an optional embodiment, the report storing module is further configured to: and storing the data report into an application data layer of a database associated with the online analysis processing engine.
As an alternative embodiment, the apparatus further comprises: and the first report query module is used for responding to the column of the report query request hit aggregation query preprocessing task and utilizing the online analysis processing engine to query the data report.
As an alternative embodiment, the apparatus further comprises: and the second report query module is used for responding to the report query request which does not hit the column of the aggregation query preprocessing task and utilizing a preset offline calculation engine to perform data report query.
It should be understood that the embodiments of the apparatus part of the present disclosure are the same as or similar to the embodiments of the method part of the present disclosure, and the technical problems to be solved and the technical effects to be achieved are also the same as or similar to each other, and the detailed description of the present disclosure is omitted.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the electronic device 700 includes a computing unit 701, which may perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the electronic device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
A number of components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 executes the respective methods and processes described above, such as the data processing method for the OLAP engine. For example, in some embodiments, the data processing method for the OLAP engine may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the data processing method for the OLAP engine described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured by any other suitable means (e.g., by means of firmware) to perform the data processing method for the OLAP engine.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a conventional physical host and VPS service ("Virtual Private Server", or "VPS" for short). The server may also be a server of a distributed system, or a server incorporating a blockchain.
In the technical scheme of the disclosure, the related user data recording, storage, application and the like all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.