Retrieval method and system for travel tracks of vehicles in urban road network
1. A retrieval method for travel tracks of urban road network vehicles is characterized by comprising the following steps:
obtaining urban road network vehicle travel track data and constructing a time-space data set;
constructing and storing a Hilbert-DR tree according to the time-space data set;
and inputting retrieval conditions, traversing the Hilbert-DR tree according to the retrieval conditions, and determining an urban road network vehicle travel track data set corresponding to the retrieval conditions.
2. The method for retrieving urban road network vehicle travel tracks according to claim 1, wherein said constructing and storing a Hilbert-DR tree according to said spatiotemporal data sets comprises:
fragmenting the space-time data set according to time periods to obtain a plurality of fragmented data sets;
let i have a value of 1;
respectively taking the plurality of fragmented data sets as corresponding cluster sets of a plurality of i-th-level intermediate nodes;
clustering the corresponding cluster sets of the ith-level intermediate nodes by adopting a k-means clustering algorithm to obtain a plurality of containing cluster sets of the ith-level intermediate nodes;
respectively judging whether each cluster contained in each ith-level intermediate node meets the leaf node generation condition; the leaf node generating condition is that the number of data in the cluster is smaller than a node capacity threshold value;
taking an inclusion cluster meeting the leaf node generation condition as a leaf node of the ith-level intermediate node where the inclusion cluster meets the leaf node generation condition;
taking an inclusion cluster which does not meet the leaf node generation condition as a corresponding cluster of the (i + 1) th-level intermediate node under the ith-level intermediate node where the inclusion cluster is located and which does not meet the leaf node generation condition;
and increasing the value of i by 1, returning to the step of performing clustering processing on the corresponding cluster set of each i-level intermediate node by adopting a k-means clustering algorithm to obtain a plurality of contained cluster sets of each i-level intermediate node until each contained cluster set meets the leaf node generation condition, and obtaining the Hilbert-DR tree.
3. The method for retrieving urban road network vehicle travel tracks according to claim 2, wherein after the time-space data sets are segmented according to time periods to obtain a plurality of segmented data sets, the method further comprises:
and performing Hilbert coding on the data in each partitioned data set to obtain a plurality of coded partitioned data sets.
4. The method according to claim 2, wherein said clustering the corresponding cluster set of each i-th level intermediate node by using a k-means clustering algorithm to obtain a plurality of containing cluster sets of each i-th level intermediate node, specifically comprises:
determining a plurality of cluster centers of corresponding cluster sets of the nth inter-level node; n is 1,2,. cndot.n; n is the number of the coded partitioned data sets;
calculating Euclidean distances between data in a corresponding cluster set of the nth interstage node and each cluster center respectively;
according to the Euclidean distance, distributing data in a corresponding cluster set of the nth inter-level node to a cluster corresponding to a cluster center corresponding to the minimum Euclidean distance;
calculating the cluster center change amount of each cluster after data distribution;
and updating the cluster centers of the clusters with the cluster center change larger than or equal to the change threshold, and returning to the step of calculating the Euclidean distance between the data in the nth coded fragment data set and each cluster center until all the cluster center change values are smaller than the change threshold, so as to obtain a plurality of cluster-containing sets.
5. The method for retrieving urban road network vehicle travel tracks according to claim 4, wherein said Euclidean distance is calculated by the formula:
in the formula (I), the compound is shown in the specification,for the ith sample point tiTo the jth cluster center ojEuclidean distance of (t)iIs the ith sample point, ojIs the jth cluster center, m is likeDimension, t, of the feature vector of the local pointizIs the z-th dimension, o, of the i-th sample point feature vectorjzIs the z-th dimension of the feature vector of the j-th cluster center.
6. The retrieval method for travel tracks of urban road network vehicles according to claim 4, wherein the calculation formula of the change amount of the clustering center is as follows:
in the formula, ωcChange of cluster center for the c-th iteration, Tc,iFor the ith cluster at the c-th iteration, Tc-1,iFor the ith cluster, | T, at iteration c-1iI is the number of data in the ith cluster, tjIs the jth sample point.
7. A retrieval system for travel tracks of vehicles in urban road network is characterized in that the system comprises:
the system comprises a time-space data set construction module, a time-space data set generation module and a time-space data set generation module, wherein the time-space data set construction module is used for acquiring travel track data of vehicles in an urban road network and constructing a time-space data set;
the Hilbert-DR tree building module is used for building and storing a Hilbert-DR tree according to the space-time data set;
and the retrieval module is used for inputting retrieval conditions, traversing the Hilbert-DR tree according to the retrieval conditions and determining an urban road network vehicle travel track data set corresponding to the retrieval conditions.
8. The retrieval system for travel tracks of vehicles in urban road network according to claim 7, wherein said Hilbert-DR tree construction module specifically comprises:
the fragmentation data set determining unit is used for fragmenting the time-space data set according to time periods to obtain a plurality of fragmentation data sets;
the assignment unit is used for enabling the value of i to be 1;
a corresponding cluster determining unit, configured to use the multiple fragmented data sets as corresponding clusters of multiple i-th-level intermediate nodes, respectively;
the included cluster determining unit is used for carrying out clustering processing on the corresponding cluster set of each ith-level intermediate node by adopting a k-means clustering algorithm to obtain a plurality of included cluster sets of each ith-level intermediate node;
the first judging unit is used for respectively judging whether each cluster contained in each ith-level intermediate node meets the leaf node generating condition; the leaf node generating condition is that the number of data in the cluster is smaller than a node capacity threshold value;
a leaf node generating unit configured to use an inclusion cluster satisfying a leaf node generating condition as a leaf node of an i-th-level intermediate node where the inclusion cluster satisfying the leaf node generating condition is located;
an intermediate node generating unit, configured to use an inclusion cluster that does not satisfy a leaf node generation condition as a corresponding cluster of an i +1 th-level intermediate node under an i-level intermediate node where the inclusion cluster that does not satisfy the leaf node generation condition is located;
and the Hilbert-DR tree determining unit is used for increasing the value of i by 1, returning to the step of performing clustering processing on the corresponding cluster set of each i-level intermediate node by adopting a k-means clustering algorithm to obtain a plurality of contained cluster sets of each i-level intermediate node until each contained cluster set meets the generation condition of leaf nodes, and obtaining the Hilbert-DR tree.
9. The system for retrieving urban road network vehicle travel tracks according to claim 8, wherein said Hilbert-DR tree construction module further comprises:
and the Hilbert coding unit is used for performing Hilbert coding on the data in each partitioned data set to obtain a plurality of coded partitioned data sets.
10. The system for retrieving urban road network vehicle travel tracks according to claim 8, wherein said inclusion cluster determining unit specifically comprises:
a cluster center determining subunit, configured to determine a plurality of cluster centers of a corresponding cluster set of the nth inter-level node; n is 1,2,. cndot.n; n is the number of the coded partitioned data sets;
the Euclidean distance calculating subunit is used for calculating Euclidean distances between data in a corresponding cluster set of the nth inter-level node and each cluster center respectively;
a data distribution subunit, configured to distribute, according to the euclidean distance, data in a cluster set corresponding to the nth inter-level node to a cluster corresponding to a cluster center corresponding to a minimum euclidean distance;
a cluster center variation calculating subunit, configured to calculate a cluster center variation of each cluster after data distribution;
and the contained cluster determining subunit is used for updating the cluster centers of the clusters with the cluster center change greater than or equal to the change threshold, and returning to the step of calculating the Euclidean distance between the data in the n-th coded fragment data set and each cluster center until all the cluster center change values are smaller than the change threshold, so that a plurality of contained clusters are obtained.
Background
The urban road network vehicle travel track data is multidimensional data, the data volume is huge, in the track data retrieval process, the HBase database is difficult to maintain the vehicle track data retrieval requirement only by the RowKey design principle, and the problems of uneven data storage distribution and low retrieval efficiency exist. In this regard, the prior art proposes the following solutions: (1) the method of combining the spatial relationship of the network objects and Hilbert hierarchical codes into a multi-layer network improves the spatial retrieval efficiency, but requires presetting a spatial range, which causes imbalance of an index structure, and the retrieved objects of the method are only suitable for point objects. (2) And clustering the data by using a Z curve, and then using the HBase database as an integral retrieval structure of a space-time association algorithm based on a clustering result. This method has high real-time and dynamic properties, but indexing efficiency is low. (3) The distributed space-time index with a double-layer structure is constructed based on the quadtree and the 3DR tree, dynamic loading of the disk subtree can be supported persistently, query efficiency is improved, and storage cost is high.
Therefore, a data retrieval technique with uniform storage distribution, high retrieval efficiency and low storage cost is needed.
Disclosure of Invention
The invention aims to provide a method and a system for searching travel tracks of vehicles in an urban road network, which have the advantages of uniform storage distribution, high searching efficiency and low storage cost.
In order to achieve the purpose, the invention provides the following scheme:
a retrieval method for travel tracks of urban road network vehicles comprises the following steps:
obtaining urban road network vehicle travel track data and constructing a time-space data set;
constructing and storing a Hilbert-DR tree according to the time-space data set;
and inputting retrieval conditions, traversing the Hilbert-DR tree according to the retrieval conditions, and determining an urban road network vehicle travel track data set corresponding to the retrieval conditions.
Optionally, the constructing and storing a Hilbert-DR tree according to the space-time data set specifically includes:
fragmenting the space-time data set according to time periods to obtain a plurality of fragmented data sets;
let i have a value of 1;
respectively taking the plurality of fragmented data sets as corresponding cluster sets of a plurality of i-th-level intermediate nodes;
clustering the corresponding cluster sets of the ith-level intermediate nodes by adopting a k-means clustering algorithm to obtain a plurality of containing cluster sets of the ith-level intermediate nodes;
respectively judging whether each cluster contained in each ith-level intermediate node meets the leaf node generation condition; the leaf node generating condition is that the number of data in the cluster is smaller than a node capacity threshold value;
taking an inclusion cluster meeting the leaf node generation condition as a leaf node of the ith-level intermediate node where the inclusion cluster meets the leaf node generation condition;
taking an inclusion cluster which does not meet the leaf node generation condition as a corresponding cluster of the (i + 1) th-level intermediate node under the ith-level intermediate node where the inclusion cluster is located and which does not meet the leaf node generation condition;
and increasing the value of i by 1, returning to the step of performing clustering processing on the corresponding cluster set of each i-level intermediate node by adopting a k-means clustering algorithm to obtain a plurality of contained cluster sets of each i-level intermediate node until each contained cluster set meets the leaf node generation condition, and obtaining the Hilbert-DR tree.
Optionally, after the time-space data set is fragmented according to a time period to obtain a plurality of fragmented data sets, the method further includes:
and performing Hilbert coding on the data in each partitioned data set to obtain a plurality of coded partitioned data sets.
Optionally, the clustering processing is performed on the cluster set corresponding to each i-th level intermediate node by using a k-means clustering algorithm, so as to obtain a plurality of included cluster sets of each i-th level intermediate node, and the method specifically includes:
determining a plurality of cluster centers of corresponding cluster sets of the nth inter-level node; n is 1,2,. cndot.n; n is the number of the coded partitioned data sets;
calculating Euclidean distances between data in a corresponding cluster set of the nth interstage node and each cluster center respectively;
according to the Euclidean distance, distributing data in a corresponding cluster set of the nth inter-level node to a cluster corresponding to a cluster center corresponding to the minimum Euclidean distance;
calculating the cluster center change amount of each cluster after data distribution;
and updating the cluster centers of the clusters with the cluster center change larger than or equal to the change threshold, and returning to the step of calculating the Euclidean distance between the data in the nth coded fragment data set and each cluster center until all the cluster center change values are smaller than the change threshold, so as to obtain a plurality of cluster-containing sets.
Optionally, the calculation formula of the euclidean distance is:
in the formula (I), the compound is shown in the specification,for the ith sample point tiTo the jth cluster center ojEuclidean distance of (t)iIs the ith sample point, ojIs the jth cluster center, m is the dimension of the sample point feature vector, tizIs the z-th dimension, o, of the i-th sample point feature vectorjzIs the z-th dimension of the feature vector of the j-th cluster center.
Optionally, the calculation formula of the cluster center change amount is as follows:
in the formula, ωcCluster center change for the c-th iteration,Tc,iFor the ith cluster at the c-th iteration, Tc-1,iFor the ith cluster, | T, at iteration c-1iI is the number of data in the ith cluster, tjIs the jth sample point.
A retrieval system for urban road network vehicle travel tracks comprises:
the system comprises a time-space data set construction module, a time-space data set generation module and a time-space data set generation module, wherein the time-space data set construction module is used for acquiring travel track data of vehicles in an urban road network and constructing a time-space data set;
the Hilbert-DR tree building module is used for building and storing a Hilbert-DR tree according to the space-time data set;
and the retrieval module is used for inputting retrieval conditions, traversing the Hilbert-DR tree according to the retrieval conditions and determining an urban road network vehicle travel track data set corresponding to the retrieval conditions.
Optionally, the Hilbert-DR tree building module specifically includes:
the fragmentation data set determining unit is used for fragmenting the time-space data set according to time periods to obtain a plurality of fragmentation data sets;
the assignment unit is used for enabling the value of i to be 1;
a corresponding cluster determining unit, configured to use the multiple fragmented data sets as corresponding clusters of multiple i-th-level intermediate nodes, respectively;
the included cluster determining unit is used for carrying out clustering processing on the corresponding cluster set of each ith-level intermediate node by adopting a k-means clustering algorithm to obtain a plurality of included cluster sets of each ith-level intermediate node;
the first judging unit is used for respectively judging whether each cluster contained in each ith-level intermediate node meets the leaf node generating condition; the leaf node generating condition is that the number of data in the cluster is smaller than a node capacity threshold value;
a leaf node generating unit configured to use an inclusion cluster satisfying a leaf node generating condition as a leaf node of an i-th-level intermediate node where the inclusion cluster satisfying the leaf node generating condition is located;
an intermediate node generating unit, configured to use an inclusion cluster that does not satisfy a leaf node generation condition as a corresponding cluster of an i +1 th-level intermediate node under an i-level intermediate node where the inclusion cluster that does not satisfy the leaf node generation condition is located;
and the Hilbert-DR tree determining unit is used for increasing the value of i by 1, returning to the step of performing clustering processing on the corresponding cluster set of each i-level intermediate node by adopting a k-means clustering algorithm to obtain a plurality of contained cluster sets of each i-level intermediate node until each contained cluster set meets the generation condition of leaf nodes, and obtaining the Hilbert-DR tree.
Optionally, the Hilbert-DR tree constructing module further includes:
and the Hilbert coding unit is used for performing Hilbert coding on the data in each partitioned data set to obtain a plurality of coded partitioned data sets.
Optionally, the include cluster determining unit specifically includes:
a cluster center determining subunit, configured to determine a plurality of cluster centers of a corresponding cluster set of the nth inter-level node; n is 1,2,. cndot.n; n is the number of the coded partitioned data sets;
the Euclidean distance calculating subunit is used for calculating Euclidean distances between data in a corresponding cluster set of the nth inter-level node and each cluster center respectively;
a data distribution subunit, configured to distribute, according to the euclidean distance, data in a cluster set corresponding to the nth inter-level node to a cluster corresponding to a cluster center corresponding to a minimum euclidean distance;
a cluster center variation calculating subunit, configured to calculate a cluster center variation of each cluster after data distribution;
and the contained cluster determining subunit is used for updating the cluster centers of the clusters with the cluster center change greater than or equal to the change threshold, and returning to the step of calculating the Euclidean distance between the data in the n-th coded fragment data set and each cluster center until all the cluster center change values are smaller than the change threshold, so that a plurality of contained clusters are obtained.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention discloses a method and a system for retrieving travel tracks of vehicles in an urban road network, wherein the method comprises the following steps: obtaining urban road network vehicle travel track data and constructing a time-space data set; constructing and storing a Hilbert-DR tree according to the time-space data set; and inputting retrieval conditions, traversing the Hilbert-DR tree according to the retrieval conditions, and determining an urban road network vehicle travel track data set corresponding to the retrieval conditions. The invention aims to provide a method and a system for searching travel tracks of vehicles in an urban road network, which have the advantages of uniform storage distribution, high searching efficiency and low storage cost.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
Fig. 1 is a flowchart of a method for retrieving travel tracks of vehicles in an urban road network according to an embodiment of the present invention;
FIG. 2 is a multidimensional space diagram of travel trajectory data of vehicles in an urban road network according to an embodiment of the present invention;
FIG. 3 is a Hilbert 1-order code diagram according to an embodiment of the present invention;
FIG. 4 is a Hilbert-2 code diagram according to an embodiment of the present invention;
FIG. 5 is a Hilbert code graph of 3 th order according to an embodiment of the present invention;
fig. 6 is a daily data distribution diagram of travel tracks of vehicles in an urban road network according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of a Hilbert-DR tree structure provided by an embodiment of the present invention;
fig. 8 is a schematic structural diagram of a retrieval system for travel tracks of vehicles in an urban road network according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a method and a system for searching travel tracks of vehicles in an urban road network, which have the advantages of uniform storage distribution, high searching efficiency and low storage cost.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is a flowchart of a method for retrieving travel tracks of vehicles in an urban road network according to an embodiment of the present invention, and as shown in fig. 1, the present invention provides a method for retrieving travel tracks of vehicles in an urban road network, including:
step 101: obtaining urban road network vehicle travel track data and constructing a time-space data set;
step 102: constructing and storing a Hilbert-DR tree according to the time-space data set;
step 103: and inputting retrieval conditions, traversing the Hilbert-DR tree according to the retrieval conditions, and determining an urban road network vehicle travel track data set corresponding to the retrieval conditions.
Specifically, step 102 specifically includes:
the method comprises the steps of fragmenting a time-space data set according to time periods to obtain a plurality of fragmented data sets;
let i have a value of 1;
respectively taking the plurality of fragmented data sets as corresponding cluster sets of a plurality of ith-level intermediate nodes;
clustering the corresponding cluster sets of each ith-level intermediate node by adopting a k-means clustering algorithm to obtain a plurality of containing cluster sets of each ith-level intermediate node;
respectively judging whether each cluster contained in each ith-level intermediate node meets the leaf node generation condition; the leaf node generating condition is that the number of data in the cluster is smaller than a node capacity threshold value;
taking the cluster meeting the leaf node generation condition as a leaf node meeting the leaf node generation condition and containing the ith-level intermediate node where the cluster is located;
taking the cluster which does not meet the leaf node generation condition as a corresponding cluster of the (i + 1) th level intermediate node which does not meet the leaf node generation condition and is positioned under the i-th level intermediate node of the cluster;
and increasing the value of i by 1, returning to the step of performing clustering processing on the corresponding cluster set of each i-level intermediate node by adopting a k-means clustering algorithm to obtain a plurality of contained cluster sets of each i-level intermediate node until each contained cluster set meets the leaf node generation condition, and obtaining the Hilbert-DR tree.
Specifically, after the time-space data set is fragmented according to the time period to obtain a plurality of fragmented data sets, the method further includes: and performing Hilbert coding on the data in each partitioned data set to obtain a plurality of coded partitioned data sets.
In addition, in the invention, the corresponding cluster set of each ith-level intermediate node is clustered by adopting a k-means clustering algorithm to obtain a plurality of containing cluster sets of each ith-level intermediate node, and the method specifically comprises the following steps:
determining a plurality of cluster centers of corresponding cluster sets of the nth inter-level node; n is 1,2,. cndot.n; n is the number of the coded partitioned data sets;
calculating Euclidean distances between data in a corresponding cluster set of the nth interstage node and each cluster center respectively;
distributing data in a corresponding cluster set of the nth inter-level node to a cluster corresponding to a cluster center corresponding to the minimum Euclidean distance according to the Euclidean distance;
calculating the cluster center change amount of each cluster after data distribution;
and updating the cluster centers of the clusters with the cluster center change value larger than or equal to the change threshold value, and returning to the step of calculating the Euclidean distance between the data in the n-th coded fragment data set and each cluster center until the change value of all the cluster centers is smaller than the change threshold value to obtain a plurality of cluster-containing sets.
The calculation formula of the Euclidean distance is as follows:
in the formula (I), the compound is shown in the specification,for the ith sample point tiTo the jth cluster center ojEuclidean distance of (t)iIs the ith sample point, ojIs the jth cluster center, m is the dimension of the sample point feature vector, tizIs the z-th dimension, o, of the i-th sample point feature vectorjzIs the z-th dimension of the feature vector of the j-th cluster center.
The calculation formula of the cluster center variation is as follows:
in the formula, ωcChange of cluster center for the c-th iteration, Tc,iFor the ith cluster at the c-th iteration, Tc-1,iFor the ith cluster, | T, at iteration c-1iI is the number of data in the ith cluster, tjIs the jth sample point.
Specifically, the method for retrieving the travel track of the vehicles in the urban road network provided by the invention specifically comprises the following steps:
step 1, floating car GPS track data of 2018 months-11 months in a certain northern city are collected, the track data only comprise parameters such as car ID, time, longitude and latitude, and the like, and the data acquisition time interval is 10 seconds.
And 2, performing data cleaning on the data in the step 1, and deleting records with the same vehicle ID and longitude and latitude but different time and data which does not accord with the continuity of the track data (the single track point has overlarge deviation) in the original data.
And step 3: the spatio-temporal data structure is divided into: the portions are sliced in time and the spatial portions that pass through the cluster. Fig. 6 is a daily data distribution diagram of urban road network vehicle travel tracks provided by the embodiment of the present invention, in the diagram, the abscissa represents a time period, and the ordinate represents a vehicle track data amount, as shown in fig. 6, there are few road traffic vehicles in the time period of 00:00:00 to 07:59:59, so that the data in the time period is separately sliced, and the remaining data is sliced once per hour to complete the time division of the data.
Step 4, fig. 2 is a multidimensional space diagram of travel trajectory data of vehicles in an urban road network provided by the embodiment of the invention; wherein t, x and x are three coordinates of space, t1-t8Is time. As shown in fig. 2, the trajectory data in step 2 has a spatio-temporal characteristic, is three-dimensional data, and is composed of two-dimensional trajectory points (x, y) and one-dimensional time (t), and the higher the dimension of the travel trajectory data of the urban road network vehicle, the lower the retrieval efficiency. Thus, the high-dimensional spatial data in each slice is converted to a one-dimensional continuum using a space-filling curve.
Specifically, Hilbert coding is carried out on the travel track data of the vehicles in the urban road network to obtain a Hilbert curve. The Hilbert curve is one of the space filling curves mentioned in step 3, and fig. 3 is a 1 st order Hilbert code graph provided in the embodiment of the present invention; FIG. 4 is a Hilbert-2 code diagram according to an embodiment of the present invention; FIG. 5 is a Hilbert code graph of 3 th order according to an embodiment of the present invention; as shown in fig. 3-5, the Hilbert curve continuously divides a square space into 4 subspaces, and then connects the central points of the small square spaces to obtain a one-dimensional continuous space curve, and the multi-order Hilbert curve has a better spatial clustering effect.
Step 5, since R-trees often generate a large amount of overlapping and dead spaces in the spatial data index structure, it is considered to store the adjacent data in step 5 under the same sub-tree in combination with the clustering algorithm, thereby reducing redundancy of spatial data storage and I/O seek time.
Specifically, the segmented data are respectively clustered by using a k-means clustering algorithm, so that the division of the data space is completed.
Taking monolithic data as an example, the clustering algorithm is as follows:
1. and the clustering center of the initial cluster distributes the data into the nearest cluster according to the Euclidean distance principle to obtain a plurality of clusters.
In order to reduce the overlapping problem after clustering and enable the clustering to have better clustering effect, the absolute error is taken as a clustering measure function, the function of the absolute error is in the iteration of clustering division until a clustering measure function value is converged, and then a clustering number k value is determined.
2. Respectively calculating new cluster central points o in the iterative processn(Ti) And the original clustering center point ol(Ti) The resulting absolute error (cluster center change amount).
In the formula, ωcChange of cluster center for the c-th iteration, Tc,iFor the ith cluster at the c-th iteration, Tc-1,iFor the ith cluster, | T, at iteration c-1iAnd | is the number of data in the ith cluster.
3. And removing the data in the cluster corresponding to the cluster center change smaller than the threshold change from the sample set.
In the cluster corresponding to the cluster center change larger than or equal to the change threshold, new cluster center point o is clusteredn(Ti) As a cluster center and repeating steps 1-3 until the piece of data is fully assigned to the K clusters.
Step 6, judging whether the data in each cluster is larger than a node capacity threshold value M,
and if not, generating leaf nodes of the Hilbert-DR tree according to the cluster position.
If yes, generating an intermediate node according to the cluster position, taking the cluster data as a new clustering object, calling a dynamic clustering algorithm to cluster Hilbert values of the cluster data, generating leaf nodes or intermediate nodes under the intermediate node until the generated data in the cluster are all smaller than M,generating the Hilbert-DR tree. The Hilbert-DR tree has two node structures of an intermediate node and a leaf node, and stores the space-time data according to the storage mode of the HBase database to form a hierarchical index mechanism, wherein the specific structure of the Hilbert-DR tree is shown in FIG. 7. In the figure, t1-tnIs time, m1-m9Is the data set stored at the leaf node.
And organizing the unique corresponding time value in the vehicle track data set of the first layer by adopting an HBase database, accessing the corresponding data set, and finally organizing the intermediate node information of the Hilbert-DR tree by utilizing the HBase, wherein the intermediate node information is used for storing the maximum Hilbert value of the data at the leaf node, realizing an indexed storage structure and storing the time attribute of the data set.
The spatial clustering and the time attribute of the data set are combined through the steps, a hierarchical index architecture of the Hilbert-DR tree is established, and the hierarchical index architecture is applied to the retrieval of the vehicle travel track. Time slices are searched through time indexes, the spatial clustering information of the time slices is determined, and then the target object is located by utilizing the efficient Hilbert-DR tree.
The invention provides a retrieval method of urban road network vehicle travel tracks, wherein the segmentation is to classify data in time, the clustering is based on two-dimensional coordinate (spatial) clustering, and the purpose is to classify adjacent points into a cluster, to perform Hilbert coding on the clusters, to approximately represent each cluster by MBR (Minimum Bounding Rectangle), to sort the MBR in ascending order, to cluster the coded clusters again, to store the adjacent data of the two-dimensional coordinate at the close position, to make the data storage distribution uniform, and to improve the retrieval efficiency; meanwhile, the number of nodes is reduced, and the storage cost is reduced.
Fig. 8 is a schematic structural diagram of a retrieval system for travel tracks of vehicles in an urban road network according to an embodiment of the present invention, and as shown in fig. 8, the present invention provides a retrieval system for travel tracks of vehicles in an urban road network, including:
a time-space data set construction module 801, configured to acquire travel trajectory data of vehicles in an urban road network, and construct a time-space data set;
a Hilbert-DR tree construction module 802 for constructing and storing a Hilbert-DR tree from the spatio-temporal data sets;
and the retrieval module 803 is used for inputting retrieval conditions, traversing the Hilbert-DR tree according to the retrieval conditions, and determining the urban road network vehicle travel track data set corresponding to the retrieval conditions.
The Hilbert-DR tree construction module 802 specifically includes:
the system comprises a fragmentation data set determining unit, a time-space data set generating unit and a time-space data set generating unit, wherein the fragmentation data set determining unit is used for fragmenting a time-space data set according to time periods to obtain a plurality of fragmentation data sets;
the assignment unit is used for enabling the value of i to be 1;
a corresponding cluster determining unit, configured to use the multiple fragmented data sets as corresponding clusters of the multiple ith-level intermediate nodes, respectively;
the included cluster determining unit is used for carrying out clustering processing on the corresponding cluster set of each ith-level intermediate node by adopting a k-means clustering algorithm to obtain a plurality of included cluster sets of each ith-level intermediate node;
the first judging unit is used for respectively judging whether each cluster contained in each ith-level intermediate node meets the leaf node generating condition; the leaf node generating condition is that the number of data in the cluster is smaller than a node capacity threshold value;
the leaf node generating unit is used for taking the cluster meeting the leaf node generating condition as a leaf node meeting the leaf node generating condition and containing the ith-level intermediate node where the cluster is located;
the intermediate node generating unit is used for taking the cluster which does not meet the leaf node generating condition as a corresponding cluster of the (i + 1) th level intermediate node which does not meet the leaf node generating condition and is positioned under the ith level intermediate node where the cluster is positioned;
and the Hilbert-DR tree determining unit is used for increasing the value of i by 1, returning to the step of performing clustering processing on the corresponding cluster set of each i-level intermediate node by adopting a k-means clustering algorithm to obtain a plurality of contained cluster sets of each i-level intermediate node until each contained cluster set meets the generation condition of leaf nodes, and obtaining the Hilbert-DR tree.
The Hilbert-DR tree building module further comprises: and the Hilbert coding unit is used for performing Hilbert coding on the data in each partitioned data set to obtain a plurality of coded partitioned data sets.
The cluster determining unit includes:
a cluster center determining subunit, configured to determine a plurality of cluster centers of a corresponding cluster set of the nth inter-level node; n is 1,2,. cndot.n; n is the number of the coded partitioned data sets;
the Euclidean distance calculating subunit is used for calculating the Euclidean distance between the data in the corresponding cluster set of the nth interstage node and each cluster center;
the data distribution subunit is used for distributing the data in the corresponding cluster set of the nth inter-level node to the cluster corresponding to the cluster center corresponding to the minimum Euclidean distance according to the Euclidean distance;
a cluster center variation calculating subunit, configured to calculate a cluster center variation of each cluster after data distribution;
and the included cluster determining subunit is used for updating the cluster centers of the clusters with the cluster center change value larger than or equal to the change threshold value, and returning to the step of calculating the Euclidean distance between the data in the n-th coded partitioned data set and each cluster center until the change value of all the cluster centers is smaller than the change threshold value, so as to obtain a plurality of included cluster sets.
Specifically, the calculation formula of the euclidean distance is as follows:
in the formula (I), the compound is shown in the specification,for the ith sample point tiTo the jth cluster center ojEuclidean distance of (t)iIs the ith sample point, ojIs the jth cluster center, m is the dimension of the sample point feature vector, tizIs the z-th dimension, o, of the i-th sample point feature vectorjzIs the z-th dimension of the feature vector of the j-th cluster center.
The calculation formula of the cluster center variation is as follows:
in the formula, ωcChange of cluster center for the c-th iteration, Tc,iFor the ith cluster at the c-th iteration, Tc-1,iFor the ith cluster, | T, at iteration c-1iI is the number of data in the ith cluster, tjIs the jth sample point.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.