Semantic track query method based on activity influence
1. A semantic track query method based on activity influence is characterized by comprising the following steps:
s1: acquiring basic information of a semantic track data set D, wherein the basic information comprises track numbers of all semantic tracks T in the semantic track data set D and activity influence of keywords corresponding to all track points p on all the semantic tracks T, each track point at least corresponds to one keyword, the keywords corresponding to all the track points are not identical, and the activity influence of the same keyword on different track points is not identical; meanwhile, the activity influence of the keyword on any track point is the occurrence frequency of the keyword on the track point;
s2: constructing a mixed grid index HGI based on basic information of a semantic track data set D, wherein the mixed grid index HGI consists of a mixed quadtree index HQ-tree and a mixed inverted index HIF;
s3: for a given query requirement Q ═ c (acts, d) by the usermax) Finding out k semantic tracks which are most matched with the query requirement Q in a semantic track data set D based on a hybrid grid index HGI, wherein loc is a query position set by a user, acts is a keyword set by the user, and DmaxK is at least 3 for the maximum value of the expected journey set by the user.
2. The activity influence-based semantic track query method according to claim 1, wherein the method for constructing the hybrid quadtree index HQ-tree in step S2 is as follows:
s21: dividing a real geographic area in which a semantic track data set D is positioned into grids of D levels respectively, wherein each level comprises 2d-1×2d-1Each grid is at least 3, and each grid corresponds to a grid number;
s22: respectively judging whether each grid of each hierarchy has a track point, eliminating grids which do not contain the track point, and constructing a quadtree by the rest grids according to the hierarchy inclusion relation of the grids, wherein nodes without child nodes are leaf nodes, and the rest nodes are non-leaf nodes;
s23: the index entries and bitmap information are respectively associated with each node of the quadtree, and the specific steps are as follows:
for any one non-leaf node g0Its associated index entry is a triple (Gid)0{ g' ∈ g. substrids }, ifile), where Gid0Representing non-leaf nodes g0The number of the corresponding grid, { g' ∈ g.subgrids } represents a list of pointers to all child nodes of the non-leaf node g, ifile represents the non-leaf node g0The first inverted index of the keywords corresponding to all the track points existing in the corresponding grids comprises the following information: each keyword is at a non-leaf node g0Activity influence in the corresponding grid, non-leaf nodes g where keywords appear0The grid number corresponding to the child node;
for any leaf node g1Its associated index entry is a tuple (Gid)1Ilist), wherein Gid1Represents the leaf node g1Number of the corresponding grid, ilist, denotes the leaf node g1And the second inverted indexes of the keywords corresponding to all the track points in the corresponding grids comprise the following information: each keyword is at leaf node g1The activity influence in the corresponding grid and the number of the semantic track of each keyword;
for any node of the quadtree, the bitmap information is a data sequence consisting of 0 and 1, and each data bit on the data sequence corresponds to a semantic track, wherein 0 indicates that the semantic track corresponding to the data bit does not pass through the grid corresponding to the current node, and 1 indicates that the semantic track corresponding to the data bit passes through the grid corresponding to the current node.
3. The semantic track query method based on activity influence according to claim 2, wherein the activity influence of the keyword in the grid corresponding to any node is calculated by:
and acquiring the activity influence of the keyword at all track points in the grid corresponding to the current node, and taking the maximum value as the activity influence of the keyword in the grid corresponding to the current node.
4. The method according to claim 1, wherein the hybrid inverted index HIF in step S2 is composed of an activity list, an activity inverted list and a sub-track length list, which correspond to each semantic track in the semantic track dataset D, wherein the activity list is used to store keywords corresponding to all track points on the semantic track, the activity inverted list is used to store association relations between the keywords and track points where the keywords appear, and the sub-track length list is used to store lengths from each track point to a start track point on the semantic track.
5. The activity impact-based semantic track query method according to claim 1, wherein the step S3 of finding k semantic tracks in the semantic track dataset D that best match the query requirement Q based on the hybrid mesh index HGI specifically comprises:
s31: setting a heap set, traversing a mixed quadtree index HQ-tree from top to bottom, firstly putting a root node into the heap set, acquiring child nodes of the root node, then removing the root node from the heap set, and adding the child nodes of the root node which accord with a set pruning rule into the heap set;
s32: finding out the node with the maximum activity influence in the current heap set, then removing the heap set from the node, adding the child nodes of the node which accord with the set pruning rule into the heap set, and repeating the step until the leaf nodes are traversed;
s33: according to the index entries related to the leaf nodes obtained through traversal, obtaining semantic tracks of key words acts set by a user in the leaf nodes, and taking the obtained semantic tracks as tracks to be verified; acquiring the activity influence of each track to be verified, and taking the front k tracks with the maximum activity influence as alternative tracks;
s34: removing the leaf nodes in the step S33 to form a heap set, repeating the steps S32-S33 on the current heap set until the early termination condition is met or all nodes are traversed, and finally obtaining k candidate trajectories (loc, acts, d) corresponding to the query requirement Q given by the usermax) The most matched semantic track.
6. The activity influence-based semantic track query method according to claim 5, wherein the pruning rule set in step S31 is: the distance between at least one track point in the grid corresponding to the child node and the query position loc set by the user is not more than dmax(ii) a The influence of the activity of the child node is not 0; at least one of all semantic tracks passing through the grid corresponding to the child node is not accessed.
7. The activity influence-based semantic track query method according to claim 5, wherein the activity influence of any node in step S32 is calculated by:
finding out keywords belonging to keywords acts set by a user from all keywords corresponding to the node; and summing the maximum value of the activity influence of the found keywords in the node, and dividing the sum value by the total number of the keywords in the word set acts to obtain the activity influence of the node.
8. The semantic track query method according to claim 5, wherein the activity influence of each track to be verified in step S33 is calculated as follows:
s331: initializing window endpoints s and e, and recording the sub-tracks extracted by the window each time as T [ s, e ];
s332: the left end point s is fixed as the first track point and is unchanged, and the right end point e starts from the first track point and follows the track to be verifiedIncreasing the direction, intercepting a section of sub-track T [ s, e ] every time the right end point is increased]Judging the currently intercepted sub-track T [ s, e ]]Whether the distance between the query position loc and the query position loc set by the user is larger than dmaxIf the current position of the right end point e is larger than the current position of the right end point e, the current position of the right end point e is kept unchanged, and the left end point s is increased until the sub-track T [ s, e ] is obtained again]The distance between the query position loc set by the user is not more than dmaxThen, the sub-track T [ s, e ] is calculated according to the following formula]Activity influence of (c); if not, directly calculating the sub-track T [ s, e ] according to the following formula]Activity influence of (c);
wherein Inf (T [ s, e ]]Q) denotes the sub-track T [ s, e ]]Act | represents the total number of keywords in the word set acts, Infp.poi(w) is represented in the sub-track T [ s, e ]]Activity influence of keywords appearing in and belonging to the word set acts, Infmax(w) represents the maximum value of the activity influence of the keywords belonging to the word set acts at all locus points, Q.acts is a keyword set by the user, w is the keyword belonging to Q.acts, and p is the sub-locus T [ s, e ]]The trace points appearing in (1);
s333: keeping the current position of the left end point s unchanged, continuously increasing the right end point e along the direction of the track to be verified, continuously carrying out condition judgment on the intercepted sub-track, obtaining the activity influence of the sub-track meeting the condition, and repeating the steps until the right end point e reaches the last track point;
s334: taking the maximum value of the obtained activity influence of all the sub-tracks as the activity influence of the current track to be verified;
9. the activity influence-based semantic track query method according to claim 8, wherein the calculation formula of the distance between the sub-track T [ S, e ] and the query position loc set by the user in step S332 is:
wherein d (T [ s, e ]]Q) is a sub-track T [ s, e ]]Distance, dist (q.loc, p) from the query location loc set by the users) Euclidean distance, dist (p), between query location loc set for the user and left endpoint si,pi+1) Is a sub-track T [ s, e ]]Any two adjacent track points p except the right end point eiAnd pi+1The euclidean distance between.
10. The activity impact-based semantic track query method according to claim 5, wherein the early termination condition in step S34 is: the first sum is smaller than the minimum value of the track activity influence in the k candidate tracks obtained by current calculation, wherein the calculation method of the first sum is as follows:
respectively taking each keyword acts set by the user as a current keyword to execute the following operations: finding out the maximum value of the influence of the current keyword on the activities at all the track points corresponding to all the nodes of the current heap set, and taking the maximum value of the influence of the activities as the maximum influence corresponding to the current keyword;
and summing the maximum influence corresponding to all the keywords acts set by the user, and dividing the sum by the total number of the keywords in the word set acts to obtain a first sum.
Background
With the continuous development of mobile social networks and location-based service applications, a great deal of semantic track data is formed. The semantic track not only contains longitude and latitude and timestamp information in the traditional spatiotemporal track, but also is accompanied by text information describing user behavior activities. Query studies on semantic track data have generated many application values, for example: recommending friends with the same interest or similar habits in a social network, providing suitable advertising targets or selecting suitable advertising topics for merchants in advertising marketing, providing personalized travel route recommendations for tourists on a travel application, and the like.
Efficient querying of trajectories relies on an efficient indexing mechanism, and IR-trees are often used to index semantic trajectory data. The essence of the IR-Tree is to extend the inverted index on the basis of the R-Tree, and at each node of the IR-Tree, there is a pointer to the inverted file describing all the keywords contained within the minimum bounding rectangle represented by the node. Each inverted file associated with the intermediate node records all sub-nodes with the keywords; the inverted list in the inverted file associated with the leaf node records the specific track point. There are also several variants of IR trees, such as DIR trees, CIR trees, etc. In the construction process of the DIR tree, the spatial information and the semantic information of the spatial text object are considered at the same time, so that the objects in the same node are similar as much as possible in semantic attributes. The CIR tree divides the objects into different clusters according to the spatial proximity of all the objects in the nodes, and records the distribution condition of the keywords in each cluster.
In the existing patent 'reverse nearest neighbor query method and device based on semantic track big data', the patent designs a query method aiming at reverse nearest neighbors in semantic track data, and the method can not be applied to the query problem solved by the invention.
In order to realize fast query processing of semantic tracks, a plurality of difficulties still exist at present: (1) with the continuous development of location-based service applications, users can put forward more personalized and diversified query requirements, and the traditional query technology cannot meet increasingly complex query requests. (2) When the index structures such as IR-Tree and the like face massive customized query, the efficiency is not high enough because various information in the semantic track is not fully utilized for pruning. (3) At present, a query processing framework and an optimization mechanism aiming at the traditional track are not suitable for semantic track query, and the existing algorithm framework and the existing optimization mechanism for processing the semantic track query are not completely suitable for the query problem provided by the invention.
Disclosure of Invention
In order to solve the problems, the invention provides a semantic track query method based on activity influence, which can search tracks with greater activity influence preferentially.
A semantic track query method based on activity influence comprises the following steps:
s1: acquiring basic information of a semantic track data set D, wherein the basic information comprises track numbers of all semantic tracks T in the semantic track data set D and activity influence of keywords corresponding to all track points p on all the semantic tracks T, each track point at least corresponds to one keyword, the keywords corresponding to all the track points are not identical, and the activity influence of the same keyword on different track points is not identical; meanwhile, the activity influence of the keyword on any track point is the occurrence frequency of the keyword on the track point;
s2: constructing a mixed grid index HGI based on basic information of a semantic track data set D, wherein the mixed grid index HGI consists of a mixed quadtree index HQ-tree and a mixed inverted index HIF;
s3: for a given query requirement Q ═ c (acts, d) by the usermax) Finding out k semantic tracks which are most matched with the query requirement Q in a semantic track data set D based on a hybrid grid index HGI, wherein loc is a query position set by a user, acts is a keyword set by the user, and DmaxK is at least 3 for the maximum value of the expected journey set by the user.
Further, the method for constructing the hybrid quadtree index HQ-tree in step S2 is as follows:
s21: semantic track data set DcThe real geographic area is divided into a grid of d levels, wherein each level comprises 2d-1×2d-1Each grid is at least 3, and each grid corresponds to a grid number;
s22: respectively judging whether each grid of each hierarchy has a track point, eliminating grids which do not contain the track point, and constructing a quadtree by the rest grids according to the hierarchy inclusion relation of the grids, wherein nodes without child nodes are leaf nodes, and the rest nodes are non-leaf nodes;
s23: the index entries and bitmap information are respectively associated with each node of the quadtree, and the specific steps are as follows:
for any one non-leaf node g0Its associated index entry is a triple (Gid)0{ g' ∈ g. substrids }, ifile), where Gid0Representing non-leaf nodes g0The number of the corresponding grid, { g' ∈ g.subgrids } represents a list of pointers to all child nodes of the non-leaf node g, ifile represents the non-leaf node g0The first inverted index of the keywords corresponding to all the track points existing in the corresponding grids comprises the following information: each keyword is at a non-leaf node g0Activity influence in the corresponding grid, non-leaf nodes g where keywords appear0The grid number corresponding to the child node;
for any leaf node g1Its associated index entry is a tuple (Gid)1Ilist), wherein Gid1Represents the leaf node g1Number of the corresponding grid, ilist, denotes the leaf node g1And the second inverted indexes of the keywords corresponding to all the track points in the corresponding grids comprise the following information: each keyword is at leaf node g1The activity influence in the corresponding grid and the number of the semantic track of each keyword;
for any node of the quadtree, the bitmap information is a data sequence consisting of 0 and 1, and each data bit on the data sequence corresponds to a semantic track, wherein 0 indicates that the semantic track corresponding to the data bit does not pass through the grid corresponding to the current node, and 1 indicates that the semantic track corresponding to the data bit passes through the grid corresponding to the current node.
Further, the method for calculating the activity influence of the keyword in the grid corresponding to any node comprises the following steps:
and acquiring the activity influence of the keyword at all track points in the grid corresponding to the current node, and taking the maximum value as the activity influence of the keyword in the grid corresponding to the current node.
Further, the hybrid inverted index HIF in step S2 is composed of an activity list, an activity inverted arrangement list and a sub-track length list, where the activity list is used to store keywords corresponding to all track points on the semantic track, the activity inverted arrangement list is used to store association relations between the keywords and track points where the keywords appear, and the sub-track length list is used to store lengths from each track point to the initial track point on the semantic track.
Further, in step S3, finding k semantic tracks that are most matched with the query requirement Q in the semantic track data set D based on the hybrid grid index HGI specifically includes:
s31: setting a heap set, traversing a mixed quadtree index HQ-tree from top to bottom, firstly putting a root node into the heap set, acquiring child nodes of the root node, then removing the root node from the heap set, and adding the child nodes of the root node which accord with a set pruning rule into the heap set;
s32: finding out the node with the maximum activity influence in the current heap set, then removing the heap set from the node, adding the child nodes of the node which accord with the set pruning rule into the heap set, and repeating the step until the leaf nodes are traversed;
s33: according to the index entries related to the leaf nodes obtained through traversal, obtaining semantic tracks of key words acts set by a user in the leaf nodes, and taking the obtained semantic tracks as tracks to be verified; acquiring the activity influence of each track to be verified, and taking the front k tracks with the maximum activity influence as alternative tracks;
s34: removing the leaf nodes in the step S33 to form a heap set, repeating the steps S32-S33 on the current heap set until the early termination condition is met or all nodes are traversed, and finally obtaining k candidate trajectories (loc, acts, d) corresponding to the query requirement Q given by the usermax) The most matched semantic track.
Further, in step S31, the pruning rule is set as: the distance between at least one track point in the grid corresponding to the child node and the query position loc set by the user is not more than dmax(ii) a The influence of the activity of the child node is not 0; at least one of all semantic tracks passing through the grid corresponding to the child node is not accessed.
Further, the method for calculating the activity influence of any node in step S32 includes:
finding out keywords belonging to keywords acts set by a user from all keywords corresponding to the node; and summing the maximum value of the activity influence of the found keywords in the node, and dividing the sum value by the total number of the keywords in the word set acts to obtain the activity influence of the node.
Further, the method for calculating the activity influence of each to-be-verified track in step S33 is as follows:
s331: initializing window endpoints s and e, and recording the sub-tracks extracted by the window each time as T [ s, e ];
s332: the left end point s is fixed as the first track point and is unchanged, the right end point e is increased from the first track point along the direction of the track to be verified, and each time the right end point is increased, a section of sub-track T [ s, e ] is intercepted]Judging the currently intercepted sub-track T [ s, e ]]Whether the distance between the query position loc and the query position loc set by the user is larger than dmaxIf the current position of the right end point e is larger than the current position of the right end point e, the current position of the right end point e is kept unchanged, and the left end point s is increased until the sub-track T [ s, e ] is obtained again]The distance between the query position loc set by the user is not more than dmaxThen, the sub-track T [ s, e ] is calculated according to the following formula]Activity influence of (c); if not, directly calculating the sub-track T [ s, e ] according to the following formula]Activity influence of (c);
wherein Inf (T [ s, e ]]Q) denotes the sub-track T [ s, e ]]Act | represents the total number of keywords in the word set acts, Infp.poi(w) is represented in the sub-track T [ s, e ]]Activity influence of keywords appearing in and belonging to the word set acts, Infmax(w) represents the maximum value of the activity influence of the keywords belonging to the word set acts at all locus points, Q.acts is a keyword set by the user, w is the keyword belonging to Q.acts, and p is the sub-locus T [ s, e ]]The trace points appearing in (1);
s333: keeping the current position of the left end point s unchanged, continuously increasing the right end point e along the direction of the track to be verified, continuously carrying out condition judgment on the intercepted sub-track, obtaining the activity influence of the sub-track meeting the condition, and repeating the steps until the right end point e reaches the last track point;
s334: taking the maximum value of the obtained activity influence of all the sub-tracks as the activity influence of the current track to be verified;
further, in step S332, the calculation formula of the distance between the sub-trajectory T [ S, e ] and the query location loc set by the user is:
wherein d (T [ s, e ]]Q) is a sub-track T [ s, e ]]Distance, dist (q.loc, p) from the query location loc set by the users) Euclidean distance, dist (p), between query location loc set for the user and left endpoint si,pi+1) Is a sub-track T [ s, e ]]Any two adjacent track points p except the right end point eiAnd pi+1The euclidean distance between.
Further, the early termination condition in step S34 is: the first sum is smaller than the minimum value of the track activity influence in the k candidate tracks obtained by current calculation, wherein the calculation method of the first sum is as follows:
respectively taking each keyword acts set by the user as a current keyword to execute the following operations: finding out the maximum value of the influence of the current keyword on the activities at all the track points corresponding to all the nodes of the current heap set, and taking the maximum value of the influence of the activities as the maximum influence corresponding to the current keyword;
and summing the maximum influence corresponding to all the keywords acts set by the user, and dividing the sum by the total number of the keywords in the word set acts to obtain a first sum.
Has the advantages that:
1. the invention provides a semantic track query method based on activity influence, which deeply researches an index structure, a query processing algorithm and a query optimization technology of semantic track data; specifically, the invention provides a concept of activity influence in semantic track data, and defines a semantic track query requirement based on the activity influence according to the concept; meanwhile, in order to realize the efficient processing of semantic track query, the invention designs a mixed grid index structure which integrates multiple information of semantic track spatial position, keywords and activity influence at the same time, has stronger pruning capability compared with the prior index technology, and can preferentially search tracks with larger activity influence based on the index structure.
2. The invention provides a semantic track query method based on activity influence, which realizes an efficient heuristic search method based on mixed grid index structure design, traverses the mixed grid index structure in a heap aggregation mode, quickly finds k semantic tracks which are most matched with a query requirement Q, and can find key words which meet the query requirement of a user in a semantic track data set, namely, the invention preferentially matches the first k tracks with the maximum activity influence within a user specified threshold, thereby greatly improving the processing efficiency of track query.
3. The invention provides a semantic track query method based on activity influence, which introduces an early termination condition in a heuristic search process and can accelerate a query processing process.
Drawings
FIG. 1 is a flow chart of a semantic track query method based on activity influence according to the present invention;
FIG. 2 is a schematic diagram of a semantic track dataset based on activity influence according to the present invention;
FIG. 3 is a schematic diagram of two-dimensional geospatial meshing provided by the present invention;
FIG. 4(a) is a schematic diagram of a quad-tree provided by the present invention;
FIG. 4(b) is a diagram illustrating an index entry of a non-leaf node according to the present invention;
FIG. 4(c) is a diagram illustrating an index entry of a leaf node according to the present invention;
FIG. 5 is a schematic diagram of a hybrid inverted file provided by the present invention;
FIG. 6 is a heuristic search framework provided by the present invention;
fig. 7 is a schematic structural diagram of a main stack and an auxiliary stack provided by the present invention.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
The invention provides and solves a novel semantic track query in large-scale semantic track data, and deeply researches an index structure, a query processing algorithm and a query optimization technology of the semantic track data. Specifically, the invention provides a concept of Activity Influence in semantic track data, and defines a semantic track Query model (AITQ) based on the Activity Influence according to the concept. In order to realize the efficient processing of the query, the invention designs a mixed grid index structure which simultaneously integrates multiple information of semantic track space positions, activity keywords and activity influence, and designs and realizes a heuristic search framework based on the index, wherein the framework can preferentially search tracks with larger activity influence. In the heuristic search process, the invention introduces an early termination condition to accelerate the query processing process.
First, the symbols and meanings adopted in the present invention as shown in table 1 are given.
TABLE 1 symbols and meanings of the invention
Some basic definitions are given below:
definition 1: a semantic track T containing n POI (Point Of Interest) positions is defined as T ═ p1,p2,…,pnIn which p isiPosi represents an arbitrary two-dimensional spatial location that can be semantically labeled, act represents a textual record consisting of some keywords describing the user's behavioral activity, and t represents a timestamp.
Definition 2: giving a semantic track T (p) composed of n track points1,p2,…,pn) Will consist of successive points of track (p)s,ps+1,…,pe) The sequence of points formed is called a sub-track of the track T, where s is greater than or equal to 1 and less than or equal to e and less than or equal to n, denoted as T [ s, e%]. With W [ s, e ]]Representing sub-tracks T [ s, e ]]All active keywords contained therein, i.e.
Definition 3: given a user-specified query Q ═ c, acts, dmax) Where loc represents the query location represented by latitude and longitude coordinates; acts is a set of activity keywords that describe a series of activities that the user intends to perform; dmaxIs the expected maximum travel distance of the user for all activities from the query location.
Definition 4: given the distance between the query and the sub-track, given query Q ═ loc, acts, dmax) A sub-track T [ s, e ] of the semantic track T]Query Q is compared to sub-track T [ s, e ]]Is denoted as d (T [ s, e ]]Q), and
where dist denotes the euclidean distance of the two positions.
Definition 5: given a semantic track T, given a query Q ═ c, acts, dmax) If one of the sub-tracks T s, e in track T]Satisfies the following conditions: (1)(2)d(T[s,e],Q)≤dmaxthen, the trajectory T [ s, e ] is called]Matching query Q.
Definition 6: given a query Q, if there is at least one sub-trajectory T [ s, e ] in a trajectory T in the semantic trajectory data set D that matches the query Q, the trajectory T is referred to as a candidate trajectory for the query Q.
In FIG. 2, the query activity currently submitted by the user is { w }1,w2,w4And d is the maximum travel distance. Track T1In does not contain activity w4Thus T1There is no sub-track matching the query, so T1Are not candidate trajectories for the query. At T2In which only sub-track T exists2[1,5]The activity requirement of the query is met, and the distance d (T) between the sub-track and Q can be seen from the graph2[1,5]Q) is large, assuming d (T)2[1,5],Q)>d, then T2Nor are candidate trajectories. At the track T3In (1), consider sub-track T3[3,4]Match Q because of T3[3,4]Active set W [ s, e ] of]={w1,w2,w3,w4Contains the query active set w1,w2,w4And consider T3[3,4]The distance to Q is small, satisfying the journey limit (i.e. d (T)3[3,4]Q) < d), therefore T3Are candidate trajectories for Q.
Definition 7: given a semantic track dataset D, setRepresenting all POIs, sets in DRepresenting all active keywords in D. Suppose that a certain activity consists of keywordsRepresents, defines the activity w inThe influence of (c) is the number of semantic tracks at this poi that contain the active keyword w, i.e.:
definition 8: given query Q ═ loc, acts, dmax) And a candidate trajectory T, if sub-trajectory T s, e]Matching with the query Q, defining the sub-track activity influence under the query Q as:
wherein Infmax(w) represents the maximum impact of activity w at all pois, i.e. Since there may be large differences in the influence of each activity in query Q, causing the influence of sub-track activities to be highly dependent on certain activities, using Infmax(w) obtaining relative activity impacts, aggregating the impacts of all query activities indifferently, and normalizing the activity impacts of the sub-tracks to [0, 1%]. There may be multiple sub-trajectories in T that match query Q, and the activity impact of candidate trajectory T under query Q is defined as:
Inf(T,Q)=maxs is more than or equal to 1 and less than or equal to e and less than or equal to n and T [ s, e%]Is matched with Q(Inf(T[s,e],Q)) (4)
Referring to FIG. 2, only one candidate trajectory T for a given query Q exists in FIG. 23At T3Only the sub-track T exists in all the sub-tracks of3[3,4]Matching Q. In the data set consisting of these three tracks, the keyword w1、w2、w4The maximum activity influence is respectively: infmax(w1)=1,Infmax(w2)=2,Infmax(w4)=1。
Inf (T) is calculated according to definition 73,Q)=Inf(T3[3,4]Q) — (1/1+1/2+1/1)/3 ═ 0.833, i.e., candidate trajectory T3The activity impact size under query Q is 0.833.
Definition 9: given query Q ═ c, acts, dmax) And inputting a positive integer k into the semantic track data set D. Assumption set CContaining all candidate tracks for Query Q in data set D, an Activity-based semantic track Query (AITQ) returns a result set RS from candidate set C that satisfies: (1)and | RS | ═ k; (2) to pairAll Inf (Q, T) is more than or equal to Inf (Q, T').
Specifically, as shown in fig. 1, a semantic track query method based on activity influence includes the following steps:
s1: acquiring basic information of a semantic track data set D, wherein the basic information comprises track numbers of all semantic tracks T in the semantic track data set D and activity influence of keywords corresponding to all track points p on all the semantic tracks T, each track point at least corresponds to one keyword, the keywords corresponding to all the track points are not identical, and the activity influence of the same keyword on different track points is not identical.
It should be noted that the activity influence of the keyword on any track point is the number of occurrences of the keyword on the current track point.
S2: and constructing a Hybrid grid index HGI based on the basic information of the semantic track data set D, wherein the Hybrid grid index HGI consists of a Hybrid Quad-tree index HQ-tree (HQ-tree) and a Hybrid Inverted index HIF (HIF).
It should be noted that the HGI index takes the track as an index object, and integrates the space, the activity keyword and the activity influence into the index structure. The hybrid quadtree index HQ-tree resides in memory, and the hybrid inverted index HIF is stored on disk.
S3: for a given query requirement Q ═ c (acts, d) by the usermax) Finding out k semantic tracks which are most matched with the query requirement Q in a semantic track data set D based on a hybrid grid index HGI, wherein loc is a query position set by a user, acts is a keyword set by the user, and DmaxK is at least 3 for the maximum value of the expected journey set by the user.
The construction method of the hybrid lattice index HGI is set forth in detail below.
On the first hand, the method for constructing the mixed quadtree index HQ-tree is as follows:
s21: dividing a real geographic area in which a semantic track data set D is positioned into grids of D levels respectively, wherein each level comprises 2d-1×2d-1Each grid is corresponding to a grid number, and d is at least 3.
That is, the HQ-tree organizes semantic track information by way of spatial Grid partitioning, recursively divides the entire two-dimensional space into four equal subspaces, sequentially forms a hierarchical structure of 1-Grid, 2-Grid, … …, (d-1) -Grid, d-Grid, until the last layer forms 2d-1×2d-1And then, indexing all grids by using the smallest grids, and embedding track information, activity keywords and activity influence into the Quad-tree index to form the HQ-tree index. At the same time, HQ-tree each node in the index represents the MBR for a grid area, and each node is associated with an inverted index of active keywords.
Referring to FIG. 2, a schematic diagram of a real geographic region where a semantic track dataset D is located is shown, and there are 3 semantic tracks T in the region1~T3,Representing a track T1The upper 4 track points are arranged on the surface of the film,representing a track T2The upper 4 track points are arranged on the surface of the film,representing a track T1Last 5 track points, w1~w6Then the semantic track T is represented1~T3The corresponding keywords of the upper trace points.
Referring to fig. 3, a schematic diagram of meshing is shown. The whole two-dimensional space is divided into 3-Grid, all grids have a Grid number Gid, 1-Grid is represented by Grid 0, Grid 0 is divided into four grids (grids 1-4) to form 2-Grid, and grids 5-20 form 3-Grid.
S22: respectively judging whether each grid of each hierarchy has a track point, eliminating grids which do not contain the track point, and constructing a quadtree by the rest grids according to the hierarchy inclusion relation of the grids, wherein nodes without child nodes are leaf nodes, and the rest nodes are non-leaf nodes;
s23: the index entries and bitmap information are respectively associated with each node of the quadtree, and the specific steps are as follows:
for any one non-leaf node g0Its associated index entry is a triple (Gid)0{ g' ∈ g. substrids }, ifile), where Gid0Representing non-leaf nodes g0The number of the corresponding grid, { g' ∈ g.subgrids } represents a list of pointers to all child nodes of the non-leaf node g, ifile represents the non-leaf node g0All present in the corresponding gridThe first inverted index of the keyword corresponding to the track point comprises the following information: each keyword is at a non-leaf node g0Activity influence in the corresponding grid, non-leaf nodes g where keywords appear0The grid number corresponding to the child node of (1).
It should be noted that the first inverted index is implemented by two hash tables, and the key is a key in the hash table. The value of the first hash table records the maximum influence of the current key (assumed to be w) in gNamely, it isWherein Infmax(w) represents the maximum influence of w in all pois; the value of the second hash table is a list of pointers that record the child nodes that contain the activity.
Meanwhile, the method for calculating the activity influence of the keyword in the grid corresponding to any node comprises the following steps: and acquiring the activity influence of the keyword at all track points in the grid corresponding to the current node, and taking the maximum value as the activity influence of the keyword in the grid corresponding to the current node.
For any leaf node g1Its associated index entry is a tuple (Gid)1Ilist), wherein Gid1Represents the leaf node g1Number of the corresponding grid, ilist, denotes the leaf node g1And the second inverted indexes of the keywords corresponding to all the track points in the corresponding grids comprise the following information: each keyword is at leaf node g1The activity influence in the corresponding grid, and the number of semantic tracks where each keyword appears.
It should be noted that the difference between the ifile of the non-leaf node and the ifile of the non-leaf node is that the value of the second hash table of the inverted list pointed by the ilist stores a track id containing a keyword key in the range of the current node MBR, and the track can be quickly obtained from the disk through the track id. Both ifile and ilist are implemented using hash tables, and the inverted index of each node can be accessed at O (1) time.
For any node of the quadtree, the bitmap information is a data sequence consisting of 0 and 1, and each data bit on the data sequence corresponds to a semantic track, wherein 0 indicates that the semantic track corresponding to the data bit does not pass through the grid corresponding to the current node, and 1 indicates that the semantic track corresponding to the data bit passes through the grid corresponding to the current node.
That is, each node of the present invention maintains a bitmap SIG that marks all traces that pass through the node. Specifically, one hash function is used to map the id of each track to one bit in SIG, and if there is a track point in the track T within the grid represented by the node, the hash function used maps the id of T to i, i.e., hash (T) ═ i, the i-th bit of SIG in the node is set to 1 (initially 0).
For example, FIG. 4(a) is an index structure obtained after indexing the trellis of FIG. 3 using an HQ-tree. The grid 7 does not contain any trace points of the track, the node is set to null in the index structure, and all null nodes are not represented in fig. 4 (a). FIG. 4(b) shows the structure of a non-leaf node 2, with two children, node 9 and node 11, and also shows the structure for indexing an activity w within node 21And w2The inverted list of (1). Fig. 4(c) shows a specific structure of the leaf node 11, which is one child node of the node 2. There are three tracks in total in the whole grid, the length of bitmap is set to 3 if hash (T)1) 1 and hash (T)2) The trajectory through node 2 and node 11 is only T21、T2Therefore, the bit maps SIG of node 2 and node 11 are both 110.
In general, the off-line construction method of the HQ-tree index trace data set can be summarized as the following steps 2-1-1 to 2-1-3:
step 2-1-1: and carrying out grid division on a two-dimensional space formed by the data set, and organizing grids of all layers by using a Quad-tree after the division is finished.
Step 2-1-2: and sequentially processing each track point of each track in the data set through insertion operation, continuously updating the ifile and SIG of the middle node from top to bottom until the track point reaches the leaf node, and returning after updating the ilist and SIG of the leaf node.
Step 2-1-3: after all tracks are processed, all nodes are updated, leaf nodes which do not contain any track point are set to null, and then the same processing is carried out on middle nodes of which all child nodes are null in a bottom-up mode until a root node is returned. Finally, the inverted file of the root node contains all the active keywords and the maximum activity influence corresponding to the active keywords, and each bit in the root node bitmap SIG is set to be 1.
In a second aspect, the hybrid inverted index HIF is composed of an activity list, an activity inverted arrangement list and a sub-track length list, wherein the activity list is used for storing keywords corresponding to all track points on a semantic track, the activity inverted arrangement list is used for storing association relations between the keywords and track points where the keywords appear, and the sub-track length list is used for storing the lengths from the track points on the semantic track to an initial track point.
It should be noted that the first list in the HIF index contains all the activities in T and is arranged in ascending order for quickly filtering out traces that do not meet the query activity requirements; the second list stores the movable inverted index of each track point in the T, and track points which do not need to be processed can be filtered by using the list; the third list stores p in track T1The length of all sub-tracks that are starting points, len (T [1, e ]]) The length of the target sub-track can be quickly obtained from the lengths of the sub-tracks in the list.
Referring to fig. 5, fig. 5 is a schematic diagram of a hybrid inverted file of two tracks on a disk. If the currently matched sub-track is T [3, 5], then len (T [3, 5]) len (T [1, 5]) -len (T [1, 3 ]). When the trace data set is small, the active list can be loaded into the memory in advance for rapid filtering, and other lists are loaded from the disk when the trace which needs further verification is acquired.
The following details how heuristic searching is performed based on the hybrid lattice index HGI when a user submits the AITQ query Q online. When processing AITQ query, the invention designs a heuristic query framework based on HGI index, the framework heuristically traverses HQ-tree, preferentially obtains the track to be verified with larger activity influence, quickly calculates the actual activity influence value of the track by using HIF index in the track verification stage, and can terminate the processing process in advance according to the termination strategy in the traversal process.
Specifically, referring to fig. 6, fig. 6 is a schematic diagram of a heuristic search framework. Given a trajectory data set D, query Q ═ loc, acts, Dmax) The positive integers k, HQ-tree and HIF. Initializing a result set RS and a to-be-verified set VS, and initializing a bitmap BM to record a searched track; in the process of traversing the HQ-tree nodes, a large top Heap Heap is used for sorting each node needing further access according to the F (node, Q) value of the node, the F (node, Q) value of the top Heap node is the largest, and the root node of the HQ-tree is heaped.
Further, the step S3 of finding k semantic tracks that best match the query requirement Q in the semantic track data set D based on the hybrid grid index HGI specifically includes the following steps:
s31: setting a heap set, traversing the mixed quadtree index HQ-tree from top to bottom, firstly putting a root node into the heap set, acquiring a child node of the root node, then removing the heap set from the root node, and adding the child node of the root node, which accords with the set pruning rule, into the heap set.
The set pruning rule is as follows: the distance between at least one track point in the grid corresponding to the child node and the query position loc set by the user is not more than dmax(ii) a The influence of the activity of the child node is not 0; at least one of all semantic tracks passing through the grid corresponding to the child node is not accessed.
Note that, for the pruning rule 1, if d (Q, node)>dmaxIf the node does not have the track points, the distance constraint is satisfied, and all the tracks in the node can be safely pruned; for the pruning rule 2, if Inf (node, Q) calculated by the formula (1) is 0, it indicates that there are no active keywords in q.acts in the node, then the node is pruned;for pruning rule 3, a node is pruned if all tracks that pass through the node have been visited before. Check whether there is an unaccessed track in the node by SigCheck (BM) function. SIG, in particular, if a bitwise operation of the node is performed&SIG result has 1 bit, which shows that there is no access track in node, SigCheck returns false; otherwise, SigCheck returns true, then the node may be pruned. Sig ═ 1,1,0]、BM=[1,0,1,0,1]It is clear that there are traces in the node that have not been accessed, i.e., traces mapped to the 2 nd, 4 th positions of the SIG, when the SIG is present&BM^SIG=[0,1,0,1,0]Obviously, node nodes need to be accessed.
It should be noted that the semantic track of the grid corresponding to the child node is quickly judged through the bitmap information in the HQ-tree index.
S32: finding out the node with the largest activity influence in the current heap set, then removing the heap set from the node, adding the child nodes of the node which accord with the set pruning rule into the heap set, and repeating the step until the leaf nodes are traversed.
That is, when the Heap set Heap is not empty, pop up the Heap top node, if the node is a leaf node, obtain some tracks to be verified, if the node is a non-leaf node, obtain its child node, and Heap the child nodes passing through 3 pruning rules.
Further, the method for calculating the influence of the activity of any node comprises the following steps: finding out keywords belonging to keywords acts set by a user from all keywords corresponding to the node; and summing the maximum value of the activity influence of the found keywords in the node, and dividing the sum value by the total number of the keywords in the word set acts to obtain the activity influence of the node.
It should be noted that each keyword may appear on a plurality of trace points in the grid corresponding to the node, however, the activity influence (i.e., the number of occurrences) of each keyword on different trace points is not necessarily the same, and therefore, the activity influence on which trace point the keyword has the greatest influence is taken as the maximum activity influence of the keyword in the current node.
Further, the calculation formula of the activity influence of any node is as follows:
maximum influence of active keyword w in nodeCan be obtained from the inverted index of the node. Then, the maximum influence of each active keyword w in q.acts on a node is obtained in turn, and Inf (node, Q) can be calculated by formula (1). In order to consider the spatial proximity of the node and the query Q and the activity influence between the node and the query Q during heuristic traversal, a constant c is used to integrate the spatial distance between the node and the Q and the activity influence to obtain the following function F (node, Q):
where d (Q, node) represents the minimum distance between MBR and q.loc of the node (q.loc is within the node, the distance is considered to be 0), and d is usedmaxAnd (6) carrying out normalization. c is equal to [0,1 ]]For controlling the impact of spatial proximity, default c is 0.2. In the process of traversing HQ-tree nodes, nodes with larger F function values are accessed in a heuristic mode preferentially.
S33: according to the index entries related to the leaf nodes obtained through traversal, obtaining semantic tracks of key words acts set by a user in the leaf nodes, and taking the obtained semantic tracks as tracks to be verified; and acquiring the activity influence of each track to be verified, and taking the front k tracks with the maximum activity influence as alternative tracks.
It should be noted that, for all currently obtained tracks to be verified, the searched tracks are marked in the bitmap BM, all the tracks that have not been verified are screened out by using the HIF index, and then verification in S331 to S334 is performed, that is, the activity influence of each track to be verified is calculated, and the top k tracks are selected. If the activity list of a certain track to be verified does not contain all the activity keywords in Q.acts, the track to be verified is directly skipped over, and the activity influence is not calculated.
Further, the calculation method of the activity influence of each to-be-verified track is as follows:
s331: initializing window endpoints s and e, and recording the sub-tracks extracted by the window each time as T [ s, e ];
s332: the left end point s is fixed as the first track point and is unchanged, the right end point e is increased from the first track point along the direction of the track to be verified, and each time the right end point is increased, a section of sub-track T [ s, e ] is intercepted]Judging the currently intercepted sub-track T [ s, e ]]Whether the distance between the query position loc and the query position loc set by the user is larger than dmaxIf the current position of the right end point e is larger than the current position of the right end point e, the current position of the right end point e is kept unchanged, and the left end point s is increased until the sub-track T [ s, e ] is obtained again]The distance between the query position loc set by the user is not more than dmaxThen, the sub-trajectory T [ s, e ] is calculated according to the formula (7)]Activity influence of (c); if not, calculating the sub-track T [ s, e ] directly according to the formula (7)]Activity influence of (c);
wherein Inf (T [ s, e ]]Q) denotes the sub-track T [ s, e ]]Act | represents the total number of keywords in the word set acts, Infp.poi(w) is represented in the sub-track T [ s, e ]]Activity influence of keywords appearing in and belonging to the word set acts, Infmax(w) represents the maximum value of the activity influence of the keywords belonging to the word set acts at all locus points, Q.acts is a keyword set by the user, w is the keyword belonging to Q.acts, and p is the sub-locus T [ s, e ]]The trace points appearing in (1);
meanwhile, in step S332, the calculation formula of the distance between the sub-trajectory T [ S, e ] and the query position loc set by the user is:
wherein d (T [ s, e ]]Q) is a sub-track T [ s, e ]]Distance, dist (q.loc, p) from the query location loc set by the users) Euclidean distance, dist (p), between query location loc set for the user and left endpoint si,pi+1) Is a sub-track T [ s, e ]]Any two adjacent track points p except the right end point eiAnd pi+1The euclidean distance between.
S333: keeping the current position of the left end point s unchanged, continuously increasing the right end point e along the direction of the track to be verified, continuously carrying out condition judgment on the intercepted sub-track, obtaining the activity influence of the sub-track meeting the condition, and repeating the steps until the right end point e reaches the last track point;
s334: taking the maximum value of the obtained activity influence of all the sub-tracks as the activity influence of the current track to be verified;
s34: removing the leaf nodes in the step S33 to form a heap set, repeating the steps S32-S33 on the current heap set until the early termination condition is met or all nodes are traversed, and finally obtaining k candidate trajectories (loc, acts, d) corresponding to the query requirement Q given by the usermax) The most matched semantic track.
It should be noted that, each time steps S32 to S33 are executed, k candidate trajectories are obtained. And after the heuristic search process is finished, returning the top-k track with the maximum activity influence under the query Q, and finishing the query processing method.
Further, the early termination condition is: the first sum is smaller than the minimum value of the track activity influence in the k candidate tracks obtained by current calculation, wherein the calculation method of the first sum is as follows:
respectively taking each keyword acts set by the user as a current keyword to execute the following operations: finding out the maximum value of the influence of the current keyword on the activities at all the track points corresponding to all the nodes of the current heap set, and taking the maximum value of the influence of the activities as the maximum influence corresponding to the current keyword;
and summing the maximum influence corresponding to all the keywords acts set by the user, and dividing the sum by the total number of the keywords in the word set acts to obtain a first sum.
That is, the present invention attempts to find the upper bound Inf of the activity impact of all traces in the nodes within the Heap in order to end the traversal of the nodes as early as possible before the Heap Heap is emptyub(Heap, Q), the search process maintains the kth large activity impact Inf in the result set RSkIf Inf is foundk≥Infub(Heap, Q), the search process can be ended early.
Definition 10: given query Q ═ loc, acts, dmax) And a node in the HQ-tree, defining the activity impact of Heap under query Q as:
wherein the content of the first and second substances,representing the maximum influence of activity w in the Heap.The maximum influence of activity w within a node can be obtained from the inverted index of the node.
Theorem 1: inf (Heap, Q) is the upper bound of the activity impact of all the not yet verified tracks, i.e. for any one not yet verified track T: inf (T, Q) is less than or equal to Inf (Heap, Q).
And (3) proving that: assuming that the track T is an unverified track, T can pass through at least one node in the Heap certainly, and for any w ∈ Q.acts, the track point possibly matching w in T is certain in a certain node in the Heap, so InfT(w)≤Infw(Heap, Q) according to definition 7 and definition 10, Inf (T, Q). ltoreq.Inf (Heap, Q). Theorem 1 proves that the process is finished.
To efficiently update Inf (H) in traversing HQ-tree nodeseap, Q) maintaining a maximum Heap for each w e QwNodes in each heap are arranged according toAnd (6) sorting the values. The nodes within the Heap of these heaps are all identical to the nodes in the main Heap Heap, but in a different order. To be able to immediately find a node in the Heap from these heaps, the nodes in these heaps are linked to the same node in the Heap. Whenever a node pops up from a Heap, nodes linked to the node are simultaneously popped from all HeapswPopping up; when adding a child node to a Heap, it is also necessary to add the child node to each Heapw. For any w epsilon Q.acts, HeapwOf pile top nodesI.e. the maximum influence Inf of the activity w in the main stack Heapw(Heap, Q). At any stage of traversing the nodes, each Heap can be passedwThe stack top node obtains the corresponding Infw(Heap, Q), calculating the action influence Inf (Heap, Q) of the stack Heap by the formula (9), and when the result is found to concentrate on the k-th large action influence InfkWhen the value is more than or equal to Inf (Heap, Q), the searching process can be ended.
Referring to FIG. 7, FIG. 7 is a schematic diagram of the structure of a main stack and a sub-stack, two stacks being present in addition to the stack HeapAndall three stacks have the same node N1-N5The order of the nodes in each heap differs, and the nodes in both subsidiary heaps are linked to the same node in the main heap. In thatMiddle heap top node N4Is/are as followsI.e. activity w in the Heap1Is most influential, so byAndand obtaining the upper bound Inf (Heap, Q) of the influence of the unverified track activities.
That is, the middle main Heap is ordered from large to small according to the total activity influence of the nodes, and the left HeapAccording to the keyword w1The activity influence of (2) ranks the nodes from large to small, right heapFor according to the keyword w2Rank the nodes from big to little, assuming w1And w2If the key word is the key word given by the user, the node N is set4Middle key word w1Activity influence and node N3Middle key word w2The maximum activity influence of all nodes which are not searched for is obtained by adding the activity influences; if the maximum activity impact is less than the minimum of the activity impacts of the first k candidate tracks calculated currently, the iteration is ended early.
In conclusion, the invention designs a novel index structure aiming at semantic track query based on activity influence, the structure integrates geographic position information, activity keyword information and activity influence information, and the indexing technology has stronger pruning capability compared with the existing indexing technology. In addition, the query processing framework and the optimization mechanism designed by the index structure can efficiently process semantic track query based on activity influence, and the prior art cannot achieve the processing efficiency.
The present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof, and it will be understood by those skilled in the art that various changes and modifications may be made herein without departing from the spirit and scope of the invention as defined in the appended claims.