City burglary crime risk assessment method based on mobile phone user and POI data
1. A risk assessment method for urban burglary crime adopts a random forest algorithm to construct an urban burglary crime risk assessment and prediction model based on multi-scale characteristic variables, and utilizes the model to assess and predict burglary risks, wherein the construction of the model comprises the following steps:
A. extracting multi-scale characteristic variables serving for a city burglary crime risk assessment model from mobile phone user data, city POI data, city road network data and night light data;
B. screening and optimizing multi-scale characteristic variables for constructing a city burglary crime risk assessment prediction model: firstly, analyzing the influence of different scales of each characteristic on burglary risk modeling precision, and determining the optimal spatial scale of each characteristic; then, calculating the feature importance based on the OOB error by using a recursive elimination method, and screening and optimizing the feature set; the feature importance is calculated by randomly replacing the ith feature in the process of iteratively training a single tree t in the forest, then comparing the OOB error change of the estimator before and after the feature is replaced, and changing the variable XiThe difference of model accuracy before and after replacement is used as the measure of the importance degree of the feature to the estimator, and is recorded as VItiI.e. the feature i has an importance value for each tree in the random forest; importance VI of feature i for the entire modeltiIs the average value of the importance of the characteristics i corresponding to all the trees in the forest, and the calculation formula is as follows:
where N represents the number of decision trees in the random forest, EtiRepresents variable XiOOB error, EP, of the t-th tree before being replacedtiRepresents a variable XiReplacing the OOB error of the t-th tree;
C. carrying out four-area division on burglary sample data of an urban area, wherein grids in two areas are used as training set training models, a grid in one area is used as a verification set for model parameter adjustment, and a grid in the other area is used as a test set for testing model accuracy;
D. on the basis of the optimal feature set determined in the step B, model training and parameter selection are carried out by comparing the precision of the model on the verification set, and the most important 3 parameters in the random forest are determined: the number of decision trees, the maximum feature number of the decision trees and the maximum depth of the decision trees; continuously changing parameters, evaluating the model through the performance of the verification set to obtain the optimal parameters of the model, and finishing the training of the model;
E. and verifying and evaluating the model precision.
2. The city burglary crime risk assessment method according to claim 1, wherein in the step a, multi-scale socioeconomic feature variables are extracted from mobile phone user data, and the extraction method comprises: and taking a grid of mobile phone user data as a center, and carrying out aggregation filtering analysis by using a plurality of moving windows of K multiplied by K grids, wherein K is an odd number, so as to obtain a multi-scale characteristic variable capable of reflecting social and economic characteristics.
3. The city burglary crime risk assessment method according to claim 2, wherein the multi-scale socioeconomic feature variables extracted from the mobile phone user data include the number of residential population in a plurality of spatial scales, and the proportion of residential population in each age group, each level of abundance, and local foreign population.
4. The method as claimed in claim 1, wherein the POI related to burglary is selected in step a, and the POI data is classified into a plurality of categories to extract the spatial density of each category.
5. The city burglary crime risk assessment method according to claim 1, wherein in step a, road network accessibility features are extracted from city road network data, intersections where 3 or more important roads intersect are first used as city important nodes, and each road of different levels is given different speeds according to actual conditions of research areas; then, the shortest travel time from each node to other urban nodes is calculated based on the urban road network, the average time from each node to other nodes is calculated, and the reciprocal of the average time is taken as the road network reachability of the node, as shown in formula (1):
wherein A (i) is the reachability of node i, N is the total number of nodes, tijRepresenting the shortest time from the node i to the node j;
and finally, generating a road network accessibility grid image of the research area grid by using the Krigin interpolation.
6. The method as claimed in claim 1, wherein the step a of obtaining the image data of the city night light by remote sensing satellite, resampling the night light image with lower resolution to higher resolution, and calculating the average value of the night light image with higher resolution in the grid as the night light value of the grid.
7. The city burglary crime risk assessment method of claim 1, wherein step B comprises:
B1. determining an optimal spatial scale for each feature by adopting a random forest algorithm based on Mean Square Error (MSE), specifically:
firstly, training based on all feature sets F to obtain a model M, and calculating the mean square error of the model, and recording the mean square error as MSE{F}(ii) a Then replacing feature F with a random valueiThe value of (2) is obtained as a new characteristic set value, and the MSE of the model after replacement is obtained based on model M prediction and is recorded asRepeating the replacement for multiple times to eliminate the randomness of the selected features of the random forest, wherein the process is a replacement test, then calculating the average mean square error increment of the features, recording the average mean square error increment as IMSE, and calculating the proportion of the IMSE of the features to the sum of all the IMSEs of the features, and recording the proportion as normalized mean square error increment NIMSE; the NIMSE value is positive, which indicates that the model precision is reduced after the feature replacement, and the NIMSE value is negative, which indicates that the model precision is improved after the feature replacement; selecting the scale with the maximum NIMSE as the optimal spatial scale of the same feature with different scales; the calculation formula is as follows:
in the formula, IMSEiRepresenting the value of the mean square error increase after the displacement of the ith feature,to replace the mean square error, MSE, of the model after the ith feature{F}For the mean square error, NIMSE, of the model obtained based on the training of all the feature setsiThe proportion of the IMSE of the ith characteristic to the total IMSE of all the characteristics is represented, k represents the randomness of repeated k-th elimination characteristic selection, j represents the jth characteristic, N represents the total characteristic number, and H represents the repetition times for eliminating the randomness in the random forest construction process;
B2. and (3) carrying out feature screening based on feature importance: and after the optimal spatial scale of each feature is determined, removing the features of the rest scales, and screening the optimal spatial scale feature set by using a dichotomy method and a recursive feature elimination method.
8. The city burglary crime risk assessment method according to claim 7, wherein the screening method of step B2 is: 1) after training is carried out on the basis of the feature set with the optimal scale, variable importance obtained by random forest training is subjected to descending order arrangement; 2) keeping the characteristics with the importance of the first 50% by using a bisection method as a new characteristic set, and training the model again to obtain new characteristic importance and a root mean square error; 3) if the precision is improved or unchanged, repeating the step 2), and if the precision is reduced, taking the feature set screened in the step 2) as the feature set after primary screening; and after the dichotomy preliminary screening, adding the features removed last time into the feature set with the optimal precision one by one according to the importance, and selecting the feature set with the minimum error as a final feature set.
9. The city burglary crime risk assessment method according to claim 1, wherein step E uses a decision coefficient R2And a Root Mean Square Error (RMSE) as model evaluation indexes, and the calculation formula is as follows:
in the formula, yiIs the true value of the ith sample,for the prediction value of the i-th sample,the average value of the real values of the samples is obtained, and n is the number of the samples;
the precision ratio test is respectively carried out on the areas which are 5 percent at the top, 10 percent at the top and 20 percent at the top of the risk value in the prediction result, namely, the areas which are 5 percent at the top, 10 percent at the top and 20 percent at the top of the prediction risk value are respectively used as positive examples, the residual areas are used as negative examples, the proportion which is really used as a positive example in the result that the prediction is a positive example is calculated, and the calculation formula is as follows:
in the formula, PreratioThe ratio represents the precision ratio, the ratio is the proportion of the high-risk area, the TP represents the number of grids predicted as the high-risk area is really the high-risk area, and the FP represents the number of grids predicted as the high-risk area is really the non-high-risk area.
Background
Socio-economic conditions and geographical environment are often closely related to the occurrence of criminal acts and affect their spatial distribution. The data of points of interest (POI) in cities can reflect the surrounding geographical environment, and has become an important data source for researching the influence of the geographical environment on crimes. The socioeconomic factors are concerned about instability of community housing, socioeconomic status, land use conditions of residential areas, and the like. Therefore, the socio-economic factors such as neighborhoods, the age of the community, income, occupation, the population structure of the floating population, social connection and collective effectiveness are very important research perspectives in criminal geography.
Most of the existing social and economic data are government statistical data, and the defects of large data statistical scale, slow updating and the like exist. With the popularization of smart phones and the development of big data technologies, it has become possible to obtain regional socioeconomic indexes by using mobile phone user data. At present, in the field of crime, attribute data behind mobile phone user data are not deeply mined, and social and economic characteristic data with fine scale obtained from the mobile phone user data can provide a new means for developing urban crime risk assessment.
Disclosure of Invention
In order to overcome the limitation of social and economic statistical data used in crime risk research, the invention provides a novel city burglary crime risk assessment method based on mobile phone user data and city interest point data. The method involves the following concepts:
mobile phone user data (MPU): the method is characterized in that a mobile phone user economic and social characteristic data set with a certain grid size is generated by a mobile service provider or a third-party related company by adopting an algorithm on the basis of mobile phone signaling data and user registration information when a phone card is purchased, and each 250-meter grid generally comprises the following fields: the number of residential population, the number of local residential population, the number of non-local residential population, the number of residential population with unknown identity, the number of residential population in each age group, the abundance index of the residential population and the like.
POI: for the abbreviation of "Point of Interest", Chinese translates to "Point of Interest". Each POI contains four-way information, name, category, coordinates, classification. For example: in the geographic information system, one POI may be a building, a shop, a mailbox, a bus station, etc.
Night light data: the night light data is obtained by an LJ-1 satellite system which is autonomously emitted by China.
Urban road network data: the urban road network data is road network provided by osm (open Street map), and the data includes road position, road grade and road name. The study uses roads and their connections with road categories motorway, trunk, primary, secondary, tertiary. And taking the nodes where 3 or more than 3 grade roads intersect as important nodes of the city.
The invention aims to overcome the defects brought by traditional socioeconomic statistical data counted according to administrative units by introducing mobile phone user data, city POI data and satellite night light data, and adopts a random forest algorithm to construct a multi-scale characteristic variable-based city burglary crime risk assessment and prediction model.
The technical scheme provided by the invention is as follows:
a city burglary crime risk assessment method based on mobile phone user data and POI data comprises the following steps: a multi-scale characteristic variable-based urban burglary crime risk assessment and prediction model is built by adopting a random forest algorithm, and burglary risks are assessed and predicted by utilizing the model, wherein the model building comprises the following steps:
A. extracting multi-scale characteristic variables serving for a city burglary crime risk assessment model, namely extracting the characteristic variables from mobile phone user data, city POI data, city road network data and night light data, and specifically comprising the following steps of:
A1. the method for extracting the multi-scale socioeconomic characteristic variables based on the mobile phone user data comprises the following steps:
and taking each grid in the mobile phone user data as a center, carrying out aggregation filtering analysis by using a plurality of moving windows of K multiplied by K grids, and taking the average value in the moving windows as the characteristic value of the center grid under the scale. K is an odd number, preferably an odd number of 1-11, and for example, aggregation filter analysis is performed by using moving windows of 1 × 1, 3 × 3, 5 × 5 and 7 × 7 grids respectively to obtain a multi-scale feature variable capable of reflecting socioeconomic features. In order to reflect age composition, economic condition difference and foreign population proportion difference of different areas, in the embodiment of the invention, the number of the residential population is not directly used, but the proportion of each age group, each level of richness margin and local foreign population to the residential population is calculated. A total of 20 characteristic 80 feature variables of 4 spatial scales (1 × 1 grid, 3 × 3 grids, 5 × 5 grids, 7 × 7 grids) are obtained.
A2. The multi-scale density feature extraction method based on the urban POI data comprises the following steps:
in one embodiment of the present invention, the POI point data is classified into 12 categories including ATM, restaurant, supermarket, police, shop, market, internet cafe, office, hospital, residential, entertainment, and exclusive store.
The density values of POI of each category within a range of 250m, 500m, 750m, 1000m, 1500m of the distance grid are extracted in one embodiment of the invention, taking into account that the influence ranges of POI of different categories are different. The specific implementation mode is that a buffer area with a corresponding distance is generated for each POI point, and if the buffer area is intersected with the grid, the POI point is considered to be within the corresponding distance of the grid. And finally, 60 POI density characteristics of the 12 POIs under 5 spatial scales are obtained through calculation.
A3. The method for extracting the road network reachability characteristics comprises the following steps:
the urban road connects the criminal background space and the criminal place space, and the spatial accessibility of the street also influences the urban land utilization mode and social and economic activities, thereby influencing the space-time mode of burglary crimes. The road network accessibility is the reciprocal of the average time of resident trip or vehicle running between urban districts or road network nodes, and can reflect the space accessibility to a certain extent. And extracting the road network accessibility of each grid of the research area based on the OSM urban road network. Specifically, intersections where 3 or more important roads intersect are first used as city important nodes, and roads of different levels are given different speeds according to actual conditions of the research area. And then, calculating the shortest travel time from each node to other urban nodes based on the urban road network, calculating the average time from each node to other nodes, and taking the reciprocal of the average time as the road network reachability of the node. And finally, generating a road network accessibility grid image of a research area grid (with the size of 250m) by using the Krigin interpolation.
Wherein A (i) is the reachability of node i, N is the total number of nodes, tijRepresenting the shortest duration from node i to node j.
A4. The night light feature extraction method comprises the following steps:
the method comprises the steps of obtaining image data of urban night light through a remote sensing satellite, firstly resampling a night light image with lower resolution to higher resolution, and then calculating the mean value of the night light image with higher resolution in a grid to be used as a night light value of the grid. In an embodiment of the invention, night light data are acquired by a Lopa I remote sensing satellite (LJ-1) which is autonomously emitted in China, 130m night light is resampled to 10m resolution, and then the average value of the night light image values with 10m resolution in 250m grids is calculated to be used as the night light values of 1 250m grids.
B. The method for realizing the characteristics selection and the scale optimization of the urban burglary crime risk assessment model comprises the following specific steps:
firstly, analyzing the influence of different scales of each characteristic on the burglary risk modeling precision, determining the optimal spatial scale of each characteristic,the specific calculation method is as described in B1. Then, the feature importance is calculated by using recursive elimination method based on the OOB (out Of bag) error, and the feature set is screened and optimized, wherein the specific calculation method is as described in B2. The feature Importance (VI) is calculated by randomly replacing the ith feature in the iterative training process of a single tree t in the forest, then comparing the OOB error change of the estimator before and after the feature is replaced, and adding the Variable XiThe difference of model accuracy before and after replacement is used as the measure of the importance degree of the feature to the estimator, and is recorded as VItiI.e. feature i has an importance value for each tree in the random forest. Importance VI of feature i for the entire modeltiIs the average value of the importance of the characteristics i corresponding to all the trees in the forest, and the calculation formula is as follows:
where N represents the number of decision trees in the random forest, EtiRepresents variable XiOOB error, EP, of the t-th tree before being replacedtiRepresents a variable XiAnd replacing the OOB error of the t-th tree.
B1. Determining an optimal spatial scale for each selected feature based on Mean Square Error (MSE), wherein the specific implementation method comprises the following steps:
in order to measure the influence of various characteristics under different scales on modeling precision, the change of the MSE of the model after each characteristic is replaced by a random value is calculated respectively. Specifically, a model M is obtained based on training of all feature sets F, and Mean Square Errors (MSEs) of the model are calculated and recorded as MSEs{F}. Then a random value is used to replace the feature FiThe value of (2) is obtained as a new characteristic set value, and the MSE of the model after replacement is obtained based on model M prediction and is recorded asRepeating the replacement a plurality of times (e.g. H-20 times) to eliminate the randomness of the selected features in the random forest, which becomes a permutation test, and then calculating the randomness of the featuresThe Mean Square error increment is denoted as IMSE (normalized error increment in the Mean Square error), and then the proportion of the feature IMSE to the sum of all the features IMSE is calculated and denoted as normalized Mean Square error increment NIMSE (normalized Mean Square error increment in the Mean Square error). The NIMSE value is positive to indicate that the model precision is reduced after the feature replacement, and the NIMSE value is negative to indicate that the model precision is improved after the feature replacement. The same feature with different scales selects the scale with the maximum NIMSE as the optimal spatial scale of the feature.
In the formula, IMSEiRepresenting the value of the mean square error increase after the displacement of the ith feature,to replace the mean square error, MSE, of the model after the ith feature{F}For the mean square error, NIMSE, of the model obtained based on the training of all the feature setsiThe proportion of the IMSE of the ith feature to the sum of all the IMSEs of the features, k represents the randomness of repeated k times of feature elimination selection, j represents the jth feature, N represents the total number of the features, and H represents the repeated times of replacement test.
B2. And (3) carrying out feature screening based on feature importance, which comprises the following specific steps:
and after the optimal spatial scale of each feature is determined, removing the features of the rest scales, and screening the optimal spatial scale feature set by using a dichotomy method and a recursive feature elimination method. Specifically, the method comprises the following steps: 1) after model training is carried out based on the feature set with the optimal scale, variable importance obtained by random forest training is subjected to descending order arrangement; 2) keeping the characteristics with the importance of the first 50% by using a bisection method as a new characteristic set, training the model again and obtaining new characteristic importance and Root Mean Square Error (RMSE); 3) and if the precision is improved or unchanged, repeating the step 2), and if the precision is reduced, taking the feature set screened in the step 2) as the feature set after primary screening. Meanwhile, part of more important features can be directly removed in the feature screening process by the dichotomy. Therefore, after the dichotomy preliminary screening, the features removed last time are added into the feature set with the optimal precision one by one according to the importance level, and the feature set with the minimum model error is selected as the final feature set.
C. Dividing a random forest model sample set according to the following modes:
in order to avoid spatial autocorrelation among training samples, four-region division is carried out on burglary sample data of an urban region, wherein grids in two regions are used as training set training models, grids in one region are used as a verification set for model parameter adjustment, and grids in the other region are used as a test set for testing model accuracy (fig. 2). When samples are divided, the grid number and the resident population number of the test set are both ensured to be more than 1/5 of the population, and the grid proportion of the training set in which burglary crimes occur is basically the same as that of the test set.
D. The method for training the burglary crime risk assessment model is realized as follows:
on the basis of the optimal feature set determined in the step B, model parameters are selected by comparing the precision of the model on the verification set, and the most important 3 parameters in the random forest are determined: the number of decision trees, the maximum feature number of decision trees, and the maximum depth of decision trees. The parameters determine the prediction accuracy and computational efficiency of the model. And continuously changing parameters, comparing the performances of the verification set, evaluating the model to obtain the optimal parameters of the model, and finishing the model training.
E. The model precision verification and evaluation method is realized as follows:
the model training process of the random forest is based on the error minimization principle to realize the optimal fitting of training samples, and the generalization capability of the model in the process is also an important factor for measuring the quality of the model. Therefore, the evaluation and inspection of the model is an important task in the whole modeling process. The present study uses a coefficient of determinism (R)2) And Root mean square error (RMSE, Root)Mean Square Errors) as model evaluation indexes, and the calculation formula is as follows:
in the formula, yiIs the true value of the ith sample,for the prediction value of the i-th sample,the average value of the real values of the samples, and n is the number of the samples.
The two indexes are the comprehensive judgment of the accuracy of each grid predicted value, wherein R2The larger the RMSE, the smaller the RMSE, indicating the higher the accuracy of the model prediction. The relative height of the crime risk is more concerned in the crime risk, and the accuracy of the high-risk area in the prediction result is of great significance for reducing the police patrol cost. Therefore, it is considered that the accuracy (precision) test is performed on the regions 5%, 10%, and 20% before the risk value of the prediction result, that is, the regions 5%, 10%, and 20% before the risk value are regarded as positive examples, and the remaining regions are regarded as negative examples, respectively, and the ratio of the positive examples in the result of the positive examples is calculated, and the calculation formula is as follows:
in the formula, PreratioRepresenting the precision ratio, ratio is the proportion of high risk areas, 5%, 10% and 20% in this study, respectively, TP represents the number of grids predicting the correct high risk areas, FP represents the number of grids predicting the high risk areas as truly non-high risk areas.
Compared with the prior art, the invention has the beneficial effects that:
the method is based on mobile phone user data and city POI data, a new method for predicting and evaluating the crime risk of city burglary is established by utilizing a random forest algorithm, and the defect that the modeling characteristics are extracted mainly by depending on social and economic statistics acquired according to administrative divisions in the current crime risk evaluation model is overcome. The multi-scale features are extracted through more dynamic and more refined mobile phone user grid data, POI density features, night light remote sensing data, road accessibility and the like, the optimized feature set and the space scale thereof are optimized and screened by utilizing a random forest algorithm, more complete expression of crime influence factors in different urban areas and precise evaluation and prediction of burglary risks are realized, and the accuracy and generalization capability of an urban crime risk prediction model are improved.
Drawings
FIG. 1 shows the distribution of burglary cases in a certain market and the results of nuclear density analysis thereof.
Fig. 2 is a schematic diagram of a city population distribution and a sample set region division method.
FIG. 3 shows the NIMSE values of the social features of different scales obtained in the embodiment of the invention.
Fig. 4 shows the NIMSE values of the POI density features at different scales obtained in the embodiment of the present invention.
FIG. 5 shows the importance ratios of the feature variables calculated based on node purity in an embodiment of the present invention.
FIG. 6 shows a scatter plot of the real values of the risk of the centralized grid and the predicted values of the random forest in the test according to the embodiment of the invention.
Detailed Description
The following describes the implementation process of the present invention by taking the burglary risk assessment of a certain city as an example.
1. Data acquisition and processing
(1) Burglary case
The data source of the burglary case is case data acquired from a Chinese judge document network (http:// wenshu. court.court.gov. cn /), the time span is 1 month and 1 day in 2014 to 12 months and 31 days in 2018, and each record comprises position and time information of the case. The spatial position of the case is accurate to a residential building, the time unit is accurate to the day, each case is endowed with an accurate geographical position through geocoding operation, and finally 849 cases meeting the conditions are obtained. The cell names in the case are used to obtain the coordinates in the GCJ-02 coordinate system by inverse geocoding using the Goods API, and then the cell location is converted from the GCJ-02 coordinate system to the WGS-84 coordinate system by coordinate conversion, and the spatial distribution is shown in FIG. 1.
(2) Mobile phone user data
The method comprises the steps that mobile phone user data products from Unicom China are obtained after anonymization and data cleaning, and for privacy protection reasons, the obtained data are all statistical data in a 250m grid and comprise fields such as residential population, local residential population, foreign residential population, unknown identity residential population, residential population of each age group, abundance index of residential population and the like. The judgment rule of the resident population is as follows: the method comprises the steps of firstly, carrying out monthly accumulation on seconds observed at each residence point from 21:00 days to 08:00 days every day in the urban area of a certain city center by a user, screening the place with the longest residence time, and judging that the user is the resident population of the position if the condition that the user is accumulated in the place for 10 days or more in the current month is met. The local foreign population is judged according to the top 4 of the identity card number of the mobile phone user authenticated by the real name, the place of the birth user is judged, the place of the birth user is a local population in a certain city, otherwise, the place of the birth user is a foreign population. The age group information is derived from the birth date in the identity card of the mobile phone user authenticated by the real name, and is divided into 0-15 year residential population, 16-24 year residential population, 25-34 year residential population, 35-44 year residential population, 45-54 year residential population, 55-64 year residential population, over 65 year residential population and unknown, wherein the unknown and the resident are unknown, and are users with uncertain identity information, so that only the unknown resident population of the resident is reserved. The abundance index is comprehensively calculated according to 9 types of data such as user internet flow, monthly telephone charge, equipment price, incoming and outgoing call quantity, local residence number, local residence time, number appearing in foreign cities, airplane trip time and user standing plot house price, and forms the abundance index reflecting the customer consumption capacity, and the larger the value is, the higher the abundance is. The abundance index is divided into 8 grades, wherein 1 and 2 grades belong to low income, 3 grades belong to medium income, 4 grades belong to well, 5 grades belong to medium yield, 6 grades belong to the root, 7 and 8 grades belong to luxury, and the information loss is unknown.
(3) City POI and road network data
The city data used for the study includes POI data and urban road network data. The POI data disclosed by the God company of 2018 is used, and the God company is a main navigation map provider in China, and the data is accurate. POI data, with reference to previous studies ([1] shows L, Ribeiro H V, Rodrigues F A. Crime prediction through urethane metrics and static learning [ J ]. Physica A: State Mechanics and its Applications,2017, 505: 435. 443.[2] Liu L, Feng J, F Ren, et al. learning the relationship between the relative positions and the relative positions, 2018,82(DEC. 10-18.), 12 types of POI related to the entrance office are selected, including the theft ATM, catering, nursing homes, police stores, police buildings, shopping malls, public buildings, hospitals, public houses, and public places. The POI data includes the spatial location and name of the various sites, infrastructures, and also transfers the POI location from the GCJ-02 coordinate system to the WGS-84 coordinate system.
The urban road network data is road network provided by osm (open Street map), and the data includes road position, road grade and road name. Roads of motorway, trunk, primary, secondary and tertiary roads and their links (links) are used in the research, and nodes where 3 or more roads of the above-mentioned level intersect are used as important nodes in cities.
(4) Remote sensing data of light at night
The used night light data is image data acquired by a Lopa A1 (LJ-1) remote sensing satellite in 25 days in 2 months in 2019, and the Lopa A1 satellite is transmitted in 12 days in 6 months in 2018, and is the first professional noctilucent remote sensing satellite in the world. The whole satellite of the satellite is 20kg, carries a large-view-field high-sensitivity noctilucent remote sensing camera, and has the noctilucent imaging capability of 130m resolution and 260km breadth. The night light image can reflect city edges and city economic activities, and provides certain reference for crime risk analysis.
2. Feature extraction and scale screening results
(1) Socioeconomic feature extraction
From the mobile phone user data, 80 feature variables (table 1) at 4 spatial scales (1 × 1 grid, 3 × 3 grid, 5 × 5 grid, 7 × 7 grid) of each of the 20 features shown in table 1 are obtained.
Table 1. social characteristic variable names extracted based on mobile phone signaling data
(2) Multi-scale POI density feature extraction
The POI density of each category of 250m, 500m, 750m, 1000m and 1500m distance grids is extracted by considering different POI influence ranges of different categories. Firstly, a buffer area with a corresponding distance is generated for the POI point, and if the buffer area is intersected with the grid, the POI point is considered to be within the corresponding distance of the grid. A total of 60 POI density features (table 2) were calculated at 5 spatial scales for each of the 12 POIs shown in table 2.
TABLE 2 POI Density variable names selected
(3) Road network reachability feature extraction
The shortest travel time from each node to other city nodes is calculated based on the city road network (table 3), the average time from each node to other nodes is calculated, and the reciprocal of the average time is taken as the road network reachability of the node. And finally, generating a road network accessibility grid image with the grid size of 250m in the research area by using the Krigin interpolation.
TABLE 3 road types and corresponding speeds selected by experiment
(4) Night light feature extraction
And resampling the night light with the resolution of 130m to the resolution of 10m, and then calculating the average value of all night light images with the resolution of 10m in a grid of 250m to be used as the average night light value in the grid.
3. Partitioning instances of sample sets
The study area is divided into four areas (fig. 2) by referring to the population distribution of a certain city and the distribution of burglary crime points, wherein grids in two areas are used as training set training models (areas 2 and 3 in fig. 2), grids in one area are used as verification sets for model parameter adjustment (area 4 in fig. 2), and grids in the other area are used as test set to test model accuracy (area 1 in fig. 2). All grids with population less than 3 were removed, and finally, the training set had 5540 grids, the validation set had 2398 grids, and the test set had 2750 grids. The number of grids in the test set and the number of resident population both account for about 1/4, and the proportion of grids in the training set and the test set with crimes is basically the same, about 5.5%.
4. Feature screening and optimal scale determination
The method is used for extracting the characteristics of the research area, and finally, an original characteristic set (table 4) of 80 social characteristics, 60 POI density characteristics, night light characteristics, road network accessibility characteristics and category average crime number which are 143 characteristics is obtained.
TABLE 4 introduction of primitive feature set
(1) Optimal spatial scale
The NIMSE of 143 features is calculated based on the optimal spatial scale selection method provided by the invention, the spatial scale with the maximum NIMSE value of each feature is selected as the optimal scale of the feature, the NIMSE of each scale of the social features is shown in FIG. 3, and the NIMSE of each scale of the POI density features is shown in FIG. 4.
The final determined optimal spatial scale for each feature is shown in table 5, where the 55-65 year old population proportion, the abundance 2, 4 and 8 scales, nimes, are all low or negative, and do not improve the model accuracy, and are therefore removed in the optimal spatial scale feature set. The optimal spatial dimension of the social features is mostly the window size of 5 multiplied by 5 and 7 multiplied by 7 grids, which is equivalent to the spatial range of a common residential area. The optimal scale of the POI density characteristics is 1500m at most, and a few POI density characteristics are 1000 m.
TABLE 5 optimal spatial dimensions for each feature
(2) Dichotomy characteristic primary screen
In the iteration process, each iteration model can obtain a verification set R2And RMSE as model evaluation indexes of corresponding feature sets. And then, arranging according to the importance of the features in a descending order, selecting the first 50% of the features as new features for next model training until the model precision reaches the highest and the feature set is the simplest, and obtaining the initially screened optimal feature set. The results of the dichotomy iteration process are as follows:
TABLE 6 dichotomy characterization of preliminary screening Process
R of the model during the second iteration2The highest is reached and the RMSE is the lowest, which is only 8 features. Then, the feature with the importance of the time being 50% is used as the feature set of the random forest model of the third iteration, and the result precision is greatly reduced, so that 8 of the original features are reserved in the initially screened optimal feature set, namely the proportion of the 35-44 years old population of a 5 x 5 grid and the proportion of the local population of a 7 x 7 gridFor example, the number of residential population in a 7 × 7 grid, the level 1 abundance ratio in a 7 × 7 grid, the road network accessibility, the internet cafe density within 1500 meters, the entertainment place density within 1000 meters, and the residential area density within 1000 meters, and these 8 features are denoted as feature set F1.
(3) Optimal feature set determination
In the last step, the characteristics of the model are preliminarily screened by using a bisection method and a recursive elimination method, redundant variables which do not contribute much to the model are preliminarily removed, but some characteristics which contribute to the model precision to a certain extent are discarded, so that the discarded variables in the second iteration need to be gradually added into the characteristic set F1 according to the importance level, and the parameters are compared with the optimal precision obtained by the bisection method to obtain a model scheme with the minimum characteristic quantity and the optimal precision, as shown in Table 7.
TABLE 7 optimal feature set determination Process
From table 7, it can be seen that the verification set has the highest accuracy when the feature set is F5. In this case, the number of features is 12, and the number of features is 35 to 44 years old resident population ratio in a 5 × 5 grid, local resident population ratio in a 7 × 7 grid, resident population number in a 7 × 7 grid, level 1 abundance ratio in a 7 × 7 grid, network accessibility, internet cafe density in 1500 meters, entertainment place density in 1000 meters, residential district density in 1000 meters, ATM density in 1500m, hospital density in 1500m, non-local population ratio in a 5 × 5 grid, and 25 to 34 years old population ratio in a 7 × 7 grid.
5. Random forest model result for burglary risk assessment
And performing grid searching parameter on the random forest model by comparing the precision of the verification set, and finally determining three model parameter values as follows: number of decision trees 40, maximum feature number 4, maximum depth 10.
(1) Random forest variable importance analysis
By using the variable importance calculation method introduced by the invention, the variable importance of the final 12 features is calculated, and fig. 5 is a proportion of the variable importance calculated based on the node purity of each feature. The importance of the social variable in the importance of the variable calculated based on the node purity accounted for 31.3%. The results show that social features play an important role in the burglary risk analysis.
(2) Test set accuracy
The trained model is used for 2750 grids in the test set to obtain a burglary risk prediction value of the test set, the decision coefficient of the model in the test set reaches 0.821, the root mean square error RMSE is 1.178, and the fitting effect of the model is ideal. Fig. 6 shows a scatter plot between the true observed value and the estimated value, where each point represents a 250m × 250m grid in space, the horizontal axis represents the true risk obtained by performing kernel density analysis on historical crime data, and the vertical axis represents the risk value of random forest prediction. On the whole, the correlation between the predicted value and the true value is high, the low value and the medium value are distributed around a 1:1 line uniformly, and underestimation exists to a certain degree for the high value.
The high risk areas of the first 5%, the first 10% and the first 20% of the model prediction are further checked for precision ratio, and as can be seen from table 8, the model can effectively identify the crime hot areas. When the first 10% of the regions with the risk values arranged from large to small are used as high risk regions, the precision ratio of the high risk regions reaches 80.7%, and certain practical reference significance can be provided for police service prevention and control.
TABLE 8 model precision calculation