Text data based standardized processing method and equipment
1. A method of normalization processing based on text data, wherein the method comprises:
determining a target data type corresponding to at least one piece of original text data to be processed;
calling a target data dictionary corresponding to the target data type, wherein the target data dictionary comprises at least one classification field and one or more preset data objects corresponding to each classification field;
and normalizing the at least one piece of original text data based on the target data dictionary to obtain normalized data structure data corresponding to the at least one piece of original text data, wherein the normalized data structure data comprises one or more classification fields in the at least one classification field and one or more target data objects corresponding to each classification field in the one or more classification fields.
2. The method of claim 1, wherein the determining a target data type corresponding to at least one piece of raw text data to be processed comprises:
acquiring at least one piece of original text data to be processed;
and prejudging extractable fields of the at least one piece of original text data, and determining a target data type corresponding to the at least one piece of original text data.
3. The method of claim 1, wherein the method further comprises:
and presetting a data dictionary corresponding to different data types, wherein the data dictionary comprises at least one field and one or more preset data objects corresponding to each field.
4. The method of any of claims 1-3, wherein the normalizing the at least one piece of raw text data based on the target data dictionary to obtain normalized data structure data corresponding to the at least one piece of raw text data, wherein the normalized data structure data comprises one or more classification fields of the at least one classification field and one or more target data objects corresponding to each classification field of the one or more classification fields, comprises:
carrying out case conversion, space removal and special character filtering on each piece of original text data in the at least one piece of original text data to obtain at least one piece of preprocessed original text data;
performing word segmentation processing on the preprocessed at least one piece of original text data in sequence based on the target data dictionary to obtain one or more classification fields corresponding to the at least one piece of original text data and one or more preset data objects corresponding to each classification field in the one or more classification fields;
and labeling field information of one or more preset data objects corresponding to each classification field in the one or more classification fields based on the target data dictionary to obtain one or more target data objects corresponding to each classification field in the one or more classification fields so as to obtain standardized data structure data corresponding to the at least one piece of original text data.
5. A non-transitory storage medium having stored thereon computer readable instructions which, when executed by a processor, cause the processor to implement the method of any one of claims 1 to 4.
6. A text data based normalization processing apparatus, wherein the apparatus comprises:
one or more processors;
a computer-readable medium for storing one or more computer-readable instructions,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-4.
Background
In the prior art, with the increasing amount of data, the data has been developed into a data asset with great potential. However, the formats of data generated from different channels, such as different platforms, different companies, etc., are not completely the same, and thus, the standardized processing cannot be performed uniformly. For example: one platform has a part of text data which is the title of the commodity, the other platform is a part of text data which is also the title of the commodity, however, how to count the number of brands in the text data in the two platforms and how to count the number of commodities in each color in the text data. Because the data is in a text form, deep mining cannot be performed on various data indexes, and the maximum data value cannot be obtained.
Disclosure of Invention
An object of the present application is to provide a method and an apparatus for text data-based standardization processing, which implement standardization processing on text data so as to further mine the value of data assets.
According to an aspect of the present application, a method for processing text data based on standardization is provided, wherein the method includes:
determining a target data type corresponding to at least one piece of original text data to be processed;
calling a target data dictionary corresponding to the target data type, wherein the target data dictionary comprises at least one classification field and one or more preset data objects corresponding to each classification field;
and normalizing the at least one piece of original text data based on the target data dictionary to obtain normalized data structure data corresponding to the at least one piece of original text data, wherein the normalized data structure data comprises one or more classification fields in the at least one classification field and one or more target data objects corresponding to each classification field in the one or more classification fields.
Further, in the above method, the determining a target data type corresponding to at least one piece of original text data to be processed includes:
acquiring at least one piece of original text data to be processed;
and prejudging extractable fields of the at least one piece of original text data, and determining a target data type corresponding to the at least one piece of original text data.
Further, in the above method, the method further includes:
and presetting a data dictionary corresponding to different data types, wherein the data dictionary comprises at least one field and one or more preset data objects corresponding to each field.
Further, in the above method, the normalizing the at least one piece of original text data based on the target data dictionary to obtain normalized data structure data corresponding to the at least one piece of original text data, where the normalized data structure data includes one or more classification fields in the at least one classification field and one or more target data objects corresponding to each classification field in the one or more classification fields, includes:
carrying out case conversion, space removal and special character filtering on each piece of original text data in the at least one piece of original text data to obtain at least one piece of preprocessed original text data;
performing word segmentation processing on the preprocessed at least one piece of original text data in sequence based on the target data dictionary to obtain one or more classification fields corresponding to the at least one piece of original text data and one or more preset data objects corresponding to each classification field in the one or more classification fields;
and labeling field information of one or more preset data objects corresponding to each classification field in the one or more classification fields based on the target data dictionary to obtain one or more target data objects corresponding to each classification field in the one or more classification fields so as to obtain standardized data structure data corresponding to the at least one piece of original text data.
According to another aspect of the present application, there is also provided a non-volatile storage medium having computer-readable instructions stored thereon, which, when executed by a processor, cause the processor to implement the text data based normalization processing method as described above.
According to another aspect of the present application, there is also provided a text data-based normalization processing apparatus, wherein the apparatus includes:
one or more processors;
a computer-readable medium for storing one or more computer-readable instructions,
when executed by the one or more processors, cause the one or more processors to implement a method of standardized processing based on textual data as described above.
Compared with the prior art, the method and the device have the advantages that the target data type corresponding to at least one piece of original text data to be processed is determined firstly; then, calling a target data dictionary corresponding to the target data type, wherein the target data dictionary comprises at least one classification field and one or more preset data objects corresponding to each classification field; and finally, standardizing the at least one piece of original text data based on the target data dictionary to obtain standardized data structure data corresponding to the at least one piece of original text data, wherein the standardized data structure data comprises one or more classification fields in the at least one classification field and one or more target data objects corresponding to each classification field in the one or more classification fields, so that the text data is standardized to facilitate the subsequent deep mining of the value of the data assets.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 illustrates a flow diagram of a method for text data based normalization processing according to an aspect of the subject application;
FIG. 2 is a schematic diagram illustrating a practical application scenario of a method for normalization processing based on textual data according to an aspect of the subject application;
fig. 3 is a flow chart of a text data-based normalization processing method in an actual application scenario according to an aspect of the present application.
The same or similar reference numbers in the drawings identify the same or similar elements.
Detailed Description
The present application is described in further detail below with reference to the attached figures.
In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
As shown in fig. 1, one aspect of the present application provides a flowchart of a text data-based normalization method, where the method includes steps S11, S12, and S13, and specifically includes the following steps:
step S11, determining a target data type corresponding to at least one piece of original text data to be processed; the target data types include, but are not limited to, types of data corresponding to various application fields, such as an intelligent terminal type, a sales type, a thesis type, a finance type, and a performance type.
Step S12, a target data dictionary corresponding to the target data type is called, where the target data dictionary includes at least one classification field and one or more preset data objects corresponding to each classification field. For example, if the target data type is a mobile phone class, the classification field included in the target data dictionary corresponding to the mobile phone class may include a brand, a model, a color, a memory, a network type, a remark, and the like, where the network type is used to indicate one or more data networks that can be supported by the mobile phone, such as mobile, connectivity, and telecommunication.
Step S13, normalizing the at least one piece of original text data based on the target data dictionary to obtain normalized data structure data corresponding to the at least one piece of original text data, wherein the normalized data structure data includes one or more classification fields in the at least one classification field and one or more target data objects corresponding to each classification field in the one or more classification fields, so that the text data is normalized through the steps S11 to S13, so as to deeply mine the value of the data assets.
As shown in fig. 2, if three pieces of original text data to be processed are obtained in step S11, the three pieces of original text data are the first three lines of original text data in fig. 2, each line represents one piece of original text data, and the type of target data corresponding to the three pieces of original text data is determined to be a mobile phone type according to the three pieces of original text data; in step S12, a call is made to the target data type: the target data dictionary corresponding to the mobile phone type comprises at least one classification field and one or more preset data objects corresponding to each classification field, wherein the classification fields of the target data dictionary corresponding to the mobile phone type comprise brands, models, colors, memories, network types, remarks and the like shown in fig. 2, and the network types are used for indicating a data network which can be supported by the mobile phone, such as one or more of mobile, communication and telecommunication; in step S13, the three pieces of original text data in fig. 2 are normalized based on the target data dictionary corresponding to the mobile phone type, so as to obtain normalized data structure data corresponding to the three pieces of original text data, such as table contents indicated by arrows in fig. 2, thereby implementing the normalization of the three pieces of original text data corresponding to the first three rows in fig. 1, so as to further deeply mine the value of the data asset.
Further, the step S11 determines a target data type corresponding to at least one piece of original text data to be processed, specifically including:
acquiring at least one piece of original text data to be processed;
and prejudging extractable fields of the at least one piece of original text data, and determining a target data type corresponding to the at least one piece of original text data.
For example, in order to perform normalization processing on one or more pieces of original text data to be processed, at least one piece of original text data to be processed, for example, three pieces of original text data corresponding to the first three rows shown in fig. 2, and then it may be preliminarily determined which fields of the three rows of original text data can be extracted according to the three rows of original text data, for example, the field that can be extracted by the three rows of original text data has a brand, a model, a color, a memory, a network system, a remark, and the like as a result of the preliminary determination, so that the target data types corresponding to the three rows of original text data may be determined according to the classification of the field that can be extracted, and the target data types corresponding to the three rows of original text data may be determined while the preliminary determination of the field that can be extracted by the three rows of original text data is performed, that the type range of the original text data to be processed is limited, for example, three pieces of original text data corresponding to the first three lines shown in fig. 2 are defined as original text data of the mobile phone type.
Further, the method for processing text data based on standardization provided by the embodiment of the present application further includes:
and presetting a data dictionary corresponding to different data types, wherein the data dictionary comprises at least one field and one or more preset data objects corresponding to each field.
For example, in an actual application scenario, the obtained original text data to be processed may have different data types, such as types of data corresponding to various application fields, e.g., intelligent terminals such as mobile phones and computers, sales classes, papers, finance classes, and achievements classes, in order to facilitate the directional standardization of the original text data corresponding to different data types, in the embodiment of the present application, data dictionaries corresponding to different data types need to be preset, the data dictionary comprises at least one field in the original text data to be processed and one or more preset data objects corresponding to each field, namely, the data dictionaries corresponding to different data types are different, so that the corresponding data dictionary processing is carried out on the original text data corresponding to different data types, therefore, the purpose of directionally standardizing the original text data corresponding to different data types is achieved.
Further, the step S13 normalizes the at least one piece of original text data based on the target data dictionary to obtain normalized data structure data corresponding to the at least one piece of original text data, where the normalized data structure data includes one or more classification fields in the at least one classification field and one or more target data objects corresponding to each classification field in the one or more classification fields, and specifically includes:
carrying out case conversion, space removal and special character filtering on each piece of original text data in the at least one piece of original text data to obtain at least one piece of preprocessed original text data;
performing word segmentation processing on the preprocessed at least one piece of original text data in sequence based on the target data dictionary to obtain one or more classification fields corresponding to the at least one piece of original text data and one or more preset data objects corresponding to each classification field in the one or more classification fields;
and labeling field information of one or more preset data objects corresponding to each classification field in the one or more classification fields based on the target data dictionary to obtain one or more target data objects corresponding to each classification field in the one or more classification fields so as to obtain standardized data structure data corresponding to the at least one piece of original text data.
For example, before the three pieces of original text data of the text data example shown in fig. 3 are normalized by using the target data dictionary corresponding to the corresponding target data type, each piece of original text data in the three pieces of original text data needs to be subjected to case conversion, space removal and special character filtering, so that each piece of original text data after preprocessing can conform to the mapping of the data dictionary; then, a target data dictionary corresponding to the mobile phone class of the three original text data is used for performing word segmentation processing on each piece of preprocessed original text data to obtain a plurality of classification fields corresponding to the three pieces of original text data and one or more preset data objects corresponding to each classification field, for example, the word segmentation shown in fig. 3 is respectively stored in the table content corresponding to the right side of the column, so that extraction of the classification fields of the three pieces of original text data is realized, and the extraction is respectively as follows: brand, model, color, memory, network type, remark, etc., and one or more preset data objects corresponding to each classification field, such as classification fields: the preset data objects corresponding to the brands are 100, 200 and the like, and the classification fields are as follows: the preset data objects corresponding to the models are 10000, 10001 and the like, so that one or more preset data objects corresponding to all classification fields extracted from the three original text data are obtained; finally, field information is labeled on one or more preset data objects corresponding to each classification field in all classification fields shown in fig. 3 based on a target data dictionary corresponding to the mobile phone class, so as to obtain one or more target data objects corresponding to each classification field, for example, table contents corresponding to the right side of fig. 3 standardized storage, and obtain standardized data structure data of the three pieces of original text data, thereby realizing labeling and mapping of actual field information of each classification field, such as classification fields: brands include apple, Huashi and millet, and the classification field is as follows: the colors of the three raw text data are green, ice crystal powder, iceberg black and the like, so as to obtain one or more actual target data objects corresponding to each classification field, thereby realizing the standardization processing of the three raw text data in fig. 3, and obtaining the standardized data structure data of the table content corresponding to the right side of the standardized storage of fig. 3.
In an actual application scenario of the text data-based standardized processing method provided in an embodiment of the present application, a specific processing flow may include the following steps:
in the first step, a classification of the raw text data to be processed is defined, for example, the data type of the text data example shown in fig. 3 is a mobile phone class, so that the three pieces of raw text data in fig. 3 are defined as a mobile phone class.
Secondly, preliminarily judging classification information of one or more fields which can be extracted from the three original text data according to the three original text information to obtain classification fields of the three original text data shown in fig. 3, wherein the classification fields are respectively as follows: and classification information such as brands, models, colors, memory, network systems and the like.
And thirdly, sorting the classified dictionary data according to the sorted classified fields in the second step. For example: a classification field: the brand, collect all brand information of the mobile phone, put the Chinese and English brand into a dictionary, the data format is dictionary form of key-vault pair, such as 100; iphone 100; the method comprises the following steps: 200 and huawei: and classifying preset data objects corresponding to the fields by 200 and the like, and sorting dictionary data by analogy with other field information.
And fourthly, combining all the classified fields and the preset data objects thereof, and then arranging the combined classified fields and the preset data objects into a data dictionary corresponding to the mobile phone class.
Fifthly, performing word segmentation operation on the three original text data in the graph 3, wherein the original text data needs to be subjected to capital and lower case conversion, empty case removal, special character filtering and other operations before word segmentation, then performing word segmentation operation on the three processed original text data based on a target data dictionary corresponding to a mobile phone class to obtain word segmentation results shown in the graph 3, and then labeling field information on the data arranged in each line according to the word segmentation results to obtain standardized data structure data.
According to another aspect of the present application, there is also provided a non-volatile storage medium having computer-readable instructions stored thereon, which, when executed by a processor, cause the processor to implement the text data based normalization processing method as described above.
According to another aspect of the present application, there is also provided a text data-based normalization processing apparatus, wherein the apparatus includes:
one or more processors;
a computer-readable medium for storing one or more computer-readable instructions,
when executed by the one or more processors, cause the one or more processors to implement a method of standardized processing based on textual data as described above.
Here, for details of each embodiment of the text data-based normalization processing device, reference may be specifically made to corresponding parts of the above text data-based normalization processing method, and details are not described here again.
In summary, the present application determines a target data type corresponding to at least one piece of original text data to be processed; then, calling a target data dictionary corresponding to the target data type, wherein the target data dictionary comprises at least one classification field and one or more preset data objects corresponding to each classification field; and finally, standardizing the at least one piece of original text data based on the target data dictionary to obtain standardized data structure data corresponding to the at least one piece of original text data, wherein the standardized data structure data comprises one or more classification fields in the at least one classification field and one or more target data objects corresponding to each classification field in the one or more classification fields, so that the text data is standardized to facilitate the subsequent deep mining of the value of the data assets.
It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.
In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Program instructions which invoke the methods of the present application may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. An embodiment according to the present application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or a solution according to the aforementioned embodiments of the present application.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.