Method, device and equipment for generating noise-containing voice data in vehicle
1. A method for generating noise-containing voice data in a vehicle is characterized by comprising the following steps:
pre-creating a voice material library, wherein the voice material library comprises pure human voice audio data and real vehicle pure noise audio data;
receiving and analyzing data requirements input by a user;
according to the analyzed data requirements, pure-person voice audio data and real-vehicle pure-noise audio data corresponding to user requirements are respectively matched from the voice material library;
and carrying out sound mixing processing on the matched pure-person voice audio data and the real-vehicle pure-noise audio data to generate in-vehicle noise-containing voice data.
2. The in-vehicle noisy speech data generating method according to claim 1, wherein said pre-creating a speech material library comprises:
pre-recording different pure human sound source materials, and marking human sound source information on each pure human sound source material to obtain pure human sound audio data;
real vehicle pure noise materials under different scenes are prerecorded, scene noise information is marked out of each real vehicle pure noise material, and real vehicle pure noise audio data are obtained.
3. The in-vehicle noisy speech data generating method according to claim 2, wherein said analyzing a data requirement input by a user comprises:
and obtaining the analyzed data demand according to the data demand input by the user, the pure human voice audio data marked with the human voice sound source information, the real vehicle pure noise audio data marked with the scene noise information and a pre-trained prediction model based on semantic analysis.
4. The in-vehicle noisy speech data generating method according to claim 2, wherein the parsed data requirement includes requirement information:
the human voice characteristics of the main speakers and the real vehicle noise scene information and/or the proportional relation between the human voice sound source energy and the real vehicle noise energy.
5. The in-vehicle noisy speech data generating method according to claim 4, wherein said matching out corresponding pure human voice audio data and real-vehicle pure noise audio data from said speech material library according to the parsed data requirement comprises:
matching optimal pure person sound audio data from the voice material library based on the demand information and the marked person sound source information;
and matching optimal real vehicle pure noise audio data from the voice material library based on the demand information and the marked scene noise information.
6. The in-vehicle noise-containing voice data generation method according to any one of claims 1 to 5, characterized by further comprising:
extracting the voice primitive of the current voice from the matched pure voice audio data;
synthesizing a batch of pure human voice audio data by utilizing the voice primitives and a plurality of preset vehicle interactive texts;
and carrying out sound mixing processing on the synthesized pure-person voice data and the matched real-vehicle pure-noise voice data one by one to obtain batch vehicle-contained noise voice data.
7. An in-vehicle noisy speech data generating apparatus, comprising:
the voice material library creating module is used for creating a voice material library in advance, and the voice material library comprises pure-person voice audio data and real-vehicle pure-noise audio data;
the data requirement acquisition module is used for receiving and analyzing the data requirement input by the user;
the demand matching module is used for respectively matching pure human voice audio data and real vehicle pure noise audio data corresponding to user demands from the voice material library according to the analyzed data demands;
and the target data generation module is used for carrying out sound mixing processing on the matched pure-person voice audio data and the matched real-vehicle pure-noise audio data to generate noise-containing voice data in the vehicle.
8. The in-vehicle noisy speech data generating apparatus according to claim 7, said apparatus further comprising:
the voice element extraction module is used for extracting the voice element of the current voice from the matched pure voice audio data;
the voice synthesis module is used for synthesizing batch pure human voice audio data by utilizing the voice primitive and a plurality of preset vehicle interactive texts;
and the target data amplification module is used for carrying out sound mixing processing on the synthesized pure human voice data and the matched real vehicle pure noise voice data one by one to obtain batch vehicle-contained noise voice data.
9. An electronic device, comprising:
one or more processors, a memory, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions which, when executed by the apparatus, cause the apparatus to perform the in-vehicle noisy speech data generating method of any of claims 1-6.
10. A computer data storage medium having a computer program stored therein, which when run on a computer causes the computer to execute the in-vehicle noise-containing voice data generation method according to any one of claims 1 to 6.
Background
With the rise of artificial intelligence, vehicle-mounted, education and medical treatment permeate artificial intelligence technology, and intelligent voice is widely applied to various fields as an important means of human-computer interaction. The vehicle-mounted scene is different from scenes such as home furnishing, medical treatment, customer service and the like, the noise environment has the characteristics of diversity, instability and the like, the vehicle-mounted scene relates to personal safety factors, and the requirements on the effects such as voice recognition and the like are high, so that a large amount of noise-containing and noise-free test voice audio data are required to serve as supports when the voice processing effects such as voice recognition and the like are optimized.
The structure of the noise voice data set in the vehicle usually needs to be repeatedly and repeatedly acquired and recorded on the spot under different fields and environmental conditions, and the manufacturing requirement is relatively high, so that the overall cost is also high; in addition, the existing method for making the noise-containing voice data in the car is usually only suitable for evaluating the single target effect, namely after the noise-containing voice data in the car is repeatedly collected in a certain scene for many times, the data set is difficult to multiplex in multiple scenes, and if the noise-containing voice data in the car expected to be collected can cover the whole scene, a large amount of time and manpower are inevitably consumed to perform complicated work such as screening, selecting, classifying and the like from a large amount of noise-containing voice data in the car; moreover, the noise-contained voice data in the vehicle collected on the spot by the real vehicle at present is difficult to compare with single and stable factors when effect evaluation is carried out, and the requirement of the actual test on accuracy cannot be met.
Therefore, in order to achieve low-cost optimization of voice processing effects such as voice interactive recognition in the vehicle, noise-containing voice data in the vehicle can be quickly and inexpensively manufactured for interactive testing, model training and the like aiming at different noise scenes, and the voice processing effects are needed most urgently in the field of current vehicle-mounted artificial intelligence.
Disclosure of Invention
In view of the foregoing, the present invention aims to provide a method, an apparatus and a device for generating noisy speech data in a vehicle, and accordingly provides a computer data storage medium and a computer program product, so as to enable the noisy speech data in the vehicle to be produced conveniently, simply, at low cost and with high efficiency.
The technical scheme adopted by the invention is as follows:
in a first aspect, the present invention provides a method for generating noisy speech data in a vehicle, including:
pre-creating a voice material library, wherein the voice material library comprises pure human voice audio data and real vehicle pure noise audio data;
receiving and analyzing data requirements input by a user;
according to the analyzed data requirements, pure-person voice audio data and real-vehicle pure-noise audio data corresponding to user requirements are respectively matched from the voice material library;
and carrying out sound mixing processing on the matched pure-person voice audio data and the real-vehicle pure-noise audio data to generate in-vehicle noise-containing voice data.
In at least one possible implementation, the pre-creating a library of speech materials includes:
pre-recording different pure human sound source materials, and marking human sound source information on each pure human sound source material to obtain pure human sound audio data;
real vehicle pure noise materials under different scenes are prerecorded, scene noise information is marked out of each real vehicle pure noise material, and real vehicle pure noise audio data are obtained.
In at least one possible implementation manner, the analyzing the data requirement input by the user includes:
and obtaining the analyzed data demand according to the data demand input by the user, the pure human voice audio data marked with the human voice sound source information, the real vehicle pure noise audio data marked with the scene noise information and a pre-trained prediction model based on semantic analysis.
In at least one possible implementation manner, the parsed data requirement includes the following requirement information:
the human voice characteristics of the main speakers and the real vehicle noise scene information and/or the proportional relation between the human voice sound source energy and the real vehicle noise energy.
In at least one possible implementation manner, the matching, according to the analyzed data requirement, the corresponding pure-person audio data and the corresponding real-vehicle pure-noise audio data from the speech material library respectively includes:
matching optimal pure person sound audio data from the voice material library based on the demand information and the marked person sound source information;
and matching optimal real vehicle pure noise audio data from the voice material library based on the demand information and the marked scene noise information.
In at least one possible implementation manner, the method further includes:
extracting the voice primitive of the current voice from the matched pure voice audio data;
synthesizing a batch of pure human voice audio data by utilizing the voice primitives and a plurality of preset vehicle interactive texts;
and carrying out sound mixing processing on the synthesized pure-person voice data and the matched real-vehicle pure-noise voice data one by one to obtain batch vehicle-contained noise voice data.
In a second aspect, the present invention provides an in-vehicle noisy speech data generating apparatus, including:
the voice material library creating module is used for creating a voice material library in advance, and the voice material library comprises pure-person voice audio data and real-vehicle pure-noise audio data;
the data requirement acquisition module is used for receiving and analyzing the data requirement input by the user;
the demand matching module is used for respectively matching pure human voice audio data and real vehicle pure noise audio data corresponding to user demands from the voice material library according to the analyzed data demands;
and the target data generation module is used for carrying out sound mixing processing on the matched pure-person voice audio data and the matched real-vehicle pure-noise audio data to generate noise-containing voice data in the vehicle.
In at least one possible implementation manner, the speech material library creating module includes:
the pure person sound audio data preparation unit is used for prerecording different pure person sound source materials and marking sound source information of each pure person sound source material to obtain pure person sound audio data;
and the real vehicle pure noise audio data preparation unit is used for prerecording real vehicle pure noise materials under different scenes, marking scene noise information on each real vehicle pure noise material and obtaining real vehicle pure noise audio data.
In at least one possible implementation manner, the data requirement obtaining module is specifically configured to:
and obtaining the analyzed data demand according to the data demand input by the user, the pure human voice audio data marked with the human voice sound source information, the real vehicle pure noise audio data marked with the scene noise information and a pre-trained prediction model based on semantic analysis.
In at least one possible implementation manner, the parsed data requirement includes the following requirement information:
the human voice characteristics of the main speakers and the real vehicle noise scene information and/or the proportional relation between the human voice sound source energy and the real vehicle noise energy.
In at least one possible implementation manner, the requirement matching module includes:
the pure person sound and frequency matching unit is used for matching optimal pure person sound and frequency data from the voice material library based on the demand information and the marked human sound source information;
and the real vehicle pure noise audio matching unit is used for matching optimal real vehicle pure noise audio data from the voice material library based on the demand information and the marked scene noise information.
In at least one possible implementation manner, the apparatus further includes:
the voice element extraction module is used for extracting the voice element of the current voice from the matched pure voice audio data;
the voice synthesis module is used for synthesizing batch pure human voice audio data by utilizing the voice primitive and a plurality of preset vehicle interactive texts;
and the target data amplification module is used for carrying out sound mixing processing on the synthesized pure human voice data and the matched real vehicle pure noise voice data one by one to obtain batch vehicle-contained noise voice data.
In a third aspect, the present invention provides an electronic device, comprising:
one or more processors, memory which may employ a non-volatile storage medium, and one or more computer programs stored in the memory, the one or more computer programs comprising instructions which, when executed by the apparatus, cause the apparatus to perform the method as in the first aspect or any possible implementation of the first aspect.
In a fourth aspect, the present invention provides a computer data storage medium having a computer program stored thereon, which, when run on a computer, causes the computer to perform at least the method as described in the first aspect or any of its possible implementations.
In a fifth aspect, the present invention also provides a computer program product for performing at least the method of the first aspect or any of its possible implementations, when the computer program product is executed by a computer.
In at least one possible implementation manner of the fifth aspect, the relevant program related to the product may be stored in whole or in part on a memory packaged with the processor, or may be stored in part or in whole on a storage medium not packaged with the processor.
The concept of the invention is to collect pure real-scene vehicle noise and pure human sound sources in advance to construct a vehicle voice material library, wherein audio resources in the vehicle voice material library are used for being combined with requirements provided by a user to respectively obtain two independent elements of vehicle-contained noise voice data, the pure human sound sources and the pure real-scene vehicle noise which meet the requirements of the user, and then the matched pure human sound sources and the pure real-scene vehicle noise are subjected to channel fusion to generate target vehicle-contained noise voice data required by the user. On one hand, the method does not need to repeatedly carry out harsh real vehicle noisy voice acquisition operation, and the scale of the recorded voice material resources can be obviously smaller than the scale of the existing real vehicle noisy voice acquisition; on the other hand, the two elements forming the data sample are independently prepared and independently matched, so that when noise voice data are required to be contained in batches of vehicles under the same or similar noise scenes, only the voice material resources are required to be expanded by combining a mature voice synthesis technology, and data are not required to be repeatedly recorded on site as in the prior art, the multiplexing of the voice material resources established in advance is realized, the defects of time waste, labor waste, high cost and the like of the current manual recording mode of real vehicle scenes are overcome, and the manufacturing efficiency of the interactive voice data sample for vehicles is effectively improved.
Drawings
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described with reference to the accompanying drawings, in which:
FIG. 1 is a flowchart of an embodiment of a method for generating noisy speech data in a vehicle according to the present invention;
FIG. 2 is a flowchart of an embodiment of a voice data expansion method provided by the present invention;
FIG. 3 is a schematic diagram of an embodiment of an in-vehicle noisy speech data generating apparatus according to the present invention;
fig. 4 is a schematic diagram of an embodiment of an electronic device provided in the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative only and should not be construed as limiting the invention.
Before the specific scheme of the invention is developed, the preparation process of the existing noise-contained voice data in the vehicle is introduced. In actual operation, considering the complexity of an environment scene for a vehicle, most of the existing voice data manufacturing methods need manual real vehicle real scene direct collection and recording, the investment cost is high, time and labor are wasted, and the operation is difficult to be rigorous, so that a large error exists in voice recognition effect comparison between different scenes, for example, when different air conditioner gears are compared under the same vehicle speed condition, if noise data is recorded in a vehicle in the same section of road for the first time, a bus passes through the vehicle to generate noise, and if no bus passes through the vehicle for the second time, the environmental variables in the two scenes have deviation, so that single and stable factor accurate comparison cannot be realized; in addition, because the existing method directly collects and records voice data containing noise, the voice in the vehicle and the vehicle-mounted noise cannot be reliably separated, that is, if any change occurs in the environmental scene information or the voice source information, the information needs to be collected again, the collected voice data cannot be reused, and the cost is additionally increased.
In order to overcome the drawback of the existing manual real-vehicle live-action noisy speech data acquisition and recording, the invention provides an embodiment of at least one method for generating noisy speech data in a vehicle, as shown in fig. 1, which may specifically include:
and step S1, creating a voice material library in advance.
The speech material library may store at least two types of audio data, pure human voice audio data and real vehicle pure noise audio data, or it may be understood that the speech material library includes at least two sub-libraries, respectively storing pure human voice audio data and real vehicle pure noise audio data.
Specifically, noisy speech data for in-vehicle interaction is mainly composed of a main speaker sound source, environmental noise, and an interference source. The main speaker sound source is an object of voice processing and is mainly used for operation links such as recognition, awakening, intention analysis, instruction control and the like; the environmental noise mainly refers to wind noise and tire noise during driving, noise outside the vehicle, air conditioning noise, sound output by a vehicle-mounted player, other noise and the like, and part of the environmental noise is related to the geographical position of the real vehicle, such as a garage, an urban area, a high speed, a rural area and the like, and has corresponding changes; the interference source mainly refers to other speaker sound sources in the vehicle, and if the main speaker sound source is a main driving sound source, the interference source can refer to human sound sources such as a secondary driving, a main driving rear position and a secondary driving rear position; if the primary speaker sound source is the secondary driving sound source, the interference source can be the primary driving, the primary driving rear position, the secondary driving rear position and other human sound sources. The main speaker sound source may be referred to as "pure human sound", and the environmental noise and the interference source may be referred to as "real vehicle pure noise".
In order to implement the production of noise-containing audio data in a vehicle according to the present invention, pure human audio data and real vehicle pure noise audio data need to be prepared in advance, and in the implementation, the pure human audio data and the real vehicle pure noise audio data can be implemented by pre-recording, and the following implementation examples are given herein for reference:
(1) and recording different pure human sound source materials in advance, and marking human sound source information on each pure human sound source material to obtain pure human sound audio data.
Regarding the aforementioned recording of the sound source material resources of the main speaker, the voice template of pure human voice can be customized according to actual requirements, specifically, the voice template can be recorded by a real person in a specific mute room, and the voice of the real person can also be simulated by a voice synthesis technology, where the following recording modes are given as examples (actual operations may not be limited thereto):
preferably, after the pure-person voice source data corresponding to the main speaker is collected, the material may be labeled according to voice characteristics, for example, the speaker role, gender, language (chinese/foreign language/dialect), volume, tone (sweet/gentle/stiff/quiet, etc.) of each pure-person voice frequency may be labeled, for example, a certain piece of main speaker source data may be labeled as: owner drives-men-50 db.
(2) Real vehicle pure noise materials under different scenes are prerecorded, scene noise information is marked out of each real vehicle pure noise material, and real vehicle pure noise audio data are obtained.
Regarding the recording of real vehicle pure noise material resources, the condition and scene of the vehicle actually used by a user need to be considered, and various vehicle use information needs to be collected in a classified manner. Specifically, real-vehicle simulation may be used for collecting and recording, for example, a recording device is bound to a main driving seat, a simulated real-person sounding device is bound to the main/sub-driving seat (the simulated real-person sounding device collects and records the aforementioned interference sources), and then noise in different scenes may be recorded by using a control variable method, for example, noise in different vehicles may be recorded: ensuring that no other artificial external sound source exists in the vehicle in the same place, and starting different vehicle speeds to record corresponding noise values; recording the voice conversation volume of the main driver/the auxiliary driver: when no other artificial sound sources are placed outside in the vehicle at the same place and the same speed, the master/slave driving simulation real person sound production equipment is used for simulating conversation, and the master/slave driving voice conversation volume is recorded. The method can record the pure noise of the real vehicle under different real vehicle scenes, different air conditioner gears and different volume of the player in the vehicle. The real vehicle pure noise acquisition and recording object can refer to the following, but is not limited to the following:
preferably, after the pure noise of the real vehicle is recorded, the material may be labeled according to the noise scene information, for example, the collected scene information may be labeled according to the category: because the vehicle noises (mainly wind noises, tire noises, engine noises, vibration of related parts and the like) brought by different vehicle speeds are different, the vehicle speeds commonly used by users are counted to be 0km/h, 20km/h, 40km/h, 60km/h, 80km/h, 100km/h and 120km/h, and then the unified classification marking is carried out by combining different vehicle speeds, such as: the vehicle speed is 40 km/h-5 db; counting that the air conditioner gears commonly used by the user include 0 gear, low gear, medium gear and high gear, and similarly, carrying out unified classification marking by combining different air conditioner gears, such as: air conditioner medium-3 db; the general user real vehicle driving geographical positions are counted to be garages, areas near the cells, downtown areas and high speeds, and similar classification marking methods can be marked as follows: garage-3 db, etc.; similarly, the volume of the in-vehicle player can be marked as: vehicle music-30 db, etc.; sources of interference may be labeled as: copilot-30 db. An example of a real vehicle pure noise mark under a certain scene integrating a plurality of factors is given here: 10db + 3db + 20db + air-conditioning medium in downtown areas.
It can be noted that by the aforementioned pre-labeling of the preferred embodiment, the reuse rate of the speech material library can be effectively increased.
And step S2, receiving and analyzing the data requirement input by the user.
The user is, for example and without limitation, an in-vehicle interactive voice tester, and in order to more accurately test in-vehicle voice recognition effects in different scenes, the tester often needs to customize a batch of in-vehicle noise voice data under different vehicle noises as a test set, so that common data demand information may be: the human voice characteristics of the main speakers and the real vehicle noise scene information and/or the proportional relation between the human voice sound source energy and the real vehicle noise energy.
Moreover, the required form of the user input may be relatively clear, for example, the sound source of the main speaker is man-chinese mandarin-50 db, the ratio of the sound source energy of the main speaker to the vehicle noise energy is 5: based on such specific and unambiguous data requirements, the subsequent step S3 may be performed directly; for such a fuzzy data demand instruction, the analytic mode may preferably consider the data demand input by the user, pure human voice audio data labeled with human voice sound source information, real vehicle pure noise audio data labeled with scene noise information, and a pre-trained prediction model based on semantic analysis to obtain the analyzed data demand.
In practical operation, for such situations, a scene information prediction model constructed in advance by using a neural network architecture, for example, but not limited to, a main speaker sound source labeling text M, a main speaker sound source energy-to-vehicle noise weight ratio Y, and a text (which may be speech of course) of a fuzzy requirement extracted by a large number of users, may be trained in advance by a data set in a speech material library, and the model is output after semantic analysis, that is, the specific main speaker sound source information M and the optimal pure vehicle noise scene information S. Thus, when relatively fuzzy data requirements currently input by a user are used as model input in an actual operation stage, the main speaker sound source information M (such as mans-Chinese Mandarin-50 db) and the optimal pure car noise scene information S (such as downtown S1-air conditioning upscale S2-interference source S3, wherein S1 is 10db, S2 is 5db, and S3 is 15db) are output after the scene information prediction model is calculated. For the model selection, training, etc., reference may be made to the existing mature techniques, which are not repeated or limited in the present invention.
And step S3, according to the analyzed data requirements, pure human voice audio data and real vehicle pure noise audio data corresponding to the user requirements are respectively matched from the voice material library.
No matter the data demand information of the user is directly analyzed, or the data demand information of the user is predicted through the model, the pure voice and the pure noise of the real vehicle can be matched and extracted in the voice material library based on the analyzed demand information. Preferably, based on the demand information and the voice sound source information marked in the voice material library, optimal pure-person voice audio data can be matched from the voice material library; and matching to obtain optimal real-vehicle pure noise audio data from the voice material library based on the demand information and the scene noise information marked in the voice material library.
The matching process can use mature data matching means for reference, the invention is not repeated and limited, but the invention needs to further point out that the optimal item mentioned here is that considering that the analyzed demand information can not be matched to the corresponding item in the library by 100%, a matching deviation range can be given, a plurality of results which accord with the demand information are obtained during matching, and then the optimal pure human voice audio data and the actual vehicle pure noise audio data are selected from the results.
And step S4, mixing the matched pure-person audio data and the matched real-vehicle pure-noise audio data to generate in-vehicle noise-containing audio data.
And finally, performing acoustic fusion on the matched optimal main speaker sound source and the optimal pure vehicle noise source to manufacture the noise-containing voice data in the target vehicle, which meets the requirements of users. The acoustic fusion described herein can also refer to the known existing method, such as superimposing two single-channel main speaker sound sources and a pure car noise source to form a single-channel mixed data, which is the voice data containing noise.
On the basis of the foregoing embodiments and the preferred embodiments thereof, it may be further considered that, in order to amplify the amount of the required noise-contained speech data in the car, the matched pure human speech material that meets the current needs of the user may be used for expansion, and specifically, reference may be made to the example of the speech data expansion scheme shown in fig. 2:
step S10, extracting the voice elements of the current voice from the matched pure voice audio data;
step S20, synthesizing a batch of pure human voice audio data by using the voice primitive and a plurality of preset vehicle interactive texts;
and step S30, mixing the synthesized pure-person audio data with the matched real-vehicle pure-noise audio data one by one to obtain batch vehicle-contained noise voice data.
Specifically, the concept of the speech data production extension mode is to perform speech synthesis by using a main speaker sound source material, and the specific process includes, but is not limited to: giving a plurality of preset texts to be synthesized such as vehicle interactive texts, such as 'several toilets along the way', 'answering' the call ',' how much the current vehicle speed is ',' opening the vehicle window ',' navigating to the company 'and the like', matching the text to the pure human voice data of the main speaker, extracting the voice primitive of the human voice (usually, but not limited to, the minimum phoneme sample such as initial consonant and final sound can be used as the extracted voice primitive), adjusting and modifying the prosodic characteristic of the voice primitive of the human voice by combining with a mature voice synthesis technology, and combining with the text to be synthesized to obtain the voice data meeting the user requirement, thus, when the user needs to output the voice data containing noise in a large batch of vehicles, conveniently and quickly performing batch voice synthesis, then uniformly overlapping and matching the pure vehicle noise frequency resource of the current requirement, noise-contained voice data in the vehicle, which meets the requirements of the user, can be manufactured in a certain scale.
In summary, the idea of the present invention is to collect pure real-world vehicle noise and pure human voice source in advance to construct a vehicle voice material library, where audio resources in the vehicle voice material library are used to combine with requirements provided by a user, so as to obtain two independent elements of vehicle-contained noise voice data, the pure human voice source and the pure real-world vehicle noise, which meet the user's expectations, respectively, and then perform channel fusion on the matched pure human voice source and the pure real-world vehicle noise, so as to generate target vehicle-contained noise voice data required by the user. On one hand, the method does not need to repeatedly carry out harsh real vehicle noisy voice acquisition operation, and the scale of the recorded voice material resources can be obviously smaller than the scale of the existing real vehicle noisy voice acquisition; on the other hand, the two elements forming the data sample are independently prepared and independently matched, so that when noise voice data are required to be contained in batches of vehicles under the same or similar noise scenes, only the voice material resources are required to be expanded by combining a mature voice synthesis technology, and data are not required to be repeatedly recorded on site as in the prior art, the multiplexing of the voice material resources established in advance is realized, the defects of time waste, labor waste, high cost and the like of the current manual recording mode of real vehicle scenes are overcome, and the manufacturing efficiency of the interactive voice data sample for vehicles is effectively improved.
Corresponding to the above embodiments and preferred schemes, the present invention further provides an embodiment of an in-vehicle noisy speech data generating apparatus, as shown in fig. 3, which may specifically include the following components:
the system comprises a voice material library creating module 1, a voice database creating module and a voice database creating module, wherein the voice material library creating module is used for creating a voice material library in advance, and the voice material library comprises pure-person voice audio data and real-vehicle pure-noise audio data;
the data requirement acquisition module 2 is used for receiving and analyzing the data requirement input by the user;
the demand matching module 3 is used for respectively matching pure human voice audio data and real vehicle pure noise audio data corresponding to user demands from the voice material library according to the analyzed data demands;
and the target data generation module 4 is used for carrying out sound mixing processing on the matched pure-person voice audio data and the matched real-vehicle pure-noise audio data to generate noise-containing voice data in the vehicle.
In at least one possible implementation manner, the speech material library creating module includes:
the pure person sound audio data preparation unit is used for prerecording different pure person sound source materials and marking sound source information of each pure person sound source material to obtain pure person sound audio data;
and the real vehicle pure noise audio data preparation unit is used for prerecording real vehicle pure noise materials under different scenes, marking scene noise information on each real vehicle pure noise material and obtaining real vehicle pure noise audio data.
In at least one possible implementation manner, the data requirement obtaining module is specifically configured to:
and obtaining the analyzed data demand according to the data demand input by the user, the pure human voice audio data marked with the human voice sound source information, the real vehicle pure noise audio data marked with the scene noise information and a pre-trained prediction model based on semantic analysis.
In at least one possible implementation manner, the parsed data requirement includes the following requirement information:
the human voice characteristics of the main speakers and the real vehicle noise scene information and/or the proportional relation between the human voice sound source energy and the real vehicle noise energy.
In at least one possible implementation manner, the requirement matching module includes:
the pure person sound and frequency matching unit is used for matching optimal pure person sound and frequency data from the voice material library based on the demand information and the marked human sound source information;
and the real vehicle pure noise audio matching unit is used for matching optimal real vehicle pure noise audio data from the voice material library based on the demand information and the marked scene noise information.
In at least one possible implementation manner, the apparatus further includes:
the voice element extraction module is used for extracting the voice element of the current voice from the matched pure voice audio data;
the voice synthesis module is used for synthesizing batch pure human voice audio data by utilizing the voice primitive and a plurality of preset vehicle interactive texts;
and the target data amplification module is used for carrying out sound mixing processing on the synthesized pure human voice data and the matched real vehicle pure noise voice data one by one to obtain batch vehicle-contained noise voice data.
It should be understood that the above division of the components of the in-vehicle noisy speech data generating apparatus shown in fig. 3 is merely a logical division, and the actual implementation may be wholly or partially integrated into one physical entity or may be physically separated. And these components may all be implemented in software invoked by a processing element; or may be implemented entirely in hardware; and part of the components can be realized in the form of calling by the processing element in software, and part of the components can be realized in the form of hardware. For example, a certain module may be a separate processing element, or may be integrated into a certain chip of the electronic device. Other components are implemented similarly. In addition, all or part of the components can be integrated together or can be independently realized. In implementation, each step of the above method or each component above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.
For example, the above components may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), one or more microprocessors (DSPs), one or more Field Programmable Gate Arrays (FPGAs), etc. For another example, these components may be integrated together and implemented in the form of a System-On-a-Chip (SOC).
In view of the foregoing examples and preferred embodiments thereof, it will be appreciated by those skilled in the art that, in practice, the technical idea underlying the present invention may be applied in a variety of embodiments, the present invention being schematically illustrated by the following vectors:
(1) an electronic device is provided. The device may specifically include: one or more processors, memory, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions, which when executed by the apparatus, cause the apparatus to perform the steps/functions of the foregoing embodiments or an equivalent implementation.
The electronic device may specifically be an electronic device related to a computer, such as but not limited to various interactive terminals and electronic products, for example, a specific vehicle-mounted intelligent terminal, or a computer of a vehicle-mounted interactive testing platform.
Fig. 4 is a schematic structural diagram of an embodiment of an electronic device provided in the present invention, and specifically, the electronic device 900 includes a processor 910 and a memory 930. Wherein, the processor 910 and the memory 930 can communicate with each other and transmit control and/or data signals through the internal connection path, the memory 930 is used for storing computer programs, and the processor 910 is used for calling and running the computer programs from the memory 930. The processor 910 and the memory 930 may be combined into a single processing device, or more generally, separate components, and the processor 910 is configured to execute the program code stored in the memory 930 to implement the functions described above. In particular implementations, the memory 930 may be integrated with the processor 910 or may be separate from the processor 910.
In addition, to further enhance the functionality of the electronic device 900, the device 900 may further include one or more of an input unit 960, a display unit 970, an audio circuit 980, a camera 990, a sensor 901, and the like, which may further include a speaker 982, a microphone 984, and the like. The display unit 970 may include a display screen, among others.
Further, the apparatus 900 may also include a power supply 950 for providing power to various devices or circuits within the apparatus 900.
It should be understood that the operation and/or function of the various components of the apparatus 900 can be referred to in the foregoing description with respect to the method, system, etc., and the detailed description is omitted here as appropriate to avoid repetition.
It should be understood that the processor 910 in the electronic device 900 shown in fig. 4 may be a system on chip SOC, and the processor 910 may include a Central Processing Unit (CPU), and may further include other types of processors, such as: an image Processing Unit (GPU), etc., which will be described in detail later.
In summary, various portions of the processors or processing units within the processor 910 may cooperate to implement the foregoing method flows, and corresponding software programs for the various portions of the processors or processing units may be stored in the memory 930.
(2) A computer data storage medium having stored thereon a computer program or the above apparatus which, when executed, causes a computer to perform the steps/functions of the preceding embodiments or equivalent implementations.
In several embodiments provided by the present invention, any of the functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer data-accessible storage medium. Based on this understanding, some aspects of the present invention may be embodied in the form of software products, which are described below, or portions thereof, which substantially contribute to the art.
In particular, it should be noted that the storage medium may refer to a server or a similar computer device, and specifically, the aforementioned computer program or the aforementioned apparatus is stored in a storage device in the server or the similar computer device.
(3) A computer program product (which may include the above-described apparatus) that, when run on a terminal device, causes the terminal device to execute the in-vehicle noisy speech data generation method of the foregoing embodiment or an equivalent embodiment.
From the above description of the embodiments, it is clear to those skilled in the art that all or part of the steps in the above implementation method can be implemented by software plus a necessary general hardware platform. With this understanding, the above-described computer program product may include, but is not limited to referring to APP.
In the foregoing, the device/terminal may be a computer device, and the hardware structure of the computer device may further specifically include: at least one processor, at least one communication interface, at least one memory, and at least one communication bus; the processor, the communication interface and the memory can all complete mutual communication through the communication bus. The processor may be a central Processing unit CPU, a DSP, a microcontroller, or a digital Signal processor, and may further include a GPU, an embedded Neural Network Processor (NPU), and an Image Signal Processing (ISP), and may further include a specific integrated circuit ASIC, or one or more integrated circuits configured to implement the embodiments of the present invention, and the processor may have a function of operating one or more software programs, and the software programs may be stored in a storage medium such as a memory; and the aforementioned memory/storage media may comprise: non-volatile memories (non-volatile memories) such as non-removable magnetic disks, U-disks, removable hard disks, optical disks, etc., and Read-Only memories (ROM), Random Access Memories (RAM), etc.
In the embodiments of the present invention, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, and means that there may be three relationships, for example, a and/or B, and may mean that a exists alone, a and B exist simultaneously, and B exists alone. Wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" and similar expressions refer to any combination of these items, including any combination of singular or plural items. For example, at least one of a, b, and c may represent: a, b, c, a and b, a and c, b and c or a and b and c, wherein a, b and c can be single or multiple.
Those of skill in the art will appreciate that the various modules, elements, and method steps described in the embodiments disclosed in this specification can be implemented as electronic hardware, combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
And, modules, units, etc. described herein as separate components may or may not be physically separate, i.e., may be located in one place, or may be distributed across multiple places, e.g., nodes of a system network. Some or all of the modules and units can be selected according to actual needs to achieve the purpose of the above-mentioned embodiment. Can be understood and carried out by those skilled in the art without inventive effort.
The structure, features and effects of the present invention have been described in detail with reference to the embodiments shown in the drawings, but the above embodiments are merely preferred embodiments of the present invention, and it should be understood that technical features related to the above embodiments and preferred modes thereof can be reasonably combined and configured into various equivalent schemes by those skilled in the art without departing from and changing the design idea and technical effects of the present invention; therefore, the invention is not limited to the embodiments shown in the drawings, and all the modifications and equivalent embodiments that can be made according to the idea of the invention are within the scope of the invention as long as they are not beyond the spirit of the description and the drawings.