HIV subtype classification system and classification method
An HIV subtype classification system, characterized in that it comprises:
a database pool comprising first-generation sequencing sequences and second-generation sequencing data of HIV from an open public database;
a database management module comprising a database pool construction and integration module and a data update module, wherein,
the database pool constructing and integrating module processes the input second-generation sequencing BAM file into a consistent sequence Reads.fasta, records the HIV sequence subjected to quality check into the database pool, records the newly added sequence of an open public database into the database pool,
the data updating module is used for automatically downloading public database sequences at regular intervals;
a typing module comprising the following typing sub-modules:
the HIV second generation sequencing data typing submodule is used for counting breakpoint coverage conditions of the consistency sequence in the database pool to all HIV subtypes, calculating and comparing breakpoint coverage rates corresponding to different HIV subtypes, comparing the breakpoint coverage rates of a sample to be classified, outputting a typing result of the sample to be detected,
an HIV generation sequencing data typing submodule used for directly blast comparing the consistency sequence of a sample to be typed with an HIV generation sequencing sequence in a database pool and outputting a sequence similarity comparison result,
and the recombination and mixed subtype HIV typing submodule is used for comparing second-generation sequencing reads of a sample to be tested with second-generation sequencing data of HIV in a database pool, counting and comparing the reads proportions of different subtypes, and assisting in judging the recombination and mixed subtypes.
2. The HIV subtype classification system according to claim 1, wherein the HIV next generation sequencing data typing submodule performs the steps of typing:
s1, inputting a sequence: inputting the constructed consistency sequence;
s2, carrying out multi-sequence comparison with the existing HIV-1Subtype Database typing list to obtain a primary uncorrected difference value;
s3, screening amino acids related to the published monitoring condition of the drug-resistant site to obtain a corrected difference value;
s4, combining the corrected difference value and upper limit typing of the breakpoint coverage rate, wherein,
s4.1, when the corrected difference value is more than 11%, defining the sample to be classified as an unknown subtype,
s4.2, when the corrected difference value is less than or equal to 11 percent, wherein,
when the sample is matched with the simple subtype sequence, comparing the difference value with the simple subtype sequence with the upper limit of the typing breakpoint coverage rate, wherein if the difference value with the simple subtype sequence is less than or equal to the upper limit of the typing breakpoint coverage rate, the sample is judged to be the simple subtype, if the difference value with the simple subtype sequence is greater than the upper limit of the typing breakpoint coverage rate, the sample is judged to be the simple subtype, and meanwhile, a warning is reported,
and when the best match is a complex subtype sequence, comparing a difference value with the complex subtype sequence with the upper limit of typing breakpoint coverage, wherein if the difference value compared with the complex subtype sequence is less than or equal to the upper limit of typing breakpoint coverage, the complex subtype is judged, if the difference value compared with the complex subtype sequence is greater than the upper limit of typing breakpoint coverage, the best match parent is judged to be scored, wherein if the difference value between the sample best match parent score and the popular recombination subtype difference value is less than or equal to 1%, the parent subtype is reported, otherwise, the unique recombination subtype is judged.
3. The HIV subtype classification system according to claim 1, wherein the recombinant and mixed subtype HIV classification submodule performs the following steps to assist in the determination of recombinant and mixed subtypes:
comparing the second generation sequencing reads of the sample to be detected with the second generation sequencing data of HIV in the database pool, counting and comparing the reads proportions of different subtypes, wherein,
if the typing result of the HIV second-generation sequencing data typing submodule is a non-URF pure subtype, the result of comparing and typing the reads in the recombinant and mixed subtype HIV typing submodule needs to be the same as the result of the HIV second-generation sequencing data typing submodule, and the proportion is not lower than 60%;
if the typing result of the HIV second-generation sequencing data typing submodule is a URF pure subtype, the results of comparing the typing top10 in the recombinant and mixed subtype HIV typing submodule with the typing results of the HIV second-generation sequencing data typing submodule have different parental subtypes, and the proportion of all the results of the top10 in the ranking is not higher than 60%;
if the typing results of the recombinant and mixed subtype HIV typing submodule are mixed subtypes, the results of reads in the recombinant and mixed subtype HIV typing submodule comparing the typing top10 and the typing results of the HIV next generation sequencing data typing submodule have the same parental subtypes, and the proportion of all the results of the top10 is not higher than 60%.
A method for classifying HIV subtypes, the method comprising the steps of:
sequencing a section of sequence spanning PR and RT regions on the pol gene of the HIV sample to be typed;
collecting HIV sequences from a public database and sequence data of a newly added HIV pool region, collecting second-generation sequencing data of HIV, and constructing a database pool;
processing data, namely processing a second-generation sequencing file in a database pool into a consistent sequence Reads.fasta, recording the HIV sequence subjected to quality check into the database pool, and recording a newly added sequence of a public database into the database pool; and
and (3) counting the breakpoint coverage condition of HIV subtypes covered by the converted consistency sequence of the second-generation sequencing data of the HIV in the database pool, calculating the breakpoint coverage rate corresponding to different HIV subtypes, comparing the breakpoint coverage rates of the sample to be detected and the known HIV subtypes, determining the typing of the sample to be detected, and outputting the typing result of the sample to be detected.
5. The HIV subtype classification method according to claim 4, characterized in that it further comprises the steps of: and (3) directly performing blast comparison on the consistent sequence with an HIV generation sequencing sequence in a database pool, and outputting a sequence similarity comparison result.
6. The HIV subtype classification method according to claim 4, further comprising a step of determining recombinant and mixed subtypes of HIV, wherein reads from next generation sequencing of the sample to be tested are compared with the second generation sequencing data of HIV in the database pool, and the ratio of reads to different subtypes is statistically compared,
if the typing result of the HIV second-generation sequencing data typing submodule is a non-URF pure subtype, the result of comparing and typing the reads in the recombinant and mixed subtype HIV typing submodule needs to be the same as the result of the HIV second-generation sequencing data typing submodule, and the proportion is not lower than 60%;
if the typing result of the HIV second-generation sequencing data typing submodule is a URF pure subtype, the results of comparing the typing top10 in the recombinant and mixed subtype HIV typing submodule with the typing results of the HIV second-generation sequencing data typing submodule have different parental subtypes, and the proportion of all the results of the top10 in the ranking is not higher than 60%;
if the typing results of the recombinant and mixed subtype HIV typing submodule are mixed subtypes, the results of reads in the recombinant and mixed subtype HIV typing submodule comparing the typing top10 and the typing results of the HIV next generation sequencing data typing submodule have the same parental subtypes, and the proportion of all the results of the top10 is not higher than 60%.
7. The HIV subtype classification method according to claim 4, characterized in that the classification of the sample to be tested is determined by the following steps:
s1, inputting a sequence: inputting the constructed consistency sequence;
s2, carrying out multi-sequence comparison with the existing HIV-1Subtype Database typing list to obtain a primary uncorrected difference value;
s3, screening off amino acids related to the published monitoring condition of the drug-resistant site to obtain a corrected difference value;
s4, combining the corrected difference value and upper limit typing of the breakpoint coverage rate, wherein,
s4.1, when the corrected difference value is more than 11%, defining the sample to be classified as an unknown subtype,
s4.2, when the corrected difference value is less than or equal to 11 percent, wherein,
when the sample is matched with the simple subtype sequence, comparing the difference value with the simple subtype sequence with the upper limit of the typing breakpoint coverage rate, wherein if the difference value with the simple subtype sequence is less than or equal to the upper limit of the typing breakpoint coverage rate, the sample is judged to be the simple subtype, if the difference value with the simple subtype sequence is greater than the upper limit of the typing breakpoint coverage rate, the sample is judged to be the simple subtype, and meanwhile, a warning is reported,
and when the best match is a complex subtype sequence, comparing a difference value with the complex subtype sequence with the upper limit of typing breakpoint coverage, wherein if the difference value compared with the complex subtype sequence is less than or equal to the upper limit of typing breakpoint coverage, the complex subtype is judged, if the difference value compared with the complex subtype sequence is greater than the upper limit of typing breakpoint coverage, the best match parent is judged to be scored, wherein if the difference value between the sample best match parent score and the popular recombination subtype difference value is less than or equal to 1%, the parent subtype is reported, otherwise, the unique recombination subtype is judged.
8. The HIV subtype classification method according to claim 4, characterized in that the sequence of 1kb across PR and RT regions on the pol gene of the HIV sample to be typed is sequenced.
9. The HIV subtype classification method according to claim 4, characterized in that the typing results of the samples to be tested and the nucleic acid level similarity of the samples to be tested and the optimal typing results are outputted.
10. The HIV subtype classification method according to claim 4, characterized in that the typing results of the samples to be tested and the similarity of the amino acid levels of the samples to be tested and the optimal typing results are outputted.
Background
HIV includes subtypes A, B, C, D, F, G, H, J, K, and the overall proportion of recombinant forms continues to increase over time. HIV diversity is complex and evolving, and is a major challenge in HIV vaccine development. Monitoring the global molecular epidemiology of HIV type remains crucial to the design, detection and implementation of HIV vaccines.
HIV typing has guiding significance for the interpretation of drug resistance test results and the formulation of individualized treatment regimens for infected patients. Since the subtype-specific genetic barrier can play a role in the occurrence and development of drug-resistant mutations, or since the influence of other drug-resistant sites on the main drug-resistant site is different, the evolution direction and the evolution speed of different subtypes are influenced. The drug-resistant mutation sites and the frequency thereof are different among different subtypes, new drug-resistant mutation sites are continuously reported, and meanwhile, some unexplained drug sensitivity also influences the explanation of the genotype drug-resistant detection result, so that the subtype specificity of the drug-resistant mutation is evaluated, and the difference in the drug-resistant mutation characteristics has important reference value in designing an ART treatment scheme for patients.
While existing HIV bioinformatics databases have facilitated researchers and medical personnel to carry out relevant work, there are still some difficulties and risks in specifically using these databases, as follows:
1. the existing public databases have scattered information sources, and HIV sequence information of the databases is mostly based on a generation of sequencing results, and the sequence quality cannot be guaranteed.
2. The function of public database genotyping annotation based on the results of HIV next generation sequencing is still in the Beta testing stage, such as HGS-Beta of HIVDB. Meanwhile, the second-generation sequencing annotation tools with only a small number of databases have the function of integrating the own databases, and in addition, the flexibility and the efficiency of integrating the own data by the annotation tools are not high.
3. Most of the existing public database annotation tools adopt a single-thread mode to execute tasks, and are difficult to be competent for mainstream data analysis tasks based on computer cluster calculation and big data.
Disclosure of Invention
In view of the above problems, it is an object of the present invention to provide an HIV subtype classification system.
It is yet another object of the present invention to provide a method for classifying HIV subtypes.
The HIV subtype classification system according to the present invention comprises:
a database pool comprising first-generation sequencing sequences and second-generation sequencing data of HIV from an open public database;
a database management module comprising a database pool construction and integration module and a data update module, wherein,
the database pool constructing and integrating module processes the input second-generation sequencing BAM file into a consistency sequence Reads.fasta; and including the quality checked HIV sequences into the database pool; and recording the newly added sequence of the open public database into the database pool,
the data updating module is used for automatically downloading public database sequences at regular intervals;
a typing module comprising the following typing sub-modules:
the HIV second generation sequencing data typing submodule is used for counting breakpoint coverage conditions of the consistency sequence in the database pool to all HIV subtypes, calculating and comparing breakpoint coverage rates corresponding to different HIV subtypes, comparing the breakpoint coverage rates of a sample to be classified, outputting a typing result of the sample to be detected,
an HIV generation sequencing data typing submodule used for directly blast comparing the consistency sequence of a sample to be typed with an HIV generation sequencing sequence in a database pool and outputting a sequence similarity comparison result,
and the recombination and mixed subtype HIV typing submodule is used for comparing second-generation sequencing reads of a sample to be tested with second-generation sequencing data of HIV in a database pool, counting and comparing the reads proportions of different subtypes, and assisting in judging the recombination and mixed subtypes.
The HIV subtype classification system according to the present invention, wherein the HIV secondary sequencing data typing submodule performs typing by:
s1, inputting a sequence: inputting the constructed consistency sequence;
s2, carrying out multi-sequence comparison with the existing HIV-1Subtype Database typing list to obtain a primary uncorrected difference value;
s3, screening off amino acids related to the published monitoring condition of the drug-resistant site to obtain a corrected difference value;
s4, combining the corrected difference value and upper limit typing of the breakpoint coverage rate, wherein,
s4.1, when the corrected difference value is more than 11%, defining the sample to be classified as an unknown subtype;
s4.2, when the corrected difference value is less than or equal to 11 percent, wherein,
when the sample is matched with the simple subtype sequence, comparing the difference value with the simple subtype sequence with the upper limit of the typing breakpoint coverage rate, wherein if the difference value with the simple subtype sequence is less than or equal to the upper limit of the typing breakpoint coverage rate, the sample is judged to be the simple subtype, if the difference value with the simple subtype sequence is greater than the upper limit of the typing breakpoint coverage rate, the sample is judged to be the simple subtype, and meanwhile, a warning is reported,
and when the best match is a complex subtype sequence, comparing a difference value with the complex subtype sequence with the upper limit of typing breakpoint coverage, wherein if the difference value compared with the complex subtype sequence is less than or equal to the upper limit of typing breakpoint coverage, the complex subtype is judged, if the difference value compared with the complex subtype sequence is greater than the upper limit of typing breakpoint coverage, the best match parent is judged to be scored, wherein if the difference value between the sample best match parent score and the popular recombination subtype difference value is less than or equal to 1%, the parent subtype is reported, otherwise, the unique recombination subtype is judged.
The HIV subtype classification system according to the present invention, wherein the recombinant and mixed subtype HIV classification submodule performs the following steps to assist in the determination of recombinant and mixed subtypes:
comparing the second generation sequencing reads of the sample to be detected with the second generation sequencing data of HIV in the database pool, counting and comparing the reads proportions of different subtypes, wherein,
if the typing result of the HIV second-generation sequencing data typing submodule is a non-URF pure subtype, the result of comparing and typing the reads in the recombinant and mixed subtype HIV typing submodule needs to be the same as the result of the HIV second-generation sequencing data typing submodule, and the proportion is not lower than 60%;
if the typing result of the HIV second-generation sequencing data typing submodule is a URF pure subtype, the results of comparing the typing top10 in the recombinant and mixed subtype HIV typing submodule with the typing results of the HIV second-generation sequencing data typing submodule have different parental subtypes, and the proportion of all the results of the top10 in the ranking is not higher than 60%;
if the typing results of the recombinant and mixed subtype HIV typing submodule are mixed subtypes, the results of reads in the recombinant and mixed subtype HIV typing submodule comparing the typing top10 and the typing results of the HIV next generation sequencing data typing submodule have the same parental subtypes, and the proportion of all the results of the top10 is not higher than 60%.
The HIV subtype classification method according to the present invention comprises the following steps:
sequencing a section of sequence spanning PR and RT regions on the pol gene of the HIV sample to be typed;
collecting HIV sequences from a public database and sequence data of a newly added HIV pool region, collecting second-generation sequencing data of HIV, and constructing a database pool;
processing data, namely processing a second-generation sequencing file in a database pool into a consistent sequence Reads.fasta, recording the HIV sequence subjected to quality check into the database pool, and recording a newly added sequence of a public database into the database pool; and
and (3) counting the breakpoint coverage condition of HIV subtypes covered by the converted consistency sequence of the second-generation sequencing data of the HIV in the database pool, calculating the breakpoint coverage rate corresponding to different HIV subtypes, comparing the breakpoint coverage rates of the sample to be detected and the known HIV subtypes, determining the typing of the sample to be detected, and outputting the typing result of the sample to be detected.
The HIV subtype classification method according to the present invention, wherein said method further comprises the steps of: and (3) directly performing blast comparison on the consistent sequence with an HIV generation sequencing sequence in a database pool, and outputting a sequence similarity comparison result.
The HIV subtype classification method according to the present invention, wherein said method further comprises the step of judging recombinant and mixed subtypes of HIV, wherein second-generation sequencing reads of a sample to be tested are compared with second-generation sequencing data of HIV in a database pool, and the ratio of reads of different subtypes is statistically compared, wherein,
if the typing result of the HIV second-generation sequencing data typing submodule is a non-URF pure subtype, the result of comparing and typing the reads in the recombinant and mixed subtype HIV typing submodule needs to be the same as the result of the HIV second-generation sequencing data typing submodule, and the proportion is not lower than 60%;
if the typing result of the HIV second-generation sequencing data typing submodule is a URF pure subtype, the results of comparing the typing top10 in the recombinant and mixed subtype HIV typing submodule with the typing results of the HIV second-generation sequencing data typing submodule have different parental subtypes, and the proportion of all the results of the top10 in the ranking is not higher than 60%;
if the typing results of the recombinant and mixed subtype HIV typing submodule are mixed subtypes, the results of reads in the recombinant and mixed subtype HIV typing submodule comparing the typing top10 and the typing results of the HIV next generation sequencing data typing submodule have the same parental subtypes, and the proportion of all the results of the top10 is not higher than 60%.
According to the HIV subtype classification method of the present invention, in step S4, the classification of the sample to be tested is determined by:
s1, inputting a sequence: inputting the constructed consistency sequence;
s2, carrying out multi-sequence comparison with the existing HIV-1Subtype Database typing list to obtain a primary uncorrected difference value;
s3, screening off amino acids related to the published monitoring condition of the drug-resistant site to obtain a corrected difference value;
s4, combining the corrected difference value and upper limit typing of the breakpoint coverage rate, wherein,
s4.1, when the corrected difference value is more than 11%, defining the sample to be classified as an unknown subtype;
s4.2, when the corrected difference value is less than or equal to 11 percent, wherein,
when the sample is matched with the simple subtype sequence, comparing the difference value with the simple subtype sequence with the upper limit of the typing breakpoint coverage rate, wherein if the difference value with the simple subtype sequence is less than or equal to the upper limit of the typing breakpoint coverage rate, the sample is judged to be the simple subtype, if the difference value with the simple subtype sequence is greater than the upper limit of the typing breakpoint coverage rate, the sample is judged to be the simple subtype, and meanwhile, a warning is reported,
and when the best match is a complex subtype sequence, comparing a difference value with the complex subtype sequence with the upper limit of typing breakpoint coverage, wherein if the difference value compared with the complex subtype sequence is less than or equal to the upper limit of typing breakpoint coverage, the complex subtype is judged, if the difference value compared with the complex subtype sequence is greater than the upper limit of typing breakpoint coverage, the best match parent is judged to be scored, wherein if the difference value between the sample best match parent score and the popular recombination subtype difference value is less than or equal to 1%, the parent subtype is reported, otherwise, the unique recombination subtype is judged.
According to the HIV subtype classification method of the present invention, 1kb sequence spanning PR and RT regions on pol gene of HIV sample to be classified is sequenced.
According to the HIV subtype classification method, the classification result of the sample to be detected and the nucleic acid level similarity of the sample to be detected and the optimal classification result are output.
According to the HIV subtype classification method of the present invention, the classification result of the sample to be tested and the similarity of the amino acid levels of the sample to be tested and the optimal classification result are output.
The HIV subtype classification method according to the present invention further comprises the steps of: and directly performing blast comparison on the consistency sequence of the sample to be detected to obtain a sequence comparison result of the top ten of sequence similarity ranks for auxiliary typing judgment.
According to the HIV subtype classification method of the invention, the public database NCBI adds the HIV sequence to supplement the database pool regularly.
According to the HIV subtype classification method, after the sample to be detected is classified, the sample to be detected is judged to be of a new subtype, and then the data of the sample to be detected is supplemented to the database pool. And after the database construction script pool collects the sequence data, storing the sequence to the database pool according to the source, thereby completing database expansion and generating a database sample information hash table for calling a parting module during working.
The embodiment of the application adopts at least one technical scheme which can achieve the following beneficial effects:
1. the HIV subtype classification system constructed by the invention comprises a database pool, a classification module and a data management module, wherein the database pool contains known HIV sequences of all genotypes and gene subtypes. The database management module can automatically complete the downloading work of the public database data at regular intervals, and automatically complete the construction and expansion work of the comparison database and the integration work of the database pool.
2. By introducing the database pool and the three typing modules, the accuracy and efficiency of HIV typing work are greatly improved. In addition, the parting module greatly improves the performance of the parting tool by using parallel computing packages in R.
3. The user only needs to input the HIV sequencing result, the database system can automatically complete the data standardization and sequence typing work, and the user can continuously record the newly obtained standardized sequence into the database pool according to the requirement. Through development and test, the genotyping function of the database is more mature and complete compared with that of a public database. The database can be used for screening, integrating and recording sequence information which is uploaded to the database by a user for analysis each time into the database, so that the capacity expansion of the database is realized. The public database requires that the data format uploaded by the user must be a specified format such as a.codfreq or.aavf format file. However, most of the second-generation sequencing data formats given by the existing sequencing platforms in the market are in the bam or fasta format, and users can utilize the public databases to perform genotyping work by manually converting the data formats by using third-party software, so that the efficiency of data analysis is greatly limited. The database can directly perform genotyping on the bam or fasta format file submitted by the user without manually preprocessing data by the user, so that the efficiency of data analysis work is improved, and the database has higher flexibility.
4. The sequences recorded in the existing public databases have uneven quality, and a plurality of sequences often contain degenerate bases. The HIV typing reference sequence adopted by the invention does not contain degenerate basic groups in the sequence after being screened. The HIV typing reference sequence adopted by the invention is derived from second-generation sequencing data, the sequencing depth is more than 1000 multiplied, and the data quality is good.
5. Most of the existing databases are set up to serve scientific research work and are influenced by space-time distribution and network conditions, so that the centralized analysis task of mass data is hard to be performed. The annotation tools on these databases mostly adopt a single-thread operation mode when performing data analysis tasks, i.e. one sample is analyzed first and the next sample is analyzed. The parting tool of the database executes the analysis tasks in a multithread mode, can perform data analysis work on 10 samples at most, and can greatly improve the working efficiency and save the working time compared with the existing database when facing the data analysis tasks of a large number of samples. This is one of the advantages of the database of the present invention.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a flow chart of a method for HIV subtype classification according to an embodiment of the present application;
FIG. 2 is a schematic diagram of an HIV subtype classification system architecture according to an embodiment of the present application;
FIG. 3 shows the typing principle of the typing submodule of the HIV secondary sequencing data;
FIG. 4 is an output page of the typing results of sample HIV-ZD-6 i-2;
FIG. 5 is an output page of the results of reads vs. typing top10 for sample HIV-ZD-6 i-2;
FIG. 6 is an output page of statistical distribution of reads vs. typing top10 results for sample HIV-ZD-6 i-2;
FIG. 7 is an output page of the results of reads vs. typing top10 for sample 65;
FIG. 8 is an output page of the results of the typing of sample 65;
FIG. 9 is an output page of statistical distribution of reads versus typing top10 results for sample 65;
FIG. 10 is an output page of the sample typing end result;
FIG. 11 is a page showing the output of the 10 sequences with the highest similarity and their typing results from the test sample and the database;
FIG. 12 is an output page of 10 optimal alignment results obtained from the comparison of consensus sequences to public database;
FIG. 13 is an output page of the results of comparing sequencing reads to HIVdb and statistically comparing reads of different subtypes.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The HIV subtype classification system according to the present invention comprises:
a database pool comprising first-generation sequencing sequences and second-generation sequencing data of HIV from an open public database;
a database management module comprising a database pool construction and integration module and a data update module, wherein,
the database pool constructing and integrating module processes the input second-generation sequencing BAM file into a consistency sequence Reads.fasta; and including the quality checked HIV sequences into the database pool; and recording the newly added sequence of the open public database into the database pool,
the data updating module is used for automatically downloading public database sequences at regular intervals;
a typing module comprising the following typing sub-modules:
the HIV second generation sequencing data typing submodule is used for counting breakpoint coverage conditions of the consistency sequence in the database pool to all HIV subtypes, calculating and comparing breakpoint coverage rates corresponding to different HIV subtypes, comparing the breakpoint coverage rates of a sample to be classified, outputting a typing result of the sample to be detected,
an HIV generation sequencing data typing submodule used for directly blast comparing the consistency sequence of a sample to be typed with an HIV generation sequencing sequence in a database pool and outputting a sequence similarity comparison result,
and the recombination and mixed subtype HIV typing submodule is used for comparing second-generation sequencing reads of a sample to be tested with second-generation sequencing data of HIV in a database pool, counting and comparing the reads proportions of different subtypes, and assisting in judging the recombination and mixed subtypes.
The HIV subtype classification system according to the present invention, wherein the HIV secondary sequencing data typing submodule performs typing by:
s1, inputting a sequence: inputting the constructed consistency sequence;
s2, carrying out multi-sequence comparison with the existing HIV-1Subtype Database typing list to obtain a primary uncorrected difference value;
s3, screening off amino acids related to the published monitoring condition of the drug-resistant site to obtain a corrected difference value;
s4, combining the corrected difference value and upper limit typing of the breakpoint coverage rate, wherein,
s4.1, when the corrected difference value is more than 11%, defining the sample to be classified as an unknown subtype;
s4.2, when the corrected difference value is less than or equal to 11 percent, wherein,
when the sample is matched with the simple subtype sequence, comparing the difference value with the simple subtype sequence with the upper limit of the typing breakpoint coverage rate, wherein if the difference value with the simple subtype sequence is less than or equal to the upper limit of the typing breakpoint coverage rate, the sample is judged to be the simple subtype, if the difference value with the simple subtype sequence is greater than the upper limit of the typing breakpoint coverage rate, the sample is judged to be the simple subtype, and meanwhile, a warning is reported,
and when the best match is a complex subtype sequence, comparing a difference value with the complex subtype sequence with the upper limit of typing breakpoint coverage, wherein if the difference value compared with the complex subtype sequence is less than or equal to the upper limit of typing breakpoint coverage, the complex subtype is judged, if the difference value compared with the complex subtype sequence is greater than the upper limit of typing breakpoint coverage, the best match parent is judged to be scored, wherein if the difference value between the sample best match parent score and the popular recombination subtype difference value is less than or equal to 1%, the parent subtype is reported, otherwise, the unique recombination subtype is judged.
The HIV subtype classification system according to the present invention, wherein the recombinant and mixed subtype HIV classification submodule performs the following steps to assist in the determination of recombinant and mixed subtypes:
comparing the second generation sequencing reads of the sample to be detected with the second generation sequencing data of HIV in the database pool, counting and comparing the reads proportions of different subtypes, wherein,
if the typing result of the HIV second-generation sequencing data typing submodule is a non-URF pure subtype, the result of comparing and typing the reads in the recombinant and mixed subtype HIV typing submodule needs to be the same as the result of the HIV second-generation sequencing data typing submodule, and the proportion is not lower than 60%;
if the typing result of the HIV second-generation sequencing data typing submodule is a URF pure subtype, the results of comparing the typing top10 in the recombinant and mixed subtype HIV typing submodule with the typing results of the HIV second-generation sequencing data typing submodule have different parental subtypes, and the proportion of all the results of the top10 in the ranking is not higher than 60%;
if the typing results of the recombinant and mixed subtype HIV typing submodule are mixed subtypes, the results of reads in the recombinant and mixed subtype HIV typing submodule comparing the typing top10 and the typing results of the HIV next generation sequencing data typing submodule have the same parental subtypes, and the proportion of all the results of the top10 is not higher than 60%.
The HIV subtype classification method according to the present invention comprises the following steps:
sequencing a section of sequence spanning PR and RT regions on the pol gene of the HIV sample to be typed;
collecting HIV sequences from a public database and sequence data of a newly added HIV pool region, collecting second-generation sequencing data of HIV, and constructing a database pool;
processing data, namely processing a second-generation sequencing file in a database pool into a consistent sequence Reads.fasta, recording the HIV sequence subjected to quality check into the database pool, and recording a newly added sequence of a public database into the database pool; and
and (3) counting the breakpoint coverage condition of HIV subtypes covered by the converted consistency sequence of the second-generation sequencing data of the HIV in the database pool, calculating the breakpoint coverage rate corresponding to different HIV subtypes, comparing the breakpoint coverage rates of the sample to be detected and the known HIV subtypes, determining the typing of the sample to be detected, and outputting the typing result of the sample to be detected.
The HIV subtype classification method according to the present invention, wherein said method further comprises the steps of: and (3) directly performing blast comparison on the consistent sequence with an HIV generation sequencing sequence in a database pool, and outputting a sequence similarity comparison result.
The HIV subtype classification method according to the present invention, wherein said method further comprises the step of judging recombinant and mixed subtypes of HIV, wherein second-generation sequencing reads of a sample to be tested are compared with second-generation sequencing data of HIV in a database pool, and the ratio of reads of different subtypes is statistically compared, wherein,
if the typing result of the HIV second-generation sequencing data typing submodule is a non-URF pure subtype, the result of comparing and typing the reads in the recombinant and mixed subtype HIV typing submodule needs to be the same as the result of the HIV second-generation sequencing data typing submodule, and the proportion is not lower than 60%;
if the typing result of the HIV second-generation sequencing data typing submodule is a URF pure subtype, the results of comparing the typing top10 in the recombinant and mixed subtype HIV typing submodule with the typing results of the HIV second-generation sequencing data typing submodule have different parental subtypes, and the proportion of all the results of the top10 in the ranking is not higher than 60%;
if the typing results of the recombinant and mixed subtype HIV typing submodule are mixed subtypes, the results of reads in the recombinant and mixed subtype HIV typing submodule comparing the typing top10 and the typing results of the HIV next generation sequencing data typing submodule have the same parental subtypes, and the proportion of all the results of the top10 is not higher than 60%.
The HIV subtype classification method according to the present invention, wherein the classification of a sample to be tested is determined by:
s1, inputting a sequence: inputting the constructed consistency sequence;
s2, carrying out multi-sequence comparison with the existing HIV-1Subtype Database typing list to obtain a primary uncorrected difference value;
s3, screening off amino acids related to the published monitoring condition of the drug-resistant site to obtain a corrected difference value;
s4, combining the corrected difference value and upper limit typing of the breakpoint coverage rate, wherein,
s4.1, when the corrected difference value is more than 11%, defining the sample to be classified as an unknown subtype;
s4.2, when the corrected difference value is less than or equal to 11 percent, wherein,
when the sample is matched with the simple subtype sequence, comparing the difference value with the simple subtype sequence with the upper limit of the typing breakpoint coverage rate, wherein if the difference value with the simple subtype sequence is less than or equal to the upper limit of the typing breakpoint coverage rate, the sample is judged to be the simple subtype, if the difference value with the simple subtype sequence is greater than the upper limit of the typing breakpoint coverage rate, the sample is judged to be the simple subtype, and meanwhile, a warning is reported,
and when the best match is a complex subtype sequence, comparing a difference value with the complex subtype sequence with the upper limit of typing breakpoint coverage, wherein if the difference value compared with the complex subtype sequence is less than or equal to the upper limit of typing breakpoint coverage, the complex subtype is judged, if the difference value compared with the complex subtype sequence is greater than the upper limit of typing breakpoint coverage, the best match parent is judged to be scored, wherein if the difference value between the sample best match parent score and the popular recombination subtype difference value is less than or equal to 1%, the parent subtype is reported, otherwise, the unique recombination subtype is judged.
First-generation sequencing: also known as Sanger sequencing (multi-molecule, monoclonal), was pioneered by Sanger et al in 1975. The principle is as follows: a certain proportion of ddNTP (divided into ddATP, ddCTP, ddGTP and ddTTP) with labels is respectively added into 4 DNA synthesis reaction systems (containing dNTP), and the DNA sequence of the molecule to be detected can be determined according to the position of an electrophoretic band after gel electrophoresis and autoradiography. Since ddNTP does not have hydroxyl group at 2 'and 3', it cannot form phosphodiester bond during DNA synthesis, and thus can be used to interrupt DNA synthesis reaction.
Second-generation sequencing: NGS technology (multi-molecule, polyclonal), the first generation Sanger sequencing has the disadvantages of long reading length and high accuracy, but the sequencing cost is high, the throughput is low, and the like, so that the applications of de novo sequencing, transcriptome sequencing and the like are difficult to popularize. Through continuous technical development and improvement of data, Ion Torrent second-generation sequencing technology of Thermo Fisher emerged and entered the second-generation sequencing market. The principle of Ion Torrent is: water-in-oil PCR +4 dNTP wheel warfare + microelectrode PH detection. The method mainly comprises the following steps: preparing a DNA library; emulsion PCR; and thirdly, microelectrode pH detection.
The advantages and disadvantages are as follows: the main difference of Ion Torrent in sequencing is that Ion Torrent does not require expensive physical imaging equipment, the cost is relatively low, the volume is small, the operation is simpler, and the whole on-machine sequencing can be completed within 2-3.5 hours (except library construction time).
The difference of the sequencing principle enables the second-generation sequencing to greatly reduce the sequencing cost compared with the first-generation sequencing, keeps higher accuracy, greatly reduces the sequencing time, and is much shorter than the first-generation sequencing technology in the sequence reading aspect.
The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.
As shown in fig. 1, the HIV subtype classification method according to the present invention includes the following steps:
sequencing a section of sequence spanning PR and RT regions on the pol gene of the HIV sample to be typed;
collecting HIV sequences from a public database and sequence data of a newly added HIV pool region, collecting second-generation sequencing data of HIV, and constructing a database pool;
processing data, namely processing a second-generation sequencing file in a database pool into a consistent sequence Reads.fasta, recording the HIV sequence subjected to quality check into the database pool, and recording a newly added sequence of a public database into the database pool; and
and (3) counting the breakpoint coverage condition of HIV subtypes covered by the converted consistency sequence of the second-generation sequencing data of the HIV in the database pool, calculating the breakpoint coverage rate corresponding to different HIV subtypes, comparing the breakpoint coverage rates of the sample to be detected and the known HIV subtypes, determining the typing of the sample to be detected, and outputting the typing result of the sample to be detected.
As shown in fig. 2, the HIV subtype classification system according to the present invention includes:
I. a database pool comprising first-generation sequencing sequences and second-generation sequencing data of HIV from an open public database.
A database management module comprising a database pool construction and integration module and a data update module, wherein,
the data update module is used for automatically downloading the public database sequence at regular intervals,
the database pool constructing and integrating module processes the collected second-generation sequencing BAM file into a consistency sequence Reads.fasta; and including the quality checked HIV sequences into the database pool; and recording the newly added sequence of the open public database into the database pool.
A typing module comprising three typing sub-modules:
(3-1) an HIV second generation sequencing data typing submodule, which is referred to as a typing submodule 1 in fig. 2 and is used for counting the breakpoint coverage condition of the consistency sequence in the database pool to all HIV subtypes, calculating and comparing breakpoint coverage rates corresponding to different HIV subtypes, comparing the breakpoint coverage rates of a sample to be classified and outputting a typing result of the sample to be detected. As shown in fig. 3, the specific typing process is as follows:
s1, inputting a sequence: inputting the constructed consistency sequence;
s2, performing multi-sequence alignment with the existing HIV-1Subtype Database typing list to obtain a primary uncorrected difference value, wherein the HIV-1Subtype Database typing list is shown in Table 1;
s3, masking off amino acids related to the published monitoring condition SDRM (SDRM: Surveillance of Drug Resistance Mutation) of the Drug Resistance locus to obtain a corrected difference value;
s4, combining the corrected difference values and the upper limit of breakpoint coverage (upper-limit of breakpoint) and the parent subtype (ParentSubtype) and the upper limit of subtype breakpoint coverage (Distanceupper-limit of breakpoint) are shown in Table 2, wherein,
s4.1, when the corrected difference value is more than 11%, defining the sample to be classified as an unknown subtype (unknown subtype);
s4.2 when the corrected difference value (the difference value at this time is called Parent Distance) is less than or equal to 11%, wherein,
when the sample is matched with the simple subtype sequence (namely the sequence of the Name) in the table 2, comparing the difference value with the simple subtype sequence and the upper limit of the typing breakpoint coverage rate, wherein if the difference value with the simple subtype sequence is less than or equal to the upper limit of the typing breakpoint coverage rate, the simple subtype is judged, if the difference value with the simple subtype sequence is greater than the upper limit of the typing breakpoint coverage rate, the simple subtype is judged and a warning is reported at the same time,
when the best match is a complex subtype sequence (namely a sequence with the ParentSubtype not equal to the Name), comparing a difference value with the complex subtype sequence and an upper limit of typing breakpoint coverage, wherein if the difference value (the difference value is called CRF Distance) compared with the complex subtype sequence is less than or equal to the upper limit of typing breakpoint coverage, the best match is judged to be the complex subtype, if the difference value compared with the complex subtype sequence is greater than the upper limit of typing breakpoint coverage, the best match is judged to be the Parent score (Parent Distance), wherein if the difference value between the sample best match and the popular recombination subtype difference value is less than or equal to 1%, the Parent subtype (Parent subtype) is reported, otherwise, the best recombination subtype is judged to be the unique recombination subtype URF subtype, and the unique recombination subtype refers to the HIV-1subtype which is found but not reported clinically.
(3-2) a first-generation HIV sequencing data typing submodule, which is referred to as a typing submodule 2 in FIG. 2, and is used for directly performing blast comparison on the consistent sequence of a second-generation typing sample to the first-generation HIV sequencing sequence in the database pool and outputting a sequence similarity comparison result.
(3-3) a recombination and mixed subtype HIV typing submodule, which is shown as a typing module 3 in fig. 2, and is used for comparing second-generation sequencing reads of a sample to be tested with second-generation HIV sequencing data in a database pool, counting and comparing the reads proportions of different subtypes, and judging the recombination and mixed subtypes in an auxiliary manner.
The recombinant subtypes are pure subtypes, the CRF subtype is an epidemic recombinant subtype which refers to HIV-1 intrasubtype or intersubtype recombinants formed by virus genome recombination due to double (multiple) multiple infection in the HIV-1 epidemic process, the categories of reads clustered to the individual subtypes are few, the proportion is high, and the typing results are concentrated; mixed subtypes (e.g., B + C) contain sequences of multiple strains, whose reads cluster to the average of the proportions of individual subtypes, with discrete typing results. The threshold for the recombinant and mixed subtype HIV typing sub-modules requires combining the results of the first two modules.
As shown in fig. 4 to 9, if the typing result of module 1(HIV secondary sequencing data typing submodule) is non-URF pure subtype, the optimal result of reads comparison typing in the recombinant and mixed subtype HIV typing submodules needs to be the same as the result of module 1(HIV secondary sequencing data typing submodule), and the proportion is not less than 60%; if the typing result of the module 1(HIV second generation sequencing data typing submodule) is URF pure subtype, different parental subtypes (parentsubtype) exist between the reads comparison typing top10 result and the typing result of the module 1 in the recombinant and mixed subtype HIV typing submodules, and the proportion of all the results of top10 is not higher than 60%; if the typing result of the module 1 is mixed subtype, the reads comparison typing top10 result in the recombinant and mixed subtype HIV typing submodule has the same parental subtype (parentsubtype) with the typing result of the module one, and the proportion of all results of top10 is not higher than 60%.
The HIV subtype classification method according to the present invention comprises the following steps:
a sequence spanning the PR and RT regions (approximately 1kb) on the pol gene of the typed HIV samples was sequenced. The absolute typing of HIV requires the measurement of the full length (about 9kb) of an HIV sequence, and then a phylogenetic tree is constructed so as to realize the typing of HIV; the pol gene is a target region for HIV drug resistance detection, and according to the technical scheme of the application, typing can be carried out only by detecting a section of sequence (about 1kb) spanning PR and RT regions on the pol gene of the HIV, wherein breakpoints of various subtypes of the HIV are distributed in the PR region and the RT region of the pol gene.
Constructing a database pool comprising first generation sequencing sequences from a public database and second generation sequencing data from the HIV;
processing data, namely processing an input second-generation sequencing file into a consistent sequence Reads.fasta, recording the HIV sequence subjected to quality check into a database pool, and recording a newly added sequence of a public database into the database pool; and
and counting the breakpoint coverage condition of the HIV subtype covered by the consistency sequence pair converted from the second-generation sequencing data of the HIV in the database pool, calculating the breakpoint coverage rate corresponding to different HIV subtypes, comparing the breakpoint coverage rates of the sample to be detected and the known HIV subtype, and outputting the typing result of the sample to be detected.
The HIV subtype classification method according to the present invention further comprises the steps of: and (3) directly performing blast comparison on the consistent sequence with an HIV generation sequencing sequence in a database pool, and outputting a sequence similarity comparison result.
The HIV subtype classification method according to the present invention further comprises the steps of: and comparing second-generation sequencing reads of the sample to be detected with second-generation sequencing data of the HIV in the database pool, and counting and comparing the reads proportions of different subtypes for assisting in judging the recombinant type and the mixed subtype.
As shown in fig. 10 and 11, according to the HIV subtype classification method of the present invention, the classification result of the sample to be tested and the nucleic acid level similarity of the sample to be tested and the optimal classification result are output after comparison, and a larger value of 0 to 1 indicates a higher similarity. According to the HIV subtype classification method, the similarity of the amino acid level of a sample to be detected compared with the sequence is output after comparison; the degree of similarity of the sample to be tested to the nucleic acid level of the sequence, the length of the consensus sequence used for typing.
As shown in fig. 12, according to the HIV subtype classification method of the present invention, the database can be processed in batch (upper limit of 10 samples per single processing) for the case of a large number of samples, and the classification result is reported separately for each sample.
As shown in fig. 13, according to the HIV subtype classification method of the present invention, the result of the alignment of the consensus sequence against the public database is output after alignment, which includes the sequence length, the initial position of the column alignment, the end position of the sequence alignment, the sequence length, the initial position of the sequence alignment, the end position of the sequence alignment, the number of mismatched bases, the alignment similarity, the alignment score, and the like.
According to a specific embodiment of the present invention, the test sample HIV is classified into type B, the nucleic acid level similarity is 98.1%, and the three classification schemes all show that the strain is highly similar to the known type B HIV.
According to the technical scheme of the invention, up to 98 HIV subtypes can be classified, including 11 HIV pure virus strains and 87 HIV epidemic recombinant strain subtypes, wherein each Subtype and a reference strain thereof are shown in the following table 1, the table 1 is an HIV-1Subtype classification list, 88-96 are simple subtypes, and the rest CRFs are epidemic recombinant subtypes.
TABLE 1 HIV-1Subtype Database typing List
TABLE 298 parental subtypes of subtypes and breakpoint coverage determination thresholds (Distanceuper-limit of breakpoint)
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:一种无细胞翻译体系、方法及产物