PN40024.v5.1 (T2T) Annotation using TITAN

With the purpose of releasing an updated annotation of the PN40024.T2T genome, whilst conserving previous manual curation efforts in the PN40024.v4 genome, we developed two tools TITAN (The integrative transcript annotator) and AEGIS (Annotation extraction and gene integration suite). We present here, the V5.1 genome annotation with a total of 41,766 protein coding genes and a BUSCO of 99.4%. A parallel pipeline was used to annotate 7,934 non-coding genes.

BUSCO analysis served as a primary tool for evaluating the completeness and quality of the annotations. Using the eudicots set of conserved orthologs relevant to Vitis vinifera, we observed an improved overall BUSCO score in the new annotation compared to the original. This increase in the number of complete and single-copy orthologs suggests a higher coverage of expected gene content in our updated annotation. Specifically, the percentage of “Complete” BUSCOs (both “Complete and Single-Copy” and “Complete and Duplicated”) increased, while the “Fragmented” and “Missing” categories showed a marked decrease. These improvements highlight the enhanced accuracy of the pipeline in capturing the core, evolutionarily conserved genes expected in Vitis vinifera.

Beyond genome completeness metrics, we conducted an expression analysis focusing on gene models unique to each annotation set based on RNA-seq data from the PN40024 cultivar used for the assembly of the genome. In the original reference annotation, a substantial number of unique gene models were identified (11,508 genes), many of which lacked transcriptomic support (most with 0 counts) and may have been the product of ab initio predictions without substantial transcriptional validation. In contrast, the gene models unique to the updated annotation (17,208 genes) show significant expression levels across the various Vitis vinifera tissue types from the RNA-seq data, indicating that these genes are biologically relevant and actively transcribed. This distinction suggests that the previously annotated unique genes may have represented false positives, potentially arising from over-annotation and algorithmic noise, rather than true functional genes.

Histograms from the figure below display the expression level distributions (in terms of read counts) for the unique genes from each annotation version, providing insight into the transcriptional activity of the unique gene sets in each annotation. The reannotation not only includes a more comprehensive set of unique genes but also emphasises transcriptionally active genes.

More on TITAN - The Integrative Transcript Annotator

This evidence-based pipeline integrates data from long and short-read RNA sequencing (RNA-seq) to produce high-confidence annotations for both protein-coding genes and long non-coding RNAs (lncRNAs). The aim of this pipeline is to construct an accurate and complete gene annotation set, optimising the selection of gene models and ensuring the correct identification of all genomic features. This process is part of TITAN, The Integrative Transcript ANnotator), a pipeline designed for broad applicability across multiple species.

The pipeline consists of several stages, each equipped with specific bioinformatics tools, filtering criteria, and quality checks to refine the final gene set. Here we show a detailed explanation of each stage of the workflow, including the rationale for each step, the methods applied, and the anticipated outcomes.

Transcriptome Assembly

The initial stage involves the assembly of transcriptomes from RNA-seq data, encompassing both long and short reads. This assembly is achieved using tools such as PsiCLASS and StringTie. The use of multiple assembly tools ensures that the pipeline captures comprehensive transcript diversity, including both high-confidence transcripts from short-read RNA-seq and full-length transcripts from long-read RNA-seq (e.g., Iso-seq). These assemblies serve as the basis for annotating gene structures, allowing for the identification of the complete transcripts of the genes, including their UTR regions.

Read alignment for short-read RNA-seq data is performed using STAR, a widely-used spliced aligner for high-throughput RNA sequencing data, whereas for long-read RNA-seq data, Minimap2 is used, which is optimised for aligning long error-prone reads, making it suitable for technologies like PacBio and Oxford Nanopore.

Once reads are aligned, the pipeline utilises assembly tools to reconstruct transcriptomes. StringTie is applied to the aligned data to assemble transcripts for both short- and long-read data. The software PsiCLASS is used specifically only for short-read RNA-seq data, helping to enhance accuracy in identifying all possible transcript structures.

This integrated approach allows for the assembly of a high-quality transcriptome by combining the strengths of both short- and long-read sequencing data.

ab initio predictions and annotation transfers

For de novo prediction, BRAKER3 is used as the primary tool, which integrates Augustus and GeneMark gene predictors, identifying potential coding sequences (CDS) without relying on external evidence. BRAKER3 takes as input the protein and RNA-Seq datasets available and outputs Augustus and Genemark annotation data to improve the accuracy of gene structure predictions. Both datasets are used for genes where transcriptomic data is not able to reconstruct complete transcript models.

When available, previous gene annotation on older assemblies for the same species can serve as a different source of ab initio predicted genes. Using the tool Liftoff, these annotations are mapped onto the newly assembled genome. Liftoff aligns the gene features from the previous assembly to identify and project these features onto the new genome, facilitating the transfer of structural annotation information. In this sense, we make use of Aegis to read different gffs and their varying inner format differences, making it a robust tool that TITAN makes use of. In addition, we also developed a unique system of quantifying similarity between two gene models based on their coordinate overlaps at different levels (gene, transcript exon) which is also part of Aegis. This approach effectively combines existing knowledge with the ability to identify novel genes in newly assembled genomes.

Calculation of Transcript Masking

To enhance the quality of gene predictions and reduce false positives, the pipeline applies a transcript masking step. Masking is based on repetitive elements annotated by EDTA (Extensive De Novo TE Annotator), which detects and annotates repetitive regions, including transposable elements. Masking repetitive sequences is crucial, as it minimises interference in subsequent gene prediction by focusing on gene-containing regions.

Filtering redundant genes

Based on all the different sources previously described, a custom in-house script was developed to eliminate redundant information from the datasets and retrieve the best possible protein-coding gene models. This script follows ten key steps:

4.1 Filtering out the noise

The pipeline first ensures that transcripts overlapping with repetitive elements are removed or flagged, thereby reducing noise and improving the reliability of the gene models generated in subsequent stages. This step also marks genes that match the following criteria:

Intron size ≥ 100,000 bp
Protein size ≥ 50 bp
Non-coding genes

4.2 Marking Transcriptomic Supported Genes

This stage identifies and highlights genes that are supported by transcriptomic evidence, specifically genes whose exon regions overlap with those observed in at least one transcriptome assembly. This overlap ensures that selected genes are not isolated predictions but rather have corroborating evidence from the transcriptome, increasing confidence in their functional relevance.

4.3 Calculating Reliable CDS Evidence Scores

To further filter and validate gene models, a reliable CDS evidence score is calculated. This score quantifies the number of transcripts that share the same CDS across the different gene models from all annotation sources. Genes that receive higher scores are prioritised for annotation, as the consistency across models suggests a high likelihood of accurate CDS prediction.

4.4 Finding the best gene model

The pipeline employs a multi-stage selection process to choose the optimal gene model for each locus. The steps include:

Model 1 Selection: the highest-scoring gene models are selected based on CDS evidence scores from all gene models overlapping at the CDS level.
Model 2 Selection: additional step to “rescue” gene models with transcriptomic or ab initio evidence that were not previously selected, only if they do not overlap an already-selected gene. This approach ensures that important gene models are not omitted, even if they have lower initial scores, based on BLAST hit importance, allowing for the recovery of models that might be biologically relevant but that were initially excluded.

4.5 Adding additional ab initio models based on BLAST Hits

This step is crucial in cases where there is a conflict between ab initio predictions and transcriptomic data. If a better BLAST hit is detected for an ab initio gene model that was not selected in the primary stages, the pipeline adds this model as a potential transcript variant to the selected gene model, increasing the gene model’s reliability for a potential function. Both versions are retained as independent transcripts, providing multiple interpretations for complex loci.

4.6 Adding additional ab initio models based on longer CDS

Similarly to point 4.5, for each locus with ab initio models holding longer in-frame coding sequences, these models are selected as a transcriptomic variant of the shorter model. This step emphasises completeness by favoring transcripts with extended coding regions, which are likely to represent the full functional gene. This refinement adds robustness to the annotation by correcting potentially incomplete models.

4.7 Removing Intron Nested Genes

Models nested within introns of other genes are removed unless they exceed a 450 bp CDS length or have a coding ratio (CDS length/transcript length) greater than 0.4 on their mRNA. This step filters out potentially spurious nested models from read misalignments or incomplete intron retentions.

4.8 Selecting the Best Possible Non-overlapping UTR

Non-overlapping UTR regions are added if they do not conflict with CDS or exon regions of other proximal selected genes, enhancing the accuracy of the gene model boundaries.

4.9 Removing UTRs from Overlapping Genes

In cases where genes still overlap after previous steps, UTR regions from both overlapping genes are removed, focusing on the core CDS.

4.10 Adding Remaining Catalogue Genes

The pipeline is able to add genes already described in the literature (if available from reference catalogues) and include them in regions where no other gene models have been selected.

Long Non-Coding Gene Annotation

At this stage, the aim is to systematically identify and annotate lncRNAs from RNA-seq data. Each step of the stage is crucial for improving the specificity and quality of the lncRNA candidates, resulting in a reliable dataset for downstream functional analyses.

Starting with HISAT2 for read alignment and proceeding through transcriptome assembly, positional comparison, and rigorous coding potential assessment, the pipeline ensures high specificity in lncRNA detection for the transcriptomic input data. The assembled transcripts are filtered to retain only those with a length greater than 200 nucleotides and an FPKM (Fragments Per Kilobase of exon per Million reads) value above 0.5. This filtering reduces noise and improves the reliability of the predicted models in favor of significant, longer transcripts.

The coding potential of each transcript is evaluated using a combination of specialised software tools: CPC2, FEELnc, and CPAT. These tools, along with BLAST hits from protein databases predict the likelihood that a transcript can code for a protein. Based on these predictions, a lncRNA confidence level (high, medium, or low) is assigned to each transcript. Only transcripts with a high or medium confidence level are considered for the subsequent stages.

To remove transcripts that are unlikely to be lncRNAs, the next step uses BLAST to filter out potential non-lncRNAs, including known miRNA precursors and other types of RNAs (rRNAs, tRNAs, snRNAs, and snoRNAs).

Finally, smaller lncRNAs overlapping with a larger model are removed to maintain only the complete transcript and remove potential partial assemblies. These models are classified into different categories depending on the type of overlap to their corresponding transcript:

lincRNAs: intergenic lncRNAs
NAT-lncRNAs: natural antisense transcript lncRNAs
int-lncRNAs: intronic lnRNAs
SOT-lncRNAs: sense overlapping transcript lncRNA

These models are incorporated along with the protein-coding genes into the final annotation file.

Antonio Santiago, David Navarro-Payá, Pascual Villalba-Bermell, Gustavo G. Gomez, Iñigo De Martín Agirre, Amandine Velt, Marco Moretto, Hua Xiao, Yongfeng Zhou, Camille Rustenholz & José Tomás Matus. (2024) A comprehensive and accurate annotation for the grapevine T2T genome. Open-GPB2024. Logroño, Spain

More on TITAN - The Integrative Transcript Annotator

Funding Agencies Involved