STSM reports - Grapedia

Implementing latest PN40024 genome version in Grapedia

COST Beneficiary: Marco Moretto – Fondazione Edmund Mach – San Michele all’Adige, Trento, Italy

Host: INRAe Colmar, France,

Dates: 01/10/2023 to 15/10/2023

During my STSM, which took place at the Centre INRAe Grand-Est in Colmar, France, under the supervision of Camille Rustenholz, we worked on the development of two pivotal workflows for data integration. These workflows will be used within Grapedia and made available to the community.

The first workflow extends the use of a tool named SNPlift, broadening its capabilities to work with markers beyond SNPs alone. It enables the mapping of genetic markers from one genome to another, whether they are different versions of the same genome or entirely different genotypes. This tool will empower Grapedia users to explore markers of interest in the various genomes provided by the platform. Furthermore, it will be accessible as a graphical tool for users to search for specific markers within Grapedia’s genomes and as a downloadable Nextflow workflow from the Grapedia GitHub repository.

The second workflow is an RNA-seq alignment workflow. It streamlines the process of going from a raw FASTQ file to a count matrix (in CSV format) for subsequent analysis. Users can utilize this tool to align their own FASTQ files or experiments from the NCBI SRA archive using genomes available in Grapedia. This workflow will be instrumental in integrating transcriptomic data for data exploration and network reconstruction within Grapedia.

During the STSM, I successfully achieved all of the planned goals and also:

● Introduced Grapedia in a dissemination seminar held at INRAe Grand-Est.

● Outlined the development roadmap in collaboration with our technological partner, Sequentia.

● Worked on organizing an upcoming Grapedia event, the QTL-hackathon, scheduled to be held in Verona.

Implementing the PN40024 reference genome in the GRAPEDIA database and portal

COST Beneficiary: Antonio Santiago Pajuelo – I2Sysbio – Spain

Host: INRAe Colmar, France,

Dates:

First period: 01/06/2023 to 31/07/2023

Second period: 01/10/2023 to 23/10/2023

The main objective of this STSM was to create a new reference genome for Vitis vinifera that integrates the best features of two existing assemblies: the T2T assembly, which has high completeness and accuracy, and the PN40024.v4 assembly, which has a rich annotation and compatibility with other grapevine resources. This new reference genome and annotation would enhance the quality and usability of grapevine genomic data and derived-resources in the Grapedia Portal. During these two months, I collaborated with Camille Rustenholz to design the workflow and interpret the results obtained. I focused on validating the new T2T assembly and improving the previous annotation pipeline used for the PN40024.v4 assembly.

The STSM involved verifying the phasing of the T2T assembly by mapping the parental data from Schiava Grossa and Pinot Noir, as well as assessing the completeness of the reference genome. The validation and evaluation of the quality and consistency of the consensus assembly and its annotation revealed some areas that require further improvement or refinement, such as resolving some structural variations or correcting some misannotations.

During the STSM, I also worked on parallelizing the speed bottleneck step of the annotation pipeline, updating the software version and containerizing it. This allows for better reproducibility and flexibility of this workflow for its future inclusion as workflow shared with the community through the Grapedia portal.

Main Achievements: I evaluated and validated the T2T assembly for its integration into the Grapedia Portal. The annotation pipeline is still under development and the final annotation is not yet available, but I have made significant progress in several steps of both structural and functional annotations. This work also fostered a strong collaboration between myself and Camille Rustenholz, a leading expert in grapevine genomics. We worked together on how to assess the quality of the different assemblies obtained and how to optimize the workflow for structural and functional annotation of genes. This allowed us to define more clearly the steps to follow to achieve these objectives.

Follow-Up Plans: I am committed to continuing with the pipeline, in order to obtain a final annotation for the T2T assembly as soon as possible. This annotation will be a valuable resource for the Grapedia portal and for the grapevine research community.

Conclusion: The STSM partially achieved its goals, as it validated the new T2T genome assembly and advanced the annotation pipeline for this genome. The annotation workflow will be available on the Grapedia portal soon, where it can be potentially applied to other cultivars.

Integration of the ChromoMap R Software Suite in GRAPEDIA

Lakshay Anand – Department of Horticulture, University of Kentucky, KY, USA (Non-beneficiary of COST STSM)

Host: Fondazione Edmund Mach – San Michele all’Adige, Trento, Italy

Dates: 19 June- 7 July 2023

Integration of the Vitviz Software Suite in GRAPEDIA

Luis Orduña – I2Sysbio – Spain (Non-beneficiary of COST STSM)

Host: Fondazione Edmund Mach – San Michele all’Adige, Trento, Italy

Dates: 1 May- 31 July 2023

The objective of my stay at the receiving center was to continue the work carried out by David Navarro-Paya on integrating the Vitviz platform in the Grapedia initiative. To achieve this objective, I worked with Dr. Marco Moretto, creator of the Vespucci platform https://vespucci.readthedocs.io/en/latest/# and part of the Grapedia team, which has extensive experience in building web resources for the Vitis vinifera L scientific community. During the stay, the following activities were carried out:

1- Presentation of the IT structure of the Vitviz platform to Dr. Marco Moretto: this activity was carried out during the first week of the stay. It served to explain to Dr. Marco Moretto how the Vitviz platform works internally, its characteristics, components and the functionalities offered therein.

2- Sharing of the best procedure to integrate Vitviz data into the Grapedia website: In this activity, the best practices were established to integrate the data generated by Vitviz into the Grapedia website, taking into account factors such as cybersecurity and user experience. Thanks to this activity, the structure of Grapedia’s backend was defined.

3- Learn and work with Dr. Marco Moretto to build an API in Python: Once the Grapedia backend structure was defined, I proceeded to learn how to use the API created by Dr. Marco Moretto to access the platform’s data Vitviz on the backend. The API, based on GraphQL, allows fast and efficient access to the data structure decided in point 2, guaranteeing secure access to Vitviz data deposited in Grapedia. As part of this activity, the API was also tested with the aim of finding errors and/or problems in the use of the API, correcting all errors that were found.

4- Development of a user-friendly interface: this activity consisted of the development of a graphical user interface (GUI) similar to that of the Vitviz platform so that Grapedia users can comfortably and easily access the data contained in the initiative platform. As a result of this activity, three interactive “Dashboards” (Gene Cards dashboard, Multiple gene dashboard & Networks dashboards) based on React were developed, which allow users to access all the contents of the Grapedia backend in a friendly and interactive way. The “Gene Cards dashboard” collects the gene symbol or gene ID of interest to the user and, through the API mentioned in points 3 and 4, is capable of extracting from the backend all the information available about that gene, displaying it interactively. This information includes all the data available for that gene in the Grape Gene Reference Catalog, gene expression data across different tissues (displayed in an interactive boxplot), and a listing of the 420 genes (1% of the total genes in the Vitis vinifera genome) that are most co-expressed with the gene of interest across different tissues. For its part, the “Multiple genes dashboard” collects a list of genes given by the user and, through the API, extracts from the backend all the expression data available for each of the genes, generating interactive expression heatmaps that can be filtered by tissues. . Finally, the “Networks dashboard” collects a list of genes given by the user and through the API extracts from the backend all the co-expressions detected between the genes given by the user in different co-expression networks. The three dashboards developed at this point will be offered on the Grapedia portal when it is made public in a short period of time.

5- Tinkering with GRAPEDIA Training School: This activity involved the organization, together with Dr. Marco Moretto, of the Training School “Tinkering with GRAPEDIA”, scheduled for 26-28 June 2023. In this event, it was possible to present a first version of Grapedia to the Vitis vinifera L. scientific community and train participants in the use of this new platform, offering presentations, examples of use and feedback sessions where participants can comment on suggestions and/or problems to improve the GRAPEDIA platform. In addition to actively participating in the organization, I was involved in the event, acting as a speaker to present the operation of “Dashboards” developed.

Integration of the Vitviz Software Suite in GRAPEDIA

COST Beneficiary: David Navarro-Payá – I2Sysbio – Spain

Host: Fondazione Edmund Mach – San Michele all’Adige, Trento, Italy

Dates: 20 March-1 April 2023

The STSM aimed to integrate VitViz, a bioinformatics tool, into Grapedia, a centralised hub for grapevine genomic data, to improve the platform’s usability. The integration would enable easy access to VitViz tools and resources within Grapedia, making it a one-stop-shop for all grapevine genomic data and resources. During these two weeks I have worked closely together with Marco Moretto to define how the backend and frontend are currently planned to work. I have personally focused on the implementation of frontend applications for grapedia that mimic the present functionality in Vitviz.

The STSM involved evaluating VitViz software compatibility with Grapedia and designing a plan to integrate VitViz into Grapedia, taking into account backend and frontend development. Using Python, I implemented backend components of Grapedia to extract data and retrieve from VitViz and transfer it into the grapedia database. I then used the GraphiQL query system in order to access the same data and parse it accordingly for frontend applications such as interactive boxplots using plotly libraries.

During the STSM, I worked on improving the visualization and presentation of data in the portal involving frontend use of Javascript and React. The aim was to create a user-friendly interface that allowed researchers to browse, search and visualise different grapevine genomic data. After the integration, I tested the system to ensure its functionality and usability. In the process I obtained a global vision of the full stack architecture of grapedia which allowed me to design the frontend applications to suitably adapt to the backend. This also involved designing the backend database to allow for maximal versatility in the frontend applications as well as facilitating the coding process itself.

Main Achievements: The integration of VitViz into Grapedia was successful, and the new VitViz-enabled features of Grapedia were tested. The new features enabled users to analyze and visualize grapevine genomic data from a gene-centric input which includes for now reference catalogue data and FPKM expression in an interactive boxplot across more than 4000 SRA runs (each dot in the plot can be interactively selected).

The integration of VitViz into Grapedia also strengthened collaborative ties between myself and Marco Moretto. By working together to integrate VitViz into Grapedia, we were able to learn from each other and to identify potential technical challenges and integration points that can be addressed in future development phases.

Follow-Up Plans: The STSM was successful in achieving its objective, and we are currently working on the development of Grapedia as a final product, ready to be accessible. We are now closely collaborating remotely over a slack channel to facilitate communication. I am determined in carrying on with the frontend development until VitViz functionalities are met within Grapedia, as well as all the improvements to visualisation discussed with Marco during the STSM.

In the short-term future we also plan to update the user documentation and tutorials to reflect any future changes or updates to the system. This will ensure that users are always up-to-date with the latest developments on the platform. Finally, we plan to continue collaborating with other institutions to improve the platform’s functionality and usability. By working together, we can continue to make improvements to the platform and to make it a valuable resource for grapevine genomic data.

Conclusion: The STSM achieved its objective of integrating VitViz software into Grapedia, improving the platform’s usability. The integration will enable easier access to VitViz tools and resources within Grapedia, hopefully making it a key hub for all grapevine genomic data and resources. By working together, we can continue to improve the platform and to make it a valuable resource for researchers around the world.

Integrating Community resources into Grapedia’s database with Airbyte

Victor García-Carpintero – IBMCP – Valencia, Spain (Non-beneficiary of COST STSM grant)

Host: Sequentia Biotech SL, Barcelona, Spain

Dates: 14-25 November 2022

Grapevine scientific community has generated over the past years abundant, high quality resources and knowledge about grapevine: genetics, omics, phenotyping…

• INTEGRAPE has contributed with the standardization of this resources following the principles of findability, accessibility, interoperability, and reusability (F.A.I.R)

• Grapevine resources remains scattered, though. One of the Grapedia’s main goals is to centralize all these resources on an open portal among other services. Integrate all data shared by the community into a federal database is required in order to achieve this.

Following Airbyte’s EL(T) process, shared Community’s resources can be replicated into Grapedia’s unified database in a seamless way, regardless of their origin or format.

What is a connector?

A connector is a Docker image built inside Airbyte with the instructions and specifications required to extract or load data.

• Airbyte provides its own connector’s core generator so you don’t have to do it from scratch.

• Connectors can be coded using java or python (or using Airbyte’s web interface).

• Both Source and Destination connectors need to define how users are going to interact with the data stored. These definitions are called streams.

• Connectors can be configured to overwrite existing records o do incremental upgrades into the database.

Connectors development

For this project, connectors were developed to integrate the following data:

• FASTA file format

• GFF file format

• GO Terms file format

• INTEGRAPE reference gene catalog • OneGenE output file format

Grapedia connectors’ development had two ways of quality assessment:

• Airbyte’s checks: Airbyte can check if connectors can communicate with the data storage and check if streams are properly configured with shell commands.

• All connectors were tested with sample data to check if ELT process was working as intended.

Some considerations

Airbyte provides a simple solution to integrate Community Resources into Grapedia’s database while keeping the ability to update previous data or adding more, but:

• Speed performance could be a problem for some volumes of data if constant updates are expected: loading GFF or GO Terms files takes 10 minutes approximately in a laptop, but OneGenE data can take more than 30 hours.

• Airbyte stores all records in raw data even if these records are going to be normalized and needs to be removed manually (as far a I know). This can consume quickly the space available if a proper database maintenance is not done.

• Choosing a proper database system where our data is going to be integrated should be a major concern. For example, adding a GFF file adds more 500 thousand records into a database. If more data is added in the future this could slow database queries and degrade the user’s experience. Non-relational database systems should be considered if this is the case.

Grapedia architecture definition and first prototype implementation

COST Beneficiary: Marco Moretto – Fondazione Edmund Mach – San Michele all’Adige, Trento, Italy

Host: Sequentia Biotech SL, Barcelona, Spain

Dates: 14-25 November 2022

During my STSM held in Sequentia Biotech in Barcelona (Spain), under the supervision of Walter Sanseverino and Marco Di Marsico we established the foundation of the main GRAPEDIA architecture in terms of the technologies to be used (Airbyte ETL, PostgreSQL database, Python FastAPI and Strawberry for the GraphQL interface, React and Acorn for the front-end development, docker and nextflow for the workflow development).

Together with the Sequentia team, we decided on the development and documentation workflow by establishing different repositories on Atlassian BitBucket for the different parts of the project. Different team members are devoted to different development activities: backend, frontend and workflow development. During the STSM, I carried out backend development for the Airbyte ETL and the GraphQL interface and coordinated the work for the future activities. Moreover, we presented the overall project and infrastructure to a broad audience during our planned online Kickoff Meeting.

During the STSM, most of the planned goals were carried out:

assessment on possible technologies to be adopted during the development of the GRAPEDIA platform;
laying the foundations for the prototype implementation.

The active development in specific repositories for different layers of the whole infrastructure was initiated, actively testing different solutions. The development was started of the unified data model that will be accessible and queryable from the GraphQL interface.

The future follow-up will affect the development of the planned working prototype. Together with the other partners we decided on the different steps to complete in the coming weeks with periodical meetings to assess the current state of the work.