Skip to main content

High throughput Python pipeline to identify Horizontal Gene Transfer

Submission Number: 165
Submission ID: 3769
Submission UUID: af0b1e88-3357-4b39-943a-c850fd242402
Submission URI: /form/project

Created: Tue, 06/13/2023 - 08:10
Completed: Tue, 06/13/2023 - 08:10
Changed: Fri, 04/19/2024 - 14:12

Remote IP address: 173.59.116.37
Submitted by: Vinayak Mathur
Language: English

Is draft: No
Webform: Project
High throughput Python pipeline to identify Horizontal Gene Transfer
CAREERS
{Empty}
bioinformatics (277), biology (515), data-wrangling (6), genomics (537), github (490), python (69), workflow (365)
Halted

Project Leader

Vinayak Mathur
7324214925
{Empty}

Project Personnel

Simon Delattre
Kendrick Key
{Empty}

Project Information

Project Description: This project seeks to further investigate the genetic phenomenon of horizontal gene transfer (HGT), specifically when involving interactions between bacteriophages and their host bacteria. From a biological perspective, this type of horizontal gene transfer occurs when bacteriophages attach themselves to a bacterial cell and inject it with a vector such as a plasmid that integrates into the host genome and takes control of the bacterium to make copies of itself. The main aim of the project is to develop an analysis pipeline written in Python that automatically generates a large output list of bacterial accession numbers given an input list of phage accession numbers. The current program employs BLAST to create this list of accession numbers.
In the analysis pipeline, the input list is iterated through, and each phage accession number is submitted as a BLAST query to be aligned with the NCBI database of bacterial genes. The top bacterial result for each phage query ID is stored and aligned with the database of bacteriophage genes in turn. A match between the original phage query ID and the phage result of the BLAST search where the bacterial accession number is the query ID indicates the presence of horizontal gene transfer. Conducting this analysis in an HPC environment using SSH could significantly speed up the process of data collection compared to the functioning of the current pipeline or performing manual searches on the NCBI website where BLAST has been made available.

Current version of the pipeline is available here: https://github.com/genomesolver/CSPpipeline

Research goals: This research project has three major goals:
1) Identify instances of HGT in a large dataset of bacteriophage proteins: The data list produced by the program facilitates more in-depth analysis of bacteriophage-mediated horizontal gene transfer.
2) Predict likelihood of HGT: By developing a probabilistic classifier, we can attempt to predict the likelihood that a certain clade of bacteria is affected by horizontal gene transfer given the HGT status of the other members of the clade. This model could assist in establishing the statistical significance of the occurrences of HGT in bacterial relatives and help identify cellular features specific to those groups of bacteria that could potentially explain their vulnerability to infection by phages.
3) Functional analysis: A Gene Ontology (GO) enrichment analysis is another research aim to extract meaningful conclusions from this data. Since the current version of the pipeline generates a list of bacterial accession numbers that correspond to phage query IDs, that list can be processed in order to find GO terms in groups of genes regulated by the integration of the nucleic acids of the bacteriophage. This type of data analysis would be very useful to visualize and increase the understanding of how the phage infections disrupt the genetic network of the bacteria.

Project Information Subsection

The goals of the project are:
1) To fine tune the already developed Python pipeline to be able to analyze larger datasets
2) Be able to use a offline version of NCBI database to run the analysis
3) Develop a model to be able to predict likelihood of HGT
{Empty}
{Empty}
{Empty}
Some hands-on experience
{Empty}
Cabrini University
610 King of Prussia Road
IAD 224
Radnor, Pennsylvania. 19087
CR-Penn State
{Empty}
No
Already behind3Start date is flexible
4
{Empty}
{Empty}
{Empty}
{Empty}
  • Milestone Title: Improvement of Current Pipeline
    Milestone Description: Attempt to improve run time of the current Python pipeline, with the possibility of downloading the NCBI database needed for comparison to a local server space
    Completion Date Goal: 2023-08-15
  • Milestone Title: Develop Classifier
    Milestone Description: Develop the probabilistic classifier for the pipeline, to predict likelihood of HGT in different bacterial clades
    Completion Date Goal: 2023-09-30
  • Milestone Title: Functional Analysis Function
    Milestone Description: Create a functional analysis function to compare to existing Gene Ontology databases. Work on writing a manuscript for publishing the results.
    Completion Date Goal: 2023-10-31
{Empty}
Plan to publish the manuscript in the journal: https://iubmb.onlinelibrary.wiley.com/journal/15393429
{Empty}
{Empty}
{Empty}
{Empty}
{Empty}

Final Report

{Empty}
{Empty}
{Empty}
{Empty}
{Empty}
{Empty}
{Empty}
{Empty}
{Empty}
{Empty}