bioinformatics
Practical 1 week 3 Manual
Bioinformatics and transcriptomics
/ MSc Biomedical Science
College of Science and Technology
Practical 1 Week 3: Bioinformatics and transcript”omics”
Background information
Computational methodologies such as systems biology and bioinformatics are novel approaches allowing for molecular, mathematical and statistical analyses of large data obtained through molecular biology and other approaches that are ‘“omics” driven.
“omics” technologies indicate field in biology that studies family or pools of molecules and not one molecule. Suffix “omics” is added to the object of study for example genomics, Proteomics, Metabolomics, Transcriptomics, Phosphoproteomics, Pharmacogenomics, Toxicogenomics and many more... Whilst this has led to the generation of large data advancing the science field, analysis of this information is challenging. Systems Biology and Bioinformatic approaches allow for the integration of these large data for investigation and ‘easier’ unbiased analysis. These approaches will eliminate the current bottleneck in biomedical research where on one side large datasets are generated at an exponential rate. However translation of this knowledge to medical treatment is still slow.
Useful definitions:
· Genes are functional regions of the DNA, and they carry the instructions how to make specific proteins. Close to each gene are "regulatory" sequences of DNA that can turn the gene "on" or "off.”
· TATA box is DNA sequence found upstream and close to transcription start site and usually bound by TATA binding protein or TBP. These elements help orient polymerase and bind general transcription factors
· Distal transcriptional regulatory elements:
· Enhancers DNA elements function as binding sites to transcription factors that activate transcription
· Silencers DNA elements function as binding sites to transcription factors that repress transcription
· Databases “a structured set of data held in a computer, especially one that is accessible in various ways.” (Oxford dictionaries)
· Systems biology modeling techniques include ordinary differential equation (ODE) modeling techniques, Petri nets, cellular automata (CA), agent-based modeling (ABM) techniques, hybrid approaches, Boolean modeling and other techniques.
· Logical modeling techniques will be used in this practical
· The mathematical relationships and graphs are usually used to generate complex networks. In the graph, nodes are utilized to represent genes/proteins and edges to represent interactions in a network.
Nnode usually indicates a connection, redistribution or a communication point
· Feedback loops could be negative or positive. Negative feedback loops contribute to the stability of the network, while positive loops drive the pathway.
Boolean network
The Boolean network (BN) is a modeling technique frequently used for analysis of biological molecular networks. The advantage of Boolean networks is that they are easier to construct due to lower number of defined parameters needed and thus faster to operate than continuous models such as ODE models. BNs allow for the systematic determination of the behavior of biological systems. They are able to capture the essential dynamic properties of complex biological phenomena with predicitive capabilities. For example in representing gene expression by Boolean modeling, the two states (on/off) represent respectively, the status of a gene being active or inactive. For more information see:
http://rsif.royalsocietypublishing.org/content/5/Suppl_1/S85.full
Aims and objectives
In this practical we will
· Search for transcription factor binding motifs in the genome
· Search STRING database to extract information about p53 tumor suppressor gene interactants
· Demonstrate how to build a small scale logical model to represent biological processes
Procedure
Task 1
High-throughput biological techniques and the increase in our understanding of genes work at the molecular level have lead to the development of a large number of online databases. There are a wide variety of databases available, but some examples are listed below:
1. Gene sequence databases such as NCBI Nucleotide. These databases provide the DNA sequences of millions of genes, both coding-only and full length sequence. These databases may be useful when you wish to design primers for PCR, and also to compare the evolutionary history of different species (assessed through sequence homology).
2. Protein sequence databases such as NCBI Protein. Similar to the gene sequence databases, these list the protein sequences of your genes of interest. They often provide information on and separate sequences for splicing isoforms of the protein.
3. Specialist databases such as FlyBase (a database specific to Drosophila). Though the information on these databases may well be contained in more general ones such as Nucleotide, they may provide extra specialist information or reduce waiting times (as catering for one species as opposed to hundreds reduces server strain).
During Task 1, you will use a series of online bioinformatics resources. The first protocol will be to use a unique gene identifier on NCBI Gene to locate a specific protein of interest. Following this, you will use other online databases such as MotifMap to find out which transcription factors have binding sites for your protein.
PROTOCOL 1.
Gene name retrieval from NCBI.
1. Type the following address into your browser:
2. Select Gene from the drop-down menu as shown below. Search for the Gene ID 7157.
3. Fill out the Table below based on the information available on the page:
|
Official Symbol |
|
|
Official Full Name |
|
|
Alternative Names |
|
|
Species of Origin |
|
|
Chromosomal Location |
|
|
Brief Summary of Function |
|
Complete the table above and include it in your write-up. From filling out this table, it should be clear that this gene is hugely important in terms of disease progression and normal cellular function. As such, you would expect it to be involved in the regulation of several other genes. Protocol 2 will introduce you to a series of databases that allow you to find transcription factor binding sites. Knowing which genes are regulated by others allows you to plan experiments such as chromatin immunoprecipitation, thus linking the bioinformatics approach to the wet laboratory.
N.B. Keep the webpage on NCBI Gene open throughout the course of Task 1 – you will need it again!
PROTOCOL 2
Identification of transcription factor binding sites
There are several databases online that include information on transcription factor binding sites, which work in a variety of ways. Some operate by putting in a Gene ID, and then list all the TFs that regulate that gene, including the chromosomal location where the TF binds. Other databases work by putting the name of a gene or a consensus sequence where a transcription factor binds then showing you genes it regulates.
Transcription factors will typically recognise particular patterns/sequences in the genes they regulate. However, these sequences may vary slightly from one regulated gene to the next. A consensus sequence describes the ideal sequence that represents the most common base in each position.
Understanding the notation used to describe consensus sequences is highly important. Notation may vary from one website or organisation to the next, but a common scheme is represented below:
GCARYNNNTA
Where:
G = Guanine
C = Cytosine
T = Thymine
A = Adenine
R = Any Purine (Adenine or Guanine)
Y = Any Pyrimidine (Cytosine or Thymine)
N = Any Nucleotide
Examples of databases that allow you to find the genes that TFs regulate:
MotifMap:
Utilises the official Gene Symbol from NCBI Gene to provide a list of TFs that regulate that gene
including whether it is upstream or downstream from the Transcription Start Site.
Champion ChiP Transcription Factor Search Portal
A textmining tool based on SABiosciences' proprietary database known as DECODE. Utilises the
official Gene Symbol from NCBI Gene and provides a list of TFs that potentially regulate the gene.
Transcriptional Regulatory Element Database (TRED)
Utilises the Gene Name/Symbol from NCBI Gene, and provides you with a list of genes that the
Gene of interest regulates.
P21 (also known as CDKN1A) is a gene that is known to be regulated by the gene you identified in Protocol 1. The first task of Protocol 2 is to use the three databases described above to find the number of transcription factors that regulate p21, and to compare the findings.
In order to get the Gene ID for p21, you must first go to NCBI Gene and find the gene. Type in CDKN1A to the search bar. The gene you want should be the first result. Ensure to select only the human (Homo sapiens) gene.
From this page you can then get the official Gene Symbol and ID to use in the databases. For guidelines on how to use each database, see supporting information in the Appendix at the end of this task.
|
Database |
Number of TFs that regulate p21 or (in the case of TRED) the number of genes p21 regulates |
|
MotifMap |
|
|
Champion ChIP |
|
|
TRED |
|
Complete this table and include it in your write-up.
As stated previously, p21 is a gene regulated by the gene you identified in Protocol 1. From the above databases, you should know the genomic location of where this protein binds to regulate p21. The chromosomal location is:
It should be clear from the results in the above table that databases vary a great deal in their accuracy and data. However, differences may not always be caused by the way the database calculates its results, or in how it obtains its information. It is important to understand the use of changing parameters. A clear example of why this is important is the comparison between the default settings of MotifMap against that of Champion ChIP. Champion ChIP checks for TF binding sites 20kb upstream and 10kb downstream of the target gene. MotifMap, however, by default looks only for 1000 bases upstream and downstream of the target gene’s transcription start site. This explains the discrepancy between the two databases.
Try adjusting the parameters of MotifMap to match those of Champion ChIP, or to be as close as possible. Do they now align more closely? Which database do you think searches more exhaustively?
The final task for Protocol 2 is to identify p53 binding sites in a variety of genes. For this task, use MotifMap only. Set upstream and downstream to the maximum values (10,000) to ensure you detect all things possible. The names given in the table are the official Gene Symbol and should be sufficient for MotifMap. In cases where multiple options appear after entering the gene name, simply select the first one.
|
Gene Symbol |
Is p53 a TF for the gene? (Y/N) |
Chromosome Number where p53 binds |
Upstream or downstream from Transcription Start Site? |
|
CDKN1A |
|
|
|
|
IL6 |
|
|
|
|
BCL6 |
|
|
|
|
BAK1 |
|
|
|
|
BTG2 |
|
|
|
Complete this table and include it in your write-up.
Appendix
Appendix Protocol 1 – Using MotifMap
1. Type the following address into your browser:
2. Click on the “Gene Search” function, and leave the settings as default, then click “Next”.
3. Type in the Gene Symbol, and wait for an option to come up. Click on it, then click “Next”.
4. It is possible to adjust the parameters once on the page. Among others, you can change the reading distance from TSS and the number of results per page.
5. If looking for a specific transcription factor, use of the search function (CTRL+F) may be easier than searching through all of them.
Appendix Protocol 2 – Using Champion ChIP
1. Type the following address into your browser:
http://www.sabiosciences.com/chipqpcrsearch.php?app=TFBS
2. Stay on the default “Transcription Factor”, enter the Official Symbol of your gene of interest from NCBI Gene and click “Search”.
3. Scroll down to the bottom of the page and click “View More Transcription Factors” to get a full list of TFs (by default Champion ChIP shows only the most relevant).
4. You will now have a full list of TFs that regulate your gene of interest.
Appendix Protocol 3 – Using TRED
1. Type the following address into your browser:
https://cb.utdallas.edu/cgi-bin/TRED/tred.cgi?process=home
2. Click on the “Search TF Target Genes” link.
3. Type in the Official Gene Symbol from NCBI Gene into the search bar and press “Search”.
4. If the gene of interest regulates any genes (at least identified by TRED) then they will show on the following page.
Task 2: Extract p53 interactants from the STRING database
STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) database comprises known and predicted protein interactions from various organisms. Interactions are both functional and physical associations. These are derived from 4 major sources: high throughput data, genomic content, literature and conserved co expression.
1. Using STRING, extract 10 p53 interactants with confidence scores over 0.7.
To do this search for STRING http://string-db.org
Type p53 into the search tab and press GO! (Fig.1). Ensure that you select homosapiens as organism type. Save the image as PNG file to desktop. (You will need this for report).
Fig. 1 STRING
2. Determine the interaction type between p53 and interacting protein ('activation' or 'inhibition') and specific interaction (‘binding, post translational, expression’ etc)
This can be achieved by using the text mining evidence application in STRING. Click on the interaction line (s) between p53 and interacting node. The interaction evidence should appear as shown in Fig.3. Click on the relevant evidence. This should direct you to Pubmed sources where you can determine the correct interaction between the two genes. For example in Fig.4 for p53 and MDM2 the interaction type reveals that p53 induces (activates) MDM2. Repeat this for remaining p53 interactions. Save all references, confirmed interaction and type (You will need this for your report).
Fig.3 Interaction evidence for p53 and MDM2
Fig.4 Predicted evidence from STRING text mining application for p53-MDM2
3. Input interaction type with source and target node for all interactions in 'Tab delimited' format into excel or notepad document.
Refer to Fig.5 for format example. Please note delete first row of headers before proceeding to next step. Save this to desktop
Fig.5 Excel sheet of interaction data
Task 3 (demonstration only): Construct a mini p53 network that only contains a subset of interactants
Cytoscape is an open source platform used with Matlab for viewing molecular interaction networks and biological pathways. Cytoscape allows for integration of these networks with annotations, gene expression and other state data analyses. For your purposes, you will be using Cytoscape (v.2.8) for visualisation purposes to create a mini p53 network from your STRING evidence.
1. Cytoscape has already been downloaded to your desktop. Open Cytoscape software.
2. Click File > Import > Network from Table. See Fig.6 for example window
3. Upload your excel file of interactions
4. Select column 1 for source interaction, column 2 for interaction type and column 3 for target.
5. Tick advanced options to ensure delimiter is ‘Tab’ and ‘Transfer last line attribute name is unchecked. Import this network. Fig.7 shows an example imported network default layout.
Fig. 6 Importing p53 network into Cytoscape
6. To visually represent interaction type – Open Vizmapper located on the left hand side of window (circled in Fig.7). Select target arrow shape to define the shape of your interaction type. For example you can use an arrow to represent activation and another for inhibition such as
Using Vizmapper, you can also change colours to represent interaction type and select the colour and layout of your model as you wish.
7. Save your p53 model to desktop as PNG. File.
File > Export > Current network view as graphics
Fig.7 Default layout of example imported network