Search

Content of this page:

Visualization is a helpful component in the process of interpreting results from high-throughput experiments, and can be indispensable when working with large data-sets. Within the GO, the "natural" visualization format is the Direct Acyclic Graph of a group of annotated sequences. In the DAG, each node represents a GO term. Arcs represent the relationships between the biological concepts. A problem when visualizing GO functional information of genomic data-sets is that these graphs can become extremely large and difficult to navigate when the number of represented sequences is high.

One of the functions of Blast2GO is the ability to display the annotation result of one or several sequences in the same GO graph. Within Blast2GO these graphs are called "Combined Graphs". This function generates joined GO DAGs to create overviews of the functional context of groups of annotations and sequences. Combined Graph nodes are highlighted through a color scale proportional to their number of sequences annotated to a given term. This confluence score (from now on denoted "node-Score") takes into account the number of sequences converging at one GO term and at the same time penalizes by the distance to the term where each sequence was actually annotated. Assigned sequences and scores can be displayed at the terms level.

Node Score

The node score is calculated for each GO term in the DAG and takes into account the topology of the ontology and the number of sequences belonging (i.e. annotated) to a given node (i.e. GO term). The score is the sum of sequences directly or indirectly associated to a given GO term weighted by the distance of the term to the term of "direct annotation" i.e. the GO term the sequence is originally annotated to. This weighting is achieved by multiplying the sequence number by a factor alpha [0, infinity] to the power of the distance between the term and the term of direct annotation (equation below). In this way, the node score is accumulative and the information of lower-level GO-terms is considered, but the influence of more distant information (i.e. annotations) is suppressed/decreased depending on the value of alpha. This compensates for the drawback of the earlier described method of simply counting the number of different sequences assigned to each GO-term. The alpha parameter allows this behavior to be further adjusted. A value of zero means no propagation of information and can be increased by raising alpha.

Graph Term Filtering

Combined graphs can become extremely large and difficult to navigate when the number of visualized sequences is high. Additionally, the relevant information in these cases is frequently concentrated in a relatively small subset of terms. We have introduced graph-pruning functions to simplify DAG structures to display only the most relevant information. In the case of the Combined Graph function, a cutoff on the number of sequences or the node-score value can be set to filter out GO terms. In this case the size of a graph is reduced without loosing the important information (i.e hiding tip and intermediate low informative nodes).

This approach of graph-filtering and trimming is based on a combination of different scoring schemes. On the one hand, graph filtering can be based on the number of sequences assigned to each node, and on the other hand, a graph can be "thinned out" by removing intermediatealpha nodes that are below a given cutoff. The latter approach allows a certain level of details to be maintained while drastically reducing the size of the graph by removing "unimportant" intermediate graph elements. In this way, any large GO graph can be reduced by abundance and information content instead of simply "cutting through" the Gene Ontology at a certain hierarchical level or by the use of GO-Slim definitions.

Below the molecular functions of 1000 sequences are visualized in 3 different ways. The first graph is unfiltered, the second graph shows the functional information after having applied a Go-Slim reduction. The third graph is filtered and thinned according to the number of sequences belonging to each GO-term and the node-score. All GO terms with less than 10 sequences were removed (tip nodes) and all the nodes with a node-score smaller than 12, applying an alpha of 0.4, were removed (intermediate nodes). This strategy allows the removal of terms that are less significant to a particular data-set while at the same time it maintains frequently present terms at lower levels of specificity.

Figure 1: Unfiltered Graph

Figure 2: Filtered Graph 1

Figure 3: Filtered Graph 2

Combined Graph Editor Sidepanel

General

This section controls the graph visualization within its area.

Zoom
Collapse All: The nodes will collapse and only the root will be visible.
Expand All: The graph will expand to its entire size.
Re-Layout: The whole graph will be re-scaled to adjust to the visualization area.

Find

Allows to search for GO IDs/ Terms/ Description in the Graph.

Node Info

Adjust the visible information in each node.

GO ID: If checked the GO ID will be included in the node.
GO Name: The GO Names are shown in the node.
GO Description: The GO Description will be included in the node.
Node score: The node score will be shown in the node.
Sequence Names: The names of the sequences annotated annotated with this GO are shown (max. 15).
Sequences: The number of sequences annotated with that particular GO will be displayed in the node.

Layout

Edge Labels: When checked, the labels at the edges will be shown.
Expand/Collapse Icon: If checked the ions that represent expand/collapse on the node are displayed.
Only is a Relations: Only the is a relations between nodes will be displayed if the box is checked.
Color
- Ontology: All nodes will be colored according to the ontology category, Biological Process - green; Molecular Function - blue; Cellular Component - yellow.
- White: The nodes will turn white.
- By Node score: The higher the node score, the more intense is the color (orange).
- By Sequence Count: Node color intensity will be proportional to the number of contributing sequences at the node.

Options

Sequence Filter: The minimal number of sequences, a GO node must have assigned, to be displayed. This filter is used to control the number of nodes present in the graph. Depending on the result, adjust this value until you obtain a satisfactory graph. Start with 10% of your total number of annotated sequences.
Node score Filter: The minimal node score that is necessary for a node to be displayed.
Score alpha. The value for parameter alpha in the Score formula Node Score Filter. Only nodes with a Score value higher than the Filter will be shown. Use this parameter to thin out the GO-DAG for low informative nodes.
Restore Defaults: All filters will be set to the default values.

Charts

Analysis of GO term associations in a set of sequences can also be done with pie/bar charts. Once the graph is visible, the Charts area allows the creation of 4 different charts.

Cuts through the graph at a specific level and generates a pie representation of the number of sequences per GO node.
Allows to select a minimum filter value in order to include only GO nodes with a higher Node-Score or sequence count in the resulting pie chart.
Same as the first one but in bar chart style.
Will show a bar chart with the number of sequences that have been annotated with a specific GO Term.