Help page for Browse literature or sequence neighbours.
This tool allows you to explore in detail the literature or sequence associations between a set of genes or proteins or the combination of the two.The literature networks are based on the results of indexing using a Natural Language Processing (NLP) regimen of all MEDLINE records (title and abstract) for gene and protein names. In this process, all known alternative symbols and names are used for each gene or protein, and the occurrences of each symbol or name for a gene or protein are collected and verified before combining them to a set of occurrences for the gene or protein. The sequence networks are based on sequence homology computed with the ParAlign sequence homology program. This is a fast and accurate implementation of the Smith-Waterman algorithm.
For literature networks, there are two ways to determine the strength of an association derived from co-ocurrences of two genes or proteins or the combination of the two. The basic method is to use the actual number of co-occurrences as the strength. Thus, for example, a pair of genes that are found together in 100 articles will be deemed as more interesting than a pair of genes found together in 10 articles. This method, however, does not consider how many times each gene or protein in a pair have been found alone. Therefore, as an alternative, the probabilistic strength is now offered as an option. With this method, the strength of a co-occurrence association is calculated by also considering the number of times the genes or proteins or the combination of the two have been found alone (or, more correctly, independently of the other member of the pair). The calculated strength is proportional to the probability that the number of occurrences of the pair is over-represented, given the number of occurrences of each term and under the assumption that the two terms in the pair are unrelated. Thus, the smaller the probabilistic literature association weight, the less likely a pair of genes or proteins or the combination of the two are to have co-occurred by chance. As a consequence, the probabilistic literature association weight can be interpreted in a similar way as the E-value of a sequence homology search.
The sequence networks have been generated by performing an all-to-all sequence alignment between all proteins for each organism, using a cut-off of E-04 (0.0001). Thus alignments that are less significant than E-04 have not been included in the database. The gene sequence networks are derived from the protein sequence networks by mapping proteins to the corresponding gene(s).
The Network Browser tool shows a graphical representation of the literature or sequence neighbourhood for a given set of genes or proteins or the combination of the two in the query subset are shown with bright red font in the graph. Direct neighbors are shown with a darker red font, and neighbours of neighbours are shown with black font. Nodes representing a degree of separation from terms queried in the shortest path network are colored light grey. A color code legend can be viewed by clicking on the "Graph Color Code" icon.
In addition to the graphical representation of the neighborhood, brief textual information about the genes or proteins or the combination of the two in the displayed graph is also included in the output from a query.
Explanation of input fields and parameter settings:
Input data
- Organism
- - use this to select the organism relevant for your data. Possible
organisms are:
- human (Homo sapiens)
- All organisms
- mouse (Mus musculus)
- rat (Rattus norvegicus)
- cow (Bos taurus)
- pig (Sus scrofa)
- dog (Canis familiaris)
- chicken (Gallus gallus)
- zebrafish (Danio rerio)
- fugu (Takifugu rubripes)
- fly (Drosophila melanogaster)
- worm (Caenorhabditis elegans)
- rice (Oryza sativa)
- arabidopsis (Arabidopsis thaliana)
- yeast (Saccharomyces cerevisiae)
- yeast (Schizosaccaromyces pompe)
- e.coli (Escherichia coli)
- anthrax (Bacillus anthracis)
- Bacillus cereus
- Streptococcus pneumoniae
- Staphylococcus aureus
- hiv (Human immunodeficiency virus 1)
- Network (association) type
- - use this to select the type of association network to use for your
analysis. The type may be:
- literature: co-occurrence count that corresponds to the number of MEDLINE records (title/abstracts) where the search term has been found with any other term (of query term type) in the database
- literature (prob): probabilistic (or statistical) literature association that will report a statistical strength linking the search term with other terms (of query term type) in the database.
- sequence: sequence homology that will report the probability (E-value) of the alignment between the search term with other terms.
- shortest path: will report the co-occurrence count that corresponds to the shortest path from term1 to term2 for a list of any submitted terms.
- Query term type
- - use this to select how to interpret the identifiers used in your input
data, i.e., the type of biological entity to work with. The
query term type may be:
- gene(s),
- protein(s) or
- gene and protein(s).
- Search terms
- - use this field to input one or more search (query) terms. The submitted search terms must correspond to the selected term type. Multiple search terms should be separated by comma or semi-colon.
- File to upload
- - use this field to select the terms file to upload. Note, if your input file contains empty lines, these will be kept in the output listing.
- Index for ID column
- - use this field to set the terms file column index for the input identifers
- Number of header rows to skip
- - use this field to set the number of rows to skip from top of terms file input
- Column header row?
- - check this box to preserve the first (remaining) terms file input row as column headers (and thereby also cut from analysis input).
- Keyword
- - use this field to enter a search keyword to explore a nerwork of genes or proteins or the combination of the two associated to that keyword. The keyword may be any annotation (or simply part of such) from MeSH, GO or Chemicals & Compounds and may include regular expression characters.
This tool uses an automatic term mapping facility that will map any of gene symbols (primary or alias), protein symbols (primary or alias), Affymetrix probe set ids, UniGene cluster ids, GenBank accession number, or IMAGE clone ids to either primary gene or protein symbols.
Parameters determining output format
- Edge weight limit
- - use this to set the minimum or maximum weight (strength) for including
an association in a network. The behaviour depends on the network (association)
type as follows:
- literature (co-occurrence count): for this network type, the limit will be a minimum, i.e., no edge with a lower co-occurrence count will be included.
- literature (probabilistic) or sequence homology: for these network types, the limit will be a maximum, i.e., no edge that is more probable to be random will be included.
- Max nodes
- - use this to set the maximum number of nodes to include in a network.
- Max depth
- - use this to set the maximum depth at which to include nodes displayed in a network. The input search term(s) are defined to have depth 1. Thus, depth 2 corresponds to neighbours of the input search term(s), and depth 3 corresponds to neighbours of neighbours, and so on.
- Max branch
- - use this to set the maximum number of neighbours of input search term(s) to be included. Note, this is a soft limit in the sense that in case of ties (when a node has several neighbours with the same association strength), the algorithm will try to add up to 3 more neighbours on the same level if they have the same association strength to the current node.
- Max branch (2)
- - use this to set the maximum number of neighbours of non-input search term(s) to be included. Note, this is a soft limit in the sense that in case of ties (when a node has several neighbours with the same association strength), the algorithm will try to add up to 3 more neighbours on the same level if they have the same association strength to the current node.
- Primary Keyword genes or proteins or the combination of the two
- - use this field(if you have entered a Keyword) to select the number of primary genes or proteins or the combination of the two associated to the keyword. The PubGene program will then explore the literature for neighbours for these primary genes or proteins or the combination of the two, based on the "Max depth" you have chosen. Each gene or protein neighbour that is selected to be on the newtork will have a literature relationship to the keyword. The primary genes or proteins or the combination of the two will be colored red on your network.
- Annotate Graph
- - use this function to allow the PubGene program to assign a biological annotation for each of
of the genes or proteins or the combination of the two on your network. The type of annoations you can choose from are as follows:
- Chemical & Compounds
- Biological Functions from the "Gene Ontology" (GO)
- Biological Processes from the "Gene Ontology" (GO)
- Biological Components from the "Gene Ontology" (GO)
- Diseases from the "Medical Subject Headings" (MeSH)
Explanation of output
Features of the displayed graph
The displayed graph shows the subset network and possibly additional literature or sequence neighbors of the genes or proteins or the combination of the two in the submitted set with edges reflecting the associations. The numbers next to the links correspond to the number of literature co-occurrences or the probability of the sequence homology match between the linked genes or proteins or the combination of the two. The node names are color-coded to reflect literature or sequence closeness to the submitted set of genes or proteins or the combination of the two. For the genes or proteins or the combination of the two in the submitted set, a bright red color is used. Direct neighbors of genes or proteins or the combination of the two in the submitted set have a darker red color, while second-level neighbors have been colored in black.
Additional information
To the right hand side of the graph, brief information about the displayed genes or proteins or the combination of the two is shown. The columns in this table are:
- ID: The PubGene ID for the gene or protein, i.e., Primary Symbol.
- D: The depth, relative to the submitted set of genes or proteins or the combination of the two, at which the gene or protein was found.
- Name: The official full name of the gene or protein.
- DBs: The nomenclature source database(s) where this information can be found.
Nomenclature Note
The PubGene tools use case-sensitive primary symbols to identify genes and proteins. The correct syntax depends on the organism and biological entity. In general, the identifier (primary symbol) for the gene is not the same as the identifier for the corresponding protein(s), although, in many cases the identifier for the gene is similar to the identifier for the protein(s). Moreover, for many genes or proteins or the combination of the two, the identifier may differ across organisms. The difference however may often be only in case (lower versus upper).
PubGene tools utilize an automatic lookup to find the correct primary symbol (identifier) for an input query term. This in order to allow the user to input a gene or protein alias and/or a symbol with a non-standard capitalization. For a given input term type, the lookup will try to find the best match for a given input query term in the following way:
- As primary symbol: Does the input string correspond to a primary symbol?
- As case translation of a primary symbol: Does the input string correspond to a primary symbol when disregarding capitalization.
- As alias symbol: Does the input string correspond to an alias symbol?
- As case translation of an alias symbol: Does the input string correspond to an alias symbol when disregarding capitalization?
- As primary symbol for the corresponding protein if the query term type is gene and vice versa.
- As an Affymetrix probeset ID.
- As a UniGene cluster ID: Does the input string match a UniGene cluster ID of the selected organism; note, the UniGene ID must include the two-letter organism code and the period (dot) between the organism code and the number.
- As an IMAGE clone ID; only all-numeric input strings may match.
- As a GenBank Accession number.
As gene identifiers, PubGene generally uses the official gene symbol from the official nomenclature committee(s) for the various organisms.
As protein identifiers, PubGene generally uses the corresponding Swiss-Prot identifier without the _ORGANISM string.
When PubGene creates association networks or associates Chemical & Compound, MeSH and GO terms to genes and proteins, PubGene uses all known gene and protein aliases and then combines information from all aliases for each gene and protein.