Project Name This is an arbitrary name of the users choosing (no spaces or non alphanumeric characters allowed). It serves to identify the folder where the sequence data and its associated results are stored. A project can store multiple sequence files (input separately), and their associated results. The contents of the project folder can be retrieved in UNIX tar format.
Sequence file Name This is an arbitrary name of the users choosing (Proj. name restrictions apply) that serves to identify the current set of sequences being input, eg the regulatory sequence for all the genes controlled by a common set of factors should reside in one sequence file. A sequence file (and its associated results) reside in a project.
Uploading Data In various places the user is allowed to upload data from their computer (eg fasta sequence files, gff files of genomic regions, or binding site motif files. In each case there are formats specific to the file type, and an overriding restriction that the files be text only (eg *.txt), such as are produced by the plain text save options under MS Word, or MAC TextEdit, and the LINUX editors vi and emacs. Limited format checking is done when files are uploaded and bad characters are flagged.
Multiple Species Whether the user wants to work with sequence from a single species or multiple species. This options define how sequence is extracted from the fly genomes or how uploaded fasta files are interpreted. With the 'multiple' option, a fasta file is interpreted as
>reference_species_gene1 >secondary_species_gene1 >reference_species_gene2 >secondary_species_gene2When inputing genomic coordination information, a second species must be selected if the 'Multiple' option has been used.
Format for ortholog extraction transcript If you specified a set of loci (sequence coordinates) and asked for their orthologous sequences to be retrieved, a transcript of the ortholog extraction procedure is printed, in the following format. There is a header line, such as ">2R_5047275_5048074", for each sequence extracted, followed by a single line of description, such as "2R:5047275:5048074 ---> 2R_5010459_5049682_Contig3083_Contig4177_808609_845405". This means that the sequence "2R:5047275:5048074" was extracted from the larger D. melanogaster context "2R:5010459..5049682", which is orthologous to the secondary species sequence "Contig_3083_Contig_4177:808609..845405". The ortholog for the query sequence was extracted from this larger orthologous context.
Extracted peaks format Each extracted "peak", i.e., a window that scores higher than all overlapping windows, and satisfies certain user-specified criteria, is reported in the following format:
"Position: " is followed by the offset of the starting position of the window in the input sequence. In case of Drosophila sequences extracted using the "Sequence Upload" page, the absolute position on the chromosome is indicated. "Nuc: " is followed by the length of the window. "Word_av_length: " is to be ignored. "Free energy: " is followed by the free energy (score) of this window, representing a log-likelihood score; the higher, the better.
The header line is followed by description of the composition of the window. The first column is the motif name, the second column is the probabilities associated with the indicated motif or background. They sum to 1 and are proportional to the third column. The third column is also the average number of sites of the indicated motif in the window, the average being taken over all possible segmentations of the window into motifs and background.
Parameters for extracing peaks Stubb slides a fixed length window over the input sequence, and computes the score (free energy) and composition (estimated numbers and types of binding sites) for each window. You may then extract the most interesting predictions in the form of "peaks" in the free energy profile -- windows that have a higher score that all overlapping windows -- that satisfy certain additional criteria that you specify:
1. Minimum number of factors: There should be at least this many motifs with site strength greater than some threshold, specified by the next parameter.
2. Factor min occ The total site strength (average count) of a motif must be above this threshold in order to count in the previous criterion. (A real number > 0.)
3. Free energy cutoff The free energy (score) of a window must be above this threshold. Note that a threshold with the same semantics was already specified in an earlier stage off Stubb input. The new threshold has to be stronger (higher) than that threshold.
Raw Output This button leads you to the TXT format of the original Stubb result.Some of the files may be empty due to none of the result is above the threshold you set
Specify the sequences We allow multiple sequences to be specified together in a sequence file. These can either be created directly by the user as a fasta file or generated by the web site relative to a known 'reference' genome (currently only D.melanogaster). When the sequences are extracted from a reference genome there is the option to used stored alignments and create the 'secondary' species sequence. At most one secondary species can be specified. Currently supported are
| Reference | Secondary |
| D.melanogaster Release 3.1 | D.yakuba, D.ananassae, D.virilis, D.mojavensis, D.pseudoobscura |
Fruitfly Genomic Coordinates You may specify genomic intervals in the Fruitfly genome as your input sequence(s). The coordinates are with reference to Release 3.1 of the fruitfly genome. If you do not know the coordinates of your regions of interest, there is a link to Gbrowse on the sequence input page that allows you to search by gene name and other landmarks.
Using Fly Gbrowse to get coordinates You may use the genome browser to search for features (e.g., genes) and create a plain text GFF format file on your computer, to upload and extract the sequences. You will have to read off the start and end coordinates from the browser and create a name for each sequence in the GFF format file. GFF format is explained here.
GFF This is a tab-separated file format with each line representing a genomic interval.
Currently we use only the first five fields of the file:
chromosome source name start end
The "name" field is used in creating the Fasta id of the extracted sequence.
Select matching contig This option defines how the matching (orthologous) secondary species sequence is chosen for each reference sequence interval, in case the latter has multiple matches on the secondary species genome.
0. Use the contig/scaffold with the best match to the reference sequence. There is no itereaction with the user, and the entire reference sequence may not be matched.
1. Not supported currently. For each query, display all plausible alignments that match the ref sequence, and allow the user to chose one or more interactively.
2. Follows Option 1 (use all plausible alignments that match the ref)
but for nested alignments (e.g., as one might get from
repetitive sequences) keep only the best one. No user input allowed.
Why is the number of returned regions more than what I input ? If a specified sequence in the reference species has more than one good match to contigs in the secondary genome, all matching contigs will be considered, leading to multiple matching sequences being returned. The default option of 'Select matching contig' from the previous page is option 2; this option keeps all plausible alignments for each reference sequence. If you just want the best match, you could go back and choose option 1 from 'Select matching contig'.
Augment query interval This option allows the user to increase the size of both the reference and secondary sequence extracted to coincide with a high confidence ungapped block in the alignment. The augmentation applies to both the 5' and 3' ends of the query.
0. Use the ref coordinates exactly as input. Define the limits of the sec sequence by homology if they fall in a block otherwise make the distance from the nearest interior block identical to that in ref.
1. If a reference coordinate falls interior to a block take the entire block.
2. If a reference coordinate falls between blocks go to the edge of the next exterior block (if possible).
3. Include all of the next exterior block (if possible).
DNA sequences in Fasta format The user may upload your own sequences in Fasta format. If the mutliple species option has been selected, a fasta file is interpreted as
>reference_species_gene1 >secondary_species_gene1 >reference_species_gene2 >secondary_species_gene2
Motifs We use the Stubb format for these, e.g.,
>bicoid 11 PSEUDO_COUNT 1.0 2 10 10 8 5 12 4 9 3 0 0 27 27 0 3 0 30 0 0 0 0 0 7 23 1 29 0 0 3 13 3 11 4 15 11 0 2 9 13 6 6 14 6 4 <The columns correspond to ACGT. The pseudocount is optional and overrides the program default of 0.5. Multiple motifs can be specified in one file by replicating blocks of lines with this format. If a motif file is uploaded, the predefined motifs are not used.
Fixed Transition Probabilities Stubb and Windowfit learn the transition probabilities of each motif from the sequence. This option forces the programs to use a fixed transition probability (0.0025) for all non-background motifs. This is for consistency with GenomeSurveyor.
Default Background Sequence The background sequence is supposed to represent the typical nucleotide composition of non-coding sequence, against which the input sequences are contrasted. The default is a set of non-coding (upstream and downstream) sequences near core genes of the segmentation pathway of the fly embryo.
Upload background sequence If you wish to use your own set of Fasta sequences as background, upload them here.The size limit is the same as for the input sequences: 100Kbp for stubb,100Kbp total and 10Kbp each sequence for windowfit.
Markov Order This parameter specifies how the background sequence model captures adjacent nucleotide dependencies. A value of k means that (k+1)-mers are counted in the background sequences, and a kth order Markov model constructed, to be used as the background model.
Phylogeny (mu) This is relevant only when working with two species. It is a real number between 0 and 1, representing the neutral substitution probability between the two species. For fruitfly, a value of 0.5 is recommended. A value of 0 means that the two species are very close or identical, and a value of 1 means that the two species are very distant.
Window Shift and Length Stubb slides a fixed length window over the input sequences, scoring each window, and outputs the top scoring windows. The (fixed) length of the sliding window is called the "Window Length", and the shift (slide) in its starting position, in each iteration, is called the "Window Shift". The default (and recommended) value of the Window Length parameter is 500 : A typical cis-regulatory module has length 300-1000 bp. The default window shift of 50 bp is recommended; lower values will lead to longer execution times.
Minimum probability for a binding site to be reported A real value in the range 0-1. Stubb does a probabilistic segmentation of an input sequence (window), and hence computes a posterior probability of a binding site being located at any (start) position, for all positions of the sequence, and for each type of binding site. This information is then reported, but after removing all binding site occurrences with posterior probability below a certain threshold. This parameter allows the user to set the threshold for reporting binding site occurrences. A value of 1 means only very strong sites will be reported, while a value of 0 is the most permissive.
Minimum score for a module to be reported A real value greater than 0. Stubb computes the log likelihood ratio score for each position of the sliding window on the input sequence, and outputs detailed information about each window that scores above a certain threshold, which is this parameter.
Minimum z-score of motif plotted in information content plots Apart from plotting Stubb's output graphically, Windowfit also plots the occurrences of input motifs as determined by a "likelihood" score, which is simply the negative log probability of sampling a site from a motif weight matrix. All putative sites with likelihood score significantly greater than random expectation are plotted. This parameter is the minimum z-score of likelihood score for which sites are plotted.
Lagan Parameters In order to run two-species Stubb, the input
homologous sequences must first be run through the
Lagan alignment tool
to identify the highly conserved blocks that will be treated by Stubb
as regions of common evolutionary descent. The Lagan run requires four
parameters, similar to the parameters of a Smith-Waterman pairwise algorithm
with affine gap penalty:
match score (-mt) > 0 mismatch penalty (-mis) < 0 gap start penalty (-gs) <= 0 gap continue penalty (-gc) <= 0Once Lagan has been run to perform a global alignment of the input pair of sequences, a post-processing tool is run to extract highly conserved blocks -- these are ungapped blocks of a certain minimum length (minimum block length) and a certain minimum percentage identity (minimum block percent identity).
Format explanation for Extracted Blocks file:
I got the error Could not fetch FILENAME.ann:Not found
Downloading a project
A project is simply a directory containing one or more fasta files
(input by the user), and their associated results, accumulated during
the user's activities on the project. The user may download the entire
contents of the project directory by clicking the "download project"
link, and "untar"-ing the downloaded tar file. This will create a local
copy of the project directory, on the user's machine. The user should then
locate and view the "log.html" file in this directory, through a browser.
This will have logs of all of user activity on the project, and links to
all the associated results. All links to locally stored files (part of the
downloaded project directory) will be static links requiring no
internet connectivity. However, there will also be links to dynamic pages
that create content online, and such links will require internet connectivity.
New extracted local peaks result
user id A user's machine is assigned a unique identifier when the user
first accesses the web site from that machine. This identifier is displayed
at the top of the web page, and is also stored on the user's machine as a
"cookie". If the user plans to access their data from other machines, or plans
to delete the machine's cookies in the future, this identifier should be
noted down. To present their user id to the site, and thus gain access to all
the data and results associated with that user id, the user must choose the
"authenticate" link on the welcome page, and enter this user id. The user id is
the form '0.dddddddddddddd' ie 14 digits with preceding '0.' and only id's of this
form are recognized by the system.
How do find coordinates of my gene?
How do I access a project/file that I have worked with earlier on the
web server ?
Is my data private?
Should I record my file names for reuse later?
How do I display multiple tracks of Stubb free energy profiles on gbrowse?
Why doesnt Gbrowse show me my region?
What is the meaning of the windowfit plots?
What are all those files in the download directory?
(x1, y1) = (x2,y2) p
indicates an ungapped locally aligned block at offset x1...y1 in the reference sequence, aligned to offset x2...y2 in secondary sequence, with percent identity p.
You need to explicitly save the free energy track for the current Stubb run before you can view it.
The result can be displayed by click the link "Each Window Scoring above Free Energy Threshold (text)" on the same page.
It reflects the new parameters you chose to define the significant peaks. Changing parameters may not result in a change in the selected peaks.
>2R_430000_449724 Mel
>scaffold_337119_363192
>2R_445454_460000 Mel
>scaffold_24363_42591
The first and third sequences are from D.melanogaster, and the second and fourth sequences are their ortholgs from D.mojavensis. The input region in D.melanogaster was broken into two fragments, each fragment matching a separate scaffold from the D.mojavensis assembly.
Bring up the Gbrowse fly genome browser from the sequence input page and search. You will then have to create a file in
GFF
format on your computer to up load and extract the actual sequence(s) from our database.
Go to the Stubb web page or the Windowfit web page, and simply type in
the name of your project and sequence file name in the appropriate text box
fields. You do not have to upload sequences in such a case, the server will
locate the file name in the specified directory and use it for further analysis.
A random number is created the first time you access stubb from your computer and appended to your project name and saved in a cookie by your browser. Other users who access stubb from the same machine will get a new number. To learn more about the authentication, click here.
No, the 'manage and view projects' page will have a log of the programs you have run listed by the file on which they operated.
On the Stubb results page the first or second box allows you to preview the
current Stubb run, and the third box to save the track in a cumulative file under the
sequence name. The track should be given a unique name (eg based on the matrices
or second species) at that time. After saving a project directory the *.ann files
can be uploaded by hand to a local version of gbrowse, and managed that way.
To view a specific genomic region in Gbrowse, simply type in the coordinates of the region in the text field named "Landmark or Region" on the Gbrowse web page. (The format for this is, e.g.,
2L:80000..100000.)
The first panel(s) Plot for.. is just a graphical representation of the stubb output.
There are colored bars indicating the position for each binding factor. The bar height encodes the
occupancy (profile value). There are additional lines of bars created if binding sites overlap. The
combined occupancy of all overlapping factors can not exceed 1. The total occupancy for each factor
in the window is listed next to the factor's name.
For the two species plot, there are
also colored line segments (colors cycled modulo 3) that depict corresponding LAGAN
aligned blocks between the species. Binding sites overlapping these blocks are fit with an evolution
model and the occupancy is then the same in the two species. The actual sequences of the binding
sites can be found in the text plot window.
(Options on the sequence extraction page will extend both the reference and secondary species sequence to include the next block up and down stream.) The free energy score is for the entire window and when parameters correspond will be similar to the free energy tracts output by stubb for gbrowse. (Differences caused by context dependence of LAGAN alignments)
Below the stubb generated plots, there are Information Content Plots, which are just scans of individual weight matrices against the sequence. Positions scoring better than a minimal z-score (number of standard deviations beyond random) are shown.
To navigate through the
download folder, start with the log.html file which lists by sequence name the programs that were run on that sequence and local links to the results within the downloaded directory were feasible. For two species stubb runs, the ref. sequences/coordinates input may be accessed in pieces due to the interspecies alignments.
These files and those derived from them have names beginning with location
(contig_beg_end) in the ref. and sec. species.
File extensions:
| .mfa, mfasta, fa | : various fasta format sequence files |
| .prof, .dict, fen, .parameter, | : are the raw Stubb output |
| .align | : alignment blocks output by LAGAN |
| .parse | : ranked list of free energy peaks from Stubb run |
| .ann | : the free energy tracks in gbrowse format |
| .html, .png, .map | : html format files for browser viewing, and associated images. |