Welcome to Scotty!

Scotty is used to plan RNA-Seq experiments that measure differential gene expression.

At the start of every experiment, someone must ask the question, "How many reads do we need to sequence?"

If you are new to RNA Seq you may find it helpful to read our tutorial, Thinking About RNA Seq Experimental Design. Our intended audience for this tutorial was people who do not have an extensive background in statistics and computational biology, for example, a first year graduate student in wet lab biology.

Input Data

A prototype dataset is required for power calculations. The prototype dataset is used to measure how much variance is present between replicates. This largely determines how many replicates are required. This data is also used to quantify how many genes are detected as more reads are sequenced. This largely determines how deeply each replicate needs to be sequenced.

There are two options for obtaining prototype data. The first is to use an existing datasets from other people's experiments. Several publically available datasets are preloaded. Additional datasets will be added upon user request. Please send an email to busbym at bc dot edu with a link to the dataset you would like loaded. We can only load datasets that include biological replicates and may not be able to load datasets that are subject to embargo.

The second option is to use pilot data generated within your own lab. Pilot data is preferable. We have found that there is a great deal of inter-experiment variability in both the variation between replicates and the rate of gene discovery, even between samples using the same species and tissue type.

However, when it is not feasible to generate pilot existing experiments with a very close protocol may serve as an acceptable substitute, and will of course be better than no data at all.

Generating Pilot Data

The most accurate approach

Scotty requires, at a minimum, two replicates of the same condition sequenced to a read depth where a reasonable portion of genes are measured. This does not have to be as deep as the final expected sequencing run, and reasonable results may be found with fairly low depths.

However, more data will lead to improved predictions. Scotty will generate the most accurate results if you have two replicates of both the control and the test condition. This allows Scotty and the user to see how variance differs between conditions. However, we recognize that this is not always feasible. If replicates are not available for one of the conditions Scotty assumes that both conditions have the same variance.

What type of replicates should I use?

There are two types of replicates: technical and biological.
In RNA-Seq, technical replicates are replicates where the biological material is the same in each replicate but the technical steps used to measure gene expression are performed separately. For example, a single culture of S. cerevisiae is grown and two samples are drawn from it and sequenced separately. In true technical replicates, all steps are replicated, including library preparation. Biological replicates consist of different biological samples that are processed through the sequencing process separately, for example, two cultures of S. cerevisiae are grown separately and sequenced separately.

Bio versus tech reps
Deciding whether to use biological or technical replicates is fairly straightforward. You simply have to decide whether or not you have biological variance in your system. For example, if you want to indentify the genes that are differentially expressed between two strains of yeast then you will most likely grow each of the two strains in different flasks. Growing the strains in different flasks of course introduces some type of biological variance. This can be seen because if you grew two flasks of the same yeast strain then the expression would be different. These differences are caused by biological variance. Thus, biological replicates are required for this experiment.

There may be cases where there is not biological variance. For example, if you are interested in the differences between tumor tissue and matched normal tissue in a single patient then it could be argued that you only have technical variance. In this case there may be only one tissue in each control and test condition. Thus, technical replicates could be used. However, you cannot say anything about cancer in general from this experiment because it only looked at one patient. You can only limit your findings to that one patient, and specifically about the two sections of tissue that were involved. However, we expect that in most differential expression experiments there will be some biological variance and biological replicates will be required.

If this explanation unclear to you then you may find it helpful to read our tutorial,
Thinking About RNA Seq Experimental Design. If you are still confused then you should probably e-mail us at busbym at bc dot edu because a proper design cannot be devised without understanding these principles.

Once pilot samples are sequenced, the reads will need to be aligned to a reference genome or transcriptome. Then, the number of reads which are aligned to each gene must be tabulated. The alignment and quantification process is currently outside the scope of Scotty, but we can recommend our lab's gkno Project tools. If you have difficulty getting started please contact us for assistance at busbym at bc dot edu and we will be happy to help.

How deeply should I sequence pilot data?

Pilot data does not have to be sequenced as deeply as the final sequencing run. However, Scotty will calculate the total power based on the number of genes that are observed in the pilot data with at least a read count of 1. That is, if 10,000 genes are quantified in the pilot data and Scotty expects you to be able to find 5,000 of these genes differentially expressed the total power is reported as 50%. The reason Scotty calculates the power this way is that it is difficult to accurately estimate the number of genes that are expressed but unobserved in the data. These unobserved genes are expected to have as substantially lower power than the observed genes and an incorrect number of them can skew the results. In general, it is reasonable to assume that there will be very fairly low power to detect differentially expressed genes that were not observed in either replicate of the pilot samples. To see this, the power of unobserved genes is shown in the final output chart. Scotty does not attempt to predict the number of genes in that class. The rarefaction output chart in Scotty gives an idea of how quickly new genes are quantified. For some species it is practical to sequence even pilot data to saturation. That is, all of the genes that can be observed have been observed.

How do I make sure my pilot data looks like my experimental data?

Control the environment in the same way that you will control the environment in the actual experiment. Environmental effects, such as growing samples in separate batches can significantly alter the power of experiments. Further, we recommend using the same protocol that you will use in the actual experiment. We observed a lot of variability in how many genes were required to observe a fixed number of genes, and expect the results to vary by the library preparation protocol used.

Upload File Format

Pilot data should be uploaded as a tab-delimited text file in the following format:

Gene Control_1 Control_2 Control_3 Test_1 Test_2
Gene1_ID 153 225 16 7 0
Gene2_ID 20 15 16 13 12
Gene3_ID 1 0 2 3 2
Gene4_ID 15 4 10 3 11

To ensure a reliable upload, the gene names should contain only letters and numbers (no spaces). If you have both technical replicates (lanes or libraries from same biological sample) and biological replicates (lanes or libraries from different biological samples) all of the technical replicates from each biological sample should be added together.

Thus, if you have samples from four different cultures sequenced with two lanes each your pilot data should have four columns, not eight. For paired end data count each pair once.

Selecting Preloaded Data

A less accurate approach
Caution! In the Scotty manuscript we speak at length about the potential shortcomings of using existing data to plan future experiments.

In short, we looked at several different experiments and found that the two factors that determine the power of an experiment (how many reads it took to measure new genes and how much variance there was between replicates) were not repeating very well between experiments, even when the experiments were nominally "the same", that is using measuring the same type of sample. We emphasize that we do not believe that this is not a shortcoming of the studies we examined. Rather, this happens for a variety of reasons and is to be expected in the normal course of laboratory work. For example, it would be expected to be more difficult to control the biological in an experiment with a large number of replicates and thus these samples would be expected to have higher variance compared to samples in smaller experiments.

Further, the number of reads that is required to measure new genes will vary between libraries based on several factors that can influence library complexity. For examples, if there is a low quantity of input RNA available in one of the experiments you may see a higher rate of read duplicates (an artifact) and the loss of certain RNA species. Thus, there will be a requirement for deeper sequencing to observe the same number of genes, and some genes may not be observed at all. Certain library preparation protocols may be better at producing complex libraries than others. Because protocols are always improving, we extend the caveat that the data in existing experiments may not be an ideal model for your experiment.

Thus, choosing a good model of pilot data for your experiment is tricky and the results of this analysis will certainly be less accurate than using your own pilot data. However, we believe that pre-loaded data will provide a better informed design than no data, and it is useful when it is not possible to generate pilot data, at the early stages of experiment planning, and after pilot data has been run to serve as a benchmark for measuring how well your own libraries are replicating.

Using Quantifications From Other Programs

The Scotty statistical model assumes that the read counts will represent the number of reads that align uniquely to a single spot in the reference. The reference can be the genome, in which case Scotty will estimate the power to quantify gene expression, or the transcriptome, in which case Scotty will estimate the power to quantify whether each transcript is differentially expressed.

In the case of a transcriptome alignment, users can expect that for complex species such as human the majority of reads that align to a transcript will not align uniquely. That is, the reads are likely to align to exons that are present in more than one isoform originating from the same gene. Therefore, substantially deeper sequencing is required to quantify differential expression at the transcript level.

Several strategies have been published that consider reads mapping to multiple isoforms and call transcript abundance based on both the uniquely and multiply-mapped reads. Scotty can be used with the output of these programs with the following considerations and cautions:

To illustrate how Scotty works with the output of these programs, we will first consider an example using a very simple reallocation method. We will consider an idealized example of a gene with two alternative transcripts of the same length (A and B) and completely even read coverage. Reads can align uniquely to one transcript or can map to both transcript A and transcript B. In this example we have:

Transcript A 6 unique reads
Transcript B 3 unique reads
Both Transcripts: 90 reads

Under a basic reallocation approach, the 90 reads which map to both transcripts will be divided proportionally with 60 reads assigned to transcript A and 30 reads going to transcript B. While this will increase the nominal count of the reads for both Transcript A and Transcript B there will not be a simple corresponding reduction in counting noise that you would see if the measurements were actually higher due to deeper sequencing. This occurs because the re-allocation itself is based on counts which a have level of noise and is thus itself noisy.

While the reallocation of multiply-mapping reads has been demonstrated to improve the absolute quantification of transcript abundance, as measured for instance by the correlation with microarray expression values, an accurate absolute quantification is not required for differential expression calls. Differential expression instead requires an accurate relative expression measurement when one compares two conditions, and an accurate relative measurement can be attained from the unique reads alone. While more complex algorithms exist than this simple approach, these methods can only impute the origin of multiply-mapped reads from the observed uniquely mapped reads and will all be fundamentally limited by the information content in the uniquely mapped reads. The question then becomes whether including the multiply-mapped reads decreases the variance across replicates (and increased power) or increases the variance across replicates (and decreases power).

We considered both the strengths and the limitations of Scotty in estimating power based on the output of these programs and offer this guidance to our users:

We do not believe that Scotty will have any inaccuracies when extrapolating the power from two pilot replicates to many replicates when the sequencing depth remains unchanged. The increase or decrease of variance that is introduced by the reallocation method will be observed in the total variance that is measured across the replicates. Thus, when the sequencing depth in the projected experiment is the same as the sequencing depth of the pilot data then the estimation of power using an increased number of replicates will be correct. However, the amount of error that is introduced by the reallocation of multiply-mapped reads is dependent upon the level of counting noise present in the original read counts. Thus, it is not possible for Scotty to model the change in variance that will occur with increased sequencing depth from the data that we collect because the error in the reallocation is both correlated with sequencing depth and specific to the statistical model that is used to reallocate the reads. Further, it is likely to be experiment-specific in regards to factors such as how well the data conforms to the model's assumptions of parameters such as evenness of coverage.

Thus, Scotty cannot completely address this inherent characteristic of the data, but we do offer this guidance to users who wish to analyze transcripts quantified by programs that consider multiply-mapped read counts. We offer suggestions: The first is to use Scotty with counts generated by the transcript-quantifying program with the caveat that the extrapolation of power to higher sequencing depth will be subject to a certain degree of error. The second option is to simply quantify the transcript abundances using uniquely mapped reads. We believe this is the better option as most of the information that is useful for differential expression calls is contained in the uniquely-mapped reads, and these results are likely to provide a reasonable approximation of the power that will be attained if you also include the multiply-mapped reads, which contain relatively less information.

This assumption is easy for the user to test if they process their pilot data using both the transcript caller and only the uniquely mapped reads. If the power observed at the sequencing depth of the pilot data is the same for both approaches then the extrapolation based on the uniquely aligned reads will provide a good estimate of the power attained by the transcript caller at higher read depths.

Cost Data

Cost data should be entered as follows:
Cost per replicate, excluding reads - This is the cost of producing each replicate, for instance the cost of obtaining the biological material and preparing the library.

Cost per million reads that are aligned to genes - A certain portion of the reads that are sequenced will contain too many errors to align to the reference. The portion that is not alignable will vary by experiment. Scotty only considers the reads that are aligned to genes. Therefore, the cost per read should reflect how much each aligned read cost. For example, if one million reads cost $1,000 but only half of those reads align to genes then you should enter $2,000 as this cost.
The currency does not matter. We just put $ because that is what is on our keyboard.

Constraints for Power Optimization

Scotty will calculate the power, cost, and measurement bias for a number of experimental configurations having different read depths and replicate number:

The first two inputs specify which experimental configurations to test:
Maximum number of replicates per condition - Scotty will test experimental configurations that use between two and the maximum number of biological replicates specified. Each experiment that is tested contains the same number of replicates in the control and test condition. While we display the power observed for only two replicates, we generally recommend that the minimum number of replicates used is 3. Gene variance is very poorly measured with only two replicates and it is difficult to detect an outlier (bad) replicate if there are only two samples.

Assess the power of sequencing depths between (entries) - For each replicate number, Scotty will test 10 different read depths between the minimum and maximum values that are specified. The read depth is not the number of reads sequenced. It is the number of reads that align to genes. Therefore, if you expect that about half your reads will align to reads, you will need to sequence twice as many reads as Scotty calculates to get to the correct read depth.

The next inputs specify the constraints that are on the experimental design:
Detect at least (a certain percentage) of genes that are differentially expressed by (a fold change) at a (p-value) - Differential expression is defined as a fold change relative to the control condition. For example, in a 2X fold change the test condition will have expression that is twice as high as the control condition.

The percentage of genes detected is defined as the percentage of genes with a true fold change of the magnitude specified in the test condition relative to the control condition. Genes that are expressed at levels that are too low to be observed in the protoype data are not included in this calculation. Use the full percentage, e.g. 80 not 0.8 which we will interpret at 0.8%.

The p-value is based on the percentage that will be calculated in a t-test. See Scotty's accompanying paper for details of the statistical test and comparisons of power between the t-test and a negative binomial test.

Experiment will cost no more than (budget) - Scotty adds up the cost of the replicates and the cost of the reads and tells the user which experiments are feasible. This does not include data analysis costs. Processing a large number of reads will require computational power beyond what is available on normal desktop computers. These resources may be available at, for example, university computer clusters. Alternatively, processing power can be obtained through cloud services but should be budgeted for.

Limit measurement bias by measuring at least (percentage) of genes with at least (percentage) of maximum power - This is the most complicated of Scotty's inputs. RNA-Seq measurements quantify data with read counts. Because of the count nature of the data, genes which are measured with low read counts (particularly below 10) have imprecise measurements. The imprecision occurs because you can be fairly certain that counts of 100 and 200 represent true differences than you can be about counts of 1 and 2. (See our tutorial
Thinking About RNA Seq Experimental Design for a longer explanation of this effect, which is called Poisson variance). Because of this effect, there will be substantially lower power to detect differentially expressed genes measured with low counts than there would have been if the expression of these genes had been measured on a continuous scale (e.g. a continuous measurement of 3.5 reads versus the forced binning of measurements into 3 or 4 reads).

However, the amount of uncertainty that is introduced into the measurements because of their count nature can be easily calculated. Thus, it is possible to calculate how much power the measurement would have had if the gene expression had been measured with a continuous measurement. We define the power that would have been calculated with continuous measurements as the maximum power. As the read count increases, bins between the counts become smaller relative to the count, and the power that is achieved using counts asymptotically approaches the maximum power:

Power Bias
In some experiments, it is undesirable to have a large number of genes measured with less power than other genes because it adds complexity to downstream analyses. This is particularly true in cases where findings of functional enrichment, for example, are in and of themselves experimental results. In such cases, it is important to account for measurement biases in the analysis to make sure that enrichment for individual biological categories is observed because gene in these categories are more likely to be differentially expressed rather than they are more likely to be detected if they are differentially expressed. This might be the case for genes which are measured with a high number of reads. While this analysis can be done, it is more difficult and will add some expense to downstream analyses. Thus, it may be worth considering whether spending more to obtained less biased data may be paid for by the cost saving achieved by a simpler analysis, particularly in species with smaller transcriptomes such as yeast where sequencing samples beyond Poisson noise is economically feasible. In other experiments, such as when the primary findings are target identification, measurement biases will matter less.


Here we show some examples of output data. Most of the data shown is taken from the human liver data samples from Bleckhman et al. 2010.


We charted the reads per gene for each replicate against the first replicate in the condition (i.e. Control Replicate 2 versus Control Replicate 1).

This is an example of a fairly typical pair of biologcial replicates:

If your pilot data is more dispersed than this then it could indicate a problem. If it is truely a mess then we would recommend first checking whether your data file is corrupt. If not, then there might be a problem with the library prep or the samples may not be replicating properly.

Control Versus Test Data

If you entered both control and test data, Scotty charted the reads per gene for the control data versus the test data. You can use this chart to show how different the control and test sample are. The black line going through the center of the data points is the slope line. The slope line is used to normalize the test and control samples. This line should be going through the center of the data. If the line seems far off, this could mean that you have a lot of differential expression in your samples, or that the library complexity between the control and the test data is very different. If this is the case, you may want to investigate including spike ins in your library prep to get more accurate sample normalization.


Scotty clusters the data to identify any problems, such as outlier samples and sample swaps. In general, you should expect samples from the control condition to cluster in one branch, and samples from the test condition to cluster in another branch. This, however, will not be the case if the groups are similar. For instance, if you are comparing gene expression between two HapMap groups then we expect the groups to share branches because, while expression individual genes may be different, the global expression is close enough that there is no clear delineation between the groups using this clustering (based on what we have observed in existing datasets). However, this clustering has proven useful for us in the past to identify an instance where one of our samples was mislabelled. For this cluster we used a Spearman rank correlation as the distance metric. We excluded genes expressed with fewer than 10 reads to avoid the findings of the chart being dominated by Poisson noise.
Cluster Example In the example above, we see that the samples labeled Female1 and Male2 had the most closely related expression. If there were large differences between male and female expression you would have expected the male samples to cluster together, and the female samples to cluster together separately. However, that effect was not observed.

Genes Detected as a Function of Sequencing Depth

As sequencing depth increases, a greater number of genes will be measured. This chart shows how many genes are measured by at least 10 reads as a function of sequencing depth. We chose 10 as the cutoff because measurements with fewer than 10 reads tend to be imprecise due to the effects of Poisson counting noise. Samples that are nominally measuring the same thing within the same experiment (i.e. all human liver samples) would be expected to have similar rarefaction curves. This example shows human liver data taken from six different individuals.
Rarefaction Example


Results are presented in the following format:

Power Optimization

The power optimization chart shows how the power to detect differentially expressed genes increases as the number of replicates and sequencing depth increases.
Power Example
Genes that are expressed at levels that are too low to be observed in the protoype data are not included in this calculation. The reason that we excluded these genes is as follows: It is easy to imagine that with an infinite number of samples and infinite sequencing all of the annotated genes would eventually be detected as expressed. However, as a practical matter, many of these genes appear to be expressed at very low levels (below one transcript per cell). These genes are not likely to either be of interest to the user or quantifiable using current RNA-Seq technology at practical costs. Therefore, Scotty assumes that the genes that are present in the protoype data are the genes that are of interest to the user.

It is worth noting that in datasets where many genes are measured with fewer than ~10 reads, as is the case in most human datasets, the calculation of total power may not be the most informative metric. In these cases, when there is a sufficiently high replicate number, the total power will be driven primarily by the mixture of read depths. For genes measured with high counts, almost all of the differentially expressed genes will be detected, while for genes measured with low counts almost none of the differentially expressed genes will detected. The total power, therefore, becomes primarily a function of how many genes have high versus low counts. The total power metric is therefore best used in conjunction with the final output chart in Scotty, which shows the power by read depth for allowed experiments. (Note that this chart does not appear if there are no allowed configurations.)

Cost Optimization

The cost optimization chart shows how the cost of the experiment will increase as the reads sequenced and the number of replicates used will increase.

Note that Scotty's models are based on the cost per read that is aligned to a gene. We use this model to simplify user inputs and because alignment rates differ between experiments and between sequencing centers. A fraction of sequenced reads will not align to genes due to errors in the reads and alignment to non-gene regions. Therefore, the cost per aligned read is higher than the cost per sequenced read. For the pre-loaded datasets, the percentage of sequenced reads that were aligned to genes is available in the documentation for the pre-loaded datasets.
Cost Example

Detection Differences Due to Poisson Noise

The measurement bias metric is described in detail
Power Bias Example

Excluded Experimental Configurations

This plot shows which experimental configurations meet the user's requirements.
Power Bias Example
In this experiment, we see that the experiments in the top right hand corner are too expensive. Those on the bottom have insufficient power. This leaves ten of the original 90 configurations that meet the user's requirements. The reasons for the exclusions are given in the key.

Allowed Experimental Configurations

This plot shows the optimal experimenetal configurations in a simpler format.
Power Bias Example
All experimental configurations plotted with white boxes meet the user-defined criteria. In this experiment, the cheapest of these allowed configurations uses 9 replicates of each condition, sequenced with 10 million reads each. The most powerful has 10 replicates with 20 million reads per replicate.

Predicted Statistical Power

In these plots we show the statistical power to identify a 3X, 2X, and a 1.5X change in expression.
Power Bias Example
In the example above, notice how much more cheerful this plot is than the previous power plots. While the overall power of the experiment is poor (around 45%) in this case this is due mostly because of the existence of a large class of genes which are measured with few reads, and have low power. The number of these genes can be seen in the lines defining the quartiles of expression. Among the genes that are measured with counts that are high enough not to be affected very much by Poisson noise (~10 or more) the power is much higher. This chart indicates that under this experimental configuration most changes in expression in genes with low read counts will be missed, but most changes in expression among higher read count genes will be found. The total power of the experiment then becomes dependent on what this coverage mix is.

Why Does Scotty Use a T-Test?

Our selection of a t-test as our statistical methods was made after considerable deliberation. The primary reason we used a t-test is that we think it is a very good test.

For example, in the Scotty manuscript we used simulations to compare the t-test's ability to identify differentially expressed genes to a widely-used benchmark method that is based on the negative binomial distribution and uses an information-sharing strategy to calculate variance. We found that the t-test does identify fewer differentially expressed genes when the experiment contains are only two replicates. This is to be expected because variance measured with only two replicates is inaccurate. However, when there were at least three replicates the t-test performed nearly as well as the benchmark test and with 4 to 6 replicates the t-test comparably. At all read depths above one read per gene the false positive rate was closely matched by the p-value, despite the fact that RNA Seq data is count-based. We found that this performance and the accuracy of the p-values was maintained whether expression was modeled as being normally or lognormally distributed. Additionally, the t-test does not introduce any biases into the findings which can occur when information-sharing approaches are used to calculate the variance. This, we think makes downstream analysis easier because findings of, for example, enrichment in biological categories are less likely to be influenced by biases in the statistical test. Formulas are also readily available, which improves Scotty's performance. We use the formuas decribed in
Chow, Shao, and Wang 2002 which are more accurate at low sample sizes than the ones commonly found in textbooks.

We would generally recommend using at least three replicates. An experiment that uses only two it is likely to be underpowered and it is also difficult to tell if one of your replicates is an outlier.

Beyond Differential Expression

The power calculated by Scotty is only predictive of the power to detect differential gene expression between control and test conditions. As a general rule, other purposes, such as variant detection, require much deeper coverage. Some tools for splice detection, for example, only perform well when there are hundreds of reads aligned to the spliced gene. Accurate single nucleotide variant detection require several reads to align across each individual variant.

For comparison, most of the genes detected as expressed in the earlier Solexa sequencing runs of human contained fewer than 10 reads across the entire gene, which may stretch for thousands of base pairs. These measurements, therefore, do not contain enough information to detect much besides expression level for the majority of the genes that were expressed in the sample.

To determine how deeply samples need to be sequenced for other purposes, we recommend reading the papers that describe tools for processing the data and seeing how many reads are used in the performance benchmarks. The figures under the optimization charts can be used to give you an idea of how many genes will be quantified at a given read depth.

Citations Keep the Money Rolling

If you use Scotty to design your experiment, we hope you will have with awesome, publishable results! If you do, please don't forget us! We can be referenced in our paper in Bioinformatics:

Scotty: A Web Tool For Designing RNA-Seq Experiments to Measure Differential Gene Expression. M.A. Busby; C. Stewart; C. Miller; K. Grzeda; G. Marth Bioinformatics 2013; doi: 10.1093/bioinformatics/btt015

Thank you!

Contact Us

We love hearing from our users! Contact Michele Busby with any questions at busbym at bc dot edu.

Good Luck!

Scotty is dedicated to Hillary St. Pierre. May all your experiments be as powerful as Hillary.