Getting started

This tutorial assumes you know how to load data into an RStudio session, view a dataframe and explore columns/rows of a dataframe in R. Knowing how to visualize data as scatterplots will also be helpful, though not essential.

We will be using two packages - reshape2 and ggpubr.
- reshape2 is a package with 2 main functions, melt and dcast. It is helpful for flexibly reshaping your data.
- ggpubr is a graphing package, that lets you create publication ready ggplots, and automatically add significance levels to your figures.
Let us also load up ggplot2 into our current environment, just in case we want to make pretty plots.

# If you don't have a package installed already install.packages('packagename')
# Otherwise, load it into the environment
library(ggplot2)
library(ggpubr)
library(reshape2)

R provides excellent support for statistical analysis. The data we will be working with is cell-line expression data from the LINCS1000 dataset. I have adapted this dataset for our use, which is available at the same spot you found this tutorial. You can also download the original data from here.

Cleaning my data

## Load the data
data_df <- read.table("data_sp_scaled.txt", sep = "\t", stringsAsFactors = F, header = T)
## and the covariate information
metadata_df <- read.table("metadata.txt", sep = "\t", stringsAsFactors = F, header = T)

As always, let us start by figuring out what we are working with. The dim() function prints the dimensions of a dataframe, and head() function shows the first 6 rows of a dataframe. You can also print only the row-count (or column-count) with the commands nrow() and ncol().
The command dim(data_df) tells us that the data has 35 rows and 33842 columns. Similarly, the row count and column count values for the metadata dataframe, metadata_df, are 35, 5, respectively.

Notice that if we try to print the first 6 rows of data_df, the output is immense. Thus, instead of head(data_df), we will print the first 6 rows, with the first 10 columns. We can select these columns with data_df[,1:10].

	TSPAN6	TNMD	DPM1	SCYL3	C1orf112	FGR	CFH	FUCA2	GCLC	NFYA
HCC1806	3.505880	0	NA	NA	3.534551	NA	2.8251175	5.727614	NA	4.600950
MCF10A	4.027427	0	5.549405	2.482252	3.963324	0.0000000	5.2812262	5.292880	6.521847	4.172004
SKBR3	2.684550	0	6.747064	3.199656	4.189142	0.1014213	0.0096204	5.240616	4.322539	3.518300
HS578T	NA	0	NA	1.424340	3.363768	0.0369093	5.2759113	5.841921	3.851815	4.461398
MDAMB231	4.287758	NA	5.693864	2.217807	4.619957	0.0000000	0.9724045	5.677292	4.046207	5.442562
BT20	3.335776	0	6.602087	2.648148	3.904788	0.0000000	0.1888233	4.781958	5.623793	4.138710

The column names correspond to genes, and the rows represent samples. These sample names correspond to the column cl_id in metadata_df (can you quickly verify this using head?).

We can also quickly ensure we don’t have random unexpected values (such as characters or alphabets where we expect numbers) using the query is.numeric. You can quickly check what this function does, by entering ?is.numeric in your R prompt.

table(sapply(data_df, is.numeric))

## 
##  TRUE 
## 33842

Looks like all our columns are numeric!

Take a quick look at the output from the head() function a couple of lines ago. It looks like we have some missing values in our data! Before we try and figure out a fix for this, let us calculate how many genes have missing values, or if the problem is only in a single sample.
R has a handy command, complete.cases, for checking if there are any rows containing missing values. It returns a TRUE/FALSE value for every row. We can summarize the results of this list in tabular form, using the function table().

table(complete.cases(data_df))

Var1	Freq
FALSE	5
TRUE	30

It appears 5 samples have atleast 1 gene with a missing value. We can redo this test for the genes, after transposing our data. This is done using the function t().

table(complete.cases(t(data_df)))

Var1	Freq
FALSE	18941
TRUE	14901

Dealing with missing data

Over 50% of the genes across 5 samples are missing. We can deal with this either using imputation strategies, or by discarding the problematic samples. As imputation strategies are an entire discussion by themselves, we will not into dive into them today (additional resources available at end of tutorial). Instead, we will take the easy way out and remove the samples with NAs. Good thing we have already covered a quick way to unselect these samples!.

data_df_clean <- data_df[complete.cases(data_df), ]

If you were perusing the previous tutorial, you would have noticed us using na.omit to find rows in a dataframe that contain any NA value (in any column). These two commands are functionally the same, but complete.cases can be used on a subset of columns instead of the entire dataframe as well. For example, if we wanted only to remove the samples where 1 or more of certain genes were missing, we could have chosen them with data_df[complete.cases(data_df[,c(“myFavGene1”, “myFavGene2”,….,“myfavGeneN”)]), ]. Don’t forget the comma after the row-selection!

After this, we have 30 samples and 33842 genes. We can do a quick ‘smell-test’ on this data, by using the dataframe function summary(). This function calculates summary statistics for each column in the dataframe. We can transpose the dataframe so that the samples become columns (instead of rows).

summary(t(data_df_clean))

BT20	MCF7	PDX1258	PDX1328	BT549	HCC38	HCC70	MDAMB134	MDAMB157	MDAMB361	MDAMB436	MDAMB453	MDAMB468	PDXHCI002	HCC1143	HCC1395	HCC1419	HCC1500	HCC1937	HCC1954	CAL120	CAL51	CAL851	SUM149	SUM159	SUM1315	T47D	CAMA1	HME1	HCC1428
Min. : 0.0000	Min. : 0.0000	Min. : 0.0000	Min. : 1.000	Min. : 0.0000	Min. : 0.0000	Min. : 0.0000	Min. : 0.0000	Min. : 0.0000	Min. : 0.0000	Min. : 0.0000	Min. : 0.0000	Min. : 0.000	Min. : 0.0000	Min. : 0.0000	Min. : 0.0000	Min. : 0.0000	Min. : 0.0000	Min. : 0.0000	Min. : 0.0000	Min. : 0.0000	Min. : 0.0000	Min. : 0.0000	Min. : 0.0000	Min. : 0.0000	Min. : 0.0000	Min. : 0.0000	Min. : 0.0000	Min. : 0.0000	Min. : 0.0000
1st Qu.: 0.0000	1st Qu.: 0.0000	1st Qu.: 0.0000	1st Qu.: 1.000	1st Qu.: 0.0000	1st Qu.: 0.0000	1st Qu.: 0.0000	1st Qu.: 0.0000	1st Qu.: 0.0000	1st Qu.: 0.0000	1st Qu.: 0.0000	1st Qu.: 0.0000	1st Qu.: 0.000	1st Qu.: 0.0000	1st Qu.: 0.0000	1st Qu.: 0.0000	1st Qu.: 0.0000	1st Qu.: 0.0000	1st Qu.: 0.0000	1st Qu.: 0.0000	1st Qu.: 0.0000	1st Qu.: 0.0000	1st Qu.: 0.0000	1st Qu.: 0.0000	1st Qu.: 0.0000	1st Qu.: 0.0000	1st Qu.: 0.0000	1st Qu.: 0.0000	1st Qu.: 0.0000	1st Qu.: 0.0000
Median : 0.1565	Median : 0.2083	Median : 0.3089	Median : 1.123	Median : 0.1346	Median : 0.2004	Median : 0.1219	Median : 0.4786	Median : 0.3232	Median : 0.2085	Median : 0.2637	Median : 0.1186	Median : 0.204	Median : 0.1926	Median : 0.1836	Median : 0.1885	Median : 0.1543	Median : 0.1977	Median : 0.2719	Median : 0.2176	Median : 0.1677	Median : 0.1725	Median : 0.2389	Median : 0.1493	Median : 0.1041	Median : 0.1418	Median : 0.2078	Median : 0.1108	Median : 0.1323	Median : 0.2588
Mean : 1.6150	Mean : 1.6330	Mean : 1.7137	Mean : 3.433	Mean : 1.5792	Mean : 1.6433	Mean : 1.6028	Mean : 1.7914	Mean : 1.7008	Mean : 1.6365	Mean : 1.6796	Mean : 1.5859	Mean : 1.670	Mean : 1.6045	Mean : 1.6385	Mean : 1.6241	Mean : 1.5863	Mean : 1.6684	Mean : 1.7110	Mean : 1.6560	Mean : 1.6060	Mean : 1.6329	Mean : 1.6847	Mean : 1.6150	Mean : 1.5660	Mean : 1.5528	Mean : 1.6730	Mean : 1.6005	Mean : 1.5523	Mean : 1.6649
3rd Qu.: 3.2142	3rd Qu.: 3.2164	3rd Qu.: 3.2897	3rd Qu.: 3.648	3rd Qu.: 3.0660	3rd Qu.: 3.1915	3rd Qu.: 3.1620	3rd Qu.: 3.4679	3rd Qu.: 3.2967	3rd Qu.: 3.2017	3rd Qu.: 3.2644	3rd Qu.: 3.1528	3rd Qu.: 3.322	3rd Qu.: 3.1102	3rd Qu.: 3.1878	3rd Qu.: 3.1407	3rd Qu.: 3.0975	3rd Qu.: 3.3240	3rd Qu.: 3.3607	3rd Qu.: 3.2790	3rd Qu.: 3.1178	3rd Qu.: 3.2700	3rd Qu.: 3.3469	3rd Qu.: 3.1577	3rd Qu.: 3.1224	3rd Qu.: 3.0169	3rd Qu.: 3.3095	3rd Qu.: 3.2538	3rd Qu.: 3.0697	3rd Qu.: 3.2264
Max. :13.9749	Max. :13.5385	Max. :400.7026	Max. :310.614	Max. :12.7921	Max. :12.9747	Max. :13.5201	Max. :13.7050	Max. :13.4885	Max. :13.5720	Max. :13.4842	Max. :13.4478	Max. :13.215	Max. :13.4067	Max. :13.3953	Max. :13.2711	Max. :14.2527	Max. :13.9911	Max. :12.8188	Max. :13.2294	Max. :12.5078	Max. :12.8472	Max. :13.3497	Max. :13.9438	Max. :13.1575	Max. :13.2789	Max. :13.9904	Max. :13.7174	Max. :12.5431	Max. :13.7117

Well, its still hard to read! Enter ggplot! However, we need to set up our data such that we can pass in a column with the sample name, and a column with the values being plotted.

For this, we will use the melt function from the reshape2 package. The melt function is helpful in converting your data from the long to wide format. A similar function, cast(), can be used when you wish to calculate summary statistics on your data.

data_compact_df = melt(t(data_df_clean))
colnames(data_compact_df) = c("Gene", "Sample", "Expression")

Dealing with outliers

ggplot has a handy geom_object (remember these from the ggplot tutorial?) for summary statistics. The stat_summary() (or geom_summary()) method allows us to plot a pointrange plot showing the mean and 2 x standard deviation.

## stat_summary() defaults to the categories defined on the x axis, and summarizes
## the numeric spread on the y-axis We use coord_flip() to switch the categories
## to the y-axis, to improve readability
ggplot(data_compact_df, aes(x = Sample, y = Expression)) + stat_summary() + theme_bw(base_size = 14) + 
    coord_flip()

## stat_summary with 1 Standard Deviation around mean
ggplot(data_compact_df, aes(x = Sample, y = Expression)) + stat_summary(fun.args = list(mult = 1)) + 
    theme_bw(base_size = 14) + coord_flip()

Segue: Another ggplot method is stat_smooth() (or geom_smooth()). This is helpful for plotting a line of best fit on your data. When you may want to compare this with a standard straight line, geom_abline is quite helpful.

## geom_point to visualize the scatterplot stat_smooth to fit the blue line with
## confidence intervals in grey geom_abline fits an x=y line by default
## (intercept=0, slope=1).
ggplot(data_df_clean, aes(x = BRCA1, y = PIK3CA)) + geom_point() + stat_smooth() + 
    geom_abline(colour = "red", size = 2) + theme_bw(base_size = 14)

End of Segue: Looking back at our stat_summary figure, we have an anomalous sample! The sample PDX1328 has readings that lie extremely outside the range of the rest of the samples. We can see this more clearly with a boxplot version of the datapoints, plotted using geom_boxplot().

ggplot(data_compact_df, aes(x = Sample, y = Expression)) + geom_boxplot() + theme_bw(base_size = 14) + 
    coord_flip()

Outliers come in many different flavors. There can be single datapoints (point outliers), noise in the data (contextual outliers), and an entire divergence in the observed values (collective outliers). In this case, we have a point outlier, which is lying far away from the rest of the observations. It may possibly have arisen from measurement or data entry errors.

Could we have made this detection analytical? We can calculate the Z-score of each observation. A Z-score is a standardized score, which tells you how many standard deviations away from the mean a data-point is. We can calculate the score using the scale function, which is applied to each column by default.

z_data = scale(data_df_clean, center = TRUE, scale = TRUE)
z_data_avgSample = rowMeans(z_data)

Let us see which sample has the maximum z-score.

print(sort(z_data_avgSample))

	x
HME1	-0.2219838
SUM159	-0.1999847
HCC70	-0.1934084
SUM1315	-0.1893011
BT549	-0.1768763
SUM149	-0.1566986
HCC1419	-0.1534749
MDAMB453	-0.1447130
CAMA1	-0.1394069
HCC1143	-0.1391422
PDXHCI002	-0.1363119
CAL120	-0.1353164
BT20	-0.1281040
MDAMB361	-0.1168243
HCC1395	-0.1133929
HCC38	-0.1064774
HCC1954	-0.0957087
MCF7	-0.0944671
CAL51	-0.0903028
CAL851	-0.0835159
MDAMB468	-0.0716014
HCC1428	-0.0480493
T47D	-0.0476032
HCC1500	-0.0466890
MDAMB436	-0.0288659
HCC1937	-0.0228481
PDX1258	-0.0215070
MDAMB157	0.0151131
MDAMB134	0.1772559
PDX1328	2.9102061

We can remove the outlier sample using the following command (note that we are making changes to the sample x gene dataframe, not the melted version).

data_df_clean2 = data_df_clean[!(rownames(data_df_clean) %in% c("PDX1328")), ]

Alright, so what does the data look like after that?

# First we melt our data
data_compact_df = melt(t(data_df_clean2))
colnames(data_compact_df) = c("Gene", "Sample", "Expression")
# Then we plot it
ggplot(data_compact_df, aes(x = Sample, y = Expression)) + geom_boxplot() + theme_bw(base_size = 14) + 
    coord_flip()

Also note that while this was easy to do for a small set of samples, you may not be able to visually identify outliers in large datasets. You can calculate z-scores for each sample, and identify samples that lie a few deviations away. You can generate PCA decompositions of your data, and plot the first 2 principal components. If you see a sample sitting further away from the rest, that’s an outlier! There are also more sophisticated approaches for dealing with outliers, explained nicely at this blogpost

It looks like PDX1258 has an outlier gene. We can easily print out the gene from our melted dataframe, with the command data_compact_df[data_compact_df$Expression > 300, "Gene"]. This returns TSPAN6. We can either remove this gene entirely, or replace it with the mean value. Sample metadata information can come in handy at this point. Let us see what information the metadata dataframe can provide:

print(metadata_df)

cl_id	cl_provider_name	cl_provider_catalog_id	cl_cell_type	cl_disease_detail
CAL51	Leibniz Institute	ACC-302	epithelial-like	breast carcinoma
MCF7	ATCC	HTB-22	epithelial	breast adenocarcinoma
HME1	ATCC	CRL-4010	epithelial	normal
SKBR3	ATCC	HTB-30	epithelial	breast adenocarcinoma
MDAMB231	ATCC	HTB-26	epithelial	breast adenocarcinoma
BT20	ATCC	HTB-19		breast ductal carcinoma
BT549	ATCC	HTB-122		breast ductal carcinoma
CAMA1	ATCC	HTB-21		breast adenocarcinoma
HC1143	ATCC	CRL-2321		breast ductal carcinoma
HCC1395	ATCC	CRL-2324		breast ductal carcinoma
HCC1419	ATCC	CRL-2326		breast ductal carcinoma
HCC1428	ATCC	CRL-2327		breast adenocarcinoma
HCC1806	ATCC	CRL-2335		squamous cell carcinoma
HCC1937	ATCC	CRL-2336		breast ductal carcinoma
HCC1954	ATCC	CRL-2338		breast ductal carcinoma
HCC38	ATCC	CRL-2314		breast ductal carcinoma
HCC70	ATCC	CRL-2315		breast ductal carcinoma
HS578T	ATCC	HTB-126		breast ductal carcinoma
MDAMB134	ATCC	HTB-23		breast ductal carcinoma
MDAMB157	ATCC	HTB-24		breast medullary carcinoma
MDAMB361	ATCC	HTB-27		breast adenocarcinoma
MDAMB436	ATCC	HTB-130		breast adenocarcinoma
MDAMB453	ATCC	HTB-131		breast carcinoma
MDAMB468	ATCC	HTB-132		breast adenocarcinoma
T47D	ATCC	HTB-133		breast carcinoma
HCC1500	ATCC	CRL-2329	epithelial	breast ductal carcinoma
MCF10A	ATCC	CRL-10317	epithelial	breast fibrocystic disease
SUM1315	Asterand	SUM-1315MO2		unknown
SUM149	Asterand	SUM-149PT		unknown
SUM159	Asterand	SUM-159PT		unknown
CAL120	Leibniz Institute DSMZ-German Collection of Microorganisms and Cell Cultures	ACC-459		breast adenocarcinoma
CAL851	Leibniz Institute DSMZ-German Collection of Microorganisms and Cell Cultures	ACC-440	epithelial	breast adenocarcinoma
PDX1258	Dan Stover (Harvard Medical School)		epithelial	breast carcinoma
PDX1328	Caitlin Mills (Harvard Medical School)		epithelial	breast carcinoma
PDXHCI002	Dan Stover (Harvard Medical School)		epithelial	breast medullary carcinoma

PDX1258 is a breast carcinoma. We can see the different disease categories by summarizing the contents of the column cl_disease_detail. We can also sort the table while we are at it….

sort(table(metadata_df$cl_disease_detail))

Var1	Freq
breast fibrocystic disease	1
normal	1
squamous cell carcinoma	1
breast medullary carcinoma	2
unknown	3
breast carcinoma	5
breast adenocarcinoma	10
breast ductal carcinoma	12

As there are 5 breast carcinomas in this dataset, we can potentially set the value of PDX1258 to the average value of the gene in other breast carcinomas. If we connect our metadata with our expression data, it will be easy to select the gene and samples of interest. For this we use the merge function. Merging requires a column that has the same values in the 2 different dataframes we are joining. Note that we can specify the column using by="shared column" if the column has the same name in the 2 dataframes.

## Merge expression and metadata
data_merged = merge(data_compact_df, metadata_df, by.x = "Sample", by.y = "cl_id")
## Select PDX1258's outlier gene
brca_tspan_df = data_merged[data_merged$Gene == "TSPAN6" & data_merged$cl_disease_detail == 
    "breast carcinoma", ]

	Sample	Gene	Expression	cl_provider_name	cl_provider_catalog_id	cl_cell_type	cl_disease_detail
101527	CAL51	TSPAN6	5.6334571	Leibniz Institute	ACC-302	epithelial-like	breast carcinoma
679083	MDAMB453	TSPAN6	0.2958427	ATCC	HTB-131		breast carcinoma
750048	PDX1258	TSPAN6	400.7026020	Dan Stover (Harvard Medical School)		epithelial	breast carcinoma
913735	T47D	TSPAN6	3.9403834	ATCC	HTB-133		breast carcinoma

Notice that there are only 4 samples here, and CAL51 is similar to PDX1258 as both are epithelial-like cell lines. We can set the TSPAN6 value for PDX1258 the same as sample CAL51, or the average of the 3 breast carcinomas.

## Firstly we update data_df_clean2 Notice how we select the row with the sample
## name, and gene with the gene name
data_df_clean2["PDX1258", "TSPAN6"] = mean(brca_tspan_df[brca_tspan_df$Sample != 
    "PDX1258", "Expression"])
## Then we recalculate the melted version of this dataframe
data_compact_df = melt(t(data_df_clean2))
colnames(data_compact_df) = c("Gene", "Sample", "Expression")

Comparing groups (and plotting the significance values)

For the last bit, we will use an in-built dataset, airquality. You can load it into your current environment by typing data(airquality).

data(airquality)

Ozone	Solar.R	Wind	Temp	Month	Day
41	190	7.4	67	5	1
36	118	8.0	72	5	2
12	149	12.6	74	5	3
18	313	11.5	62	5	4
NA	NA	14.3	56	5	5
28	NA	14.9	66	5	6

We will use the package ggpubr. This package is quite similar to ggplot, but it has additional methods that make it easy to create publication-ready figures in R. One of these methods is stat_compare_means().

ggplot(airquality[airquality$Month %in% c(5, 6), ], aes(x = Month, y = Temp)) + geom_boxplot() + 
    stat_compare_means() + theme_bw(base_size = 14)

You have probably run into an error message as you run the code above.
Warning message: Continuous x aesthetic – did you forget aes(group=…)? .
This is because the categories we are passing to perform the paired test for significance are numeric (hence ‘continuous’). We can overcome this by treating the category column’s values (Month) as strings.

ggplot(airquality[airquality$Month %in% c(5, 6), ], aes(x = as.character(Month), 
    y = Temp)) + geom_boxplot() + stat_compare_means() + theme_bw(base_size = 14)

You can change the type of test that is performed. For example, try updating stat_compare_means() to stat_compare_means(method="t.test")

We can visualize the spread of data in the different categories using other geometric objects. A violin plot is an extension of a box-plot that shows the kernel density distributions of the data points, in addition to the median value and spread.

ggplot(airquality[airquality$Month %in% c(5, 6), ], aes(x = as.character(Month), 
    y = Temp)) + geom_violin() + stat_compare_means() + theme_bw(base_size = 14)

We can also extend the comparison to more than two groups. This, however, requires a bit of work. We first need to define the various pairwise comparisons we wish to perform. Subsequently we pass this list of comparisons to stat_compare_means.

my_comparisons <- list(c("5", "6"), c("7", "8"), c("6", "8"), c("7", "9"))
## Plot p-values for specified comparisons
ggplot(airquality, aes(x = as.character(Month), y = Temp)) + geom_violin() + geom_point(alpha = 0.5) + 
    stat_compare_means(comparisons = my_comparisons, method = "t.test") + theme_bw(base_size = 14)

You can calculate the significance of difference in means between all groups relative to a reference, like so:

ggplot(airquality, aes(x = as.character(Month), y = Temp)) + geom_violin() + geom_point(alpha = 0.5) + 
    stat_compare_means(method = "t.test", ref.group = "5") + theme_bw(base_size = 14)

ggpubr magic

ggpubr’s methods theme_pubclean and theme_pubr shift the focus of your plot to your data.

my_comparisons <- list(c("5", "6"), c("7", "8"), c("6", "8"), c("7", "9"))

ggplot(airquality, aes(x = as.character(Month), y = Temp)) + geom_violin() + geom_point(alpha = 0.5) + 
    stat_compare_means(comparisons = my_comparisons, method = "t.test") + labs(x = "Month", 
    y = "Temperature") + theme_pubclean(base_size = 14)

Additional Resources

Understanding reshape2, wide and long formats
Outlier detection with R
Understanding Outliers and their relevance Detailed lecture on data cleanup with R
Using ggpubr to calculate significance

Take-aways

Basic smell-tests on your data
Removing cases with missing data
Identifying outliers
Descriptive statistics
Using ggpubr to create publication-ready figures

Extra

Pairs plots for expression data

Given a few samples (observations) with a large number of genes (variables), we can quickly evaluate if certain samples are outliers, simply by comparing the pair-wise scatterplots for all the samples

## We can also plot pairwise scatterplots
brca_df = data_merged[data_merged$cl_disease_detail %in% c("normal", "breast medullary carcinoma"), 
    ]
## Reverse the melt step
brca_df_recast = dcast(brca_df[, c("Sample", "Gene", "Expression")], Sample ~ Gene)
rownames(brca_df_recast) = brca_df_recast$Sample
brca_df_recast$Sample <- NULL
## Remove the outlier gene
brca_df_recast = brca_df_recast[, !(colnames(brca_df_recast) %in% c("TSPAN6"))]
pairs(t(brca_df_recast), panel = function(...) smoothScatter(..., add = TRUE))

Filtering genes by average value or standard deviation

We firstly identify genes that vary within disease types. We will compare breast adenocarcinomas and breast ductal carcinomas.

## Select the samples from the metadata dataframe
samples_brca = metadata_df[metadata_df$cl_disease_detail %in% c("breast adenocarcinoma", 
    "breast ductal carcinoma"), ]
## Filter our dataframe based on this list
brca_cohorts_df = data_df_clean2[rownames(data_df_clean2) %in% samples_brca$cl_id, 
    ]
### Compare this dataframe to what you get with the following command
brca_cohorts_testdf = data_df_clean2[samples_brca$cl_id, ]

We will do some filtering to identify the most variable genes. Bioconductor’s package genefilter also has some of these pre-implemented.

# Calculate mean of each gene
avg_genes = sapply(brca_cohorts_df, mean)
# Calculate standard deviation of each gene
sd_genes = sapply(brca_cohorts_df, sd)
# Filter dataframe based on an SD threshold of your choosing
brca_cohorts_filt = brca_cohorts_df[, names(sd_genes[sd_genes == max(sd_genes)]), 
    drop = FALSE]
brca_cohorts_filt$cl_id = rownames(brca_cohorts_filt)
brca_cohorts_filt = merge(brca_cohorts_filt, metadata_df)

ggplot(brca_cohorts_filt, aes(x = cl_disease_detail, y = TFF1)) + geom_boxplot() + 
    theme_bw(base_size = 16) + stat_compare_means(method = "t.test")

Data cleanup and summary statistics with R

Jasleen Grewal

2019-02-21