Tutorial 2
Cheatsheets
If you forget ggplot2 syntax, here is a cheatsheet for your reference: ggplot2 Cheatsheet
Question Set 1
This question set builds on Question Set 4 from Tutorial 1. If you haven’t loaded the data yet, follow the instructions in Q1 and Q2. Otherwise, proceed directly to Q3.
- Q1: On your Desktop, create a subfolder named
R_exercisesand download the file containing differentially expressed genes results Diff_Expression_results.tsv (or view here) to this folder. Check your current working directory using thegetwd()function and change it to the newly created folder. Hint: Usesetwd("<path to R_exercises>")to change the working directory, replacing<path to R_exercises>with the actual path to the folder, or use theSessionmenu in RStudio to set the working directory.
- Q2: Use the
read.table()function to load the downloaded file into a data frame nameddf_DEG. Hint: Use the full path or just the file name if it’s in the current working directory. Ensureheader=TRUEis set as an argument. To find help for a function, use?function_namein R.
- Q3: Use the
summary()function to get an overview of thedf_DEGdata frame.
- Q4: Create a histogram of the
log2FoldChangecolumn indf_DEGusing both the basic plotting functionhist()and theggplot2package. For the ggplot2 plot, add a figure titleHistogram of log2FoldChange.
- Q5: Create a scatter plot with the x-axis as
log10(baseMean)and the y-axis aslog2FoldChangefrom thedf_DEGdata frame. Use both the basicplot()function and theggplot2package.
- Q6: Manipulate the data frame by adding two columns:
- Add a column
log2FC_clipfor clippinglog2FoldChangeto the range[-5, +5]. - Add a column
is_DEto indicate ifpadj < 0.05.
- Add a column
- Q7: Use the
table()function to count the values in theis_DEcolumn.
- Q8: Using ggplot2, create a scatter plot colored by the newly added column
is_DE. To make the figure clear, set the color legend title toIs differentially expressedand map the labels fromTRUEtoYesandFALSEtoNo.
- Q9: For the scatter plot from Q8, reverse the color order so that
TRUE (Yes)is blue andFALSE (No)is red. How do you adjust the code? Hint: ggplot uses the order of factor levels to render the color. Usefactor()and set thelevelsparameter to control the order of categories.
- Q10: Save the scatter plot from Q8 or Q9 into a file named “My_DEG_results.pdf”.
Question Set 2
Import and inspect the iris dataset. This is a pre-defined dataset in the R environment that you can load and use by default. This study measured the widths and lengths of petals and sepals (parts of the flower) among different iris species.
- Q1: Plot a boxplot showing the petal length distribution among different species in the
irisdataset, with the following requirements:- Place
Specieson the y-axis. - Label the x-axis as
Petal Length (cm). - Label the y-axis as
Species.
- Place
- Q2: Create a plot showing the association between
Petal.WidthandSepal.Widthamong differentSpeciesin theirisdataset, with the following requirements:- Place
Petal.Widthon the x-axis. - Place
Sepal.Widthon the y-axis. - Label the x-axis as
Petal Width (cm). - Label the y-axis as
Sepal Width (cm). - Assign colors to differentiate
Species.
- Place
Question Set 3
This question set is adapted from the NIH Bioinformatics Training and Education Program. The original question is slightly more complex. If you are interested in more challenging questions, visit the website.
Let’s load the Titanic dataset.
| Column | Description |
|---|---|
| Survived | 0 (No) / 1 (Yes) |
| Pclass | 1 / 2 / 3 (Passenger Class) |
| Name | Passenger Name |
| Sex | Male / Female |
| Age | Age |
| Siblings.Spouses.Aboard | Number of siblings/spouses aboard |
| Parents.Children.Aboard | Number of parents/children aboard |
| Fare | Fare Price |
- Q1: Inspect the data using
str(),summary(), andhead().
- Q2: Explore the relationship between the passenger’s age and fare. What type of plot should you use? Write the code using
ggplot2.
- Q3: Color the points from Q2 by
Pclass. Remember thatPclassis a proxy for socioeconomic status. While the values are treated as numeric upon loading, they are categorical and should be treated as such. You will need to coercePclassinto a categorical (factor) variable usingfactor()oras.factor().
- Q4: Modify the plot from Q3 to update the legend labels to (1 = 1st Class, 2 = 2nd Class, 3 = 3rd Class).