Tutorial 2

Author

Yu Cheng Hsu, BBMS1021 teaching team

Published

September 15, 2025

Cheatsheets

If you forget ggplot2 syntax, here is a cheatsheet for your reference: ggplot2 Cheatsheet

Question Set 1

This question set builds on Question Set 4 from Tutorial 1. If you haven’t loaded the data yet, follow the instructions in Q1 and Q2. Otherwise, proceed directly to Q3.

Q1: On your Desktop, create a subfolder named R_exercises and download the file containing differentially expressed genes results Diff_Expression_results.tsv (or view here) to this folder. Check your current working directory using the getwd() function and change it to the newly created folder. Hint: Use setwd("<path to R_exercises>") to change the working directory, replacing <path to R_exercises> with the actual path to the folder, or use the Session menu in RStudio to set the working directory.

Q2: Use the read.table() function to load the downloaded file into a data frame named df_DEG. Hint: Use the full path or just the file name if it’s in the current working directory. Ensure header=TRUE is set as an argument. To find help for a function, use ?function_name in R.

Q3: Use the summary() function to get an overview of the df_DEG data frame.

Q4: Create a histogram of the log2FoldChange column in df_DEG using both the basic plotting function hist() and the ggplot2 package. For the ggplot2 plot, add a figure title Histogram of log2FoldChange.

Q5: Create a scatter plot with the x-axis as log10(baseMean) and the y-axis as log2FoldChange from the df_DEG data frame. Use both the basic plot() function and the ggplot2 package.

Q6: Manipulate the data frame by adding two columns:
- Add a column log2FC_clip for clipping log2FoldChange to the range [-5, +5].
- Add a column is_DE to indicate if padj < 0.05.

Q7: Use the table() function to count the values in the is_DE column.

Q8: Using ggplot2, create a scatter plot colored by the newly added column is_DE. To make the figure clear, set the color legend title to Is differentially expressed and map the labels from TRUE to Yes and FALSE to No.

Q9: For the scatter plot from Q8, reverse the color order so that TRUE (Yes) is blue and FALSE (No) is red. How do you adjust the code? Hint: ggplot uses the order of factor levels to render the color. Use factor() and set the levels parameter to control the order of categories.

Q10: Save the scatter plot from Q8 or Q9 into a file named “My_DEG_results.pdf”.

Question Set 2

Import and inspect the iris dataset. This is a pre-defined dataset in the R environment that you can load and use by default. This study measured the widths and lengths of petals and sepals (parts of the flower) among different iris species.

Q1: Plot a boxplot showing the petal length distribution among different species in the iris dataset, with the following requirements:
1. Place Species on the y-axis.
2. Label the x-axis as Petal Length (cm).
3. Label the y-axis as Species.

Q2: Create a plot showing the association between Petal.Width and Sepal.Width among different Species in the iris dataset, with the following requirements:
1. Place Petal.Width on the x-axis.
2. Place Sepal.Width on the y-axis.
3. Label the x-axis as Petal Width (cm).
4. Label the y-axis as Sepal Width (cm).
5. Assign colors to differentiate Species.

Question Set 3

This question set is adapted from the NIH Bioinformatics Training and Education Program. The original question is slightly more complex. If you are interested in more challenging questions, visit the website.

Let’s load the Titanic dataset.

Column	Description
Survived	0 (No) / 1 (Yes)
Pclass	1 / 2 / 3 (Passenger Class)
Name	Passenger Name
Sex	Male / Female
Age	Age
Siblings.Spouses.Aboard	Number of siblings/spouses aboard
Parents.Children.Aboard	Number of parents/children aboard
Fare	Fare Price

Q1: Inspect the data using str(), summary(), and head().

Q2: Explore the relationship between the passenger’s age and fare. What type of plot should you use? Write the code using ggplot2.

Q3: Color the points from Q2 by Pclass. Remember that Pclass is a proxy for socioeconomic status. While the values are treated as numeric upon loading, they are categorical and should be treated as such. You will need to coerce Pclass into a categorical (factor) variable using factor() or as.factor().

Q4: Modify the plot from Q3 to update the legend labels to (1 = 1st Class, 2 = 2nd Class, 3 = 3rd Class).