COMP 5960/6960 - 090 Programming for BioMedical Data Science (R)

University of Utah

Semester: Spring 2026
Time: Self-Paced
Location: Online

Instructor: Tingying He (tingyinghe@sci.utah.edu)
Faculty Coordinator: Jeff Phillips (jeff.m.phillips@utah.edu)

Office Hours: Tuesday from 3:00–4:00 PM, WEB 2821

Zoom meetings are available by appointment only for students who cannot attend in person and have questions. Please email the instructor at least 2 hours in advance to schedule a virtual session during this time. If the instructor is unable to hold in-person office hours, an announcement will be made in advance.

General Information

This course will provide an introduction to programming in R with topics and pace designed for biomedical students interested in data science. Prior programming experience is not required. Students will learn how to write code for handling data, focusing on dataframe representations. Using these common representations, students will learn to prepare data for analysis starting from various formats, visualize its contents, and perform basic analysis to evaluate the data veracity. This course is structured as a series of stackable short-courses, where students need to complete 4 short courses in the semester to fulfill requirements for this credit-earning course.

This course is 100% self-paced and online.

Short courses

This is a composite course consisting of the four self-paced online short courses, listed below.

You should complete all four of these courses sequentially, starting with Introduction to R for Data Analysis, and ending with Statistics for Data Analysis.

If you finish a course early, you are welcome to start the next course (you do not need to wait until after the previous course’s due date).

1. Introduction to R for Data Analysis

Length: 5.5 hours of pre-recorded lectures, 3 projects

Due date: Monday, February 9, 2026

Description: This course introduces the R programming language and is designed for beginners who are new to R and coding. Specific topics covered include using RStudio and writing documents with quarto along with basic coding principles, defining variables, vectors, and data frames, pipes, data manipulation with dplyr, and data visualization with ggplot2.

Course Learning Objectives:

Perform operations with character, numeric, logical, and Boolean type objects in R
Use the dplyr library functions select, mutate, filter, summarize, and group_by to manipulate, and summarize tabular data
Use the ggplot2 library to create customizable data visualizations, including histograms, bar charts, and scatterplots

2. Advanced R for Data Analysis

Length: 4 hours of pre-recorded lectures, 4 projects

Due date: Monday, March 2, 2026

Description: You’ve learned the basics of R and the tidyverse, and now you’re ready to conduct more sophisticated data manipulation and analysis. This course is designed to elevate your R programming expertise, building on the foundations laid in the Introduction to R for Data Analysis course. You’ll learn advanced techniques and powerful tools that will transform your data workflows and analytical capabilities, such as creating custom R functions to automate your data science pipeline, reducing code redundancy with map functions, refactoring column variables, joining multiple data frames, and reshaping data. This course utilizes the R programming language, and it is assumed that learners have taken our “Introduction to R for Data Analysis” course or have equivalent experience.

Course Learning Objectives:

Write your own custom R functions
Reduce redundancy in your code using iteration techniques with the purrr package
Refactor column variables
Reshape your data frames
Join multiple data frames together

3. Data Cleaning with R

Length: 6 hours of pre-recorded lectures, 6 projects

Due date: Monday, March 30, 2026

Description: Real-world data can never perfectly capture the real world, and it rarely arrives in a ready-for-analysis format. Before your data is ready for analysis, it is critical that it is formatted appropriately, and you have ensured that it is representing the reality it was originally designed to capture to the greatest extent possible. The process of molding a dataset into a format that satisfies these criteria is known as “data cleaning”. While it may seem like a boring process, data cleaning is arguably the most important stage of the entire data science life cycle, since it ensures that you have a clear understanding of how real-world information is represented in your data as well as its limitations. Since every dataset is “messy” in its own way, the process of data cleaning will be unique to every dataset. This course introduces a series of steps that can be used to help you to understand your data and create a custom data cleaning procedure for any dataset, with a focus on biomedical data, such as electronic health records and health survey data. Specific topics covered include identifying and addressing missing values, handling data quality issues, such as invalid and inconsistent values, reshaping data into a “tidy” format, and creating a custom function in R that provides a reproducible and modifiable data cleaning pipeline for your project. This course utilizes the R programming language, and it is assumed that learners have taken our “Introduction to R for Data Analysis” course, or equivalent. It is recommended that students have also taken our “Advanced R for Data Analysis” course, but this is not a requirement.

Course Learning Objectives:

Data collection procedure and data dictionaries
Loading and pre-formatting data in R
Identifying and handling missing values
Identifying and handling invalid and inconsistent values
Converting data to a tidy data format
Creating and implementing a reproducible data cleaning pipeline

4. Statistics for Data Analysis

Length: 7.5 hours of pre-recorded lectures, 4 projects

Due date: Monday, April 20, 2026

Description: This course is designed to introduce you to the main statistical techniques you’ll need for analyzing real-world data using R. Statistics involves learning about a population by examining data from a sample from the population. This course will focus on applying common statistical methods to biomedical data. For example, you might use the skills from this course to analyze patient information from hospital records to understand health trends in the broader population or evaluate the effects of a new drug on a small group of patients to assess its impact on all potential patients. It is recommended that students have also taken our “Advanced R for Data Analysis” and “Data Cleaning with R” courses, but these are not requirements.

Course Learning Objectives:

Apply statistical thinking
Perform statistical inference
Calculate and interpret confidence intervals
Conduct hypothesis testing
Implement and interpret linear regression models

Grading

Grade: Each short course contains of a number of projects that are at the end of each “module”.

Your grade for each short course will be based on the average of your individual project scores.

Your grade for the entire composite course will be correspond to the average score you receive across the four short courses.

Late policy: If you do not complete a short-course by its due date (see above), you will lose 2% of your grade from that short-course per day, unless instructor permission is granted.

The due dates shown on individual Canvas assignment pages are suggested planning dates only to help you pace your work. The official deadlines for all assignments in each short course are the short-course due dates listed on this syllabus page.

Final exam: There is no final exam for this course.

Contact

Please email Tingying He (tingyinghe@sci.utah.edu) and Jeff Phillips (jeff.m.phillips@utah.edu) with questions or for more information.