7 Combining Echo360 data with other sources of data with tidyverse
7.1 Introduction
So far, we have focused on analysing Echo360 data in isolation. However, you might be interested in combining data from multiple sources to explore if there are relationships between lecture capture and engagement/attainment data.
In this tutorial, we will demonstrate how you can combine data from multiple sources and the kind of identifiers you will need to match people across data sets. In addition to Echo360 data we have used so far, we also have two types of data from the virtual learning environment (VLE) Moodle, based on an introductory research methods course (RM1). We have mock checklist data for how many tasks people complete each week and mock attainment data for scores on a multiple choice quiz (MCQ). We applied the same synthetic data process and created anonymous names and emails to provide consistent information across files to join.
To follow the tutorial, please download the Echo360 and Moodle data from this .zip file..
library(tidyverse) # Package of packages for plotting and wrangling
library(plotly) # Creates interactive plots
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
7.2 Reading in the data
The first step is to read in the data we need for the tutorial. For the first example, we will read in data from one lecture. For later examples, we will combine data from multiple lectures. We only have three lectures of Echo360 data for this course, but to reduce repetition and scale better if you are working with more files, we will still read in data by listing files instead of assigning each file individually to an object. We are using relatives paths for the tutorial and assume you have a folder called data
in your working directory and two subfolders called RM1_duplicate
and Moodle_duplicate
. If you saved the data in another way, you will need to edit the paths.
# One lecture for an example
lecture_01 <- read_csv("data/RM1_duplicate/Lecture01.csv")
# Reading the Echo360 data from three lectures into one data object
# Obtain list of files from directory
files <- list.files(path="data/RM1_duplicate/",
pattern=".csv",
full.names = TRUE)
# Read in all the files to one object and label each file name
Echo360_data <- read_csv(files,
id="file_name")
# Preview the columns and data
glimpse(Echo360_data)
## Rows: 145
## Columns: 16
## $ file_name <chr> "data/RM1_duplicate/Lecture01.csv", "data/RM1_duplic…
## $ media_id <chr> "28dcfb20-a198-438b-8dbe-1b5725514231", "28dcfb20-a1…
## $ media_name <chr> "RM1 Lecture 1", "RM1 Lecture 1", "RM1 Lecture 1", "…
## $ create_date <chr> "09/23/2022", "09/23/2022", "09/23/2022", "09/23/202…
## $ duration <time> 01:00:00, 01:00:00, 01:00:00, 01:00:00, 01:00:00, 0…
## $ owner_name <chr> "James Bartlett", "James Bartlett", "James Bartlett"…
## $ course <chr> "Research Methods 1 (PGT Conv)", "Research Methods 1…
## $ user_name <chr> "Yaw, Collin", "Viaan, Jordan", NA, NA, "Sunny, Hali…
## $ email_address <chr> "214488@university.ac.uk", "211717@university.ac.uk"…
## $ total_views <dbl> 1, 1, 1, 3, 1, 1, 1, 2, 2, 2, 1, 2, 1, 3, 3, 1, 1, 5…
## $ total_view_time <time> 00:00:30, 00:01:00, 00:00:08, 00:04:30, 00:01:00, 0…
## $ average_view_time <time> 00:00:30, 00:01:00, 00:00:08, 00:01:30, 00:01:00, 0…
## $ on_demand_views <dbl> 1, 1, 1, 3, 1, 1, 1, 2, 2, 2, 1, 2, 1, 3, 3, 1, 1, 5…
## $ live_view_count <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ downloads <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ last_viewed <chr> "11/04/2022", "09/30/2022", "02/10/2023", "10/21/202…
We only have individual files for the checklist and MCQ data, so we can assign those to individual objects.
7.3 Joining data using dplyr
The dplyr
package in tidyverse
has several functions that merge or "join" data. There are several versions of these functions depending on how you want to join the files. For a detailed overview, please refer to the dplyr reference page online.
There are two main types of joins which add and match columns from two data frames. There is inner_join()
which only retains observations from data frame 1 that also has a matching case in data frame 2. There are then different types of outer joins which retain observations that appear in at least one data frame, such as left_join()
which retains all observations in data frame 1.
To illustrate these functions, we will consider data from Echo360 together with data from a checklist which records the percentage of tasks that student have marked as complete and also with data containing the results of a multiple choice quiz. In both the checklist and MCQ data objects, each student only appears once.
7.3.1 Merging data where each student only appears once in each data object
If you are interested in combining Echo360 data from just one lecture (and thus where each student will appear once) together with the MCQ data we could follow the following steps.
First, select the variables of interest from each data object, where lecture_01
contains the Echo360 data from the first lecture.
# Manually select which columns to retain
lecture_01_merge <- lecture_01 %>%
select(user_name,
email_address,
total_views,
total_view_time,
average_view_time,
last_viewed)
MCQ_merge <- MCQ %>%
select(first_name,
surname,
id_number,
email_address,
quiz_mcq_summative_assessment_real,
quiz_mcq_summative_assessment_perc)
Second, specify the type of 'join' to apply. For this tutorial, we will demonstrate the different types of join to show how the resulting objects differ. Echo360 and Moodle store data differently, so we will need an identifier to use as a key to link people from both files. The participants' email is the only consistent identifier, so we will use that to link participants across files. Other identifying information like the name and student ID number are not consistent across files.
7.3.1.1 left_join()
The first type is left_join()
which retains all observations in data frame 1, meaning we keep everything in lecture_01_merge
and add information from MCQ_merge
. We state email_address
as the shared variable to act as the key.
lecture_01_MCQ_left <- left_join(lecture_01_merge,
MCQ_merge,
by = "email_address")
It is important you explore the resulting data frame to make sure everything is as expected.
glimpse(lecture_01_MCQ_left)
## Rows: 67
## Columns: 11
## $ user_name <chr> "Yaw, Collin", "Viaan, Jordan", NA,…
## $ email_address <chr> "214488@university.ac.uk", "211717@…
## $ total_views <dbl> 1, 1, 1, 3, 1, 1, 1, 2, 2, 2, 1, 2,…
## $ total_view_time <time> 00:00:30, 00:01:00, 00:00:08, 00:0…
## $ average_view_time <time> 00:00:30, 00:01:00, 00:00:08, 00:0…
## $ last_viewed <chr> "11/04/2022", "09/30/2022", "02/10/…
## $ first_name <chr> "Collin", "Jordan", NA, NA, "Hali",…
## $ surname <chr> "Yaw", "Viaan", NA, NA, "Sunny", "T…
## $ id_number <int> 214488, 211717, NA, NA, 278359, 272…
## $ quiz_mcq_summative_assessment_real <int> 17, 22, NA, NA, 19, 19, 22, NA, 20,…
## $ quiz_mcq_summative_assessment_perc <dbl> 77.27273, 100.00000, NA, NA, 86.363…
We started with 67 rows for the Echo360 data and 243 rows for the MCQ data, but because we only wanted to retain all Echo360 observations, we ended up with 67 rows.
If you do not enter a column to act as the key, dplyr
will do its best to identify if there are two columns with the same name. It will warn you about this to make sure you double check it is using the right variables. In the first example, the variable also has the same name which makes things easier. However, you might find the same information has a different name, so you can tell dplyr
which two columns you want to use as the joining key. See what these alternatives look like below.
# Warning which column it is joining by
lecture_01_MCQ_left <- left_join(lecture_01_merge,
MCQ_merge)
## Joining with `by = join_by(email_address)`
7.3.1.2 right_join()
In contrast, right_join()
retains all observations in data frame 2, which is the MCQ data in this example.
lecture_01_MCQ_right <- right_join(lecture_01_merge,
MCQ_merge,
by = "email_address")
It is important you explore the resulting data frame to make sure everything is as expected.
glimpse(lecture_01_MCQ_right)
## Rows: 243
## Columns: 11
## $ user_name <chr> "Yaw, Collin", "Viaan, Jordan", "Su…
## $ email_address <chr> "214488@university.ac.uk", "211717@…
## $ total_views <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 1, 2, 1, 3,…
## $ total_view_time <time> 00:00:30, 00:01:00, 00:01:00, 00:0…
## $ average_view_time <time> 00:00:30, 00:01:00, 00:01:00, 00:0…
## $ last_viewed <chr> "11/04/2022", "09/30/2022", "10/08/…
## $ first_name <chr> "Collin", "Jordan", "Hali", "Franki…
## $ surname <chr> "Yaw", "Viaan", "Sunny", "Tzivia", …
## $ id_number <int> 214488, 211717, 278359, 272061, 275…
## $ quiz_mcq_summative_assessment_real <int> 17, 22, 19, 19, 22, NA, 20, 21, 18,…
## $ quiz_mcq_summative_assessment_perc <dbl> 77.27273, 100.00000, 86.36364, 86.3…
We started with 67 rows for the Echo360 data and 243 rows for the MCQ data, but because we wanted to retain all MCQ observations, we ended up with 243 rows.
7.3.1.3 full_join()
Finally, right_join()
retains all observations in data frame 1 and data frame 2.
lecture_01_MCQ_full <- full_join(lecture_01_merge,
MCQ_merge,
by = "email_address")
It is important you explore the resulting data frame to make sure everything is as expected.
glimpse(lecture_01_MCQ_full)
## Rows: 245
## Columns: 11
## $ user_name <chr> "Yaw, Collin", "Viaan, Jordan", NA,…
## $ email_address <chr> "214488@university.ac.uk", "211717@…
## $ total_views <dbl> 1, 1, 1, 3, 1, 1, 1, 2, 2, 2, 1, 2,…
## $ total_view_time <time> 00:00:30, 00:01:00, 00:00:08, 00:0…
## $ average_view_time <time> 00:00:30, 00:01:00, 00:00:08, 00:0…
## $ last_viewed <chr> "11/04/2022", "09/30/2022", "02/10/…
## $ first_name <chr> "Collin", "Jordan", NA, NA, "Hali",…
## $ surname <chr> "Yaw", "Viaan", NA, NA, "Sunny", "T…
## $ id_number <int> 214488, 211717, NA, NA, 278359, 272…
## $ quiz_mcq_summative_assessment_real <int> 17, 22, NA, NA, 19, 19, 22, NA, 20,…
## $ quiz_mcq_summative_assessment_perc <dbl> 77.27273, 100.00000, NA, NA, 86.363…
We started with 67 rows for the Echo360 data and 243 rows for the MCQ data, but because we wanted to retain all MCQ observations, we ended up with 245 rows.
To summarise:
The
left
join retains allnrow(lecture_01_MCQ_left)
student email addresses listed in the Echo360lecture_01_merge
object;Tthe
right
join retains allnrow(lecture_01_MCQ_right)
student email addresses listed in theMCQ_merge
object;The
full
join retains allnrow(lecture_01_MCQ_full)
unique student email addresses listed in bothlecture_01_merge
andMCQ_merge
.
On inspection, there are two observations in lecture_01_merge
that have missing values (NA
) for email_address
and these correspond to the two "extra" observations in lecture_01_MCQ_full
compared to lecture_01_MCQ_right
.
7.4 Merging data where each student may appear more than once
If we are interested in combining Echo360 data from multiple lectures (and thus where students may appear more than once) together with another data object where the students appear only once (such as the checklist and MCQ data described above), then we can still use the left_join()
function. Despite the function including the argument multiple
for the "handling of rows in x with multiple matches in y", this argument is not relevant to our scenario since if the Echo360 data is the x
argument and has multiple rows for each student it can be merged with another data source as the y
argument (within which each student only appears once) in one of the ways shown previously.
First, select the variables of interest from each data object, where Echo360_data
contains the data from the first three lectures.
Echo360_data_merge <- Echo360_data %>%
select(user_name,
email_address,
media_name,
total_views,
total_view_time,
average_view_time,
last_viewed)
MCQ_merge <- MCQ %>%
select(first_name,
surname,
id_number,
email_address,
quiz_mcq_summative_assessment_real,
quiz_mcq_summative_assessment_perc)
Where students may appear more than once, its necessary to include a variable that distinguishes between the repeated appearances. In this case its the variable media_name
which identifies the lecture that the Echo360 data is from.
Second, specify the type of 'join', in this case there is only one.
Echo360_data_MCQ_left <- left_join(Echo360_data_merge,
MCQ_merge,
by = "email_address")
Third, check that the input and output data objects are as we expect.
# Dimensions of original echo 360 data
glimpse(Echo360_data_merge)
# Dimensions of echo 360 data joined with MCQ data
glimpse(Echo360_data_MCQ_left)
## Rows: 145
## Columns: 7
## $ user_name <chr> "Yaw, Collin", "Viaan, Jordan", NA, NA, "Sunny, Hali…
## $ email_address <chr> "214488@university.ac.uk", "211717@university.ac.uk"…
## $ media_name <chr> "RM1 Lecture 1", "RM1 Lecture 1", "RM1 Lecture 1", "…
## $ total_views <dbl> 1, 1, 1, 3, 1, 1, 1, 2, 2, 2, 1, 2, 1, 3, 3, 1, 1, 5…
## $ total_view_time <time> 00:00:30, 00:01:00, 00:00:08, 00:04:30, 00:01:00, 0…
## $ average_view_time <time> 00:00:30, 00:01:00, 00:00:08, 00:01:30, 00:01:00, 0…
## $ last_viewed <chr> "11/04/2022", "09/30/2022", "02/10/2023", "10/21/202…
## Rows: 145
## Columns: 12
## $ user_name <chr> "Yaw, Collin", "Viaan, Jordan", NA,…
## $ email_address <chr> "214488@university.ac.uk", "211717@…
## $ media_name <chr> "RM1 Lecture 1", "RM1 Lecture 1", "…
## $ total_views <dbl> 1, 1, 1, 3, 1, 1, 1, 2, 2, 2, 1, 2,…
## $ total_view_time <time> 00:00:30, 00:01:00, 00:00:08, 00:0…
## $ average_view_time <time> 00:00:30, 00:01:00, 00:00:08, 00:0…
## $ last_viewed <chr> "11/04/2022", "09/30/2022", "02/10/…
## $ first_name <chr> "Collin", "Jordan", NA, NA, "Hali",…
## $ surname <chr> "Yaw", "Viaan", NA, NA, "Sunny", "T…
## $ id_number <int> 214488, 211717, NA, NA, 278359, 272…
## $ quiz_mcq_summative_assessment_real <int> 17, 22, NA, NA, 19, 19, 22, NA, 20,…
## $ quiz_mcq_summative_assessment_perc <dbl> 77.27273, 100.00000, NA, NA, 86.363…
The left
join retains all 145 observations listed in the Echo360_data_merge
object and combines them with the (single) corresponding observations in MCQ_merge
.
7.5 Visualizing merged data
Merging data enables the exploration of how video engagement (recorded in the Echo360 data) may be related to student performance. For example, the number of times each student views video content can be combined with their performance in an assessment (such as contained in the the MCQ data above).
First, we need some brief wrangling to summarise the total number of video views per student and repeat the joining process from before.
student_data <- Echo360_data %>%
group_by(email_address) %>%
summarise(total_views = sum(total_views)) %>%
mutate(email_address = factor(email_address)) %>%
ungroup()
student_MCQ <- left_join(student_data,
MCQ_merge,
by = "email_address") %>%
rename(MCQ_perc = quiz_mcq_summative_assessment_perc) # Rename super long name
We now have Echo360 and student attainment data that we can use to explore for potential patterns. For example, we can look at the relationship between these two variables in a scatterplot. There is a lot of overlap, so we have added a little jitter to the data points so its easier to see the density. This is why some of the points look like they extend beyond 100%.
fig.no_viewed.mcq <- student_MCQ %>%
ggplot(aes(x = total_views, y = MCQ_perc)) +
geom_jitter() +
labs(title= "Number of Video Views vs. Percentage Score in an MCQ") +
ylab("Percentage Score in MCQ (%)") +
xlab("No. of Video Views") +
theme_bw()
fig.no_viewed.mcq
From this plot, it does not look like there is a clear relationship as there are far more video views between 0 and 5, and clustered in the top left to show very good performance above 80%.
As in the previous chapter, you might want to identify some of these data points in an interactive version of the plot, so you could convert it using plotly
:
ggplotly(fig.no_viewed.mcq)
The purpose of our tutorials has not been to introduce you to inferential statistics, but having the data available means you can explore in different ways. The scatterplot did not suggest there was a clear relationship visually between total video views and MCQ attainment, but maybe there is a subtle statistical relationship.
Here, we can apply a Spearman correlation since the data may not meet parametric assumptions given the distribution of each variable.
cor.test(student_MCQ$total_views,
student_MCQ$MCQ_perc,
method = "spearman")
## Warning in cor.test.default(student_MCQ$total_views, student_MCQ$MCQ_perc, :
## Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: student_MCQ$total_views and student_MCQ$MCQ_perc
## S = 133575, p-value = 0.2004
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.1370056
The correlation coefficient is slightly negative suggesting as the number of video views increases, the MCQ percentage tends to decrease, but the relationship is not statistically significant.
This might inspire you to explore additional relationships and variables to see if there patterns worth investigating further to answer research questions you might have.