Next we use separate() to split the key into location and time, using a regular expression to describe the character that separates them. Each person has been randomly assigned to either treatment or control. We have some measurements of how much time people spend on their phones, measured at two locations (work and home), at two times. Take this example from stackoverflow (modified slightly for brevity).
separate() allows you to tease them apart ( extract() works similarly but uses regexp groups instead of a splitting pattern or position). Sometimes two variables are clumped together in one column. We use gather() to gather the a and b columns into key-value pairs of drug and heartrate: messy %>% We have three variables (name, drug and heartrate), but only name is currently in a column.
In this experiment we’ve given three people two different drugs and recorded their heart rate: library(tidyr) Here’s an example how you might use gather() on a made-up dataset. Other names for gather include melt (reshape2), pivot (spreadsheets) and fold (databases). Gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer. tidyr provides three main functions for tidying your messy data: gather(), separate() and spread(). To tidy messy data, you first identify the variables in your dataset, then use the tools provided by tidyr to move them into columns. When use tidy data and tidy tools, you spend less time worrying about how to feed the output from one function into the input of another, and more time answering your questions about the data. The two most important properties of tidy data are:Īrranging your data in this way makes it easier to work with because you have a consistent way of referring to variables (as column names) and observations (as row indices). Tidy data is data that’s easy to work with: it’s easy to munge (with dplyr), visualise (with ggplot2 or ggvis) and model (with R’s hundreds of modelling packages). Tidyr is new package that makes it easy to “tidy” your data.