Introduction to Quarto in R Studio

Welcome! this is a tutorial for using R and R Studio in Posit Cloud. Here, you will learn the basics of using these software to do data analysis.

The document you are currently reading is a Quarto document that has been “rendered” to html, so that you can read it in any browser.

A Quarto document combines both word processing AND data science tools, all in one. It is very convenient! You can do all of your work in one place. Then, when you are done, you can convert the document into a reader-friendly version, presenting only what you want to, and leaving the rest “under the hood”.

Create a Posit Cloud account

Go to the Posit Cloud site and click on “Get Started”.

Now click on “Cloud Free” on the upper left side (just above “$0 / forever”), and then click “Sign Up” on the next screen.

You can use whatever email you wish!

Starting a new project

Once you have a Posit Cloud account, start a new RStudio project by clicking on “New Project”, in the upper right corner, and then clicking “New RStudio Project”:

Creating a new Quarto document

To get started with Quarto, open a new Quarto document like this:

And then give it a title and hit “Create”:

To get started using Quarto documents, you will need to install the R “package” rmarkdown - just click “Install” in the yellow banner:

Before you move on, remember to save this new document to your general project folder:

Getting acquainted with your document

There are two ways to interact with your Quarto document: “Visual” and “Source” - you can see them both toward the top left of the document. The “visual” style is more like a typical word processing document, because text is rendered automatically to look like it will on the final document. The “source” style is the underlying source code, and is plain text. You can use either, whatever is more comfortable!

You will see that Quarto gives you some examples for how it is used. There are a few things to notice right away.

At the top is a “yaml” block - this is used to control overall options for the document, like whether it should be rendered to html or pdf. We won’t use that much.

Below that is the document content. You will see two different kinds of content: (1) text and (2) R code.

To write text, you just start typing anywhere: Quarto knows that it should be text. To format it, you can use the tools at the top (e.g., bold, italics, etc.). If you are using “source”, there are other ways to change text format. The most common are (1) headers using #, ##, and ### (e.g., # This is a big header, and decreasing in font size with more #); (2) italics by wrapping words in * (e.g., *these words will be italicized*); and (3) boldface by wrapping words in ** (e.g., **these words will be boldface**).

To write code in the R language, you need to put it in executable cells. Quarto will only know you are writing code if it is in one of these cells! Anything outside of these, it will assume is text. To start a new executable cell, you can either click “Insert/Executable Cell/R”, or (more easily), type “Control-Alt-I”.

Give it a try, you will see a new cell appear. It has a label {r} to tell you it expected R code (if you had chosen “Python”, it would expected Python code, and so on).

Click into your new code cell, and try out some math, to see how this works. In the cell, type 10 * 37, then hit the little play button in the upper-right corner of the cell (if you are using the “Source” view, you can type “Control-Shift-Enter” whenever your cursor is within the cell). Magic!

The last thing to know about your document is how to render it into a final product. This is easy: just click “Render” at the top. Once you do that, it will translate your Quarto document into an html document (it will open in a new window).

Getting your data into your R Studio project

You are going to be working with data. To do so, you first need to upload it into your project. This is easy! Just click “Upload” in the bottom right area of the screen and then click “Choose File” on the pop-up window. Navigate your files to wherever your data file is located, and then double-click it. Hit “OK” on the pop-up screen to continue.

Reading in your data

To work with your data, you need to read it into your document. Here is how (remember to write the code below [or any other code] in an executable cell!):

# read in beps.csv data and store it in an object called `df` for data frame
df <- read.csv("lectures_beps.csv")

TIP: An important thing to know is that R Studio will ignore anything in a code block that is on the same line as #. This is an easy way to leave notes and tell the person reading your code what you are doing. It is good practice to make a note for every chunk of code you write. I assure you, you will forget what you were trying to do! And your reader will be very confused if you don’t.

If you want to see what is in your data, it is useful to summarize the variables:

# see what is in the data by summarizing all the variables
summary(df)

     vote                age        economic.cond.national
 Length:1525        Min.   :24.00   Min.   :1.000         
 Class :character   1st Qu.:41.00   1st Qu.:3.000         
 Mode  :character   Median :53.00   Median :3.000         
                    Mean   :54.18   Mean   :3.246         
                    3rd Qu.:67.00   3rd Qu.:4.000         
                    Max.   :93.00   Max.   :5.000         
 economic.cond.household     Blair           Hague          Kennedy     
 Min.   :1.00            Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:3.00            1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000  
 Median :3.00            Median :4.000   Median :2.000   Median :3.000  
 Mean   :3.14            Mean   :3.334   Mean   :2.747   Mean   :3.135  
 3rd Qu.:4.00            3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :5.00            Max.   :5.000   Max.   :5.000   Max.   :5.000  
     Europe       political.knowledge    gender         
 Min.   : 1.000   Min.   :0.000       Length:1525       
 1st Qu.: 4.000   1st Qu.:0.000       Class :character  
 Median : 6.000   Median :2.000       Mode  :character  
 Mean   : 6.729   Mean   :1.542                         
 3rd Qu.:10.000   3rd Qu.:2.000                         
 Max.   :11.000   Max.   :3.000

You can also see the data directly, in spreadsheet format:

# take a look at the data itself in spreadsheet format (opens a new tab in R Studio)
View(df)

# alternatively, use "head" to see the first few rows
head(df)

              vote age economic.cond.national economic.cond.household Blair
1 Liberal Democrat  43                      3                       3     4
2           Labour  36                      4                       4     4
3           Labour  35                      4                       4     5
4           Labour  24                      4                       2     2
5           Labour  41                      2                       2     1
6           Labour  47                      3                       4     4
  Hague Kennedy Europe political.knowledge gender
1     1       4      2                   2 female
2     4       4      5                   2   male
3     2       3      3                   2   male
4     1       3      4                   0 female
5     1       4      6                   2   male
6     4       2      4                   2   male

Data cleaning / making variables

A first step in data analysis is typically “cleaning” your data. This means: get your data and your variables into a format that can be analyzed. For example, some variables might need to be “recoded” - e.g., a variable that has text values of “Democrat” or “Republican” might need to be recoded to have numeric values of either 0 or 1. Here are some examples.

Let’s start by looking at the variable gender in our dataset df. To reference a variable in a data frame, you use the $, like this:

# look at a table that lists all values of gender and how many people are in each category
table(df$gender)


female   male 
   812    713

So gender is currently defined using text values. Let’s recode it to be numeric, with female equal to 1 and male equal to 0:

# create a new variable named gender_01 that is 1 if gender is female and 0 otherwise
df$gender_01 <- ifelse(df$gender == "female", 1, 0)

# let's see how it looks
table(original = df$gender, new = df$gender_01)

        new
original   0   1
  female   0 812
  male   713   0

Success! Let’s try another. There are two variables in our data measuring people’s views on the economy: economic.cond.national (views on the national economy) and economic.cond.household (views on household economic well-being). Let’s say we want to combine these into a single measure of people’s overall economic views. We can do that by (1) making sure they are on the same scale and (2) averaging them together:

# let's see what scale each variable is on
table(natl = df$economic.cond.national, HH = df$economic.cond.household)

    HH
natl   1   2   3   4   5
   1  15  10   7   4   1
   2  13 105  91  43   5
   3  23 101 321 136  26
   4  12  57 200 234  39
   5   2   7  29  23  21

Good! They are on the same scale. Now let’s create a new variable, econ_views, that is the average of the two:

# create new var that is average of the two econ variables
df$econ_views <- (df$economic.cond.national + df$economic.cond.household) / 2

# let's see what it looks like
table(df$econ_views)


  1 1.5   2 2.5   3 3.5   4 4.5   5 
 15  23 135 208 424 348 289  62  21

Success! If you want to see what the variable looks like graphically, you can use hist for a histogram plot:

# create a histogram for econviews
hist(df$econ_views)

Let’s try one more example. The variable Europe is coded from 1 to 11. But let’s say you need it to be on a scale that ranges from 0 to 1, with zero being the minimum value and one being the maximum. Here is how we could do this:

# create a new variable, Europe_01, that has minimum value 0 and maximum 1
df$Europe_01 <- (df$Europe - 1) / 10

# did it work?
table(df$Europe_01)


  0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9   1 
109  79 129 127 124 209  86 112 111 101 338

Success! But why did that work? The first thing we did was subtract an amount from the original variable needed to make the minimum equal to zero. Since the original minimum was 1, we just needed to subtract one. That made the new maximum equal to 10, but we want it to be 1, which means we needed to then divide the new variable by ten.

Some basic data analysis

There are lots of things you might want to do with your data once it is cleaned. In some cases, we will likely have to help you with analyses that are specific to your project. But there are many cases that rely on similar kinds of analysis techniques, and we will show you some basic ones here.

One of the simplest things you might want to do is to calculate correlations between variables. A correlation tells us about both the strength and the direction of the relationship between two variables. Correlations range from negative one to one. The further a correlation is from zero, the stronger it is. Negative correlations mean: as one variable gets bigger (smaller), the other gets smaller (bigger). Positive correlations mean: as one variable gets bigger (smaller) the other variable gets bigger (smaller).

Correlations

Here is an example. Let’s correlate national and household economic views (the option use = "complete.obs" is needed if either variable has missing values):

# calculate correlation of national and household economic views
cor(df$economic.cond.national, df$economic.cond.household, use = "complete.obs")

[1] 0.3463033

So the correlation is about 0.35. This means that the two variables are moderately positively correlated. People who think the national economy is doing well are more likely to their own household is doing well, and vice-versa.

Now let’s look at the correlation between opposition to European integration (Europe) and feelings about Prime Minister Tony Blair (Blair):

# calculate correlation of opposition to European integration and evals of Blair
cor(df$Europe, df$Blair, use = "complete.obs")

[1] -0.2961623

So these two variables are negatively correlated, at about -0.30. This means that people who are more opposed to European integration (have higher values on Europe) feel less positive about Tony Blair. Similarly, people who feel more positive about Tony Blair, are less opposed to European integration.

We can see this relationship graphically using a figure. Here, we will plot the average evaluations of Tony Blair for each value of Europe. First, we can calculate the average value of Blair at each value of Europe:

# calculate the mean of Blair at each value of Europe
blair_means <- aggregate(Blair ~ Europe, data = df, FUN = mean)

# let's see what it looks like
blair_means

   Europe    Blair
1       1 3.825688
2       2 3.949367
3       3 3.744186
4       4 3.700787
5       5 3.451613
6       6 3.464115
7       7 3.186047
8       8 3.196429
9       9 3.009009
10     10 3.079208
11     11 2.881657

Great! Now let’s plot this in a bar plot:

# create barplot
barplot(height = blair_means$Blair,
        names.arg = blair_means$Europe,
        col = "lightblue",
        xlab = "Opposition to European Integration",
        ylab = "Feelings about Tony Blair",
        main = "Support for Tony Blair by Opposition to European Integration",
        ylim = c(0,5))`

Mean differences

It is often the case that we expect the average value of one variable to be different across categories of some other variable. This is true for most experiments: we assign people to either a treatment or control condition, and then we look at whether some variable is different for treatment versus control. This kind of analysis requires us to calculate the averages in each category, and then say something about how confident we can be that those averages are really different from one another.

Let’s look at differences in opposition to European integration by gender. Let’s start by calculating the average opposition to integration for each gender category:

# calculate the mean of Europe at each value of gender
Europe_means <- aggregate(Europe ~ gender, data = df, FUN = mean)

# let's see what it looks like
Europe_means

  gender  Europe
1 female 6.96798
2   male 6.45582

So there is a difference in means: people who identify as female are more opposed to European integration tha people who identify as male. But is this difference “real”, or is it due to random sampling error? Notice that it is not that big (about half a point on an 11-point scale), so maybe it is just random error. Let’s conduct a statistical test to see:

t.test(Europe ~ gender, data = df)


    Welch Two Sample t-test

data:  Europe by gender
t = 3.0246, df = 1475.4, p-value = 0.002533
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
 0.1800055 0.8443142
sample estimates:
mean in group female   mean in group male 
             6.96798              6.45582

This test - called a “t-test” - tells us whether we can, or cannot, be confident that the true difference between the two groups is NOT zero. There are two things to look at.

First, the p-value: when this value is less than 0.05, we say the difference between groups is “statistically significant”, and we feel confident it is a true difference (i.e., not due to random sampling error). In our case, the p-value is 0.003, which is much less than 0.05, so we feel confident the difference between males and females is a real one.

Second, the 95 percent confidence interval: this tells us the range of values within which the true difference between groups is likely to fall. Our confidence interval is 0.18 to 0.84: this means that the true difference between males and females is likely to be somewhere between 0.18 and 0.84. We cannot be certain of the exact value of the true difference, but we feel confident it is somewhere in that range!

Regression

Regression is a method for estimating how two variables are related to one another, while “controlling” for other variables that might also matter. Why do we need to control for other variables at all? Consider an example. Imagine you are trying to see if exercise reduces the risk of a heart attack. You might find that people who exercise more are also less likely to have heart attacks. But consider that young people are both more likely to exercise AND less likely to have heart attacks (for reasons unrelated to exercise)! If you don’t “control” or “adjust” for age differences, you will overstate the degree to which exercise is associated with heart attacks. The same basic logic applies in lots of cases. For example, you might find that income is positively correlated with political knowledge, but is that because of income itself? An alternative hypothesis is that more educated people have both higher incomes and more knowledge of politics. To make sure you don’t overestimate the relationship of income to knowledge, you need to control for education.

Regression analysis allows you to estimate relationships while controlling for other variables. Let’s look at a simple example using the same data as above. Specifically, we want to know whether people’s beliefs about their own economic circumstances (economic.cond.household) are related to support for the sitting Prime Minister, Tony Blair. We might expect that people who feel more positive about their household finances evaluate Blair more positively. Let’s see if that’s true, but estimating a simple regression of feelings toward Blair on evaluations of household finances:

# estimate regression of Blair on household econ and store it in an object called m1
m1 <- lm(Blair ~ economic.cond.household, data = df)

# use the summary function to print the results
summary(m1)


Call:
lm(formula = Blair ~ economic.cond.household, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8402 -1.2963  0.4318  0.7037  2.2477 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              2.48039    0.10353  23.958   <2e-16 ***
economic.cond.household  0.27196    0.03161   8.603   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.148 on 1523 degrees of freedom
Multiple R-squared:  0.04634,   Adjusted R-squared:  0.04572 
F-statistic: 74.01 on 1 and 1523 DF,  p-value: < 2.2e-16

There are a couple things to look at here. The first is the “Estimate” for economic.cond.household: this tells us the expected change in Blair for a 1-unit increase in economic.cond.household. Just as we expected, a 1-point increase in positive evaluations of one’s household finances is associated with a 0.27 point increase in positive feelings toward Tony Blair. This makes sense! The second thing to look at is the column labeled “Pr(>{t})” for economic.cond.household: this tells us the “p-value” for the relationship between the two variables. Just like for mean differences above, if this number is less than 0.05 (indicated by having one or more “*” next to it), then we say the relationship is “statistically significant”, and we are confident it is real and not due to sampling error. Our p-value has three stars: it is much less than 0.05, and we conclude that there is a real relationship between evaluations of personal economic finances and feelings toward Tony Blair.

But is it really household economic assessments that matter? What if, instead, it is national economic assessments that really shape evaluations of the Prime Minister, and people who have positive national assessments have both positive household assessments AND positive feelings toward the Prime Minister? How can we know? Well, we can “control” or “adjust” for this other variable, economic.cond.national. Here is how we do it:

# estimate regression of Blair on household econ AND national econ and store it in an object called m2
m2 <- lm(Blair ~ economic.cond.household + economic.cond.national, data = df)

# use the summary function to print the results
summary(m2)


Call:
lm(formula = Blair ~ economic.cond.household + economic.cond.national, 
    data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8952 -1.0733  0.3978  0.7802  2.3983 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              1.63323    0.12513  13.053  < 2e-16 ***
economic.cond.household  0.14652    0.03240   4.522 6.61e-06 ***
economic.cond.national   0.38235    0.03421  11.178  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.104 on 1522 degrees of freedom
Multiple R-squared:  0.1187,    Adjusted R-squared:  0.1175 
F-statistic: 102.5 on 2 and 1522 DF,  p-value: < 2.2e-16

Notice a few things about the new results. First, we now have two different “Estimates”: one for economic.cond.household and one for economic.cond.national. Second, the Estimate for national assessments is larger than the one for household assessments. This seems to confirm our worry! Third, after we control for national assessments, the size of the relationship between household assessments and evaluations of Tony Blair gets smaller; that is, closer to zero. So we had good reason to worry, and it does seem like controlling for economic.cond.national was a good idea. Having said that, the relationship between household assessments and feelings toward Blair remains positive and statistically significant, so it still matters, just not as much as national economic assessments.