by Charles Leech
Overview and Set Up
R Markdown can be used for a variety of reasons and in a multitude of disciplines; political scientists use it when interpreting data, as do professionals in marketing, advertising, and other social sciences. On a more basic level, R Markdown can create files and edit text similar to how Google Docs or Word can. Having a working knowledge of R Markdown helps a person to stand out in job applications and might even be the deciding factor for employers when the choice between two candidates is close. Through reading this article, I will give you a better understanding of R Markdown, allowing you to learn commands that will help you analyze data and show your findings.
Installation and Opening Files
Before delving into the specifics of R Markdown, you must first have R and R Studio downloaded on your computer.
For R, go to the following website” https://www.r-project.org.
For RStudio use https://rstudio.com/products/rstudio/. Make sure to download the free Desktop version of the app.
There won’t be any need to spend exorbitant amounts of money for what this article will review. Keep in mind that you have to have a laptop or Mac of some kind; Chromebooks are unable to operate R.
Once R is downloaded onto your computer you will want to open up the application. There, you will need a little more set up before truly beginning. You’ll want to open a new R Markdown file. You may have noticed that I have been using R Studio and R Markdown interchangeably; for clarification, R Studio is the program we’ll run, and R Markdown is the type of file that we will be creating.
To begin, there is a menu bar in the top left corner of the screen. There, you will find an icon depicting a white sheet of paper with a green plus sign. Click it, then select “R Markdown…” from the Menu.
Once you have completed that task, a menu will pop up asking for a) the title of the document, b) your name, and c) the type of document you would like to create. You can explore these options on your own; for our purposes you can name the document anything you like, but select Word for the Default Output Format.
Now we’re almost ready to begin. First, you’ll need to install a few packages. This is done by clicking on the Tools tab in the Markdown Menu, up toward the top center of the page. Select “Install Packages…” from the dropdown menu and a tab will pop up and your screen will look like this:
Type “tidyverse” into the bar and click install. Now we are finally able to begin in earnest.
Data and Graphing
First, we need to select the data we wish to analyze. This could be in the form of an Excel document, Google Sheets, or a CSV file. For this example, I will use a spreadsheet I’ve made as a political science major that deals with the amount of women in civil conflict, as well as their involvement in the political process afterwards. You may also use this sheet if you would like, but if you have another set of data you would like to work with, that will work as well.
In order to select the data we are using, we will have to 1) find where the data is saved on our computer, and 2) set that area as the working directory. I will show you an easy way to do this: first, once you find where your data is (for me this is in the “Quan” folder on my computer), go over to the left side of the screen under “Files” and select the folder.
Remember that you will have saved your data in a different place than myself, so your files will look different than those displayed above.
After selecting your data, you will click on the blue gear that says “More” next to it. Another dropdown menu will appear that lets you select the option “Set as Working Directory.” Notice that in the Console in the bottom left corner, the code “setwd(“~/WITS”)” appears. You will want to copy and paste that line of code into your markdown file and then run it. To run code, simply have your cursor on the line of code and press Ctrl + Enter.
Library and Data Loading
Next, you will want to load in your libraries and your data. To do this, you will type the following code:
You should then determine what kind of file you will be working with. I’ll be demonstrating with a CSV file. If you are operating with a different file type, the change between can be done within excel by going to File → Export → Change File Type, then select CSV file. Next, enter and run the following code:
data <- read_csv(“The name of your dataset.csv”).
Make sure that the quotation marks are present within the parentheses and that the file name is exactly typed out, matching case, including the “.csv”. Please note that the words in parentheses should match the actual name of your data rather than the example above. If the command was executed correctly, there should be a new item beneath the Global Environment header located in the top right corner of the screen. Keep in mind that when running a code, you will always click on the line of code you want to run in order to move your text cursor to the line, and then press the ‘Ctrl’ and ‘Enter’ keys at the same time. Additionally, you do not have to name the object you create “data” as I did, moreover, the name of your dataset will differ from mine if you are using another data frame. In this particular case, mine is titled IFPJ data.
Start to Code
Now that your set up is done, you can begin to code! One thing that can always be helpful is knowing the measures of central tendency and the distribution of your variables. So, let’s play around with the data. If you are using different data than me, feel free to use your own variables and adapt my instructions to cater to your needs. For those of you who are using the same data as myself, we will be working with the percent of women in a given country’s legislature post-conflict; this variable is denoted as per_womenLeg in our data.
Firstly, you will want to open another chunk of data. To do so, type three pips and then a squiggly bracket with the letter ‘r’ inside, then you will press the ‘Enter’ key a few times and end the code chunk with another three pips as follows:
Now it’s time to learn your first code for interpreting analysis. When interpreting data, information such a median, mean, range, and standard deviation are very helpful; this data will give you a place to start and help determine what kind of analysis you will have to proceed.
There are some commands you could enter, one by one, that will eventually give you the information you need, but I’ll give you a shortcut that comes in the form of the Quantile command. This will generate values for different quartiles of the data. If you want more information than provided here on quantiles, I would recommend taking a stats class, where the topic will be explored in more depth.
In order to use the Quantile command, type the following (replacing “per_womenLeg” if necessary):
Once you run the code, you should get five numbers with the percentages of 0, 25, 50, 75, and 100 above them. Without getting bogged down on how we find these numbers or what some of their other applications are, I will tell you which numbers concern us for the moment: the number corresponding to 0% is the minimum, the number with the 100% is the maximum. Finally, the number with the 50% above it is the average of the data. You can find the range of this variable by simply subtracting the minimum from the maximum.
Now that we have some information about one of the variables, try this out with your own data. Try to find the minimum, maximum, median/mean, and range for the dummy variables of women in noncombat roles, combat roles, and leadership roles. If a group has a zero in one category, there aren’t women present, and if there is a one, there are women present. If you can’t remember what to do, just enter the codes below:
You may notice that when you run the code for these variables, many of the numbers under each of the respect percentages are the same. This is due to the fact that these are dummy variables, meaning that the only value that will ever be present is either a one or a zero. When you look at the distribution, women in leadership are less common than the other two groups, and women in combat roles are less common than women in noncombat roles.
Next, let’s run some code that will help us visualize these findings.
In this section, we will graph each of these variables on their own. You’ll need to enter the following command:
Once entered, a graph should appear. Please note that the code needs to be entered exactly as shown, it will not work if the code is all on one line. That being said, depending on if you are using your own data or not, your code may look slightly different, but it must use the same format. Let’s break it down a little.
In this code, the first thing you put in the parentheses of ggplot() is the name of your dataframe (in my case, data). Then, you will have a comma and the code aes(). As another example:
Within that parenthesis, you will put the variable you are trying to measure. Then you will exit the parentheses and add a “+”, then press enter to bring the code chunk down a line. Afterwords, determine what kind of graph you will need. You will always enter “geom_”, but what you put next determines what kind of graph you will get. geom_bar will make a bar graph, geom_density will make a density curve, and geom_histogram will create a histogram (naturally).
There are many kinds of graphs you can make, but for our first round of graphing, these will primarily be what we look at. Now, using the information you have been given, try to graph the other variables on your own. Please note, these other variables are a different kind and will need to use a different graph. If you need help, the codes are below:
Presence in Noncombat Roles
ggplot(data, aes(Noncombat_D)) +
Presence in Combat Roles
ggplot(data, aes(Combat_D)) +
Presence is Leadership Roles
ggplot(data, aes(Leader_D)) +
If properly executed, the graphs should look as follows.
As you can see through the graphing of the data, groups that have women in leadership and combat roles are rare, while groups that have women in noncombat roles are much more common. Researchers will know what these graphs mean, but if we are going to communicate our findings effectively, we need to make our graphs look better to the general eye. The following picture is going to show you what you need to add to the code graphing the percent of women in a legislature.
The command labs() adds labels to the title, the x-axis, and the y-axis. Due to the fact that the other three variables we are looking at are different variables, I will demonstrate again what to do with these kinds of variables before sending you off on your own.
As you can see from the above picture, there are two primary changes that we will have to go through in order to add some labels to the graph. First, you will want to go into the aes(Noncombat_D) command and before the name of the variable, you will want to add the command as.factor(). In the end, the new line of code should read aes(as.factor(Noncombat_D)). This enables the individual bars to be named. Next, you will want to actually add labels to each section of the graph; this is done by adding a plus sign after the geom_bar() command, and then adding the code as seen in the picture. If you are unable to see the picture or would otherwise want it to be written out, here it is:
labs(title = “Distribution of Groups with Women Present in Noncombat Roles”,
x = “Are Women Present in Noncombat Roles?”,
y = “Number of Groups”) +
scale_x_discrete(labels = c(“No”, “Yes”))
Now that you have a feel for how the code should look, try it on your own. The correct codes are shown in the picture below.
There is one last step until you have some of the most important skills that you will need to navigate R Markdown. For this last step, you will learn to graph the relationship between two different variables. Before we start graphing, however, we have to do some coding. This step is only needed if you are graphing a categorical measure versus a continuous measure.
In our case, our dependent variable, the percent of women in legislature, is continuous, and our independent variables are categorical. If you are confused about what these different kinds of variables are, feel free to reach out to me and I will explain them to the best of my ability. To begin, enter and run the following command.
graphNoncm <- data %>%
summarise(mean_womenLeg = mean(per_womenLeg, na.rm = TRUE))
This creates a dataset that will accurately represent the data that you wish to graph. This process will need to be repeated for each independent variable, which can be done by keeping the exact same code and switching the group_by(Noncombat_D) to group_by(Combat_D) and so on. Additionally, the first part of the code will need to change as well; instead of graphNoncm it could be graphCom or graphLeader. As you are creating a new dataset, potentially with your own data, it can be named whatever makes sense to you.
Once the new dataset is created, the graphing process remains fairly similar to the process executed when we graphed a single variable. In the picture below, you will see the bare-bones code for graphing groups that have women in Noncombat roles against the percent of women in legislature, after the conflict is over. There are a few particular aspects of these new graphs that I want to bring your attention to, so I circled them in red.
The large chunk of code is the necessary step I mentioned earlier, one that is crucial to creating an accurate graph and that will affect the code that follows it.
If you look above, you’ll see that the second circle shows our first deviation from the earlier graphs. The first bit of code after ggplot is where the data for the graph will come from. When graphing our singular variables, we got our information from the data dataset. However, as we needed to create another dataset to get an accurate reading, we need to change the code to the name of the new dataset.
The third circle is also part of the changes created by making the new dataset. We want to graph the percent of women in legislature on the y-axis; previously, that variable was called per_womenLeg, but now that we’ve created a new dataset, the name of the variable has changed and will to be changed in our code as well.
Our last circle deals with cosmetic changes. If you have managed to make it this far without the assistance of my pictures, you may have noticed that certain titles for these graphs are too long and thus aren’t completely visible. The \n that I have included here moves the text onto another line, allowing the viewer to completely read the title of your graph. I would also like to note that the backslash in \n is located above the Enter key; avoid using the regular “/” slash, otherwise you’ll run into numerous errors.
Taking the knowledge I have given you, I would like you to try and recreate this process for the other graphs. Again, if you get stuck, I will include pictures of the correct code below.
By looking at the graphs created by the code you just ran, you can see a possible connection between groups that have women in different roles in their revolution and if those women are brought into the power structure after combat is over.
In reading this article, you have equipped yourself with some knowledge that will put you ahead many of your peers when it comes to coding and graphing. This article simply covers some of the basics of R, but nonetheless, even a little knowledge of how R Markdown functions will make you an appealing candidate in job interviews, serving as an impressive mark on your resume.
If this article piqued your interest, there are classes here at NWU that concern the subject, or you can reach out to me for any other questions you may have. Additionally, there is a wealth of knowledge on the internet. I hope this has helped you feel more confident concerning R Markdown and some basic coding. Good luck, and happy graphing!
For any questions about R Markdown, please contact Charles Leech at firstname.lastname@example.org.