mydt[code to filter columns, code to create new columns, code to group data]
A lot of data.table will feel familiar to you if you know SQL. For more on data.table, check out the package website or this intro video:
When working with a basic data frame, you can think of each row as similar to a database record and each column like a database field. There are lots of useful functions you can apply to data frames, such as base R's summary()
and the dplyr package's glimpse().
Back to base R quirks: There are several ways to find an object's underlying data type, but not all of them return the same value. For example, class()
and str()
will return data.frame on a data frame object, but mode()
returns the more generic list.
If you'd like to learn more details about data types in R, you can watch this video lecture by Roger Peng, associate professor of biostatistics at the Johns Hopkins Bloomberg School of Public Health:
Roger Peng, associate professor of biostatistics at the Johns Hopkins Bloomberg School of Public Health, explains data types in R.
One more useful concept to wrap up this section — hang in there, we're almost done: factors. These represent categories in your data. So, if you've got a data frame with employees, their department and their salaries, salaries would be numerical data and employees would be characters (strings in many other languages); but you might want department to be a factor — ia category you may want to group or model your data by. Factors can be unordered, such as department, or ordered, such as "poor," "fair," "good," and "excellent."
R command line differs from the Unix shell
When you start working in the R environment, it looks quite similar to a Unix shell. In fact, some R command-line actions behave as you'd expect if you come from a Unix environment, but others don't.
Want to cycle through your last few commands? The up arrow works in R just as it does in Unix -- keep hitting it to see prior commands.
The list function, ls()
, will give you a list, but not of files as in Unix. Rather, it will provide a list of objects in your current R session.
Want to see your current working directory? pwd, which you'd use in Unix, just throws an error; what you want is getwd()
.
rm(my_variable)
will delete a variable from your current session.
R does include a Unix-like grep()
function. For more on using grep in R, see this brief writeup on Regular Expressions with The R Language at regular-expressions.info. If you want to work with regexps in R, you may also be interested in the tidyverse stringr package - see Matching patterns in regular expressions in R for Data Science by Hadley Wickham and Garrett Grolemund.
R's syntax for regular expression is a bit different than in most languages. For example, identifying the first matched "group" is typically $1
or \1
in other languages; in R, it's \\1
.
Terminating your R expressions
R doesn't need semicolons to end a line of code (while it's possible to put multiple commands on a single line separated by semicolons, you don't see that very often). Instead, R uses line breaks (i.e., new line characters) to determine when an expression has ended.
What if you want one expression to go across multiple lines? The R interpreter tries to guess if you mean for it to continue to the next line: If you obviously haven't finished a command on one line, it will assume you want to continue instead of throwing an error. Open some parentheses without closing them, use an open quote without a closing one or end a line with an operator like + or - and R will wait to execute your command until it comes across the expected closing character and the command otherwise looks finished.
Syntax cheating: Run SQL queries in R
If you've got SQL experience and you're not yet comfortable in R -- especially when you're trying to figure out how to get a subset of data with proper R syntax -- you might start longing for the ability to run a quick SQL SELECT command query your data set.
You can.
The add-on package sqldf lets you run SQL queries on an R data frame (there are separate packages allowing you to connect R with a local database). Install and load sqldf, and then you can issue commands such as:
sqldf("select * from mtcars where mpg > 20 order by mpg desc")
This will find all rows in the mtcars sample data frame that have an mpg greater than 20, ordered from highest to lowest mpg.
Examine and edit data with a GUI
And speaking of cheating, if you don't want to use the command line to examine and edit your data, R has a couple of options. The edit() function brings up an editor where you can look at and edit an R object, such as
edit(mtcars)
This can be useful if you've got a data set with a lot of columns that are wrapping in the small command-line window. However, since there's no way to save your work as you go along — changes are saved only when you close the editing window — and there's no command-history record of what you've done, the edit window probably isn't your best choice for editing data in a project where it's important to repeat/reproduce your work.
In RStudio you can also examine a data object (although not edit it) by clicking on it in the workspace tab in the upper right window.
Saving and exporting your data
In addition to saving your entire R workspace with the save.image() function and various ways to save plots to image files and R objects to your hard disk as R objects (save()
and saveRDS()
) you can save individual objects for use in other software. The rio package is a great way to export - and import - a data frame to and from lot of different data file types.
You just need to remember two functions - export(mydf, "myfilename")
and import("myfilename")
- and rio's function determines what to do based on the file name extension.
For example, if you've got a data frame and want to export it as a CSV file, its
export(mydf, "myfile.csv")
Want an Excel file instead?
export(mydf, "myfile.xlsx")
write.table(myData, "testfile.txt", sep="\t")
.
This article, , was originally published at Computerworld.com.