More on data structures
This commit is contained in:
parent
6f2f9b858d
commit
abcf1e38a4
|
@ -98,11 +98,13 @@ typeof(x)
|
|||
|
||||
You learned how to manipulate these vectors in [strings].
|
||||
|
||||
## Molecular vectors
|
||||
## Subsetting
|
||||
|
||||
There are three important types of vector that are built on top of atomic vectors: factors, dates, and date times. I call these molecular vectors, to torture the chemistry metaphor a little further. The chief difference between atomic and molecular vectors is that molecular vectors also have __attributes__.
|
||||
|
||||
Attributes are a way of adding arbitrary additional metadata to a vector. Each attribute is a named vector. You can get and set individual attribute values with `attr()` or see them all at once with `attributes()`.
|
||||
|
||||
## Augmented vectors
|
||||
|
||||
There are three important types of vector that are built on top of atomic vectors: factors, dates, and date times. I call these augmented vectors, because they are atomic vectors with additional __attributes__. Attributes are a way of adding arbitrary additional metadata to a vector. Each attribute is a named vector. You can get and set individual attribute values with `attr()` or see them all at once with `attributes()`.
|
||||
|
||||
```{r}
|
||||
x <- 1:10
|
||||
|
@ -112,7 +114,34 @@ attr(x, "farewell") <- "Bye!"
|
|||
attributes(x)
|
||||
```
|
||||
|
||||
The most important use of attributes in R is implement the S3 object oriented system. S3 objects have a "class" attribute, and which work with __generic functions__ to implement behaviour that differs based on the class of the object. A detailed discussion of S3 is beyond the scope of this book, but you can read more about it at <http://adv-r.had.co.nz/OO-essentials.html#s3>.
|
||||
There are three very important attributes that are used to implement fundamental parts of R:
|
||||
|
||||
* "names" are used to name the elements of a vector.
|
||||
* "dims" make a vector behave like a matrix or array.
|
||||
* "class" is used to implemenet the S3 object oriented system.
|
||||
|
||||
Class is particularly important because it changes what __generic functions__ do with the object. Generic functions are key to OO in R. Here's what a typical generic function looks like:
|
||||
|
||||
```{r}
|
||||
as.Date
|
||||
```
|
||||
|
||||
The call to "UseMethod" means that this is a generic function, and it will call a specific __method__, based on the class of the first argument. You can list all the methods for a generic with `methods()`:
|
||||
|
||||
```{r}
|
||||
methods("as.Date")
|
||||
```
|
||||
|
||||
And you can see the specific implementation of a method with `getS3method()`:
|
||||
|
||||
```{r}
|
||||
getS3method("as.Date", "default")
|
||||
getS3method("as.Date", "numeric")
|
||||
```
|
||||
|
||||
The most important S3 generic is print: it controls how the object is printed when you type its name on the console.
|
||||
|
||||
A detailed discussion of S3 is beyond the scope of this book, but you can read more about it at <http://adv-r.had.co.nz/OO-essentials.html#s3>.
|
||||
|
||||
### Factors
|
||||
|
||||
|
@ -126,7 +155,13 @@ attributes(x)
|
|||
|
||||
Historically, factors were much easier to work with than characters so many functions in base R automatically convert characters to factors (controlled by the dread `stringsAsFactors` argument). To get more historical context, you might want to read [stringsAsFactors: An unauthorized biography](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) by Roger Peng or [stringsAsFactors = \<sigh\>](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh) by Thomas Lumley. The motivation for factors is the modelling context. If you're going to fit a model to categorical data, you need to know in advance all the possible values. There's no way to make a prediction for "green" if all you've ever seen is "red", "blue", and "yellow"
|
||||
|
||||
The packages in this book keep characters as is, but you will need to deal with them if you are working with base R or many other packages. When you encounter a factor, you should first check to see if you can eliminate it. Often there will be `stringsAsFactors` argument that you can set to `FALSE`. Otherwise, you can use `as.character()` to explicitly turn back into a factor.
|
||||
The packages in this book keep characters as is, but you will need to deal with them if you are working with base R or many other packages. When you encounter a factor, you should first check to see if you can avoid creating it in the first. Often there will be `stringsAsFactors` argument that you can set to `FALSE`. Otherwise, you can apply `as.character()` to the column to explicitly turn back into a factor.
|
||||
|
||||
```{r}
|
||||
x <- factor(letters[1:5])
|
||||
is.factor(x)
|
||||
as.factor(letters[1:5])
|
||||
```
|
||||
|
||||
### Dates
|
||||
|
||||
|
@ -166,7 +201,7 @@ As far as I know there is no case in which you need POSIXlt. If you find you hav
|
|||
|
||||
## Recursive vectors (lists)
|
||||
|
||||
Lists are the data structure R uses for hierarchical objects. You're already familiar with vectors, R's data structure for 1d objects. Lists extend these ideas to model objects that are like trees. You can create a hierarchical structure with a list because unlike vectors, a list can contain other lists.
|
||||
Lists are the data structure R uses for hierarchical objects. Lists extend atomic vectors to model objects that are like trees. You can create a hierarchical structure with a list because unlike vectors, a list can contain other lists.
|
||||
|
||||
You create a list with `list()`:
|
||||
|
||||
|
@ -296,16 +331,32 @@ knitr::include_graphics("images/pepper-3.jpg")
|
|||
1. What happens if you subset a data frame as if you're subsetting a list?
|
||||
What are the key differences between a list and a data frame?
|
||||
|
||||
|
||||
## Data frames
|
||||
|
||||
## Matrices
|
||||
Data frames are augmented lists.
|
||||
|
||||
## Subsetting
|
||||
```{r}
|
||||
df <- data.frame(x = 1:5, y = 5:1)
|
||||
typeof(df)
|
||||
attributes(df)
|
||||
```
|
||||
|
||||
Not sure where else this should be covered.
|
||||
Generally, I prefer using `dplyr::data_frame()` instead of `data.frame`. It creates an object that is verty similar:
|
||||
|
||||
```{r}
|
||||
df <- dplyr::data_frame(x = 1:5, y = 5:1)
|
||||
typeof(df)
|
||||
attributes(df)
|
||||
```
|
||||
|
||||
* Doesn't convert variable types or variable names. It never uses character
|
||||
row names.
|
||||
|
||||
* It adds additional classes `tbl_df` to give better printing and subsetting
|
||||
behaviour.
|
||||
|
||||
## Predicates
|
||||
|
||||
### Predicates
|
||||
| | lgl | int | dbl | chr | list | null |
|
||||
|------------------|-----|-----|-----|-----|------|------|
|
||||
| `is_logical()` | x | | | | | |
|
||||
|
|
Loading…
Reference in New Issue