Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Summarizing Data Part 1

DATA 606 - Statistics & Probability for Data Analytics

Jason Bryer, Ph.D. and Angela Lui, Ph.D.

February 8, 2023

1 / 38

One Minute Paper Results

What was the most important thing you learned during this class?

What important question remains unanswered for you?

2 / 38

Workflow

Data Science Workflow

Source: Wickham & Grolemund, 2017

3 / 38

Tidy Data

See Wickham (2014) Tidy data.

4 / 38

Types of Data

  • Numerical (quantitative)
    • Continuous
    • Discrete
  • Categorical (qualitative)
    • Regular categorical
    • Ordinal

5 / 38

Data Types in R

6 / 38

Data Types / Descriptives / Visualizations

Data Type Descriptive Stats Visualization
Continuous mean, median, mode, standard deviation, IQR histogram, density, box plot
Discrete contingency table, proportional table, median bar plot
Categorical contingency table, proportional table bar plot
Ordinal contingency table, proportional table, median bar plot
Two quantitative correlation scatter plot
Two qualitative contingency table, chi-squared mosaic plot, bar plot
Quantitative & Qualitative grouped summaries, ANOVA, t-test box plot
7 / 38

Variance

Population Variance: S2=Σ(xiˉx)2N Consider a dataset with five values (black points in the figure). For the largest value, the deviance is represented by the blue line ( xiˉx ).

See also: https://shiny.rit.albany.edu/stat/visualizess/
https://github.com/jbryer/VisualStats/

8 / 38

Variance (cont.)

Population Variance: S2=Σ(xiˉx)2N In the numerator, we square each of these deviances. We can conceptualize this as a square. Here, we add the deviance in the y direction.

9 / 38

Variance (cont.)

Population Variance: S2=Σ(xiˉx)2N

We end up with a square.

10 / 38

Variance (cont.)

Population Variance: S2=Σ(xiˉx)2N We can plot the squared deviance for all the data points. That is, each component in the numerator is the area of each of these squares.

11 / 38

Variance (cont.)

Population Variance: S2=Σ(xiˉx)2N The variance is therefore the average of the area of all these squares, here represented by the orange square.

12 / 38

Population versus Sample Variance

Typically we want the sample variance. The difference is we divide by n1 to calculate the sample variance. This results in a slightly larger area (variance) then if we divide by n.

Population Variance (yellow): S2=Σ(xiˉx)2N

Sample Variance (green): s2=Σ(xiˉx)2n1

13 / 38

Robust Statistics

Consider the following data randomly selected from the normal distribution:

set.seed(41)
x <- rnorm(30, mean = 100, sd = 15)
mean(x); sd(x)
## [1] 103.1934
## [1] 16.8945
median(x); IQR(x)
## [1] 103.9947
## [1] 25.68004

14 / 38

Robust Statistics

15 / 38

Robust Statistics

Let's add an extreme value:

x <- c(x, 1000)
16 / 38

Robust Statistics

Let's add an extreme value:

x <- c(x, 1000)

16 / 38

Robust Statistics

Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,

  • for skewed distributions it is often more helpful to use median and IQR to describe the center and spread

  • for symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread

17 / 38

About legosets

To install the brickset package:

remotes::install_github('jbryer/brickset')

To load the load the legosets dataset.

data('legosets', package = 'brickset')

The legosets data has 16355 observations of 34 variables.

names(legosets)
## [1] "setID" "name" "year"
## [4] "theme" "themeGroup" "subtheme"
## [7] "category" "released" "pieces"
## [10] "minifigs" "bricksetURL" "rating"
## [13] "reviewCount" "packagingType" "availability"
## [16] "agerange_min" "US_retailPrice" "US_dateFirstAvailable"
## [19] "US_dateLastAvailable" "UK_retailPrice" "UK_dateFirstAvailable"
## [22] "UK_dateLastAvailable" "CA_retailPrice" "CA_dateFirstAvailable"
## [25] "CA_dateLastAvailable" "DE_retailPrice" "DE_dateFirstAvailable"
## [28] "DE_dateLastAvailable" "height" "width"
## [31] "depth" "weight" "thumbnailURL"
## [34] "imageURL"
18 / 38

Structure (str)

str(legosets)
## 'data.frame': 16355 obs. of 34 variables:
## $ setID : int 7693 7695 7697 7698 25534 7418 7419 6020 22704 7421 ...
## $ name : chr "Small house set" "Medium house set" "Medium house set" "Large house set" ...
## $ year : int 1970 1970 1970 1970 1970 1970 1970 1970 1970 1970 ...
## $ theme : chr "Minitalia" "Minitalia" "Minitalia" "Minitalia" ...
## $ themeGroup : chr "Vintage" "Vintage" "Vintage" "Vintage" ...
## $ subtheme : chr NA NA NA NA ...
## $ category : chr "Normal" "Normal" "Normal" "Normal" ...
## $ released : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ pieces : int 67 109 158 233 NA 1 1 60 65 NA ...
## $ minifigs : int NA NA NA NA NA NA NA NA NA NA ...
## $ bricksetURL : chr "https://brickset.com/sets/1-8" "https://brickset.com/sets/2-8" "https://brickset.com/sets/3-6" "https://brickset.com/sets/4-4" ...
## $ rating : num 0 0 0 0 0 0 0 0 0 0 ...
## $ reviewCount : int 0 0 1 0 0 0 0 1 0 0 ...
## $ packagingType : chr "{Not specified}" "{Not specified}" "{Not specified}" "{Not specified}" ...
## $ availability : chr "{Not specified}" "{Not specified}" "{Not specified}" "{Not specified}" ...
## $ agerange_min : int NA NA NA NA NA NA NA NA NA NA ...
## $ US_retailPrice : num NA NA NA NA NA 1.99 NA NA 4.99 NA ...
## $ US_dateFirstAvailable: Date, format: NA NA ...
## $ US_dateLastAvailable : Date, format: NA NA ...
## $ UK_retailPrice : num NA NA NA NA NA NA NA NA NA NA ...
## $ UK_dateFirstAvailable: Date, format: NA NA ...
## $ UK_dateLastAvailable : Date, format: NA NA ...
## $ CA_retailPrice : num NA NA NA NA NA NA NA NA NA NA ...
## $ CA_dateFirstAvailable: Date, format: NA NA ...
## $ CA_dateLastAvailable : Date, format: NA NA ...
## $ DE_retailPrice : num NA NA NA NA NA NA NA NA NA NA ...
## $ DE_dateFirstAvailable: Date, format: NA NA ...
## $ DE_dateLastAvailable : Date, format: NA NA ...
## $ height : num NA NA NA NA NA ...
## $ width : num NA NA NA NA NA ...
## $ depth : num NA NA NA NA NA NA NA NA 5.08 NA ...
## $ weight : num NA NA NA NA NA NA NA NA NA NA ...
## $ thumbnailURL : chr "https://images.brickset.com/sets/small/1-8.jpg" "https://images.brickset.com/sets/small/2-8.jpg" "https://images.brickset.com/sets/small/3-6.jpg" "https://images.brickset.com/sets/small/4-4.jpg" ...
## $ imageURL : chr "https://images.brickset.com/sets/images/1-8.jpg" "https://images.brickset.com/sets/images/2-8.jpg" "https://images.brickset.com/sets/images/3-6.jpg" "https://images.brickset.com/sets/images/4-4.jpg" ...
19 / 38

RStudio Eenvironment tab can help

20 / 38

Data Wrangling Cheat Sheet

22 / 38

Tidyverse vs Base R

23 / 38

Pipes %>% and |>

The pipe operator (%>%) introduced with the magrittr R package allows for the chaining of R operations. Base R has now added their own pipe operator (|>). They take the output from the left-hand side and passes it as the first parameter to the function on the right-hand side.

You can do this in two steps:

tab_out <- table(legosets$category)
prop.table(tab_out)

Or as nested function calls.

prop.table(table(legosets$category))

Using the pipe (|>) operator we can chain these calls in a what is arguably a more readable format:

table(legosets$category) |> prop.table()

##
## Book Collection Extended Gear Normal Other
## 0.028798533 0.032100275 0.025191073 0.143564659 0.713420972 0.054050749
## Random
## 0.002873739
24 / 38

Filter

25 / 38

Logical Operators

  • !a - TRUE if a is FALSE
  • a == b - TRUE if a and be are equal
  • a != b - TRUE if a and b are not equal
  • a > b - TRUE if a is larger than b, but not equal
  • a >= b - TRUE if a is larger or equal to b
  • a < b - TRUE if a is smaller than be, but not equal
  • a <= b - TRUE if a is smaller or equal to b
  • a %in% b - TRUE if a is in b where b is a vector
which( letters %in% c('a','e','i','o','u') )
## [1] 1 5 9 15 21
  • a | b - TRUE if a or b are TRUE
  • a & b - TRUE if a and b are TRUE
  • isTRUE(a) - TRUE if a is TRUE
26 / 38

Filter

dplyr

mylego <- legosets %>% filter(themeGroup == 'Educational' & year > 2015)

Base R

mylego <- legosets[legosets$themeGroups == 'Educaitonal' & legosets$year > 2015,]

nrow(mylego)
## [1] 61
27 / 38

Select

dplyr

mylego <- mylego %>% select(setID, pieces, theme, availability, US_retailPrice, minifigs)

Base R

mylego <- mylego[,c('setID', 'pieces', 'theme', 'availability', 'US_retailPrice', 'minifigs')]

head(mylego, n = 4)
## setID pieces theme availability US_retailPrice minifigs
## 1 26803 103 Education {Not specified} NA 6
## 2 26689 142 Education {Not specified} NA 4
## 3 26804 98 Education {Not specified} NA 6
## 4 26277 188 Education Educational 78.95 NA
28 / 38

Relocate

29 / 38

Relocate

dplyr

mylego %>% relocate(where(is.numeric), .after = where(is.character)) %>% head(n = 3)
## theme availability setID pieces US_retailPrice minifigs
## 1 Education {Not specified} 26803 103 NA 6
## 2 Education {Not specified} 26689 142 NA 4
## 3 Education {Not specified} 26804 98 NA 6

Base R

mylego2 <- mylego[,c('theme', 'availability', 'setID', 'pieces', 'US_retailPrice', 'minifigs')]
head(mylego2, n = 3)
## theme availability setID pieces US_retailPrice minifigs
## 1 Education {Not specified} 26803 103 NA 6
## 2 Education {Not specified} 26689 142 NA 4
## 3 Education {Not specified} 26804 98 NA 6
30 / 38

Rename

31 / 38

Rename

dplyr

mylego %>% dplyr::rename(USD = US_retailPrice) %>% head(n = 3)
## setID pieces theme availability USD minifigs
## 1 26803 103 Education {Not specified} NA 6
## 2 26689 142 Education {Not specified} NA 4
## 3 26804 98 Education {Not specified} NA 6

Base R

names(mylego2)[5] <- 'USD'
head(mylego2, n = 3)
## theme availability setID pieces USD minifigs
## 1 Education {Not specified} 26803 103 NA 6
## 2 Education {Not specified} 26689 142 NA 4
## 3 Education {Not specified} 26804 98 NA 6
32 / 38

Mutate

33 / 38

Mutate

dplyr

mylego %>% filter(!is.na(pieces) & !is.na(US_retailPrice)) %>%
mutate(Price_per_piece = US_retailPrice / pieces) %>% head(n = 3)
## setID pieces theme availability US_retailPrice minifigs Price_per_piece
## 1 26277 188 Education Educational 78.95 NA 0.4199468
## 2 25949 280 Education Educational 224.95 NA 0.8033929
## 3 25954 1 Education Educational 14.95 NA 14.9500000

Base R

mylego2 <- mylego[!is.na(mylego$US_retailPrice) & !is.na(mylego$Price_per_piece),]
mylego2$Price_per_piece <- mylego2$Price_per_piece / mylego2$US_retailPrice
head(mylego2, n = 3)
## [1] setID pieces theme availability
## [5] US_retailPrice minifigs Price_per_piece
## <0 rows> (or 0-length row.names)
34 / 38

Group By and Summarize

legosets %>% group_by(themeGroup) %>% summarize(mean_price = mean(US_retailPrice, na.rm = TRUE),
sd_price = sd(US_retailPrice, na.rm = TRUE),
median_price = median(US_retailPrice, na.rm = TRUE),
n = n(),
missing = sum(is.na(US_retailPrice)))
## # A tibble: 15 × 6
## themeGroup mean_price sd_price median_price n missing
## <chr> <dbl> <dbl> <dbl> <int> <int>
## 1 Action/Adventure 31.3 29.9 20.0 1280 462
## 2 Basic 13.1 12.8 7.99 843 473
## 3 Constraction 15.1 14.0 9.99 501 125
## 4 Educational 89.0 107. 59.7 452 294
## 5 Girls 23.4 22.6 15.0 677 225
## 6 Historical 25.5 27.7 15.0 473 125
## 7 Junior 18.6 13.2 17.8 228 93
## 8 Licensed 42.9 58.3 25.0 2060 467
## 9 Miscellaneous 14.3 20.8 6.99 4925 2117
## 10 Model making 52.8 65.1 30.0 582 166
## 11 Modern day 31.2 33.7 20.0 1723 763
## 12 Pre-school 23.8 19.4 20.0 1487 699
## 13 Racing 24.8 30.2 10 270 59
## 14 Technical 60.8 68.1 40.0 550 137
## 15 Vintage 9.71 9.56 7.50 304 264
35 / 38

Describe and Describe By

library(psych)
describe(legosets$US_retailPrice)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 9886 28.52 42 14.99 20.14 14.83 0 799.99 799.99 5.62 58.91 0.42
describeBy(legosets$US_retailPrice, group = legosets$availability, mat = TRUE, skew = FALSE)
## item group1 vars n mean sd min max range se
## X11 1 {Not specified} 1 3197 24.24484 36.282072 0.60 789.99 789.39 0.6416833
## X12 2 Educational 1 9 140.95000 86.358265 14.95 244.95 230.00 28.7860885
## X13 3 LEGO exclusive 1 1066 28.79797 70.954538 0.00 799.99 799.99 2.1732094
## X14 4 LEGOLAND exclusive 1 7 12.70429 6.447591 4.99 19.99 15.00 2.4369603
## X15 5 Not sold 1 1 12.99000 NA 12.99 12.99 0.00 NA
## X16 6 Promotional 1 167 9.19485 23.667555 0.00 249.99 249.99 1.8314504
## X17 7 Promotional (Airline) 1 11 15.79455 6.614819 5.00 28.00 23.00 1.9944429
## X18 8 Retail 1 4824 29.82030 33.270049 1.95 399.99 398.04 0.4790158
## X19 9 Retail - limited 1 600 44.64837 57.391438 0.40 379.99 379.59 2.3429956
## X110 10 Unknown 1 4 2.24750 1.253671 1.00 3.99 2.99 0.6268356
36 / 38

Additional Resources

For data wrangling:

37 / 38

One Minute Paper

Complete the one minute paper: https://forms.gle/p9xcKcTbGiyYSz368

  1. What was the most important thing you learned during this class?

  2. What important question remains unanswered for you?

38 / 38

One Minute Paper Results

What was the most important thing you learned during this class?

What important question remains unanswered for you?

2 / 38
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow