1. Context of this project

This project is intended to perform EDA(Exploratory Data Analysis) and sampling on Ratings for Ramen, using data set fetched from the below URL: https://www.kaggle.com/residentmario/ramen-ratings.

2. Content

Data set contains ramen scoring status. Including review ids, ramen brands, varieties, styles, producing countries, ratings and top 10 rankings. 7 columns, 2580 rows totally.

3. Data preparation & preprocessing

3.1 Data Preparation

  1. Download the Ramen-ratings dataset in local computer. URL:https://www.kaggle.com/residentmario/ramen-ratings

  2. Import the data set into R. Note: Set the working directory first.

ramen.rating <- read.csv("ramen-ratings.csv",
                         header=TRUE)
head(ramen.rating)
##   Review..          Brand
## 1     2580      New Touch
## 2     2579       Just Way
## 3     2578         Nissin
## 4     2577        Wei Lih
## 5     2576 Ching's Secret
## 6     2575  Samyang Foods
##                                                       Variety Style     Country
## 1                                   T's Restaurant Tantanmen    Cup       Japan
## 2 Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles  Pack      Taiwan
## 3                               Cup Noodles Chicken Vegetable   Cup         USA
## 4                               GGE Ramen Snack Tomato Flavor  Pack      Taiwan
## 5                                             Singapore Curry  Pack       India
## 6                                      Kimchi song Song Ramen  Pack South Korea
##   Stars Top.Ten
## 1  3.75        
## 2     1        
## 3  2.25        
## 4  2.75        
## 5  3.75        
## 6  4.75

3.2 Processing Data

Deleted rows which are lack of information, such as empty ramen styles and items missing rates.

data<-ramen.rating[!(ramen.rating$Style=="")&!(ramen.rating$Stars=="Unrated"), ]

4. Analysis of categorical Variable

Categorical data is qualitative. It describes an event using a string of words rather than numbers. Here are the frequency divisions for the various ramen styles under that data. According to the graph, PACK, BOWL and CUP are the predominant packaging methods, and about 60% of the ramen are packaged in PACK.

table(data$Style)
## 
##  Bar Bowl  Box  Can  Cup Pack Tray 
##    1  481    6    1  450 1528  108
s<-data.frame(table(data$Style))
colnames(s)<-c("Style","Freq")

library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
fig <- plot_ly(s, labels = ~Style, values = ~Freq, type = 'pie')
fig <- fig %>% layout(title = 'Ramen Package-Styles Distribution Under Data',
                      xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
                      yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

fig
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

5. Analysis of numerical Variable

A numerical variable is a variable where the measurement or number has a numerical meaning. Ramen’s ratings are able to display their numerical data. Here is the frequency and chart of the rating divisions.The ramen scores range from 1 to 5, and the scores are not integers. I chose to use scatters to show the divisional status of the different scores. As seen in the graph, most of the rating divisions are between 3 and 4.

### One numerical variable
table(data$Stars)
## 
##     0   0.1  0.25   0.5  0.75   0.9     1   1.1  1.25   1.5  1.75   1.8     2 
##    26     1    11    14     1     1    26     2    10    37    27     1    68 
##   2.1 2.125  2.25   2.3   2.5  2.75   2.8  2.85   2.9     3   3.0  3.00   3.1 
##     1     1    21     2    67    85     2     1     2   172     2     1     2 
## 3.125   3.2  3.25   3.3   3.4   3.5  3.50   3.6  3.65   3.7  3.75   3.8     4 
##     1     1   170     1     1   326     9     1     1     1   349     3   384 
##   4.0  4.00 4.125  4.25   4.3   4.5  4.50  4.75     5   5.0  5.00 
##     3     6     2   143     4   132     3    64   369    10     7
stf<-data.frame(table(data$Stars))
colnames(stf)<-c("Rate","Freq")

#convert factors to numeric
stf$Rate<-as.numeric(paste(stf$Rate))

#graph for numerical variables
colnames(stf)<-c("Rate","Freq")
p <- plot_ly(stf, x = ~Rate, y = ~Freq,
               marker = list(size = 10,
                             color = 'rgba(255, 182, 193, .9)',
                             line = list(color = 'rgba(152, 0, 0, .8)',
                                         width = 2)))
p <- p %>% layout(title = 'Rates Scatter',
                      yaxis = list(zeroline = FALSE),
                      xaxis = list(zeroline = FALSE))
p
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

6. Analysis of Multivariate data

The relationship between ramen packaging and top 10 rankings is shown here, using bivariate data. In statistics, bivariate data is data on each of two variables, where each value of one of the variables is paired with a value of the other variable. Typically it would be of interest to investigate the possible association between the two variables.

I analyze the relationship between ramen packaging and top.ten. Based on the mosaic plot, I learned that most of the top ten ramen use the Pack for style. Bowl and cup each have only one type of ramen in the top.ten list.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.3
## -- Attaching packages ------------------------------ tidyverse 1.3.0 --
## √ tibble  3.0.3     √ dplyr   1.0.2
## √ tidyr   1.1.2     √ stringr 1.4.0
## √ readr   1.4.0     √ forcats 0.5.0
## √ purrr   0.3.4
## Warning: package 'readr' was built under R version 4.0.3
## Warning: package 'dplyr' was built under R version 4.0.3
## Warning: package 'stringr' was built under R version 4.0.3
## Warning: package 'forcats' was built under R version 4.0.3
## -- Conflicts --------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks plotly::filter(), stats::filter()
## x dplyr::lag()    masks stats::lag()
d<-as_tibble(data)
d.top.ten<-d[!(!is.na(d$Top.Ten) & d$Top.Ten==""), ]
table(d.top.ten$Style,d.top.ten$Top.Ten)
##       
##        \n 2012 #1 2012 #10 2012 #2 2012 #3 2012 #4 2012 #5 2012 #6 2012 #7
##   Bowl  1       0        0       0       0       0       0       0       0
##   Cup   1       0        0       0       0       0       0       0       0
##   Pack  2       1        1       1       0       1       1       1       1
##   Tray  0       0        0       0       1       0       0       0       0
##       
##        2012 #9 2013 #1 2013 #10 2013 #2 2013 #3 2013 #4 2013 #6 2013 #9 2014 #1
##   Bowl       0       0        0       0       0       0       0       0       0
##   Cup        0       0        0       0       0       0       0       0       0
##   Pack       1       1        1       1       1       1       0       1       1
##   Tray       0       0        0       0       0       0       1       0       0
##       
##        2014 #10 2014 #4 2014 #5 2014 #6 2014 #7 2014 #8 2014 #9 2015 #1
##   Bowl        0       0       0       0       0       0       0       0
##   Cup         0       0       0       0       0       0       0       0
##   Pack        1       0       1       1       1       1       1       1
##   Tray        0       1       0       0       0       0       0       0
##       
##        2015 #10 2015 #4 2015 #6 2015 #7 2015 #8 2015 #9 2016 #1 2016 #10
##   Bowl        0       0       0       0       0       0       0        0
##   Cup         0       0       0       0       0       0       0        0
##   Pack        1       1       1       1       1       0       1        1
##   Tray        0       0       0       0       0       1       0        0
##       
##        2016 #5 2016 #7 2016 #8 2016 #9
##   Bowl       0       0       0       0
##   Cup        0       0       0       0
##   Pack       1       1       1       1
##   Tray       0       0       0       0
#-----------
d.top.ten<-data.frame(d.top.ten)
colnames(d.top.ten)<-c("Num","Brand","Variety","Style","Country","Stars","Top.Ten")
p <- plot_ly(d.top.ten, x = ~Style, y = ~Top.Ten) 
p
## No trace type specified:
##   Based on info supplied, a 'histogram2d' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#histogram2d

7. Examing the distribution of numerical data

What is analyzed here is the segmentation of the RATING. According to the graph, most rating divisions are above 3 points on the right side. This division is left-skewed. A left-skewed distribution has a long left tail.

d.stars<-table(data$Stars)
plot(d.stars,xlab="Stars",ylab="Frequencies",main="Distribution of Ramen Stars")

# Left Skewed distribution

8. Random sampling of the data: Central Limit Theorem

The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement then the distribution of the sample means will be approximately normally distributed. The sample size is 10,20,30,40,50,60. According to the graph and parameters, the distribution is closer to the mean symmetry when the sample size is large enough. Satisfying the central limit Theorm.

library(sampling)
data$Stars<-as.numeric(paste(data$Stars))
x <- data$Stars
par(mfrow=c(1,1))
hist(x, prob = TRUE, 
     xlim=c(0,6), ylim = c(0, 4),main="Distribution of Stars",xlab="stars")

par(mfrow = c(3,2))
samples <- 3000
xbar <- numeric(samples)
for (size in c(10, 20, 30, 40, 50, 60)) {
  for (i in 1:samples) {
    xbar[i] <- mean(sample(x, size, replace = FALSE))
  }
  
  hist(xbar, prob = TRUE, 
       xlim=c(0,6), ylim = c(0, 4),
       main = paste("Sample Size =", size),xlab="Stars")
  
  cat("Sample Size = ", size, " Mean = ", mean(xbar),
      " SD = ", sd(xbar), "\n")
}
## Sample Size =  10  Mean =  3.657043  SD =  0.3178566
## Sample Size =  20  Mean =  3.658541  SD =  0.2230601
## Sample Size =  30  Mean =  3.650431  SD =  0.1852735
## Sample Size =  40  Mean =  3.656216  SD =  0.1637549
## Sample Size =  50  Mean =  3.65289  SD =  0.1404125

## Sample Size =  60  Mean =  3.655275  SD =  0.1324599
par(mfrow = c(1,1))

9. Sampling Methods

The sampling method chosen here is simple random sampling with no replacement. The newly generated sample shows that the rating distribution is still left-skewed. Another sampling method is systematic random sampling in which sample members from a larger population are selected according to a random starting point but with a fixed, periodic interval.

Simple random and systematic sampling is less labor intensive than using aggregate data; although there is some error, it is simple to obtain a rough assessment of trends for data with a large population(sample size large enough).

Particular for this dataset, these two sampling method can approximate the distribution of the overall parameters if the sample size is large enough to reduce the analyst’s workload. Simple random sampling is the easiest way to sampling. As the data of ratings is decimal fraction, use systematic random sampling is suitable to classify ratings. The use of systematic random sampling allows the sample to be grouped, resulting in more accurate estimates.

# simple random sampling

library(sampling)
set.seed(100)
s<-srswor(100,nrow(data))
sample.2<-data[s!=0,]
table(sample.2$Stars)
## 
##    0    1  1.1  1.5 1.75    2  2.5 2.75    3 3.25  3.5 3.75    4 4.25  4.3  4.5 
##    1    1    1    3    1    1    6    5    4    6   14   13   16    7    1    6 
## 4.75    5 
##    5    9
plot(table(sample.2$Stars),xlab="Stars",ylim=c(0,16),ylab="Frequencies",main="Simple Random Sampling For Ramen Rating")

# Systematic sampling
set.seed(100)

N <- nrow(data)

n <- 70   #sample size=70
k <- ceiling(N / n)     #take a value upward
r <- sample(k, 1)       #Take one number out of k.
s <- seq(r, by = k, length = n)
sample.1 <- data[s, ]
table(sample.1$Stars)
## 
##  1.5 1.75    2  2.5 2.75    3 3.25  3.5 3.75    4 4.25  4.5 4.75    5 
##    1    1    2    5    5    5    2    5   10   10    3    6    3   12
plot(table(sample.1$Stars),xlab="Stars",ylab="Frequencies",ylim=c(0,16),main="Systematic Random Sampling For Ramen Rating")

10. Summary

60% of the ramen are pack-style, which is the overall majority. This packing style has a definite place in top.ten. with people preference.