Hodustory/프로그래밍&DB

[R programming 기초] 기초통계량 확인하기, summaryBy()

호두밥 2019. 7. 16. 21:52

온라인 광고 클릭수 분석하기

 

기본 데이터 불러오기

 data1 <- read.csv(url('http://stat.columbia.edu/~rachel/datasets/nyt1.csv'))

 

데이터 살펴보기

 >head(data1)
  Age Gender Impressions Clicks Signed_In
1  36      0           3      0         1
2  73      1           3      0         1
3  30      0           3      0         1
4  49      1           3      0         1
5  47      1          11      0         1
6  47      0          11      1         1

#나이 구간별로 데이터 분리(0~18, 18~24...)
> data1$agecat <- cut(data1$Age, c(-Inf, 0,18,24,34,44,54,64, Inf))
> summary(data1)
      Age             Gender       Impressions         Clicks          Signed_In           agecat      
 Min.   :  0.00   Min.   :0.000   Min.   : 0.000   Min.   :0.00000   Min.   :0.0000   (-Inf,0]:137106  
 1st Qu.:  0.00   1st Qu.:0.000   1st Qu.: 3.000   1st Qu.:0.00000   1st Qu.:0.0000   (34,44] : 70860  
 Median : 31.00   Median :0.000   Median : 5.000   Median :0.00000   Median :1.0000   (44,54] : 64288  
 Mean   : 29.48   Mean   :0.367   Mean   : 5.007   Mean   :0.09259   Mean   :0.7009   (24,34] : 58174  
 3rd Qu.: 48.00   3rd Qu.:1.000   3rd Qu.: 6.000   3rd Qu.:0.00000   3rd Qu.:1.0000   (54,64] : 44738  
 Max.   :108.00   Max.   :1.000   Max.   :20.000   Max.   :4.00000   Max.   :1.0000   (18,24] : 35270  
                                                                                      (Other) : 48005  
                                                                                      
> str(data1)
'data.frame':   458441 obs. of  6 variables:
 $ Age        : int  36 73 30 49 47 47 0 46 16 52 ...
 $ Gender     : int  0 1 0 1 1 0 0 0 0 0 ...
 $ Impressions: int  3 3 3 3 11 11 7 5 3 4 ...
 $ Clicks     : int  0 0 0 0 0 1 1 0 0 0 ...
 $ Signed_In  : int  1 1 1 1 1 1 0 1 1 1 ...
 $ agecat     : Factor w/ 8 levels "(-Inf,0]","(0,18]",..: 5 8 4 6 6 6 1 6 2 6 ...

doBy 패키지 설치 후 summaryBy 사용하기(요약통계량/기술통계량 구하기)

 summaryBy(Formula, , FUN, data) : 특정 함수를 추가해 결과 값 보기

 summaryBy(Formula, , data) : 함수를 기술하지 않으면 중앙값mean을 결과로 보여줌.

 #Formula : 통계 결과로 보고 싶은 변수 ~ 변수값을 묶을 변수

> install.packages('doBy')
> library('doBy')

> summaryBy(Gender+Signed_In+Impressions+Clicks ~ agecut, data=data1)
#Gender, Signed_In, Impressions, Clicks를 나이별로, agecut을 기준으로 살펴보기
     agecut Gender.mean Signed_In.mean Impressions.mean Clicks.mean
1  (-Inf,0]   0.0000000              0         4.999657  0.14207985
2    (0,18]   0.6421151              1         4.998961  0.13105132
3   (18,24]   0.5338531              1         5.006635  0.04845478
4   (24,34]   0.5321621              1         4.993829  0.05048647
5   (34,44]   0.5316963              1         5.021507  0.05167937
6   (44,54]   0.5289790              1         5.010406  0.05027377
7   (54,64]   0.5361885              1         5.022308  0.10183736
8 (64, Inf]   0.3632664              1         5.012347  0.15128856

> siterange <- function(x){c(length(x),min(x),mean(x),max(x))}
> summaryBy(Age~agecut,data=data1,FUN=siterange)
#Age를 나이별로, agecut을 기준으로  길이, 최소값, 중앙값, 최대값 살펴보기
     agecut Age.FUN1 Age.FUN2 Age.FUN3 Age.FUN4
1  (-Inf,0]   137106        0  0.00000        0
2    (0,18]    19252        7 16.03350       18
3   (18,24]    35270       19 21.26904       24
4   (24,34]    58174       25 29.50335       34
5   (34,44]    70860       35 39.49468       44
6   (44,54]    64288       45 49.49258       54
7   (54,64]    44738       55 59.49819       64
8 (64, Inf]    28753       65 72.98870      108

 

클릭 여부와 인상도(Impression)별 건수 분석하기

> data1$hasimps <- cut(data1$Impressions, c(-Inf,0,Inf))
> data1$scode[data1$Impressions==0] <- 'NoImps'
> data1$scode[data1$Impressions>0] <- 'Imps'
> data1$scode[data1$Clicks > 0] <-'Clicks'
> #data1scode <- factor(data1$scode)
> head(data1)
  Age Gender Impressions Clicks Signed_In    agecut  hasimps  scode
1  36      0           3      0         1   (34,44] (0, Inf]   Imps
2  73      1           3      0         1 (64, Inf] (0, Inf]   Imps
3  30      0           3      0         1   (24,34] (0, Inf]   Imps
4  49      1           3      0         1   (44,54] (0, Inf]   Imps
5  47      1          11      0         1   (44,54] (0, Inf]   Imps
6  47      0          11      1         1   (44,54] (0, Inf] Clicks
> clean <- function(x){c(length(x))}
> etable <- summaryBy(Impressions ~ scode+Gender+agecut,data=data1, FUN=clean)
> etable
    scode Gender    agecut Impressions.clean
1  Clicks      0  (-Inf,0]             17776
2  Clicks      0    (0,18]               846
3  Clicks      0   (18,24]               779
4  Clicks      0   (24,34]              1361
5  Clicks      0   (34,44]              1675
6  Clicks      0   (44,54]              1494
7  Clicks      0   (54,64]              2006
8  Clicks      0 (64, Inf]              2598
9  Clicks      1    (0,18]              1525
10 Clicks      1   (18,24]               890
11 Clicks      1   (24,34]              1509
12 Clicks      1   (34,44]              1917
13 Clicks      1   (44,54]              1645
14 Clicks      1   (54,64]              2331
15 Clicks      1 (64, Inf]              1486
16   Imps      0  (-Inf,0]            118401
17   Imps      0    (0,18]              6001
18   Imps      0   (18,24]             15538
19   Imps      0   (24,34]             25690
20   Imps      0   (34,44]             31290
21   Imps      0   (44,54]             28563
22   Imps      0   (54,64]             18626
23   Imps      0 (64, Inf]             15585
24   Imps      1    (0,18]             10754
25   Imps      1   (18,24]             17807
26   Imps      1   (24,34]             29241
27   Imps      1   (34,44]             35512
28   Imps      1   (44,54]             32143
29   Imps      1   (54,64]             21499
30   Imps      1 (64, Inf]              8887
31 NoImps      0  (-Inf,0]               929
32 NoImps      0    (0,18]                43
33 NoImps      0   (18,24]               124
34 NoImps      0   (24,34]               165
35 NoImps      0   (34,44]               219
36 NoImps      0   (44,54]               224
37 NoImps      0   (54,64]               118
38 NoImps      0 (64, Inf]               125
39 NoImps      1    (0,18]                83
40 NoImps      1   (18,24]               132
41 NoImps      1   (24,34]               208
42 NoImps      1   (34,44]               247
43 NoImps      1   (44,54]               219
44 NoImps      1   (54,64]               158
45 NoImps      1 (64, Inf]                72
반응형