온라인 광고 클릭수 분석하기
기본 데이터 불러오기
data1 <- read.csv(url('http://stat.columbia.edu/~rachel/datasets/nyt1.csv'))
데이터 살펴보기
>head(data1)
Age Gender Impressions Clicks Signed_In
1 36 0 3 0 1
2 73 1 3 0 1
3 30 0 3 0 1
4 49 1 3 0 1
5 47 1 11 0 1
6 47 0 11 1 1
#나이 구간별로 데이터 분리(0~18, 18~24...)
> data1$agecat <- cut(data1$Age, c(-Inf, 0,18,24,34,44,54,64, Inf))
> summary(data1)
Age Gender Impressions Clicks Signed_In agecat
Min. : 0.00 Min. :0.000 Min. : 0.000 Min. :0.00000 Min. :0.0000 (-Inf,0]:137106
1st Qu.: 0.00 1st Qu.:0.000 1st Qu.: 3.000 1st Qu.:0.00000 1st Qu.:0.0000 (34,44] : 70860
Median : 31.00 Median :0.000 Median : 5.000 Median :0.00000 Median :1.0000 (44,54] : 64288
Mean : 29.48 Mean :0.367 Mean : 5.007 Mean :0.09259 Mean :0.7009 (24,34] : 58174
3rd Qu.: 48.00 3rd Qu.:1.000 3rd Qu.: 6.000 3rd Qu.:0.00000 3rd Qu.:1.0000 (54,64] : 44738
Max. :108.00 Max. :1.000 Max. :20.000 Max. :4.00000 Max. :1.0000 (18,24] : 35270
(Other) : 48005
> str(data1)
'data.frame': 458441 obs. of 6 variables:
$ Age : int 36 73 30 49 47 47 0 46 16 52 ...
$ Gender : int 0 1 0 1 1 0 0 0 0 0 ...
$ Impressions: int 3 3 3 3 11 11 7 5 3 4 ...
$ Clicks : int 0 0 0 0 0 1 1 0 0 0 ...
$ Signed_In : int 1 1 1 1 1 1 0 1 1 1 ...
$ agecat : Factor w/ 8 levels "(-Inf,0]","(0,18]",..: 5 8 4 6 6 6 1 6 2 6 ...
doBy 패키지 설치 후 summaryBy 사용하기(요약통계량/기술통계량 구하기)
summaryBy(Formula, , FUN, data) : 특정 함수를 추가해 결과 값 보기
summaryBy(Formula, , data) : 함수를 기술하지 않으면 중앙값mean을 결과로 보여줌.
#Formula : 통계 결과로 보고 싶은 변수 ~ 변수값을 묶을 변수
> install.packages('doBy')
> library('doBy')
> summaryBy(Gender+Signed_In+Impressions+Clicks ~ agecut, data=data1)
#Gender, Signed_In, Impressions, Clicks를 나이별로, agecut을 기준으로 살펴보기
agecut Gender.mean Signed_In.mean Impressions.mean Clicks.mean
1 (-Inf,0] 0.0000000 0 4.999657 0.14207985
2 (0,18] 0.6421151 1 4.998961 0.13105132
3 (18,24] 0.5338531 1 5.006635 0.04845478
4 (24,34] 0.5321621 1 4.993829 0.05048647
5 (34,44] 0.5316963 1 5.021507 0.05167937
6 (44,54] 0.5289790 1 5.010406 0.05027377
7 (54,64] 0.5361885 1 5.022308 0.10183736
8 (64, Inf] 0.3632664 1 5.012347 0.15128856
> siterange <- function(x){c(length(x),min(x),mean(x),max(x))}
> summaryBy(Age~agecut,data=data1,FUN=siterange)
#Age를 나이별로, agecut을 기준으로 길이, 최소값, 중앙값, 최대값 살펴보기
agecut Age.FUN1 Age.FUN2 Age.FUN3 Age.FUN4
1 (-Inf,0] 137106 0 0.00000 0
2 (0,18] 19252 7 16.03350 18
3 (18,24] 35270 19 21.26904 24
4 (24,34] 58174 25 29.50335 34
5 (34,44] 70860 35 39.49468 44
6 (44,54] 64288 45 49.49258 54
7 (54,64] 44738 55 59.49819 64
8 (64, Inf] 28753 65 72.98870 108
클릭 여부와 인상도(Impression)별 건수 분석하기
> data1$hasimps <- cut(data1$Impressions, c(-Inf,0,Inf))
> data1$scode[data1$Impressions==0] <- 'NoImps'
> data1$scode[data1$Impressions>0] <- 'Imps'
> data1$scode[data1$Clicks > 0] <-'Clicks'
> #data1scode <- factor(data1$scode)
> head(data1)
Age Gender Impressions Clicks Signed_In agecut hasimps scode
1 36 0 3 0 1 (34,44] (0, Inf] Imps
2 73 1 3 0 1 (64, Inf] (0, Inf] Imps
3 30 0 3 0 1 (24,34] (0, Inf] Imps
4 49 1 3 0 1 (44,54] (0, Inf] Imps
5 47 1 11 0 1 (44,54] (0, Inf] Imps
6 47 0 11 1 1 (44,54] (0, Inf] Clicks
> clean <- function(x){c(length(x))}
> etable <- summaryBy(Impressions ~ scode+Gender+agecut,data=data1, FUN=clean)
> etable
scode Gender agecut Impressions.clean
1 Clicks 0 (-Inf,0] 17776
2 Clicks 0 (0,18] 846
3 Clicks 0 (18,24] 779
4 Clicks 0 (24,34] 1361
5 Clicks 0 (34,44] 1675
6 Clicks 0 (44,54] 1494
7 Clicks 0 (54,64] 2006
8 Clicks 0 (64, Inf] 2598
9 Clicks 1 (0,18] 1525
10 Clicks 1 (18,24] 890
11 Clicks 1 (24,34] 1509
12 Clicks 1 (34,44] 1917
13 Clicks 1 (44,54] 1645
14 Clicks 1 (54,64] 2331
15 Clicks 1 (64, Inf] 1486
16 Imps 0 (-Inf,0] 118401
17 Imps 0 (0,18] 6001
18 Imps 0 (18,24] 15538
19 Imps 0 (24,34] 25690
20 Imps 0 (34,44] 31290
21 Imps 0 (44,54] 28563
22 Imps 0 (54,64] 18626
23 Imps 0 (64, Inf] 15585
24 Imps 1 (0,18] 10754
25 Imps 1 (18,24] 17807
26 Imps 1 (24,34] 29241
27 Imps 1 (34,44] 35512
28 Imps 1 (44,54] 32143
29 Imps 1 (54,64] 21499
30 Imps 1 (64, Inf] 8887
31 NoImps 0 (-Inf,0] 929
32 NoImps 0 (0,18] 43
33 NoImps 0 (18,24] 124
34 NoImps 0 (24,34] 165
35 NoImps 0 (34,44] 219
36 NoImps 0 (44,54] 224
37 NoImps 0 (54,64] 118
38 NoImps 0 (64, Inf] 125
39 NoImps 1 (0,18] 83
40 NoImps 1 (18,24] 132
41 NoImps 1 (24,34] 208
42 NoImps 1 (34,44] 247
43 NoImps 1 (44,54] 219
44 NoImps 1 (54,64] 158
45 NoImps 1 (64, Inf] 72
반응형