Data Science Certification Course

data science certification course

245 views

Embed
Email

From

Username or Email (please add comma after each username or email)

Name	Email

Back

Menu 3

Eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo.

Tejaalex2182

Uploaded on Aug 28, 2018

Category Education

ExcelR is considered to be the best Data Science training institute in Noida which offers a gamut of services starting from training to placement as part of the program. Faculty is our forte. All our trainers are working as Data Scientists with over 15+ years professional experience. They are qualified, certified, experienced and has passion for training. Majority of the trainers are alumni of premier institutes such as IIT, IIM, Indian School of Business (ISB) and a few Ph.D qualified professionals. Participants who register for classroom training can attend instructor led online training and get access to self-paced e-learning videos. This blended model of training will ensure a perpetual learning so that the participants can absorb and assimilate the concepts thoroughly. ExcelR is the official training delivery partner for over 30+ universities and colleges across the globe which endorses the quality of our course and faculty. ExcelR holds one of the highest placement records in the space of Data Science owing to their tie ups with various organizations, recruiting the participants trained through us.

Category Education

Comments

                     data science certification course
                     
Microsoft PowerPoint - 1-PCA



Dimension Reduction using 

Principal Components Analysis 

(PCA)


Application of dimension 

reduction

• Computational advantage for other

algorithms

• Face recognition— image data (pixels)

along new axes works better for

recognizing faces

• Image compression


Data for 25 undergraduate programs at 

business schools in US universities in 1995.

Use PCA to:

1) Reduce # columns

Additional benefits:

2) Identify relation

between columns

3) Visualize

universities in 2D

Univ SAT Top10 Accept SFRatio Expenses GradRate
Brown 1310 89 22 13 22,704 94
CalTech 1415 100 25 6 63,575 81
CMU 1260 62 59 9 25,026 72
Columbia 1310 76 24 12 31,510 88
Cornell 1280 83 33 13 21,864 90
Dartmouth 1340 89 23 10 32,162 95
Duke 1315 90 30 12 31,585 95
Georgetown 1255 74 24 12 20,126 92
Harvard 1400 91 14 11 39,525 97
JohnsHopkins 1305 75 44 7 58,691 87
MIT 1380 94 30 10 34,870 91
Northwestern 1260 85 39 11 28,052 89
NotreDame 1255 81 42 13 15,122 94
PennState 1081 38 54 18 10,185 80
Princeton 1375 91 14 8 30,220 95
Purdue 1005 28 90 19 9,066 69
Stanford 1360 90 20 12 36,450 93
TexasA&M 1075 49 67 25 8,704 67
UCBerkeley 1240 95 40 17 15,140 78
UChicago 1290 75 50 13 38,380 87
UMichigan 1180 65 68 16 15,470 85
UPenn 1285 80 36 11 27,553 90
UVA 1225 77 44 14 13,349 92
UWisconsin 1085 40 69 15 11,857 71
Yale 1375 95 19 11 43,514 96

Source: US News & World Report, Sept 18 1995


Input   Output

Univ SAT Top10 Accept SFRatio Expenses GradRate PC1 PC2 PC3 PC4 PC5 PC6
Brown 1310 89 22 13 22,704 94
CalTech 1415 100 25 6 63,575 81
CMU 1260 62 59 9 25,026 72
Columbia 1310 76 24 12 31,510 88
Cornell 1280 83 33 13 21,864 90
Dartmouth 1340 89 23 10 32,162 95
Duke 1315 90 30 12 31,585 95
Georgetown 1255 74 24 12 20,126 92
Harvard 1400 91 14 11 39,525 97
JohnsHopkins 1305 75 44 7 58,691 87

Hope is that a fewer columns may 

capture most of the information 

from the original dataset

PCA


The Primitive Idea – Intuition 

First

How to compress the data

loosing the least amount of information?


Input

• p measurements/

original columns

• Correlated

Output

• p principal components

(= p weighted averages

of original

measurements)

• Uncorrelated

• Ordered by variance

• Keep top principal

components; drop rest

PCA


Mechanism

The ith principal component is a weighted average of 
original measurements/columns:

Weights (aij) are chosen such that:
1. PCs are ordered by their variance (PC1 has largest

variance, followed by PC2, PC3, and so on)
2. Pairs of PCs have correlation = 0
3. For each PC, sum of squared weights =1

pip2i21i1i Xa   Xa Xa PC +…++=

Univ SAT Top10 Accept SFRatio Expenses GradRate PC1 PC2 PC3 PC4 PC5 PC6
Brown 1310 89 22 13 22,704 94
CalTech 1415 100 25 6 63,575 81
CMU 1260 62 59 9 25,026 72
Columbia 1310 76 24 12 31,510 88
Cornell 1280 83 33 13 21,864 90
Dartmouth 1340 89 23 10 32,162 95
Duke 1315 90 30 12 31,585 95
Georgetown 1255 74 24 12 20,126 92
Harvard 1400 91 14 11 39,525 97
JohnsHopkins 1305 75 44 7 58,691 87


Demystifying weight computation

• Main idea: high variance = lots of information

• Goal: Find weights aij that maximize variance
of PCi, while keeping PCi uncorrelated to
other PCs.

• The covariance matrix of the X’s is needed.

jiPCPC

) ,XCov(Xaa ),XCov(Xaa
)Var(X)Var(X)Var(X)Var(PC

ji

pp-ipi p-ii

pipiii aaa

≠=

+…++

++…++=

when ,0),(Covar  want,Also

22 112121

2
2

2
21

2
1

pip2i21i1i Xa   Xa Xa PC +…++=


Standardize the inputs

Why? 

• variables with large

variances will have

bigger influence on

result

Solution 

• Standardize before

applying PCA

Excel: =standardize(cell, average(column), stdev(column)) 

Univ Z_SAT Z_Top10 Z_Accept Z_SFRatio Z_Expenses Z_GradRate
Brown 0.4020 0.6442 -0.8719 0.0688 -0.3247 0.8037
CalTech 1.3710 1.2103 -0.7198 -1.6522 2.5087 -0.6315
CMU -0.0594 -0.7451 1.0037 -0.9146 -0.1637 -1.6251
Columbia 0.4020 -0.0247 -0.7705 -0.1770 0.2858 0.1413
Cornell 0.1251 0.3355 -0.3143 0.0688 -0.3829 0.3621
Dartmouth 0.6788 0.6442 -0.8212 -0.6687 0.3310 0.9141
Duke 0.4481 0.6957 -0.4664 -0.1770 0.2910 0.9141
Georgetown -0.1056 -0.1276 -0.7705 -0.1770 -0.5034 0.5829
Harvard 1.2326 0.7471 -1.2774 -0.4229 0.8414 1.1349
JohnsHopkins 0.3559 -0.0762 0.2433 -1.4063 2.1701 0.0309
MIT 1.0480 0.9015 -0.4664 -0.6687 0.5187 0.4725
Northwestern -0.0594 0.4384 -0.0101 -0.4229 0.0460 0.2517
NotreDame -0.1056 0.2326 0.1419 0.0688 -0.8503 0.8037
PennState -1.7113 -1.9800 0.7502 1.2981 -1.1926 -0.7419
Princeton 1.0018 0.7471 -1.2774 -1.1605 0.1963 0.9141
Purdue -2.4127 -2.4946 2.5751 1.5440 -1.2702 -1.9563
Stanford 0.8634 0.6957 -0.9733 -0.1770 0.6282 0.6933
TexasA&M -1.7667 -1.4140 1.4092 3.0192 -1.2953 -2.1771
UCBerkeley -0.2440 0.9530 0.0406 1.0523 -0.8491 -0.9627
UChicago 0.2174 -0.0762 0.5475 0.0688 0.7620 0.0309
UMichigan -0.7977 -0.5907 1.4599 0.8064 -0.8262 -0.1899
UPenn 0.1713 0.1811 -0.1622 -0.4229 0.0114 0.3621
UVA -0.3824 0.0268 0.2433 0.3147 -0.9732 0.5829
UWisconsin -1.6744 -1.8771 1.5106 0.5606 -1.0767 -1.7355
Yale 1.0018 0.9530 -1.0240 -0.4229 1.1179 1.0245


Standardization shortcut for PCA

• Rather than standardize the data

manually, you can use correlation

matrix instead of covariance matrix

as input

• PCA with and without standardization

gives different results!


PCA
Transform > Principal Components
(correlation matrix has been used here)

• PCs are uncorrelated

• Var(PC1) > Var (PC2)

> ...

pip2i21i1i Xa   Xa Xa PC +…++=

Scaled Data PC Scores

Principal 

Components


Computing principal scores

• For each record, we can compute their score
on each PC.

• Multiply each weight (aij) by the appropriate
Xij

• Example for Brown University (using
standardized numbers):

• PC Score1 for Brown University = (–
0.458)(0.40) +(–0.427)(.64) +(0.424)(–0.87)
+(0.391)(.07) + (–0.363)(–0.32) + (–0.379)(.80) =
–0.989


R Code for PCA (Assignment)

OPTIONAL R Code

install.packages("gdata") ## for reading xls files
install.packages("xlsx") ## ” for reading xlsx files

mydata<-read.xlsx("University Ranking.xlsx",1) ## use read.csv for csv files
mydata ## make sure the data is loaded correctly

help(princomp) ## to understand the api for princomp

pcaObj<-princomp(mydata[1:25,2:7], cor = TRUE, scores = TRUE, covmat = NULL)
## the first column in mydata has university names
## princomp(mydata, cor = TRUE) not_same_as prcomp(mydata, scale=TRUE); similar, but different
summary(pcaObj)

loadings(pcaObj)

plot(pcaObj)

biplot(pcaObj)

pcaObj$loadings

pcaObj$scores


Goal #1: Reduce data dimension

• PCs are ordered by their variance (=information)

• Choose top few PCs and drop the rest!

Example: 

• PC1 captures most ??% of the information.

• The first 2 PCs capture ??%

• Data reduction: use only two variables instead of

6.


Matrix Transpose

OPTIONAL: R code

help(matrix)

A<-matrix(c(1,2),nrow=1,ncol=2,byrow=TRUE)

A

t(A)

B<-matrix(c(1,2,3,4),nrow=2,ncol=2,byrow=TRUE)

B

t(B)

C<matrix(c(1,2,3,4,5,6),nrow=3,ncol=2,byrow=TRUE) 
C

t(C)


Matrix Multiplication

OPTIONAL R Code

A<-

matrix(c(1,2,3,4,5,6),nrow=3,ncol=2,byrow= 
TRUE)

A

B<-

matrix(c(1,2,3,4,5,6,7,8),nrow=2,ncol=4,byro 
w=TRUE)

B

C<-A%*%B

D<-t(B)%*%t(A) ## note,  B%*%A is not 

possible; how does D look like?


Matrix Inverse























=

=×

1 0 ... 0 0
0 ...1 0 0
. ... . . . . . .

0 0 ... 1 0
0 0 ... 0 1

:matrixIdentity 

A 1-B Then,
matrixidentity , If, IBA

OPTIONAL R Code

## How to create nˣn
Identity matrix? 

help(diag)

A<-diag(5)

## find inverse of a matrix

solve(A)


Data Compression

[ ]
[ ]

[ ] [ ]
[ ]

[ ] [ ] [ ]omponentsPrincipalCPCScorestaedScaledDaApproximat

:ionApproximat

omponentsPrincipalCPCScores

omponentsPrincipalCPCScores

ScaledData

omponentsPrincipalCScaledDataPCScores

1

T
pc

cNpN

T

pppN

pp
pN

pN

pppNpN

×
××

××






−

×
×

×

×





××





×=

×=

×=

×=

c = Number of 

components 

kept; c ≤ p


Goal #2: Learn relationships with 

PCA by interpreting the weights

• ai1,…, aip are the coefficients for PCi.

• They describe the role of original X

variables in computing PCi.

• Useful in providing context-specific

interpretation of each PC.


PC1 Scores

(choose one or more)

1. are approximately a simple average of

the 6 variables

2. measure the degree of high Accept &

SFRatio, but low Expenses, GradRate,

SAT, and Top10


Goal #3: Use PCA for visualization

• The first 2 (or 3) PCs provide a way to

project the data from a p-dimensional

space onto a 2D (or 3D) space


Scatter Plot: PC2 vs. PC1 scores


Monitoring batch processes 

using PCA

• Multivariate data at different time points

• Historical database of successful batches

are used

• Multivariate trajectory data is projected

to low-dimensional space

>>> Simple monitoring charts to spot outlier


Your Turn!

1. If we use a subset of the principal
components, is this useful for
prediction? for explanation?

2. What are advantages and
weaknesses of PCA compared to
choosing a subset of the variables?

3. PCA vs. Clustering