Data Science Training In Hyderabad

data science training in hyderabad

638 views
Embed
Email
From
Username or Email (please add comma after each username or email)
Name	Email
Back
Menu 3

Eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo.
Ravali470d

Uploaded on Sep 25, 2018
Category Education
ExcelR offers Data Science course in Hyderabad, the most comprehensive Data Science course in the market, covering the complete Data Science lifecycle concepts from Data Collection
Category Education
Comments

                     data science training in hyderabad
                     
Data Mining



© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Data Science using R, Minitab & XLMiner

R, Minitab XLMiner for Forecasting


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

PMP

PMI-ACP

PMI-RMP

CSM

LSSGB

Project Management Professional

Agile Certified Practitioner

Risk Management Professional

Certified Scrum Master

Lean Six Sigma Green Belt

LSSBB

SSMBB

ITIL

Lean Six Sigma Black Belt

Six Sigma Master Black Belt

Information Technology Infrastructure Library

Agile PM Dynamic System Development Methodology Atern

Name:  Bharani Kumar

Education:  IIT Hyderabad
Indian School of Business

Professional certifications:

My Introduction


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

HSBC

Driven using UK policies

ITC Infotech

Driven using Indian policies SME

Infosys

Driven using Indian policies under Large enterprises

Deloitte

Driven using US policies

1
2

3
4

My Introduction

RESEARCH in 
ANALYTICS, DEEP 
LEARNING & IOT

DATA SCIENTIST


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Tuckman Model


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

AGENDA

Data 
Visualization 
using Tableau

Data Mining –
Supervised & 
Unsupervised 
(Machine 
Learning)

Text Mining & 
NLP

AGENDA


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

What does it take to be a DATA SCIENTIST?

Successful Data Scientist

All Agenda 
Topics

Domain 
Knowledge

Practice

Statistical 
Analysis

Data 
Minin
g

Forecasting

Data 
Visualizatio
n


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Welcome to the Information Age …

… drowning in data and starving for Knowledge 


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

500 million tweets every day, 1.3 billion accounts

YouTube users upload 100 hours of video every minute

306 items are purchased every second
26.6 Million transactions per day

100 terabytes of data uploaded daily
http://www.dnaindia.com/scitech/report-facebook-saw-

one-billion-simultaneous-users-on-aug-24-2119428

Processing 100 petabytes a day (1 petabyte 
= 1000 terabytes)

More than 1 million customer transactions every hour

BIG DATA!

https://www.techinasia.com/alibaba-crushes-records-brings-143-billion-singles-day


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Why Tableau?


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Why Tableau?


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Why Tableau?


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Why Tableau?


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

1

2

3

4

5

Data Types – Continuous, Discrete, Nominal, Ordinal, Interval, 
Ratio, Random Variable, Probability, Probability Distribution

First, second, third & fourth moment business decisions

Graphical representation – Barplot, Histogram, Boxplot, Scatter 
diagram

Simple Linear Regression

Hypothesis Testing

Agenda – Basic Statistics


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Data Types – Continuous & Discrete


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Data Types – Preliminaries


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Random Variable


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Probability


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Probability Distribution


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Probability Applications


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Population

Sampling Frame

SRS

Sample

Sampling Funnel


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Central Tendency Population Sample

Mean / Average

Median Middle value of the data

Mode Most occurring value in the data

Measures of Central Tendency

“Every American should have above average income, and my 
Administration is going to see they get it.” – American President


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Measures of Dispersion


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Dispersion Population Sample

Variance

Standard Deviation

Range Max – Min

Measures of Dispersion


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

 For a probability distribution, the mean of the distribution is known as the expected
value

 The expected value intuitively refers to what one would find if they repeated the
experiment an infinite number of times and took the average of all of the outcomes

 Mathematically, it is calculated as the weighted average of each possible value

Expected Value

The formula for calculating the
expected value for a discrete random
variable X, denoted by μ, is:

The variance of a discrete random
variable X, denoted by σ2 is


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Graphical Techniques – Bar Chart


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Graphical Techniques – Histogram

A Histogram Represents the frequency distribution, i.e., how many observations 
take the value within a certain interval.


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Skewness & Kurtosis

• A measure of asymmetry in the distribution
• Mathematically it is given by

E[(x-µ/σ)]3

• Negative skewness implies mass of the
distribution is concentrated on the
right

Third and Fourth moments

Skewness Kurtosis

• A measure of the “Peakedness” of
the distribution

• Mathematically it is given by
E[(x-µ/σ)]4 -3

• For Symmetric distributions, negative
kurtosis implies wider peak and thinner
tails


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Graphical Techniques – Box Plot

Box Plot : This graph shows the distribution of data by dividing
the data into four groups with the same number of data points
in each group. The box contains the middle 50% of the data
points and each of the two whiskers contain 25% of the data
points. It displays two common measures of the variability or
spread in a data set

Range : It is represented on a box plot by the
distance between the smallest value and the largest
value, including any outliers. If you ignore outliers,
the range is illustrated by the distance between the
opposite ends of the whiskers

Range(IQR):  The 
middle half of a data 

set falls within the 
inter- quartile range

Inter-
quartile


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Normal Distribution

 The Probability associated with any single value of a random variable is always zero 

 Area under the entire curve is always equal to 1

 The normal random variable takes values from -∞ to +∞ 


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Normal Distribution

Characterized by 
a bell shaped 

curve

Has the following 
properties:

68.26% of values 
lie within ±1 σ 
from the mean

95.46% of the 
values lie within 

±2 σ  from the 
mean

99.73% of the 
values lie within ±

3σ  from the 
mean


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Normal Distribution

Characterized by 
mean, µ, and 

standard deviation, σ 
X~N(µ,σ)


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Z scores, Standard Normal Distribution

• For every value (x) of the random variable X, we can calculate Z score:

• Interpretation − How many standard deviations away is the value from the mean ?

Z = 
X−µ

σ


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Calculating Probability from Z distribution

Suppose GMAT scores can be reasonably modelled using a normal distribution
−    µ = 711   σ = 29

What is p(x ≤ 680)?

Step 1: Calculate Z score corresponding to 680 
- Z = (680-711)/29 = -1.06

Step 2: Calculate the probabilities using  Z – Tables
- P(Z ≤ -1) = 0.14 


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Calculating Probability from Z distribution
• What is P( 697 ≤ X ≤ 740) ?

• Step 1 : Use P(x1 ≤ X ≤ x2) = Use P( X ≤ x2) − P( X ≤ x1)

• Step 2 : Calculate P( X ≤ x2) and  P( X ≤ x1) as before
P( X ≤ 740) = P( Z ≤ 1) = 0.84 ; P( X ≤ 697) = P( Z ≤ - 0.5) = 0.31

• Step 3 : Calculate P( 697 ≤ X ≤ 740 ) = 0.84 – 0.31 = 0.53


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Normal Quantile (Q-Q) Plot

Sa
m

p
le

 Q
u

an
ti

le
s

Theoretical Quantiles


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Sampling variation

 Sample mean can be (and most likely is) different from the population mean

 Sample mean varies from one sample to another

 Sample mean is a random variable


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Central Limit Theorem

The standard error of the mean estimates the variability between samples whereas the 
standard deviation measures the variability within a single sample

The Distribution of the sample mean

- will be normal when the distribution of data in the population is normal

- will be approximately normal even if the distribution of data in the population is not normal
if the “sample size” is fairly large

Mean ( X ) = µ ( the same as the population mean of the raw data) 

Standard Deviation (X) =           ,  where σ is the population standard deviation and n is the sample size 

- This is referred to as standard error of mean  

_

σ 

√��


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Sample Size Calculation

A Sample Size of 30 is considered large enough, but that may /may not be adequate

More Precise conditions 
- n > 10( K3 )

2 , where ( K3 ) is sample skewness and 
- n > 10( K4 )  , where  ( K4)  is sample kurtosis


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Confidence Interval

• What is the Probability of tomorrow’s temperature being 42 degrees ?

Probability is ‘0’

• Can it be between [-50⁰C     &   100⁰C] ?


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Case Study: Confidence Interval

• A University with 100,000 alumni is thinking of offering a 
new affinity credit card to its alumni.

• Profitability of the card depends on the average balance 
maintained by the card holders.

• A Market research campaign is launched, in which about 
140 alumni accept the card in a pilot launch.

• Average balance maintained by these is $1990 and the
standard deviation is $2833. Assume that the population
standard deviation is $2500 from previous launches.

• What we can say about the average balance that will be 
held after a full−fledged market launch ?


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Interval estimates of parameters

• Based on sample data

− The point estimate for mean balance = $1990

− Can we trust this estimate ?

• What do you think will happen if we took another random sample of 140 alumni ?

• Because of this uncertainty, we prefer to provide the estimate as an interval (range) 
and associate a level of confidence with it

Interval 
Estimate  = Point Estimate ± Margin of Error


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Confidence Interval for the Population Mean

Start by choosing a confidence level (1-α) %  (e.g. 95%, 99%, 90%)

Then, the population mean will be with in  

X ± Z1-ᾳ where Z1-ᾳ satisfies p( -Z1-ᾳ ≤  Z  ≤  Z1-ᾳ) = 1-ᾳσ 

√��

Margin of error depends on the underlying uncertainty, confidence level and sample size 

_

Interval 
Estimate  =

Point Estimate ± Margin of Error


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Calculate Z value  - 90%, 95% & 99%


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Confidence Interval Calculation

• Based on the survey and past data 

• Construct a 95% confidence interval for the mean card balance and interpret it  ?

• Construct a 90% confidence interval for the mean card balance and interpret it  ?

− n = 140; σ = $2500;  X = $ 1990
− σ X

_

- σ 

√��
= = 2500

√140
= 211.29


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Confidence Interval Interpretation

Consider the 95% Confidence interval for the mean income : [$1576, $2404]

Does this mean that 
- The mean balance of the population lies in the range ? 

- The mean balance is in this range 95% of the time ? 

- 95% of the alumni have balance in this range ?

Interpretation 1 : Mean of the population has a 95% chance of being in this range for a random 
sample

Interpretation 2 : Mean of the population will be in this range for 95% of the random samples   


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

What if we don’t know Sigma?
• Suppose that the alumni of this university are very different and hence population standard 

deviation from previous launches can not be used

We replace σ with our best guess (point estimate) s, which is the standard deviation of the sample:

Calculate 

• If the underlying population is normally distributed , T is a random variable distributed 

according to a t-distribution with n-1 degrees of freedom Tn-1

• Research has shown that the t-distribution is fairly robust to deviation of the population 

of the normal model


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Student’s t-distribution

As n ꝏ

tn N(0,1)

i.e., as the degrees
of the freedom
increase, the
t-distribution
approaches the
standard
normal distribution


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Confidence Interval for mean with unknown Sigma


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Calculating t-value

• Construct a 95% confidence interval for the mean card balance and interpret it?

− n = 140; σ = $2500;  X = $ 1990

− σ X

_

- = 2833
√140

= 239.46

Then the 95% confidence interval for balance is [$1516,  $2464]

Calculate t0.95, 139 = 1.98


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Right Decision

Confidence
Type II error

Right Decision

Power
Type I error

Ho is TRUE H1 is TRUE

Fail to 

Reject Ho

Reject Ho

Hypothesis Testing

1-α

1-β

Start with Hypothesis about a 
Population Parameter

Collect Sample Information

Reject/Do Not Reject Hypothesis

The factors that affect the power of a test include sample size, effect size, population variability, and ��.
Power and �� are related as increasing �� decreases ��. Since power is calculated by 1 minus ��, if you 
increase ��,You also increase the power of a test. The maximum power a test can have is 1, whereas 
the minimum value is 0. 


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Our quality will not improve 
after the consulting project

We will acquire 8,000 new 
customers if I open a store in 
this area

We will need 400 more 
person hours to finish this 
project

The retail market will grow by 
50% in the next 5 years

Our potential customers do 
not spend more than 60 
minutes on the web every day

Less than 5% clients will default 
on their loans

Hypothesis Testing


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Hypothesis Testing


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Hypothesis Testing


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

1-Sample Z test

The length of 25 samples of a fabric are taken at random. Mean
and standard deviation from the historic 2 years study are 150 and
4 respectively. Test if the current mean is greater than the historic
mean. Assume α to be 0.05

Normality Test

Stat > Basic Statistics > 
Graphical Summary 

1

Population Standard 
Deviation Known or Not

1 Sample Z Test

Stat > Basic Statistics > 
1 Sample Z

2 3

Fabric 

Data


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

1-Sample Z test – Write Hypothesis


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Y: Fabric Length is 
continuous
X: Discrete 1 Population

We are comparing mean 
with external standard 
of 150mm

Data was 
shown to 
be normal

Population 
standard 
deviation is 
known=4


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

1-Sample t Test

The mean diameter of the bolt manufactured should be 10mm to
be able to fit into the nut. 20 samples are taken at random from
production line by a quality inspector. Conduct a test to check with
95% confidence that the mean is not different from the
specification value.

Normality Test

Stat > Basic Statistics > 
Graphical Summary 

1

Population Standard 
Deviation Known or Not

1 Sample t Test

Stat > Basic Statistics > 
1 Sample t

2 3

Bolt 
Diameter


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

1-Sample t Test – Write Hypothesis


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Y: Bolt Diameter is continuous
X: Discrete 1 Population

We are comparing mean with 
external standard of 10mm

Data was 
given to be 
Normal

Population 
standard 
deviation is 
NOT known


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

1-Sample Sign Test

The scores of 20 students for the statistics exam are provided. Test
if the current median is not equal to historic median of 82. Assume
‘’ to be 0.05

Normality Test

Stat > Basic Statistics > 
Graphical Summary 

1

1 Sample Sign Test

Stat > Non Parametric > 
1 Sample sign

3

Student 
Scores


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

1-Sample Sign Test – Write Hypothesis


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

2-Sample t Test

A financial analyst at a Financial institute wants to evaluate a
recent credit card promotion. After this promotion, 450
cardholders were randomly selected. Half received an ad
promoting a full waiver of interest rate on purchases made over
the next three months, and half received a standard Christmas
advertisement. Did the ad promoting full interest rate waiver,
increase purchases?

Normality Test

Stat > Basic Statistics > 
Graphical Summary 

1

Variance Test

Stat > Basic Statistics > 
2 Variance

2 Sample t Test

Stat > Basic Statistics > 
2-Sample t

2 3

Marketing
Strategy


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

2-Sample t Test – Write Hypothesis


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Hypothesis Testing


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Paired T Test
• This test is used to compare the means of two sets of observations when all the

other external conditions are the same

• This is a more powerful test as the variability in the observations is due to
differences between the people or objects sampled is factored out

Example: To find out if medication A lowers blood pressure


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Trigger your thoughts!

Comparing the performance of machine A vs. 
machine B by feeding different raw materials to 
each machine

Compare the performance of machine A vs. 
machine B when the same raw material is fed to 
each machine

Compare the power output of a wind mill when you 
use motor A for 1 month and motor B for 1 month

Compare the power output of two wind mills next 
to each other simultaneously when you use 
motor A on one wind mill and motor B on another

Identifying resistor defects and capacitor defects 
in same PCB by collecting such data using 20 PCB 
units

Identifying resister defects on 20 PCB’s and 
capacitor defects on 20 (different) PCB’s


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

2-Sample t test or Paired T test

Effect of fuel additive on vehicles is being studied. Out of a total of 20 vehicles, 10
vehicles are chosen randomly and mileage is recorded. In rest of the 10 vehicles,
additive to be tested is added with the fuel and their mileage is recorded. Find if the
mileage increases by adding the fuel additive.

Assume the same data was recorded if only 10 vehicles were chosen and mileage
was recorded before and after adding the additive. What method will you choose to
find the result.

2-Sample t test 

Paired T test 


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Mann-Whitney test

Effect of fuel additive on vehicles is being studied. Out of a total
of 20 vehicles, 10 vehicles are chosen randomly and mileage is
recorded. In rest of the 10 vehicles, additive to be tested is added
with the fuel and their mileage is recorded. Find if the mileage
increases by adding the fuel additive.

Normality Test

Stat > Basic Statistics > 
Graphical Summary 

1

Mann – Whitney test 
for Medians

Stat > Non Parametric > 
Mann Whitney

2

Vehicle with

& without 
Additives


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Mann-Whitney Test – Write Hypothesis


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Paired T test

Effect of fuel additive on vehicles is being studied. Out of a total
of 20 vehicles, 10 vehicles are chosen randomly and mileage is
recorded. In rest of the 10 vehicles, additive to be tested is added
with the fuel and their mileage is recorded. Find if the mileage
increases by adding the fuel additive. Assume the same data was
recorded if only 10 vehicles were chosen and mileage was
recorded before and after adding the additive.

Normality Test

Stat > Basic Statistics > 
Graphical Summary 

1

Paired T Test

Stat > Basic Statistic > 
Paired T

2

Vehicle with

& without 
Additives

• Since the data was not 
normal, the cause of 
non-normality was 
investigated and it was 
found that the first data 
point for “with additive” 
was wrongly entered. 
This value should have 
been 20. Now, proceed 
with the rest of the 
analysis.

• If the data were truly 
non-normal our analysis 
would stop here.


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Paired T test – Write Hypothesis


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

One-Way ANOVA

A marketing organization outsources their back-office operations
to three different suppliers. The contracts are up for renewal and
the CMO wants to determine whether they should renew
contracts with all suppliers or any specific supplier. CMO want to
renew the contract of supplier with the least transaction time.
CMO will renew all contracts if the performance of all suppliers is
similar

Normality Test

Stat > Basic Statistics > 
Graphical Summary 

1

Variance Test

Stat > ANOVA > 
Test for Equal Variances

ANOVA

Stat > ANOVA > 
One-Way….

2 3

Contract 
Renewal


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Example : More weight reduction programs

• She randomly assigns equal number of participants to each of these programs from
a common pool of volunteers

• Suppose the nutrition expert would like to do a comparative evaluation of three diet 
programs(Atkins, South Beach, GM)

• Suppose the average weight losses in each of the groups(arms) of the experiments 
are 4.5kg, 7kg, 5.3kg

• What can she conclude?


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Two kinds of variation matter

• Not every individual in each program will respond identically to the diet program

• Easier to identify variations across programs if variations within programs are smaller

• Hence the method is called Analysis of Variance(ANOVA)

• With-in group variation = Experimental Error 

• Between group variation


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

• It should be obvious that for every observation : Totij = ti + eij

• What is more surprising and useful is:

Formalizing the intuition behind variations


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Statistically test for equality means

• n subjects equally divided into r groups

• Hypothesis
- H0:  μ1 = μ2 = μ3 = … = μr
- Not all μi are equal

•  Calculate
- Mean Square Treatment MSTR = SSTR / (r‐1)
- Mean Square Error MSE = SSE / (n‐r)
- The ratio of two squares f = MSTR/MSE = Between group variation/Within group variation
- Strength of this evidence p‐value = Pr(F(r‐1,n‐r) ≥ f)

•  Reject the null hypothesis if p‐value < α


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Analysis of variance(ANOVA)

• ANOVA can be used to test equality of means when there are
more then 2 populations

• ANOVA can be used with one or two factors

• If only one factor is varying, then we would use a one-way
ANOVA

– Example: We are interested in comparing the mean performance of several departments within a
company. Here the only factor is the name of department

– If there are two factors, we would use a two way ANOVA. Example: One factor is department and the
second factor is the shift.(day vs. Night)


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Analysis of variance(ANOVA)

Source of Variation Sum of Squares (SS) Degrees of Freedom Mean Square (MS) F Test Statistic

Between Treatments SSFactor K-1 MSFactor = SSFactor / 
DFFactor

F = MSFactor / 
MSError

Within Treatment SSError N-k MSError = SSError / 
DFError

Total SSTotal N-1

Source of Variation Sum of Squares (SS) Degrees of Freedom Mean Square (MS) F Test Statistic

Factor A SSA nA - 1 MSA = SSA / (nA – 1) FA = MSA / MSE

Factor B SSB nB - 1 MSB = SSB / (nB – 1) FB = MSB / MSE

Interaction A * B SSAB (nA – 1) (nB – 1) MSAB = SSAB / (nAB – 1) FAB = MSAB / MSE

Error SSE n – nA * nB MSE = SSE / (n – nA * nB)

Total SST n - 1

One Way ANOVA

Two Way ANOVA


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Is the Transaction time dependent
on whether person A or B processes
the transaction?

Is medicine 1 effective or medicine 2
at reducing heart stroke?

Is the new branding program more
effective in increasing profits?

Does the productivity of employees vary
depending on the three levels?
(Beginner, Intermediate and Advanced)

Three different sale closing methods
were used. Which one is most effective?

Four types of machines are used. Is
weight of the Rugby ball dependent on
the type of machine used?

2 Sample t-test ANOVA – One Way

Dichotomies


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Non-Parametric equivalent to ANOVA

• When the data are not normal or if the data points are very few to figure out
if the data are normal and we have more than 2 populations, we can use the

Mood’s Median or Kruskal Wallis test to compare the populations

Ho : All the medians are the same

Ha : One of the medians is different

• Mood’s median assigns the data from each population that is higher than
the overall median to one group, and all points that are equal or lower to

another group. It then uses a Chi-Square test to check if the observed

frequencies are close to expected frequencies

• Kruskal Wallis is another test that is non-parametric equivalent of ANOVA.
Kruskal Wallis is the extension of Mann-Whitney test


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Mood’s Median & Kruskal Wallis

Growth is measured for three treatments as shown in the case
study. Compare the effect of the three treatments on growth.

Mood’s Median – handles outliers well

Stat > Nonparametric > Mood’s Median

1
Kruskal Wallis – more powerful than Mood’s Median

Stat > Nonparametric > Kruskal Wallis

2

Height

Growth


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Hypothesis Testing


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

1-Proportion Test

• A poll is carried out to find the acceptability of
new football coach by the people. It was
decided that if the support rate for the coach
for the entire population was truly less then
25%, the coach would be fired

• 2000 people participated and 482 people
supported the new coach

• Conduct a test to check if the new coach should
be fired with 95% level of confidence

Football 

Coach

1-Proportion Test

Stat > Basic Statistics > 1-Proportion


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

2-Proportion Test

Johnnie Talkers soft drinks division sales manager has been planning
to launch a new sales incentive program for their sales executives.
The sales executives felt that adults (>40 yrs) won’t buy, children will
& hence requested sales manager not to launch the program.
Analyze the data & determine whether there is evidence at 5%
significance level to support the hypothesis

Johnnie 
Talkers

Proportion A = Proportion B Check p-valueHo

Proportion A NOT = Proportion B
If p-value < alpha, 

we reject Ho
Ha


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Chi-Square Test 

How can you determine whether the distribution of defects in your product or service has changed
from the historic distribution over time, or exceeds an industry standard

• Do you think mean is more significant or variance?

Comparing population’s variance to a standard value involves calculating the

chi-square test statistic

We can also:

Determine whether one variable is dependent over another

Comparing observed & expected frequencies where variance is unknown.

This is called as goodness-of-fit test

Compare multiple proportions


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Chi-Square Goodness-of-fit test

Goodness-of-fit test is to test assumptions about the distributions that fit the
process data

Are observed frequencies (O) same or different from historical, expected or
theoretical frequencies (E)?

If there’s a difference between them, this suggests that the distribution
model expressed by the expected frequencies does not fit the data


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Chi-Square Test 

• A city has a newly opened nuclear plant, and there are families staying
dangerously close to the plant. A health safety officer wants to take this case
up to provide relocation for the families that live in the surrounding area. To
make a strong case, he wants to prove with numbers that an exposure to
radiation levels is leading to an increase in diseased population. He
formulates a contingency table of exposure and disease.

• Does the data suggest an association between the disease and exposure?

Disease Total

Exposure Yes No

Yes 37 13 50

No 17 53 70

Total 54 66 120


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Chi-Square Test 

Calculate the number of individuals of exposed and unexposed groups 
expected in each disease category (yes and no) if the probabilities were 
the same

If there were no effect of exposure, the probabilities should be same and 
the chi-squared statistic would have a very low value. 

Proportion of population exposed = (50/120) = 0.42
Proportion of population not exposed = (70/120) = 0.58

Thus, expected values:
Population with disease = 54
Exposure Yes : 54 * 0.42 = 22.5
Exposure No : 54 * 0.58 = 31.5
Population without disease = 66
Exposure Yes : 66 * 0.42 = 27.5
Exposure No : 66 * 0.58 = 38.5


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Chi-Square Test 
• Calculate the Chi-squared statistic

χ2 = Σ =   

=  29.1

• Calculate the degrees of freedom :

(Number of rows – 1) X (Number of columns – 1)

df = (2 – 1) X (2 – 1) = 1

• Calculate the p-value from the Chi-squared table 

For chi-squared value 29.1 and degrees of freedom = 1, from the table, p-value is < 0.001

• Interpretation: There is 0.001 chance of obtaining such discrepancies between expected and 
observed values if there is no association 

• Conclusion : There is an association between the exposure and disease


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Chi-Square Test 

Bahamantech Research Company uses 4 regional centers in South Asia
(India, China, Srilanka and Bangladesh) to input data of questionnaire
responses. They audit a certain % of the questionnaire responses versus
data entry. Any error in data entry renders it defective. The chief data
scientist wants to check whether the defective % varies by country.
Analyze the data at 5% significance level and help the manager draw
appropriate inferences. [‘1’ means not defectives & ‘0’ means defective]

All proportions are equal Check p-valueHo

Not all proportions are equal
If p-value < alpha,

we reject Ho
Ha

Bahaman 
Research


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Non-Parametric Tests

•Referred to as “distribution free”, as they don’t involve making 
assumptions of any data

•They have lower power than the parametric tests and hence are 
always given the second preference after the parametric tests

•These tests are typically focused on median rather than mean

•They involve straight-forward procedures like counting and ordering


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Thank You


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Probability Distributions
Lognormal:

• Fits many kinds of failure data
• Used for reliability analysis, cycles-to-failure, loading variables & fatigue stress
• Tensile strength of fibers & breaking strength of concrete
• Environment data such as random quantities of pollutants in water or air
• Economic variables such as per capita income

• Extreme values are well managed & makes data normal
• μ, σ are mean & standard deviation of natural logarithms

Data Log transformed

12 2.48

28 3.33

87 4.47

143 4.96


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Probability Distributions
Lognormal:

• This distribution is right skewed
• Skewness increases as value of σ increases
• Pdf starts at zero, increases to its mode, and then decreases
• If time-to-failure has a lognormal distribution, then the logarithm of time-to-failure has a normal 

distirbution


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Probability Distributions
Exponential:

• Length of time between check-ins at a reception desk, calls at a call center, customers at a cashier
• Used when events occur continuously & independently at a constant average rate 
• Used to model rate of change that will occur in a given amount of time
• How long equipment will keep working with proper maintenance & part replacement
• Use to model behavior of independent variables that have a constant rate
• The occurrences of variables are described by a Poisson distribution, but the times between occurrences 

are described by Exponential distribution
• If X is Poisson distributed then Y = 1/X will be exponentially distributed
• # of arrivals at a checkout counter, # of product failures over time – Poisson
• Length of time between events, i.e., one arrival or failure & the next – Exponential distribution
• Exponential distribution can model the interval between random events

• λ = failure rate; θ = mean; x = random variable
• Used to model mean time between occurrences
• In exponential population, 37% of observations are below the mean & 

63% are above
• Uses constant failure rate


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Probability Distributions
Weibull:

• Model failure rate; rate is not constant

• Model time to failure, time to repair & material strength

• When system/item ages & failure rate increases/decreases

• Can model different distributions due to having parameters of shape, scale & location

• Can simulate Lognormal, Exponential & many other distributions

• Use widely in reliability & statistical applications

• Weibull & Lognormal are from same family & both can be used to assess the dataset that 
contains close to average values (not too high / low)

• However, Weibull is a better fit when majority of data falls to the higher side

• Lognormal is a better fit when majority of data falls to the lower side


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Probability Distributions
Weibull:
• β is shape parameter, also called as slope, determines the shape of the distribution

 When beta = 1, shape of distribution = exponential distribution
 When beta: 3 to 4, shape of distribution = normal distribution
 Several beta values can approximate lognormal distribution

• η is scaled parameter (eta), determines the spread or width of distribution

• γ is non-zero location parameter, is the point, below which there are no failures, changing the value will 
move distribution to right or left
 Gamma > 0, there is a period when no failures occur
 Gamma < 0, failures have occurred before time equals zero

e.g., defective raw materials or failure during transportation
 When Gamma = 0, eta is called as characteristic life

• Regardless of specific value of beta, 63.2% of values fall below the characteristic life


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Probability Distributions
Bivariate Normal Distribution:

• Used when 2 variables that are normally distributed & may be totally independent or may be correlated to 
some degree 

• A joint distribution of two independent variables that simultaneously 
& jointly cross-classifies the data

• Can be discrete or continuous
• 3D plot like mountain terrain
• X & Y axes represent independent variables
• Z axis shows either 

 frequency for discrete data
 probability for continuous data

• The maximum or peak occurs when X1 = Mu1 & X2 = Mu2. You can take a
“slice” anywhere along the distribution by fixing one of the variables. This
is known as a conditional distribution


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Probability Distributions
Bivariate Normal Distribution:

• Can help determine items of critical importance:
• Causality – examine the joint frequencies to investigate if the second variable changes in a 

systematic way when the first variable changes
• Predictions – reviewing outcomes from one variable as the other changes
• Importance – if two variables are causally related they should have a statistically significant impact


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Scatter Diagram
 Scatter diagrams or plots provides a graphical representation of  the relationship of  two 

continuous variables

 Be Careful - Correlation does not guarantee causation. Correlation by itself  does not 
imply a cause and effect relationship!

 Judge strength of  relationship by width or tightness of  scatter

 Determine direction of  the relationship, e.g. If  X increases, and Y decreases, it is negative 
correlation, similarly if  X increases, and Y increases, it is positive correlation


© 2013 - 2016 ExcelR Solutions. All Rights Reserved


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Correlation Analysis
 Correlation Analysis measures the degree of linear relationship between two variables

 Range of correlation coefficient -1 to +1

 Perfect positive relationship +1

 Perfect negative relationship -1

 No Linear relationship 0

 If the absolute value of the correlation coefficient is greater than 0.85, then we say there is a

good relationship

• Example: r = 0.87, r = -0.9, r = 0.9, r = -0.87 describe good relationship

• Example: r = 0.5, r = -0.5, r = 0.28 describe poor relationship

 Correlation values of -1 or 1 imply an exact linear relationship. However, the real value of

correlation is in quantifying less than perfect relationships

 We can perform regression analysis, which attempts to further describe this type of

relationship, if the correlation is good between the 2 variables


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Correlation Analysis


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Linear Regression Model

The equation that represents how an independent variable is related to a dependent variable 

and an error term is a regression model

y = β0 + β1x + ε

Where,  β0 and β1 are called parameters of the model,

ε is a random variable called error term.

β0

β1


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Linear Regression Model

y intercept

Error term

An observed value of x 
when x equals x0

Mean value of 
y when x 
equals x0

Straight line defined by the 
equation y = β0 + β1x

X

Y

x0 = A specific value of x, the 
independent variable.

β0

β1

Fitting a straight line by least squares

Ŷ = b̂0 + b̂1X


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Regression Analysis

 R-squared-also known as Coefficient of  determination, represents the % variation  in 
output (dependent variable)  explained  by input variables/s  or Percentage of   response 

variable  variation  that is explained  by its relationship with one or more predictor variables

 Higher  the R^2, the better the model fits your data

 R^2 is always  between 0 and 100%

 R squared is between 0.65 and 0.8  => Moderate correlation

 R squared in greater than 0.8 => Strong correlation


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Regression Analysis

 Prediction and Confidence Interval are types of confidence intervals used for
predictions in regression and other linear models

 Prediction Interval: Represents a range that a single new observation is likely to fall given
specified settings of the predictors

 Confidence interval of the prediction: Represents a range that the mean response is likely
to fall given specified settings of the predictors

 The prediction interval is always wider than the corresponding confidence interval because
of the added uncertainty involved in predicting a single response versus the mean response


© 2013 - 2016 ExcelR Solutions. All Rights Reserved

Regression Techniques – Simple Linear Regression

Y = Continuous

X = Single & 
Continuous

Simple 
Linear 

Regression

Y = Continuous

X = Single & 
Discrete

Simple 
Linear 

Regression

Create
Dummy 
Variable


109 Footer Copyright © 2015 ExcelR . All rights reserved.

Simple Linear Regression – Dummy Variable

Gender Dummy Variable

Male 1

Female 0

Male 1

Female 0

Male 1

Male 1

Female 0

Male 1

Male 1

Female 0


110 Footer Copyright © 2015 ExcelR . All rights reserved.

Simple Linear Regression – R
A business problem:

The Waist Circumference – Adipose Tissue data

• Studies have shown that individuals with excess Adipose tissue (AT) in the abdominal region have a higher risk of 
cardio-vascular diseases

• Computed Tomography, commonly called the CT Scan is the only technique that allows for the precise and 
reliable measurement of the AT (at any site in the body)

• The problems with using the CT scan are:
• Many physicians do not have access to this technology
• Irradiation of the patient (suppresses the immune system)
• Expensive

• Is there a simpler yet reasonably accurate way to predict the AT area? i.e.,
• Easily available
• Risk free
• Inexpensive 

• A group of researchers conducted a study with the aim of predicting abdominal AT area using simple 
anthropometric measurements, i.e., measurements on the human body

• The Waist Circumference – Adipose Tissue data is a part of this study wherein the aim is to study how well waist 
circumference (WC) predicts the AT area


111 Footer Copyright © 2015 ExcelR . All rights reserved.

Simple Linear Regression – Data Set
Observation Waist AT Observation Waist AT Observation Waist AT

1 74.75 25.72 38 103 129 75 108 217

2 72.6 25.89 39 80 74.02 76 100 140

3 81.8 42.6 40 79 55.48 77 103 109

4 83.95 42.8 41 83.5 73.13 78 104 127

5 74.65 29.84 42 76 50.5 79 106 112

6 71.85 21.68 43 80.5 50.88 80 109 192

7 80.9 29.08 44 86.5 140 81 103.5 132

8 83.4 32.98 45 83 96.54 82 110 126

9 63.5 11.44 46 107.1 118 83 110 153

10 73.2 32.22 47 94.3 107 84 112 158

11 71.9 28.32 48 94.5 123 85 108.5 183

12 75 43.86 49 79.7 65.92 86 104 184

13 73.1 38.21 50 79.3 81.29 87 111 121

14 79 42.48 51 89.8 111 88 108.5 159

15 77 30.96 52 83.8 90.73 89 121 245

16 68.85 55.78 53 85.2 133 90 109 137

17 75.95 43.78 54 75.5 41.9 91 97.5 165

18 74.15 33.41 55 78.4 41.71 92 105.5 152

19 73.8 43.35 56 78.6 58.16 93 98 181

20 75.9 29.31 57 87.8 88.85 94 94.5 80.95

21 76.85 36.6 58 86.3 155 95 97 137

22 80.9 40.25 59 85.5 70.77 96 105 125

23 79.9 35.43 60 83.7 75.08 97 106 241

24 89.2 60.09 61 77.6 57.05 98 99 134

25 82 45.84 62 84.9 99.73 99 91 150

26 92 70.4 63 79.8 27.96 100 102.5 198

27 86.6 83.45 64 108.3 123 101 106 151

28 80.5 84.3 65 119.6 90.41 102 109.1 229

29 86 78.89 66 119.9 106 103 115 253

30 82.5 64.75 67 96.5 144 104 101 188

31 83.5 72.56 68 105.5 121 105 100.1 124

32 88.1 89.31 69 105 97.13 106 93.3 62.2

33 90.8 78.94 70 107 166 107 101.8 133

34 89.4 83.55 71 107 87.99 108 107.9 208

35 102 127 72 101 154 109 108.5 208

36 94.5 121 73 97 100

37 91 107 74 100 123


112 Footer Copyright © 2015 ExcelR . All rights reserved.

Simple Linear Regression – Transformation
reg <- lm(AT ~ Waist)  # Linear Regression
summary(reg)
confint(reg, level=0.95)
predict(reg, interval="predict”)

reg_log <- lm(AT ~ log(Waist)) # Regression using Logarithmic Transformation
summary(reg_log)
confint(reg_log, level=0.95)
predict(reg, interval="predict”)

reg_exp <- lm(log(AT) ~ Waist) # Regression using Exponential Transformation
summary(reg_exp)
confint(reg_exp, level = 0.95)
predict(reg, interval="predict”)


113 Footer Copyright © 2015 ExcelR . All rights reserved.

Regression Techniques – Multiple Linear Regression

Y = Continuous

X = Multiple & 
Continuous

Multiple 
Linear 

Regression

Y = Continuous

X = Multiple 
& Discrete

Multiple 
Linear 

Regression

Create 
Dummy 
Variable


114 Footer Copyright © 2015 ExcelR . All rights reserved.

Multiple Linear Regression – Dummy Variable

Make of car
Dummy 

Variable_Petrol
Dummy 

Variable_Diesel
Dummy 

Variable_CNG
Dummy 

Variable_LPG

Petrol 1 0 0 0

Diesel 0 1 0 0

CNG 0 0 1 0

LPG 0 0 0 1

Diesel 0 1 0 0

CNG 0 0 1 0

Petrol 1 0 0 0

LPG 0 0 0 1

Petrol 1 0 0 0

LPG 0 0 0 1


115 Footer Copyright © 2015 ExcelR . All rights reserved.

Multiple Regression Model

DATA : CARS, 81 observations, “cars.csv”

• VOL = cubic feet of cab space

• HP = engine horsepower

• MPG = average miles per gallon 

• SP = top speed, miles per hour

• WT = vehicle weight, hundreds of pounds

Our interest is to model the MPG of a car based on the other variables


116 Footer Copyright © 2015 ExcelR . All rights reserved.

Model and Assumptions

Our Model:

① Linearity (Assumptions about the form of the model):

◦ Linear in parameters

② Assumptions about the errors:

◦ IID Normal (Independently & identically distributed)

◦ Zero mean

◦ Constant variance (Homoscedasticity)

◦ If no constant variance (HETEROSCEDASTICITY)

◦ Independent of each other. If not independent, it is called as AUTO CORRELATION problem

③ Assumptions about the predictors:

◦ Non-random

◦ Measured without error

◦ Linearly independent of each other. If not it is called as COLLINEARITY problem

④ Assumptions about the observations:

◦ Equally reliable

Y = b0 +b1X1 +b2X2 +......+bkXk +e
Linear

Independent

Normal

Equal Variance


117 Footer Copyright © 2015 ExcelR . All rights reserved.

Techniques used for Discrete Output

Logit Analysis

Probit Analysis

Logistic Regression

1

3

2


118 Footer Copyright © 2015 ExcelR . All rights reserved.

Regression Techniques – Simple Logistic Regression

Y = Discrete

X = Single & 
Continuous

Simple 
Logistic 

Regression

Y = Discrete

X = Single 
& Discrete

Simple 
Logistic

Regression

Create
Dummy 
Variable


119 Footer Copyright © 2015 ExcelR . All rights reserved.

Logistic Regression

• Logistic Regression model predicts the probability associated with each
dependent variable Category

How does it do this?

• It finds linear relationship between independent variables and a link
function of this probabilities. Then the link function that provides the

best goodness-of-fit for the given data is chosen


120 Footer Copyright © 2015 ExcelR . All rights reserved.

Logistic Regression

Multiple Logistic Regression Model is quite similar to the Multiple

Linear Regression Model, Only β coefficients vary


121 Footer Copyright © 2015 ExcelR . All rights reserved.

Logistic Regression


122 Footer Copyright © 2015 ExcelR . All rights reserved.

Logistic Regression Methods


123 Footer Copyright © 2015 ExcelR . All rights reserved.

Assumptions in Logistic Regression

Only one outcome per event – Like pass or fail

The outcomes are statistically independent

All relevant predictors are in the model

One category at a time – Mutually exclusive & 
collectively exhaustive 

Sample sizes are larger than for linear regression 

1

2
3

4

5


124 Footer Copyright © 2015 ExcelR . All rights reserved.

Steps in Logistic Regression

Collect & organize sample data

Formulate Logistic Regression Model

Check the model’s validity

Determine Probabilities using Probability equation

Compile the results

1

2
3

4

5


125 Footer Copyright © 2015 ExcelR . All rights reserved.

Logistic Regression Example

Imagine that you are a Data Scientist at a very large scale integration circuit
manufacturing company. You want to know whether or not the time spent
inspecting each product impacts the quality assurance department’s ability to
detect a designing error in the circuit

→ Step-1: Collect and organize the sample data

→ Number of Observations

→ Error Identification

→ Inspection Time

Number of Observations: 55 Observations of circuits with errors, and
determine whether those errors were detected by QA
data science training in hyderabad

Menu 3

Ravali470d

Comments

data science training in hyderabad

Recommended