data science training


Arohi

Uploaded on Jun 24, 2019

Category Education

ExcelR is a global leader delivering a wide gamut of management and technical training over 40 countries.  With over 20 Franchise partners all over the world, ExcelR helps individuals and organisations by providing data science courses based on practical knowledge and theoretical concepts.

Category Education

Comments

                     

data science training

Data Science using R, Minitab & XLMiner © 2013 - 2016 ExcelR Solutions. All Rights Reserved R, Minitab XLMiner for Forecasting My Introduction Name: Bharani Kumar Education: IIT Hyderabad Indian School of Business Professional certifications: PMP PMI-ACP PMI-RMP CSM LSSGB Project Management Professional Agile Certified Practitioner Risk Management Professional Certified Scrum Master Lean Six Sigma Green Belt LSSBB SSMBB ITIL Lean Six Sigma Black Belt Six Sigma Master Black Belt Information Technology Infrastructure Library Agile PM Dynamic System Development Methodology Atern © 2013 - 2016 ExcelR Solutions. All Rights Reserved My Introduction 3 RESEARCH in ANALYTICS, DEEP LEARNING & IOT 4 2 Deloitte 1 Driven using US policies DATA SCIENTIST Infosys Driven using Indian policies under Large enterprises ITC Infotech Driven using Indian policies SME HSBC Driven using UK policies © 2013 - 2016 ExcelR Solutions. All Rights Reserved Tuckman Model © 2013 - 2016 ExcelR Solutions. All Rights Reserved AGENDA Data Visualization using Tableau Data Mining – Supervised & Unsupervised (Machine Learning) Text Mining & NLP AGENDA © 2013 - 2016 ExcelR Solutions. All Rights Reserved What does it take to be a DATA SCIENTIST? Domain All Agenda Data Knowledge Topics Statistical Analysis Minin g Practice Forecasting Successful Data Scientist Data Visualizatio n © 2013 - 2016 ExcelR Solutions. All Rights Reserved Welcome to the Information Age ... ... drowning in data and starving for Knowledge © 2013 - 2016 ExcelR Solutions. All Rights Reserved BIG DATA! https://www.techinasia.com/alibaba-crushes-records- brings-143-billion-singles-day 500 million tweets every day, 1.3 billion accounts YouTube users upload 100 hours of video every minute 100 terabytes of data uploaded daily http://www.dnaindia.com/scitech/report-facebook-saw- one-billion-simultaneous-users-on-aug-24-2119428 Processing 100 petabytes a day (1 petabyte = 1000 terabytes) More than 1 million customer transactions every hour 306 items are purchased every second 26.6 Million transactions per day © 2013 - 2016 ExcelR Solutions. All Rights Reserved Why Tableau? © 2013 - 2016 ExcelR Solutions. All Rights Reserved Why Tableau? © 2013 - 2016 ExcelR Solutions. All Rights Reserved Why Tableau? © 2013 - 2016 ExcelR Solutions. All Rights Reserved Why Tableau? © 2013 - 2016 ExcelR Solutions. All Rights Reserved Agenda – Basic Statistics Graphical representation – Barplot, Histogram, 12345 Boxplot, Scatter diagram Data Types – Simple Linear Continuous, Discrete, Regression Nominal, Ordinal, Interval, Ratio, Random Variable, Probability, Hypothesis Testing Probability Distribution First, second, third & © 2013 - 2016 ExcelR Solutions. fourth moment business All Rights Reserved decisions Data Types – Continuous & Discrete © 2013 - 2016 ExcelR Solutions. All Rights Reserved Data Types – Preliminaries © 2013 - 2016 ExcelR Solutions. All Rights Reserved Random Variable © 2013 - 2016 ExcelR Solutions. All Rights Reserved Probability © 2013 - 2016 ExcelR Solutions. All Rights Reserved Probability Distribution © 2013 - 2016 ExcelR Solutions. All Rights Reserved Probability Applications © 2013 - 2016 ExcelR Solutions. All Rights Reserved Sampling Funnel Population Sampling Frame SRS Sample © 2013 - 2016 ExcelR Solutions. All Rights Reserved Measures of Central Tendency “Every American should have above average income, and my Administration is going to see they get it.” – American President Central Tendency Population Sample Mean / Average Median Middle value of the data Mode Most occurring value in the data © 2013 - 2016 ExcelR Solutions. All Rights Reserved Measures of Dispersion © 2013 - 2016 ExcelR Solutions. All Rights Reserved Measures of Dispersion Dispersion Population Sample Variance Standard Deviation Range Max – Min © 2013 - 2016 ExcelR Solutions. All Rights Reserved Expected Value ◆ For a probability distribution, the mean of the distribution is known as the expected value ◆ The expected value intuitively refers to what one would find if they repeated the experiment an infinite number of times and took the average of all of the outcomes ◆ Mathematically, it is calculated as the weighted average of each possible value The formula for calculating The variance of a discrete the expected value for a random variable X, discrete random variable denoted by σ2 is X, denoted by μ, is: © 2013 - 2016 ExcelR Solutions. All Rights Reserved Graphical Techniques – Bar Chart © 2013 - 2016 ExcelR Solutions. All Rights Reserved Graphical Techniques – Histogram A Histogram Represents the frequency distribution, i.e., how many observations take the value within a certain interval. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Skewness & Kurtosis Third and Fourth moments Skewness Kurtosis • A measure of the • A measure of asymmetry in “Peakedness” of the the distribution distribution • Mathematically it is given by • Mathematically it is given by E[(x-μ/σ)]3 E[(x-μ/σ)]4 -3 • Negative skewness implies • For Symmetric distributions, mass of the distribution is negative kurtosis implies wider concentrated on the right peak and thinner tails © 2013 - 2016 ExcelR Solutions. All Rights Reserved Graphical Techniques – Box Plot Box Plot : This graph shows the you ignore outliers, the range is distribution of data by dividing the data illustrated by the distance into four groups with the same number between the opposite ends of the of data points in each group. The box contains the middle 50% of the data whiskers points and each of the two whiskers Range(IQR): The middle half contain 25% of the data points. It of a data set falls within the displays two common measures of the inter- quartile range variability or spread in a data set Range : It is represented on a box plot by the distance between the smallest value and the largest Inter- quartile value, including any outliers. If © 2013 - 2016 ExcelR Solutions. All Rights Reserved Normal Distribution ▪ The normal random variable takes values from -∞ to +∞ ▪ The Probability associated with any single value of a random variable is always zero ▪ Area under the entire curve is always equal to 1 © 2013 - 2016 ExcelR Solutions. All Rights Reserved Characterized by a bell shaped curve Normal Has the following properties: Distribution © 2013 - 2016 ExcelR Solutions. All Rights Reserved 68.26% of values lie within ±1 σ from the 99.73% of the values lie within ± 3σ from mean the mean 95.46% of the values lie within ±2 σ from 99.73% of the values lie within ± 3σ from the mean the mean Normal Distribution Characterized by mean, μ, and standard deviation, σ X~N(μ,σ) © 2013 - 2016 ExcelR Solutions. All Rights Reserved Z scores, Standard Normal Distribution • For every value (x) of the random variable X, we can calculate Z score: X−μ Z = σ • Interpretation − How many standard deviations away is the value from the mean ? © 2013 - 2016 ExcelR Solutions. All Rights Reserved Calculating Probability from Z distribution Suppose GMAT scores can be reasonably modelled using a normal distribution − μ = 711 σ = 29 What is p(x ≤ 680)? Step 1: Calculate Z score corresponding to 680 - Z = (680-711)/29 = -1.06 Step 2: Calculate the probabilities using Z – Tables - P(Z ≤ -1) = 0.14 © 2013 - 2016 ExcelR Solutions. All Rights Reserved Calculating Probability from Z distribution • What is P( 697 ≤ X ≤ 740) ? • Step 1 : Use P(x1 ≤ X ≤ x2) = Use P( X ≤ x2) − P( X ≤ x1) • Step 2 : Calculate P( X ≤ x2) and P( X ≤ x1) as before P( X ≤ 740) = P( Z ≤ 1) = 0.84 ; P( X ≤ 697) = P( Z ≤ - 0.5) = 0.31 • Step 3 : Calculate P( 697 ≤ X ≤ 740 ) = 0.84 – 0.31 = 0.53 © 2013 - 2016 ExcelR Solutions. All Rights Reserved Normal Quantile (Q-Q) Plot © 2013 - 2016 ExcelR Solutions. All Rights Reserved ample Quantiles S Theoretical Quantiles Sampling variation ▪ Sample mean varies from one sample to another ▪ Sample mean can be (and most likely is) different from the population mean ▪ Sample mean is a random variable © 2013 - 2016 ExcelR Solutions. All Rights Reserved Central Limit Theorem The Distribution of the sample mean - will be normal when the distribution of data in the population is normal - will be approximately normal even if the distribution of data in the population is not normal if the “sample size” is fairly large _ Mean ( X ) = μ ( the same as the population mean of the raw data) Standard Deviation (X) = √n σ , where σ is the population standard deviation and n is the sample size - This is referred to as standard error of mean The standard error of the mean estimates the variability between samples whereas the standard deviation measures the variability within a single sample © 2013 - 2016 ExcelR Solutions. All Rights Reserved Sample Size Calculation A Sample Size of 30 is considered large enough, but that may /may not be adequate More Precise conditions - n > 10( K3 )2 , where ( K3 ) is sample skewness and - n > 10( K4 ) , where ( K4) is sample kurtosis © 2013 - 2016 ExcelR Solutions. All Rights Reserved Confidence Interval • What is the Probability of tomorrow’s temperature being 42 degrees ? Probability is ‘0’ • Can it be between [-500C & 1000C] ? © 2013 - 2016 ExcelR Solutions. All Rights Reserved Case Study: Confidence Interval • A University with 100,000 alumni is thinking of offering a new affinity credit card to its alumni. • Profitability of the card depends on the average balance maintained by the card holders. • A Market research campaign is launched, in which about 140 alumni accept the card in a pilot launch. • Average balance maintained by these is $1990 and the standard deviation is $2833. Assume that the population standard deviation is $2500 from previous launches. • What we can say about the average balance that will be held after a full−fledged market launch ? © 2013 - 2016 ExcelR Solutions. All Rights Reserved Interval estimates of parameters • Based on sample data − The point estimate for mean balance = $1990 − Can we trust this estimate ? • What do you think will happen if we took another random sample of 140 alumni ? • Because of this uncertainty, we prefer to provide the estimate as an interval (range) and associate a level of confidence with it Point Estimate ± Interval Estimate = Margin of Error © 2013 - 2016 ExcelR Solutions. All Rights Reserved Confidence Interval for the Population Mean Start by choosing a confidence level (1-α) % (e.g. 95%, 99%, 90%) Then, the population mean will be with in X _ ±Z1-ᾳ √n σ where Z1-ᾳ satisfies p( -Z1-ᾳ ≤ Z ≤ Z1-ᾳ) = 1-ᾳ Point Estimate ± Interval Estimate = Margin of Error Margin of error depends on the underlying uncertainty, confidence level and sample size © 2013 - 2016 ExcelR Solutions. All Rights Reserved Calculate Z value - 90%, 95% & 99% © 2013 - 2016 ExcelR Solutions. All Rights Reserved Confidence Interval Calculation • Based on the survey and past data − n = 140; σ = $2500; _ X = $ 1990 − σ - X 2500 = σ √n = = 211.29 √140 • Construct a 95% confidence interval for the mean card balance and interpret it ? • Construct a 90% confidence interval for the mean card balance and interpret it ? © 2013 - 2016 ExcelR Solutions. All Rights Reserved Confidence Interval Interpretation Consider the 95% Confidence interval for the mean income : [$1576, $2404] Does this mean that - The mean balance of the population lies in the range ? - The mean balance is in this range 95% of the time ? - 95% of the alumni have balance in this range ? Interpretation 1 : Mean of the population has a 95% chance of being in this range for a random sample Interpretation 2 : Mean of the population will be in this range for 95% of the random samples © 2013 - 2016 ExcelR Solutions. All Rights Reserved What if we don’t know Sigma? • Suppose that the alumni of this university are very different and hence population standard deviation from previous launches can not be used We replace σ with our best guess (point estimate) s, which is the standard deviation of the sample: Calculate • If the underlying population is normally distributed , T is a random variable distributed according to a t-distribution with n-1 degrees of freedom Tn-1 • Research has shown that the t-distribution is fairly robust to deviation of the population of the normal model © 2013 - 2016 ExcelR Solutions. All Rights Reserved Student’s t-distribution © 2013 - 2016 ExcelR Solutions. All Rights Reserved As n ꝏ tn N(0,1) i.e., as the degrees of the freedom increase, the t- distribution approaches the standard normal distribution Confidence Interval for mean with unknown Sigma © 2013 - 2016 ExcelR Solutions. All Rights Reserved Calculating t-value • Construct a 95% confidence interval for the mean card balance and interpret it? − n = 140; σ = $2500; X = $ 1990 − σ X _ - = 2833 = 239.46 √140 Calculate t0.95, 139 = 1.98 Then the 95% confidence interval for balance is [$1516, $2464] © 2013 - 2016 ExcelR Solutions. All Rights Reserved Hypothesis Testing Start with Hypothesis about a Population Parameter 1-α Collect Sample Information Reject/Do Not Reject Hypothesis 1-β The factors that affect the power of a test include sample size, effect size, population variability, and α. Power and α are related as increasing α decreases β. Since power is calculated by 1 minus β, if you increase α,You also increase the power of a test. The maximum power a test can have is 1, whereas the minimum value is 0. Right Decision Confidence © 2013 - 2016 ExcelR Solutions. All Rights Reserved Ho is TRUE H1 is TRUE Fail to Reject Ho Reject Ho Type II error Type I error Right Decision Power Hypothesis Testing Our quality will not improve after the consulting project We will acquire 8,000 new customers if I open a store in this area Our potential customers do The retail market will grow by not spend more than 60 50% in the next 5 years minutes on the web every day Less than 5% clients will default on their loans We will need 400 more person hours to finish this project © 2013 - 2016 ExcelR Solutions. All Rights Reserved Hypothesis Testing © 2013 - 2016 ExcelR Solutions. All Rights Reserved Hypothesis Testing © 2013 - 2016 ExcelR Solutions. All Rights Reserved 1-Sample Z test 1 2 3 Normality Test Population Standard Deviation Known or Not Stat > Basic Statistics > Graphical Summary Fabric Data © 2013 - 2016 ExcelR Solutions. All Rights Reserved The length of 25 samples of a fabric are taken at random. Mean and standard deviation from the historic 2 years study are 150 and 4 respectively. Test if the current mean is greater than the historic mean. Assume α to be 0.05 1 Sample Z Test Stat > Basic Statistics > 1 Sample Z 1-Sample Z test – Write Hypothesis © 2013 - 2016 ExcelR Solutions. All Rights Reserved We are comparing mean with Population standard deviation is external standard of 150mm known=4 © 2013 - 2016 ExcelR Solutions. Data was shown to be normal All Rights Reserved Y: Fabric Length is continuous X: Discrete 1 Population 1-Sample t Test 1 2 3 Normality Test Population Standard Deviation Known or Not Stat > Basic Statistics > Graphical Summary Bolt Diameter © 2013 - 2016 ExcelR Solutions. All Rights Reserved The mean diameter of the bolt manufactured should be 10mm to be able to fit into the nut. 20 samples are taken at random from production line by a quality inspector. Conduct a test to check with 95% confidence that the mean is not different from the specification value. 1 Sample t Test Stat > Basic Statistics > 1 Sample t 1-Sample t Test – Write Hypothesis © 2013 - 2016 ExcelR Solutions. All Rights Reserved Y: Bolt Diameter is continuous X: Discrete 1 Population We are comparing mean with external standard of 10mm Data was given to be Normal Population standard deviation is NOT known © 2013 - 2016 ExcelR Solutions. All Rights Reserved 1-Sample Sign Test 1 3 Normality Test 1 Sample Sign Test Stat > Basic Statistics > Stat > Non Parametric > Graphical Summary 1 Sample sign Student Scores © 2013 - 2016 ExcelR Solutions. All Rights Reserved The scores of 20 students for the statistics exam are provided. Test if the current median is not equal to historic median of 82. Assume ‘α’ to be 0.05 1-Sample Sign Test – Write Hypothesis © 2013 - 2016 ExcelR Solutions. All Rights Reserved 2-Sample t Test 1 2 3 Normality Test Variance Test Stat > Basic Statistics > Stat > Basic Statistics > Graphical Summary 2 Variance Marketing Strategy © 2013 - 2016 ExcelR Solutions. All Rights Reserved A financial analyst at a Financial institute wants to evaluate a recent credit card promotion. After this promotion, 450 cardholders were randomly selected. Half received an ad promoting a full waiver of interest rate on purchases made over the next three months, and half received a standard Christmas advertisement. Did the ad promoting full interest rate waiver, increase purchases? 2 Sample t Test Stat > Basic Statistics > 2-Sample t 2-Sample t Test – Write Hypothesis © 2013 - 2016 ExcelR Solutions. All Rights Reserved Hypothesis Testing © 2013 - 2016 ExcelR Solutions. All Rights Reserved Paired T Test • This test is used to compare the means of two sets of observations when all the other external conditions are the same • This is a more powerful test as the variability in the observations is due to differences between the people or objects sampled is factored out Example: To find out if medication A lowers blood pressure © 2013 - 2016 ExcelR Solutions. All Rights Reserved Trigger your Compare the power output of two wind mills next to each other simultaneously when you thoughts! use motor A on one wind mill and motor B on another Comparing the performance of machine A vs. machine B by Identifying resistor defects and feeding different raw materials capacitor defects in same PCB to each machine by collecting such data using 20 PCB units Compare the performance of for 1 month and motor B for 1 machine A vs. machine B when month the same raw material is fed to each machine Identifying resister defects on Compare the power output of a 20 PCB’s and capacitor defects wind mill when you use motor A on 20 (different) PCB’s © 2013 - 2016 ExcelR Solutions. All Rights Reserved 2-Sample t test or Paired T test Effect of fuel additive on vehicles is being studied. Out of a total of 20 vehicles, 10 vehicles are chosen randomly and mileage is recorded. In rest of the 10 vehicles, additive to be tested is added with the fuel and their mileage is recorded. Find if the mileage increases by adding the fuel additive. 2-Sample t test Assume the same data was recorded if only 10 vehicles were chosen and mileage was recorded before and after adding the additive. What method will you choose to find the result. Paired T test © 2013 - 2016 ExcelR Solutions. All Rights Reserved Mann-Whitney test 1 2 Normality Test Mann – Whitney test for Medians Stat > Basic Statistics > Graphical Summary Stat > Non Parametric > Mann Whitney Vehicle with & without Additives © 2013 - 2016 ExcelR Solutions. All Rights Reserved Effect of fuel additive on vehicles is being studied. Out of a total of 20 vehicles, 10 vehicles are chosen randomly and mileage is recorded. In rest of the 10 vehicles, additive to be tested is added with the fuel and their mileage is recorded. Find if the mileage increases by adding the fuel additive. Mann-Whitney Test – Write Hypothesis © 2013 - 2016 ExcelR Solutions. All Rights Reserved Paired T test 1 2 Normality Test Paired T Test Stat > Basic Statistics > Stat > Basic Statistic > Graphical Summary Paired T Vehicle with & without Additives © 2013 - 2016 ExcelR Solutions. All Rights Reserved Effect of fuel additive on vehicles is being studied. Out of a total of 20 vehicles, 10 vehicles are chosen randomly and mileage is recorded. In rest of the 10 vehicles, additive to be tested is added with the fuel and their mileage is recorded. Find if the mileage increases by adding the fuel additive. Assume the same data was recorded if only 10 vehicles were chosen and mileage was recorded before and after adding the additive. • Since the data was not normal, the cause of non-normality was investigated and it was found that the first data point for “with additive” was wrongly entered. This value should have been 20. Now, proceed with the rest of the analysis. • If the data were truly non-normal our analysis would stop here. Paired T test – Write Hypothesis © 2013 - 2016 ExcelR Solutions. All Rights Reserved One-Way ANOVA 1 2 3 Normality Test Variance Test Stat > Basic Statistics > Stat > ANOVA > Graphical Summary Test for Equal Variances Contract Renewal © 2013 - 2016 ExcelR Solutions. All Rights Reserved A marketing organization outsources their back-office operations to three different suppliers. The contracts are up for renewal and the CMO wants to determine whether they should renew contracts with all suppliers or any specific supplier. CMO want to renew the contract of supplier with the least transaction time. CMO will renew all contracts if the performance of all suppliers is similar ANOVA Stat > ANOVA > One-Way.... Example : More weight reduction programs • Suppose the nutrition expert would like to do a comparative evaluation of three diet programs(Atkins, South Beach, GM) • She randomly assigns equal number of participants to each of these programs from a common pool of volunteers • Suppose the average weight losses in each of the groups(arms) of the experiments are 4.5kg, 7kg, 5.3kg • What can she conclude? © 2013 - 2016 ExcelR Solutions. All Rights Reserved Two kinds of variation matter • Not every individual in each program will respond identically to the diet program • Easier to identify variations across programs if variations within programs are smaller • Hence the method is called Analysis of Variance(ANOVA) • With-in group variation = Experimental Error • Between group variation © 2013 - 2016 ExcelR Solutions. All Rights Reserved Formalizing the intuition behind variations • It should be obvious that for every observation : Totij = ti + eij • What is more surprising and useful is: © 2013 - 2016 ExcelR Solutions. All Rights Reserved Statistically test for equality means • n subjects equally divided into r groups • Hypothesis - H0: μ1 = μ2 = μ3 = ... = μr - Not all μi are equal • Calculate - Mean Square Treatment MSTR = SSTR / (r‐ 1) - Mean Square Error MSE = SSE / (n‐r) - The ratio of two squares f = MSTR/MSE = Between group variation/Within group variation - Strength of this evidence p‐value = Pr(F(r‐1,n‐r) ≥ f) • Reject the null hypothesis if p‐value < α © 2013 - 2016 ExcelR Solutions. All Rights Reserved Analysis of variance(ANOVA) • ANOVA can be used to test equality of means when there are more then 2 populations • ANOVA can be used with one or two factors • If only one factor is varying, then we would use a one-way ANOVA – Example: We are interested in comparing the mean performance of several departments within a company. Here the only factor is the name of department – If there are two factors, we would use a two way ANOVA. Example: One factor is department and the second factor is the shift.(day vs. Night) © 2013 - 2016 ExcelR Solutions. All Rights Reserved Analysis of variance(ANOVA) Source of Variation Sum of Squares (SS) Degrees of Freedom Mean Square (MS) F Test Statistic Between Treatments SSFactor K-1 DFFactor MSFactor = SSFactor / F = MSFactor / MSError One Way ANOVA Within Treatment SSError N-k MSError = SSError / DFError Total SSTotal N-1 Two Way ANOVA Source of Variation Sum of Squares (SS) Degrees of Freedom Mean Square (MS) F Test Statistic Factor A SSA nA- 1 MSA = SSA / (nA– 1) FA = MSA / MSE Factor B SSB nB- 1 MSB = SSB / (nB– 1) FB = MSB / MSE Interaction A * B SSAB (nA– 1) (nB– 1) MSAB = SSAB / (nAB– 1) FAB = MSAB / MSE Error SSE n – nA * nB MSE = SSE / (n – nA * nB) Total SST n -1 © 2013 - 2016 ExcelR Solutions. All Rights Reserved Is the Transaction time dependent on Dichotomies whether person A or B processes thetransaction? Is medicine 1 effective or medicine 2 at reducing heart stroke? Three different sale closing methods were used. Which one is most Is the new branding program more effective? effective in increasing profits? Does the productivity of employees Four types of machines are used. Is vary depending on the three levels? weight of the Rugby ball dependent on (Beginner, Intermediate and the type of machine used? Advanced) 2 Sample t-test ANOVA – One Way © 2013 - 2016 ExcelR Solutions. All Rights Reserved Non-Parametric equivalent to ANOVA • When the data are not normal or if the data points are very few to figure out if the data are normal and we have more than 2 populations, we can use the Mood’s Median or Kruskal Wallis test to compare the populations Ho : All the medians are the same Ha: One of the medians is different • Mood’s median assigns the data from each population that is higher than the overall median to one group, and all points that are equal or lower to another group. It then uses a Chi-Square test to check if the observed frequencies are close to expected frequencies • Kruskal Wallis is another test that is non- parametric equivalent of ANOVA. Kruskal Wallis is the extension of Mann-Whitney test © 2013 - 2016 ExcelR Solutions. All Rights Reserved