data science in pune


Ravali470d

Uploaded on Oct 10, 2018

Category Education

Data for a Data Scientist is what Oxygen is to Human Beings. This is also a profession where statistical adroit works on data

Category Education

Comments

                     

data science in pune

Data Mining © 2013 ExcelR Solutions. All Rights Reserved Market Basket Analysis Affinity Analysis Relationship Mining Association Rules © 2013 ExcelR Solutions. All Rights Reserved Market Basket Analysis • Large number of transaction records through data collected using bar-code scanners • Each record = All items purchased on a single purchase transaction © 2013 ExcelR Solutions. All Rights Reserved Association Rules • What item goes with what • Are certain groups of items consistently purchased together • What business strategies will you device with this knowledge © 2013 ExcelR Solutions. All Rights Reserved Association Rules • Products shelf placement – a specific product beside another • Selling of prominent shelves – Slotting Fees • Stocking – Supply Chain Management • Price Bundling – Combo offers. How? Source: http://www.economist.com/news/business/21654601-supplier-rebates-are-heart-some-supermarket-chains-woes-buying-up-shelves https://en.wikipedia.org/wiki/Association_rule_learning © 2013 ExcelR Solutions. All Rights Reserved Association Rules – Cell phone faceplates A store sells accessories for cellular phones runs a promotion on faceplates OFFER! Buy multiple faceplates from a choice of 6 different colors & get discount How would you help store managers device strategy to become more profitable © 2013 ExcelR Solutions. All Rights Reserved Association Rules – Cell phone faceplates Transaction # Faceplate colors purchased Transaction # Red White Blue Orange Green Yellow 1 Red White Green 1 1 1 0 0 1 0 2 White Orange 2 0 1 0 1 0 0 3 White Blue 3 0 1 1 0 0 0 4 Red White Orange 4 1 1 0 1 0 0 5 Red Blue 5 1 0 1 0 0 0 6 White Blue 6 0 1 1 0 0 0 7 White Orange 7 0 1 0 1 0 0 8 Red White Blue Green 8 1 1 1 0 1 0 9 Red White Blue 9 1 1 1 0 0 0 10 Yellow 10 0 0 0 0 0 1 List Format Binary Matrix Format Association Rules are probabilistic “if-then” statements 2 Main Ideas:  Examine all possible “if-then” rule formats  Select rules, which indicates true dependence © 2013 ExcelR Solutions. All Rights Reserved Association Rules – Cell phone faceplates Rules for { Red, White, Green} 1. If {Red, White} then {Green} 2. If {Red, Green} then {White} 3. If {White, Green} then {Red} 4. If {Red} then {White, Green} 5. If {White} then {Red, Green} 6. If {Green} then {Red, White} Problem • Many rules are possible • How to select the TRUE/GOOD rules from all generated rules? © 2013 ExcelR Solutions. All Rights Reserved Association Rules – Terminology • If {Red, White} then {Green} • If Red & White phone faceplates are purchased, then Green faceplate is purchased  Antecedent: Red & White  Consequent: Green “IF” part = Antecedent = A “THEN” part = Consequent = C © 2013 ExcelR Solutions. All Rights Reserved Association Rules – Performance Measures Support 1 Confidence Lift 2 3 © 2013 ExcelR Solutions. All Rights Reserved Association Rules – Support Support 1 • Consider only combinations that occur with higher frequency in the database • Support is the criterion based on frequency Percentage / Number of transactions in which IF/Antecedent & THEN / Consequent appear in the data Mathematically: # transactions in which A & C appear together _____________________________________ Total no. of transactions © 2013 ExcelR Solutions. All Rights Reserved Support - Calculation • What is the support for “if White then Blue”? 1. 4 2. 40% 3. 2 4. 90% Transaction # Faceplate colors purchased 1 Red White Green 2 White Orange 3 White Blue 4 Red White Orange 5 Red Blue 6 White Blue 7 White Orange 8 Red White Blue Green 9 Red White Blue 10 Yellow • What is the support for “if Blue then White”? 1. 4 2. 40% 3. 2 4. 90% © 2013 ExcelR Solutions. All Rights Reserved Support - Problem • Generating all possible rules is exponential in the number of distinct items • Solution: Frequent item sets using Apriori Algorithm © 2013 ExcelR Solutions. All Rights Reserved Apriori Algorithm For k products: 1 2 3 4 5 Set minimum support criteria Generate list of one-item sets that meet the support criterion Use list of one-item sets to generate list of two-item sets that meet support criterion Use list of two-item sets to generate list of three-item sets that meet support criterion Continue up through k-item sets © 2013 ExcelR Solutions. All Rights Reserved Support – Criterion = 2 Transaction # Faceplate colors purchased 1 Red White Green 2 White Orange 3 White Blue 4 Red White Orange 5 Red Blue 6 White Blue 7 White Orange 8 Red White Blue Green 9 Red White Blue 10 Yellow Item set Support (Count) {Red} 5 {White} 8 {Blue} 5 {Orange} 3 {Green} 2 {Red, White} 4 {Red, Blue} 3 {Red, Green} 2 {White, Blue} 4 {White, Orange} 3 {White, Green} 2 {Red, White, Blue} 2 {Red, White, Green} 2 Create rules from frequent item sets only © 2013 ExcelR Solutions. All Rights Reserved Support Criterion Example Rules for { Red, White, Green} 1. If {Red, White} then {Green} 2. If {Red, Green} then {White} 3. If {White, Green} then {Red} 4. If {Red} then {White, Green} 5. If {White} then {Red, Green} 6. If {Green} then {Red, White} How good are these rules beyond the point that they have high support? © 2013 ExcelR Solutions. All Rights Reserved Association Rules – Confidence Confidence 2 • Percentage of If/Antecedent transactions that also have the Then/Consequent item set Mathematically: P (Consequent | Antecedent) = P(C & A) / P(A) # transactions in which A & C appear together _____________________________________ # transactions with A © 2013 ExcelR Solutions. All Rights Reserved Confidence - Calculation • What is the confidence for “if White then Blue”? 1. 4/5 2. 5/8 3. 5/4 4. 4/8 Transaction # Faceplate colors purchased 1 Red White Green 2 White Orange 3 White Blue 4 Red White Orange 5 Red Blue 6 White Blue 7 White Orange 8 Red White Blue Green 9 Red White Blue 10 Yellow • What is the confidence for “if Blue then White”? 1. 4/5 2. 5/8 3. 5/4 4. 4/8 © 2013 ExcelR Solutions. All Rights Reserved Confidence - Weakness • If antecedent and consequent have: High Support => High / Biased Confidence © 2013 ExcelR Solutions. All Rights Reserved Association Rules – Lift Ratio Lift Ratio 3 Confidence / Benchmark confidence Benchmark assumes independence between antecedent & consequent: P(antecedent & consequent) = P(antecedent) X P(consequent)Benchmark confidence P(C|A) = P(C & A) / P(A) = P(C) X P(A) /P(A) = P(C) # transactions with consequent item sets _____________________________________ # transactions in database © 2013 ExcelR Solutions. All Rights Reserved Interpreting Lift • Lift > 1 indicates a rule that is useful in finding consequent item sets • The rule above is much better than selecting random transactions © 2013 ExcelR Solutions. All Rights Reserved Lift - Calculation • What is the Lift for “if White then Blue”? 1. 4/8 2. 5/10 3. 4/5 4. 1 Transaction # Faceplate colors purchased 1 Red White Green 2 White Orange 3 White Blue 4 Red White Orange 5 Red Blue 6 White Blue 7 White Orange 8 Red White Blue Green 9 Red White Blue 10 Yellow © 2013 ExcelR Solutions. All Rights Reserved Rules selection process Generate all rules that meet specified Support & Confidence  Find frequent item sets based on Support specified by applying minimum support cutoff  From these item sets, generate rules with defined Confidence. By filtering remaining rules select only those with high Confidence © 2013 ExcelR Solutions. All Rights Reserved Rules Inputs Data # Transactions in Input Data 10 # Columns in Input Data 6 # Items in Input Data 6 # Association Rules 8 Minimum Support 2 Minimum Confidence 70.00% List of Rules Rule: If all Antecedent items are purchased, then with Confidence percentage Consequent items will also be purchased. Row ID Confidence % Antecedent (A) Consequent (C) Support for A Support for C Support for A & C Lift Ratio 8 100 green red & white 2 4 2 2.5 4 100 green red 2 5 2 2 6 100 white & green red 2 5 2 2 3 100 orange white 3 8 3 1.25 5 100 green white 2 8 2 1.25 7 100 red & green white 2 8 2 1.25 1 80 red white 5 8 4 1 2 80 blue white 5 8 4 1 © 2013 ExcelR Solutions. All Rights Reserved Alarming!  Random data can generate apparently interesting association rules  More the rules you produce, greater the danger  Rules based on large numbers of records are less subject to this danger © 2013 ExcelR Solutions. All Rights Reserved Profusion of rules © 2013 ExcelR Solutions. All Rights Reserved Applications • What if Product & Stores are selected as a tuple for analysis? • What if crimes in different geographies for each week is known? Narcotics Robbery AssaultBattery Narcotics Public Peace Violation © 2013 ExcelR Solutions. All Rights Reserved Recap with an example • How can you use the information if you know about the purchase history of customers in a specific geography? • Supermarket database has 100,000 POS transactions • 2000 transactions include both Strepsils & Orange Juice • 800 of the above 2000 include Soup purchases © 2013 ExcelR Solutions. All Rights Reserved Recap with an example • What is the support for rule “IF (Orange Juice & Strepsils) are purchased THEN (Soup) is purchased on the same trip”? 1. 0.8 % 2. 2 % 3. 40 % • What is the confidence for rule “IF (Orange Juice & Strepsils) are purchased THEN (Soup) is purchased on the same trip”? 1. 0.8 % 2. 2 % 3. 40 % © 2013 ExcelR Solutions. All Rights Reserved Recap with an example • What is the lift ratio for rule “IF (Orange Juice & Strepsils) are purchased THEN (Soup) is purchased on the same trip”? © 2013 ExcelR Solutions. All Rights Reserved Sequential Pattern Mining Purchases / events occur at the same time • If person X has taken “Data Mining Unsupervised” training in 1st Quarter, Person X has also taken “Data Mining Supervised” training in 2nd Quarter • Based on the statement above, recommend “Data Mining Supervised” training to those who have enrolled for “Data Mining Unsupervised” NOT IT IS © 2013 ExcelR Solutions. All Rights Reserved Association Rules vs. Sequential Pattern Mining • Look for temporal patterns • Order/sequence of a & b matters for a rule “b follows a” • However, what happens in between a & b doesn’t matter • In phone faceplates dataset:  Association among items, which were bought within the same week were discovered  How about finding what they would buy next week or the week after, if they had bought ‘x’ in this week? © 2013 ExcelR Solutions. All Rights Reserved Applications • Identify the appropriate Basket • Identify popular taxi routes  Sequential pattern from GPS tracks; spatiotemporal records of taxi trajectories  First cluster collocated customers © 2013 ExcelR Solutions. All Rights Reserved THANK YOU