PerformancePoint Blog

A Blog about PerformancePoint and Microsoft BI technologies. Your host is Russell Christopher

Visualizing Microsoft Market Basket Analysis with Tableau

Posted by Russell on February - 24 - 2011 with 3 Comments

Over the past couple of weeks I’ve shared my thoughts on using Tableau to visualize Microsoft’s Time Series and Clustering algorithms. Today, I’ll finish off with Association Rules.

The Association Rules algorithm focuses on itemsets, which you can think of as a bundle of “things” (attributes) which frequently go together. For example, a popular itemset might be {Peanut Butter, Jelly, Bread}. Itemsets don’t have to be related to shopping, however. A perfectly valid itemset might be {Red Sox Fan, Cretin, Convict}: We’re just creating groups of attributes which go together.

Given many itemsets, the Association Rules algorithm discovers most frequently occurring ones. Then, it creates association rules using those frequently occurring itemsets.  Here’s what a rule looks like:

{Peanut Butter = Existing , Jelly = Existing => Bread = Existing}

If we see Peanut Butter and Jelly in someone’s shopping cart, they are likely to buy Bread, as well.

The concepts of support, probability, and importance are used to tell us how many times we see an item or itemset, the confidence we have in the rule, and the “lift” we might expect for a rule.

If you’d like to understand more about these concepts, read this thread by Jamie Maclennan.

Here’s a basic query to return information from our data mining model:

SELECT * FROM OpenQuery (SSAS,’SELECT NODE_CAPTION as [Rule], NODE_SUPPORT AS [Support], MSOLAP_NODE_SCORE AS [Importance],

NODE_PROBABILITY AS [Probability] from

[Association].content WHERE NODE_TYPE=8′)

Executing this puppy will return all our rules and their support,  probability & importance values. We’re looking for rules with relatively high importance (> 0) which indicate that the existence of peanut butter and jelly will impact bread buying positively:

Original Query

If you open the image above, you’ll see that each rule looks pretty much like a “formula”. To make these formulae useful, we’re going to need to parse them into individual fields that be used as labels on an axis. To do that, we write lots of kludgy string parsing code:

SELECT
    [Rule],
    — Cast NTEXT to nvarchar then determine number of items in itemset by keying on number of = symbols before the -> characters
    — Divide by 2 since DATALENGTH against NVARCHAR will be twice the string length
    (DATALENGTH(LEFT(CAST([rule] as nvarchar(300)),CHARINDEX(‘->’,cast([RULE] as nvarchar(300)))))- DATALENGTH (REPLACE(LEFT(CAST([rule] as NVARCHAR(300)),CHARINDEX(‘->’,[RULE])),’=',”)))/2   as [ItemSet Size],
    [Support],
    [Importance],
    [Probability],
    — Grab first item in itemset
    LEFT(CAST([rule] as nvarchar(300)), CHARINDEX(‘=’, CAST([rule] as nvarchar(300))) -1) as [Basket Item 1],
    — If two items exist in itemset, grab second using kludgy string parsing
    CASE (DATALENGTH(LEFT(CAST([rule] as nvarchar(300)),CHARINDEX(‘->’,cast([RULE] as nvarchar(300)))))- DATALENGTH (REPLACE(LEFT(CAST([rule] as NVARCHAR(300)),CHARINDEX(‘->’,[RULE])),’=',”)))/2
        WHEN 2 Then SUBSTRING(CAST([rule] as NVARCHAR(300)),CHARINDEX(‘,’,CAST([rule] as nvarchar(300)))+2, ((CHARINDEX(‘->’,cast([RULE] as nvarchar(300)))-12) – (CHARINDEX(‘,’,CAST([rule] as nvarchar(300)))+2) ))
    END as [Basket Item 2],
    — Recommended Item
    REPLACE(RIGHT(CAST([rule] as NVARCHAR(300)), (DATALENGTH(CAST([rule] as NVARCHAR(300)))/2) – (CHARINDEX(‘->’,cast([RULE] as nvarchar(300)))+2)),’ = Existing’,”) as [Recommended Item]
FROM
OpenQUery (SSAS,’SELECT NODE_CAPTION as [Rule], NODE_SUPPORT AS [Support], MSOLAP_NODE_SCORE AS [Importance],
NODE_PROBABILITY AS [Probability] from
[Association].content WHERE NODE_TYPE=8′)

Writing the SQL above was by far the ugliest bit. On to the results:
Updated Query

The resultset above adds “Basket Item 1”, “Basket Item 2”, and “Recommended Item” fields by parsing the the rule column.

Now for the fun stuff. First, let’s look at our rules by Importance and Probability:

3 - Association Rules

Next, lets look at the actual products which are being recommended:

4 - Recommendation

Support is being used you determine our bubble size, so we can see below that Mountain Bottle cages are recommended by 5 different rules:

5 - Recommednations

By highlighting a specific item (product), we can see that while we generated lots of rules which recommend it, those rules aren’t very important – I’d probably ignore these.

6 - Highligt

Finally, let’s see which items  tend to go together in the same basket. Here we can see that the Mountain Bottle Cage gets grouped with a lot of other items.

7 -Recommendations per Product

3 Responses so far.

  1. Joe Mako says:

    In regards to your stacked bar chart, you said: “Here we can see that the Mountain Bottle Cage gets grouped with a lot of other items”

    I may be interpreting it incorrectly, as I do not fully understand the business logic, but my impression is that “Sport-100″ gets grouped with a lot of other items because out of the 15 dimension values in “Basket 1″, 11 of them have a “Sport-100″ as a “Basket 2″ item.

    Also I do not see a color for “Mountain Bottle Cage” does it only exist as a “Basket 1″ value? Does it make sense to look at this from just what was in basket 1 and Basket 2, or would it make sense to look at this from any pair, regardless of order, or is order important?

    You provided the example:

    {Peanut Butter = Existing , Jelly = Existing => Bread = Existing}

    is

    {Jelly = Existing, Peanut Butter = Existing => Bread = Existing}

    also true?

    Can you publish your workbook to Tableau Public?

    • Russell says:

      Hey Joe!

      You are 100% correct.
      I somewhat arbitrarily used “Basket Item 1” on the Columns Shelf and “Basket Item 2” on the Color shelf. If we reverse that, we’d see that ”Sport-100 “and “Water Bottles” both were part of many more itemsets than “Mountain Bottle Cage”.
      I chose to stick with my approach because the previous viz showed that rules that included “Mountain Bottle Cage” were generally more important (and that in and of itself is difficult for you to see because Mountain Bottle Cages generally appear in basket item 2, and the level of detail I have in the screenshot is Basket Item 1).
      Itemset order doesn’t make a difference in this case, and so we could indeed look at this in the way you suggest. The challenge however is how to split up the data to make this possible:
      If we choose to ignore position, it becomes difficult to completely “model” a rule using one record.
      Right now I have the luxury of doing this for a resultset with the following fields:
      Resultset shape:
      Basket Item 1
      BasketItem 2
      Itemset Size
      Recommended Item
      Full Rule
      Importance
      Probability
      Support
      ——————————–
      Record:
      Peanut Butter, Jelly, 2, Bread, {PB=Existing, Jelly=Existing => Bread=Existing}, 1.2, .8, 2010
      If we treat all items in the itemset the same regardless of position, I think we’d have to break this single record into two records and do something like lean on the “full rule” {PB=Existing, Jelly=Existing => Bread=Existing}, as a way to group them together later on:

      Peanut Butter, 2, Bread, {PB=Existing, Jelly=Existing => Bread=Existing}, 1.2, .8, 2010
      Jelly, 2, Bread, {PB=Existing, Jelly=Existing => Bread=Existing}, 1.2, .8, 2010
      I’m not sure if this approach would make the viz easier or harder to create, or maybe there’s a smarter way to approach the data shaping of the result than what I can come up with. What do you think?
      I actually attempted to upload this to Public, but one of my datasets is based on SSAS, and evidently Analysis Services isn’t “extractable”. Do you think it would still be worthwhile to do so even if I have to delete the “Clustering” sheets from my previous post?

  2. Jeannine says:

    Love this example of a “valid itemset” {Red Sox Fan, Cretin, Convict}. Sometimes I think the name makes us all forget that association analysis is good for things other than shopping baskets!


  • RSS
  • Delicious
  • Digg
  • Facebook
  • Twitter
  • Linkedin
  • Russell's About.Me Profile