PerformancePoint Blog

A Blog about PerformancePoint and Microsoft BI technologies. Your host is Russell Christopher

The Microsoft Clustering Algorithm, Tableau, and More Cowbell

Posted by Russell on February - 13 - 2011 with 2 Comments

As a follow-up to my work with the Time Series algorithm, I’ve been playing with the SSAS clustering algorithm over several cups of Bald Guy coffee.

One will generally use the clustering algorithm to segment “things” (customers into “customer clusters”, for example) based on similarities between attributes that describe the “things” in question.

The Adventure Works sample cube has a ready-made data mining structure which contains a customer-focused clustering model. It’s called Customer Clusters and takes the following inputs:

  • Commute Distance
  • Education
  • Gender
  • Home Owner
  • Marital Status
  • Number of Cars Owned
  • Number of Children at Home
  • Occupation
  • Total Children
  • Yearly Income

Rather than recreate the wheel, I’ll use the sample.

The first thing we’ll probably want to understand is how different combinations of attributes and attribute values combine to create distinct clusters of customers – otherwise known as a “cluster profile”. SQL Server Analysis Services has a quick but unsupported system stored procedure named GetClusterProfiles() which will give us this information, and since it’s the weekend, I don’t care if it’s unsupported. I’m going to execute this sproc against my data mining model:

1 - System Sproc

Click and view the image above as there’s a lot of good information to be found. The columns 001-010 represent 10 distinct clusters of customers that were discovered based on the data the model was trained on. Attribute Name and Attribute Value show the inputs and potential input values that were found in the data. For each attribute name/value pair, you’ll see a support and probability value. Support represents the number of cases (rows) we found that fit the criteria in question. Probability tells us the probability a particular attribute name/value pair will show up in the customer cluster in question

For example in Customer Cluster 001, there is a ~ 42% chance the customer lives less than a mile away from work, about a 1% chance the person lives 10+ miles away from work, and a ~25% chance they live within 1-2 miles, etc.

As an aside, it’s normal to actual rename clusters from 001, 002, 003 to something more meaningful like “Soccer Moms” , “Yuppies” or “Hipsters” based on our read of the primary drivers of each cluster. I haven’t bothered, so we’ll refer to our clusters as 001-010.

As I mentioned, I initially wanted to use GetClusterProfile()’s output as my data source, but found that I couldn’t use OPENQUERY and the Microsoft MSOLAP provider to execute it. I’m guessing that the fact we return an unknown number of columns (based on the number of clusters we find) could be causing the problem.

Instead, I had to use multiple DMX statements “UNIONed” together. Each DMX SELECT returns cluster attribute name/value pairs for a single attribute. Here’s one for Commute Distance:

1.5 - DMX for Single Attribute

There’s something important here to note, by the way. Notice the WHERE clause where I filter out any attributes with less support than 500 cases. Choosing 500 was an arbitrary decision on my part, but it impacted my results in a big way. In the output above, there are now only 3 Commute Distance nodes returned for customer cluster 001, while there were 5 when I executed GetClusterProfile() in the first screenshot. Is this good or bad? Hard to say – there were only 35 cases supporting the “1% probability of living 10+ miles away” node in cluster 001. Is 35 statistically significant enough to include?

2 - OpenQuery

(By the way, after the whole discussion around SUPPORT vis-à-vis “Customer Commute” I managed to completely FORGET to add it to the final query above! Since this is just an example I decided not to go back and correct my oversight, however – don’t be surprised when you don’t see the Commute Distance attribute in any of the following vizes.)

After dropping our final SELECT statement into a Tableau “Custom SQL” data connection, we’re ready to build a Viz:

  • The Node_Name (The name of the cluster: 001, 002, etc.) pill lands on Columns
  • Attribte_Name and the Probability measure get dropped in Rows
  • Attrbute_Value lives in the Color shelf

You can now easily see how each attribute value drives the composition of different customer clusters (001-010, on columns).

3- Cluster Profile

I found there were mixed text and number values in the Attribute_Value column, and this was painful. For example, I had to deal with values like “Married” and “Single” (Marital Status) and .056365 (Number of Cars Owned in cluster 010) in the same field. I used Groups to simplify things a bit, but I’d probably massage the data model / query a bit more if I were doing this “for real” rather than “for fun”.

After we understand how customer clusters are composed, we can really have some fun.

The Adventure Works sample SSAS database contains a second cube called “Mined Customer”. Mined Customers combines two data mining models (Customer Clusters and an Association Rules model for market basket analysis) with lots of other metrics like sales, profit. The mash up of our “default” metrics and dimensions with the “mining insights” gives us some exciting possibilities for visualization.

In the viz below, we’re able to look at average sales and profit margin by our ten customer clusters. One can see that Cluster 5 drives the highest profit margin with a relatively large number of customers:

4  - Clustered Profit 1

And the fun continues! By dropping one of the “demographic inputs” on the Level of Detail shelf, we can begin exploring within individual clusters. I’m going to drill in by Occupation.

5-Clustered Profit 2

It now becomes apparent that skilled manual workers in Cluster 6 are more profitable than Cluster 5 as a whole. By putting together my Quick Filters and Demographic fields on the Level of Detail shelf I also have the ability to completely explore the sales and profitability of my customer clusters.

More Cowbell

And I have an admission to make: I’m obsessed. Since Gartner released its 2011 Magic Quadrant for Business Intelligence Platforms report a few weeks ago, I’ve had MQ on the brain. I’ve been thinking about MQs non-stop – it’s my cowbell. In fact, I got a fever, and the only prescription…is more cowbell!!”

So, I decided I’d take the previous viz and turn it into a MQ. I essentially zoomed in a tad, and dropped in four annotation areas with no labels and played around with background shading. Voila, more cowbell!

6 - Clustered Profit MQ

Next time, Association Rules – otherwise known as market basket analysis

2 Responses so far.

  1. Tim Roberts says:

    Nice Work– look forward to some more!

  2. [...] of weeks I’ve shared my thoughts on using Tableau to visualize Microsoft’s Time Series and Clustering algorithms. Today, I’ll finish off with Association [...]


  • RSS
  • Delicious
  • Digg
  • Facebook
  • Twitter
  • Linkedin
  • Russell's About.Me Profile