An illustrated guide to data-driven questionable decision making.

Joseph Schafer 5/10/17 Joseph Schafer 5/10/17

A quick transform to summarize and present data.

In Alteryx, there are some little tricks that I find myself reusing time and again One example of this is the use of Cross Tab to create an in-Alteryx pivot of the data, usually summarized at a high level with a familiar time segment as columns, to trend out the data and validate that it's accurate.

Since the goal of this is to make it human-readable, I almost always also take the time to sort out the segmentation so it occurs predictably and according to established precedent. For Seagate, we put the enterprise segments first, followed by the client segments and then the flash and consumer segments. This tends to be true no matter which topic is being discussed, and it's easy enough to do with a quick use of the switch function in a Formula tool against [Consensus Segment], followed by the Sort tool later on, after the Cross Tab tool has been invoked. I tend to do a little clean up after the fact, deselecting my index dimension using the Select tool.

Joseph Schafer 4/21/17 Joseph Schafer 4/21/17

Commenting Alteryx; Because Alzheimer's runs in the family...

About 62.5% of my ancestors rocked out to the surreal psychedelic trip that is dementia, which is probably mostly because they tend to live well into their eighties and nineties. I'm certainly not complaining - I'm saying I forget things!

Like why I built what I built in Alteryx.

I would guess that a fair share of Alteryx users aren't trained programmers, and even so, they may not be big on the whole commenting concept that's out there in that field of practice. I would guess that's particularly likely if you're coming from the business side of things. Folks like me were brought up on Excel, where there really isn't much accommodation made for annotation and even named variables - an example of self-documenting code, are a relative rarity.

So, here's what I do, in the process of building out an Alteryx workflow: I use the comment tool with every single tool to articulate what I'm doing and why I'm doing it. Oftentimes when I'm building a new workflow, the annotations are my starting point, rather than an afterthought.

Perhaps it's surprising, but when faced with a blank page, I typically find comments to be an easier and more efficient way to get started. By organizing my thought process in the comments first, followed by tools that align to those steps, I can easily order and reorder my operations without stressing about the details of configuring tools. All those tools have to be configured to work right without complaining, whereas I can often jot down the concepts I'm working through in prose more quickly.

Of course, this is just one way of approaching a blank page and I do tend to get some tools out on the page pretty quickly.

Bottom line: Comments save a lot of time over the long haul, and it may even be a faster way to think out a new workflow. It's like showing your work on a math problem. It's usually worth the extra effort.

Joseph Schafer 4/2/17 Joseph Schafer 4/2/17

Put a fork in it: Duplicating data in Alteryx to work around Tableau's limitations

In an exceptional situation, I've been asked to produce a fancy table in Tableau - like putting all the subtotals at the end, instead of interspersed, like normal. Crazy, right?

It turns out, this is harder than one might think; probably even impossible if you don't pre-process the data ahead of time before bringing it into Tableau, 'cause apparently Tableau has some pretty stern opinions about how tables should look.

The use case here is to produce a table which has columns of historical weeks of shipments, followed by future weeks of demand, followed by monthly subtotals, some actual and demand quarterly totals, a historical quarterly total, plan totals, and performance to plan and prior quarter. Essentially, it's the same numbers, just summarized across different sized and overlapping segments of time.

In short, midway through that use case, Tableau is like "well, fuck all that" and flips the table and goes home to what I imagine is a pretty awesome mancave to brood about how I just insulted its whole raison d'être.

Enter Alteryx.

In this example, I need to arrange my "Actuals" a few ways:

first in weekly columns (column numbers 1~13),
then I show the weekly UCD columns (column numbers 21~33),
then I display the Monthly Actual Totals (#s 41,42,43),
then the QTD Actuals (# 44),
Prior QTD Actuals (#45), QTD / Plan (#51), Prior Attainment(#52) and finally the UCD Qtr Total (#53).

For this I use formula tools after forking the data. I use the same Formula tools to simultaneously code the [Column Labels]. I'll put both the [Column Number] pill in the Column shelf first, followed by the [Column Header] pill and then hide the headers for [Column Number] so that they don't show up in the viz.

annotated std actuals in tableau blurred.png

Notice how it doesn't really matter that there are gaps in the [Column Number]; All Tableau cares about is the fact that they're all numbers and thus they can be sorted numerically.

I filter the data where necessary, thereby duplicating only the data I want, by forking it and then I union the repeated data back together to allow for subtotals to co-exist in the same table.

Now, there is one important risk to this kind of solution: When I duplicate data, I run the risk of having someone come along, not understanding what's going on, and mis-interpreting the results. In this case, one must invoke either the [Column Header] or [Column Number] dimension to de-dupe the data. While, this isn't unfamiliar territory for any analyst that pulls in multiple snapshots in one table, it is worth mentioning. To alleviate some of this risk, I have a disclaimer in a special warning calculation at the top of the list of dimensions within the Tableau workbook as a courtesy to my fellow analysts that might open the hood on this workbook.

Joseph Schafer 4/2/17 Joseph Schafer 4/2/17

Working within the constraints of Business Objects

For most folks at Seagate, the official source of truth is SAP Business Objects. It offers a rich semantic business layer on top of our EDW and is well-governed by a talented and responsive team within our IT department.

Answering a typical business question usually entails the simultaneous consideration of multiple datasets originating out of Business Objects. Take for example, the daily evaluation of our in-quarter shipment performance compared to prior quarter performance and two predictive data sets: our quarterly financial plan and the latest demand signal. In my world, I'd need to pull three different data sets: Actuals, Plan, Demand. In any well-groomed dataverse, data sets such as these are maintained such that there's substantial dimensional commonality across all three.

At Seagate, this is possible from within Business Objects and for particular use cases I do provide reports directly to users from within an automated Business Objects report scheduled, exported, and emailed to users.

That said, Business Objects and Excel reports are rarely my customers' preferred presentation layer and seldom my preferred final data preparation layer:

To the former point, my users often demand features such as mobility, interactivity, or embeddedness that the Tableau platform offers and our Business Objects environment simply does not.

To the latter point, my team often gets asked variations on the same question and as a matter of operational efficiency, we'd all prefer to offer the same answer to the same question. We can do that by resorting to the same Business Objects queries which we use to store the results in ephemeral staging tables that we can all reference simultaneously for any particular business case.

Since we're now in the business of building staging tables, we can also use multiple queries within the same subject matter to work around the row-count limitations imposed by both the Business Objects platform and Excel files. We'll refresh historical quarters using big, quarterly queries, prior weeks with weekly queries, and in-week results with an hourly schedule that matches the refresh rate of our EDW platform which pulls from our transactional platform hourly.

Now, that might seem like a big jump into ominous Shadow IT territory, but consider all the benefits:

We've drastically reduced unnecessary refreshes of stale data, thus diminishing the resource demands coming from our team on an aging BI infrastructure.
We've de-duped queries that hitherto were sometimes prone to entropy, especially when seen only from 20,000 ft. This also reduced demand on our BI infrastructure.
Our team is standardized on augmented data tables with all eyes on the remaining queries, which are seen by myriad sets of eyes and from all kinds of angles, ensuring our source of truth's integrity.
Because the staged data is augmented with a touch of metadata and available at a detailed level, we're ready to quickly answer the next business question from within our preferred presentation layer, circumventing Business Objects query times that can take up to an hour to complete.

While I'm at it, I add a bit of metadata to facilitate the refresh process and add some quality of life improvements for quicker analysis and a cleaner presentation.

Joseph Schafer 4/2/17 Joseph Schafer 4/2/17

I don't have a crystal ball, but I can still predict the future.

In certain circumstances, a running sum is a great way to predict the future. For example, if I take all the products I have on-hand, and add all the products I plan to build in the next few weeks, and subtract out the demand I plan to sell in those weeks, I should end up with my projected on-hand.

In most businesses, keeping track of inventory is a really good idea, because it means your business' capital isn't just sitting idle and depreciating in value. We ain't sellin' fine art 'round these parts!

The first order problem is, there are often a lot of gaps in the data. For a given part number, perhaps we don't plan to sell or build that item in week 2. If there's no Wk 2 record for a given part, then a running sum tool will simply move on from the Wk1 record to the Wk3 record. If I then display the example results by week, one can see, there's no record for Wk 2 when, in fact, we would have had 100 units laying around in that week.

The way to fix this is to separately summarize to every possible product, and every possible time at the lowest granularity, then append one to the other so you have a complete, empty, combination of product and time. Then, join the empty data set on the data set that has values by products and times in the same join and make sure you keep all the empty time records on the side of that join.

I wish I could take credit for this one, but this is all thanks to Mike Mixon, @mix_pix. You can check his twitter feed below.

Joseph Schafer 4/2/17 Joseph Schafer 4/2/17

Keeping things fresh: Adding [Latest Snapshot Date]

One of the key additions I make to our team's staging tables is a Time attribute called [Latest Snapshot Date], which I append to each subject matter dataset in the process of processing it from a .xlsx file that's originated out of Business Objects into a .yxdb file that's going to be used as an input for our more involved data prep workflows.

In most situations, analysts are keen to cite their sources, all the way down to the snapshot timestamp used. This helps immensely when questions of freshness come up.

Personally, I like to provide a small table that lists each dataset and a timestamp. This is trivial to produce in a presentation layer if each dataset is equipped with something like a [Latest Snapshot Date] attribute. I resort to two different ways to derive this attribute:

append lastwritetime as latest snapshot date.png

1) Transactional / Live Data

With a transactional or pseudotransactional dataset, where snapshots aren't provided, I resort to capturing the write time on the xlsx file written by Business Objects and append it as the [Latest Snapshot Date].

append max date as latest snapshot date.png

2) Snapshot Data

It's a bit different when a dataset has a series of snapshots. I'll use the Summarize to to find the maximum value within the snapshot field of the dataset and append that max snapshot as the [Latest Snapshot Date].

In some cases, the timestamp data is provided in a string and in such a situation I'll resort to a regex tool to convert it from a string into the time format preferred by Alteryx. Regular expressions are a topic worthy of their own series of articles. More on them in a future post!

Joseph Schafer 4/2/17 Joseph Schafer 4/2/17

Questionable fashion sense? Find your company's brand guide.

After all, my blog is called "questionabledecisionmaking"... So I'll confess that for almost a year I wore pants cut two inches too high because I tend to make questionable decisions, particularly in the fashion realm.

However, I do know enough about color theory that I probably shouldn't leave the walk-in closet without a consult from my better half. To that end, your corporate overlords probably have sunk a vast sum of money in a Brand Guide and you should absolutely leverage that investment.

As an example, Seagate's Brand Guide is a publicly available .pdf and it is a fantastically useful grimoire for those of us at Seagate do don't secretly harbor dreams of being a fashion buyer.

The most immediate value to me is in the color palette and hex codes.

By referencing this graphic from the brand guide, I make any viz I've made look right when embedded in the inevitable PPT prezo or frame. Further, it distinguishes my work from many other analysts in the company because most people simply don't take the time to recolor their work from the default palettes or perhaps aren't aware of this resource.