DataFit Blog: 2015

Friday, 23 October 2015

Simplify so my grandmother can understand

It’s always interesting to watch as technologies and business strategies gain momentum and then ebb and flow in popularity. Big Data has been a prime example, with journals, blogs and social media displaying an ongoing variation between hype and stagnation. The result, especially in the intensive, opinion-rich on-line world, is a mix of views from “10 reasons why big data is failing” (e.g. Forbes} to examples portraying “the ten most valuable big data stories” (e.g. Information Age). Extremes always grab attention.

When digging a bit deeper, these conflicting views actually highlight some interesting similarities. Firstly, many of the reasons for failure are not big data or analytics-specific issues, but a reflection of poorer corporate or enterprise strategy and focus, that could relate to any project. I’d include in this factors such as a lack of clear business objectives, not considering an enterprise strategy but instead working in silos, or a lack of communication and misalignment of business and IT objectives. None of this is a big data problem.

An area that gets much attention is the skills needed for big data, with a specific focus on Data Science. This is much more of an analytical issue, but again digging into some depth reveals that the underlying issues are often communication and terminology. With a clear definition of what an organisation is trying to achieve, it becomes easier to understand what skills are needed. From this, like any project, it’s easy to develop a skills matrix and determine what skills are light in the organisation (or missing), and what can be addressed by training, and what by hiring. Organisations that assume the second step in defining a big data project is to hire a data scientist, will be on a futile unicorn quest.

A common theme in the successful criteria is for focus and clarity. A focus on objectives and a clarity of purpose, but without pre-supposing outcomes. This is a tricky balance and requires an open-minded approach – this is where the real skill necessity emerges:

How do you keep an open mind and not set-out with a preconception of what a piece of analytics will discover?
How to let the data guide you, to read, analyse and interpret with a flexibility to move in new directions as the story unfolds?
Trying out new ideas, new techniques and incorporating new data : taking unexpected detours on the journey.
To read data, but also avoid red-herrings, making sensible, reasoned observations and avoiding traps (correlation vs causation being a prime example).

The most telling way that this approach is successfully achieved is in how the final conclusions are stated. Ideally it will simplify and summarise, explain not just what, but why (and also why-not; to show what was tried and found not to be useful). “Simplify, so my grandmother can understand” is how one CEO put it to me.

As a direct consequence the emerging focus is on clearer communication of findings, and on topics like visualisation and storytelling. Many organisations are achieving much more with analytics and big data; their quest now is to expand the horizon through better communication, to ensure that projects become enterprise initiatives.

Monday, 28 September 2015

Summary of Hadoop Use Cases

A summary paper is now available of the recent research project into published use cases of Hadoop adoption. The seven page paper summarises the use cases by identifying the key uses to which Hadoop is being put, the major industries that are being used as case studies by the Hadoop distributors and the key benefits identified by organisations that have implemented Hadoop. The paper is available here and requires no registration.

Friday, 18 September 2015

Hadoop use cases: research underway to identify common themes

Within the next week or so I should conclude a piece of research that summarises the published customer case studies for Hadoop adoption. It's been a fascinating project to work on. With all the hype around Big Data, and technologies like Hadoop it's often difficult to get a clear objective view of real usage. The research has examined close to 200 customer case studies to identify common themes in adoption, usage and benefits, the aim being to provide a reference for those looking to adopt Hadoop. The initial findings have shown some interesting insight:

whilst much of the focus of Hadoop has been around the perception of it providing a low cost analytical platform, driven by a combination of its open source foundation and use of commodity hardware, this is not the most referenced benefit
instead it is scalability that is most commonly quoted, with almost 2/3 [65%] of documented customer stories highlighting this as a key factor in adopting Hadoop. A common driving factor here is that organisations identify a need to retain more history than they could previously handle, or to explore new data sources that had inherently high volumes of data.
the next most reported benefit, and directly related to scalability, was speed of analytics (the time taken to run queries), with 57% of descriptive case studies highlighting this advantage
the 'cost driver', comes in third place with 39% of customers specifically highlighting the savings in adopting Hadoop

What's always interesting when looking at factors like scalability of analytics, and speed of queries, is to understand 'in comparison to what'. Many of the use cases undertook comparative benchmarks against other technologies, but many have migrated up from other platforms, most commonly MySQL or SQL server (by number of customers). In these latter cases, the 'faster' argument is always a bit thin; new commodity hardware is always going to be more performant that older solutions. Another reason why it's so difficult to get objective information of user adoption of technologies like Hadoop.

I'll add another blog post here once the research summary is available, with a link for download. Alternatively drop me an email kevin [at] datafit.co.uk and I'll email it when ready (your email address won't get added to a mailing list or distributed further).

Friday, 11 September 2015

The return of Arthur Andersen. What’s that about?

An interesting mix of news this week, with an announcement that Arthur Anderson will, according to The Times, ‘rise from the ashes’, in the form of a new French entity; providing a "unique international business services network integrating an authentic inter-professional dimension". Wow! After the dramatic demise of Arthur Anderson through the Enron scandal that's a pretty bold move. Whats even more interesting is that there is now a dispute between this entity, and another in the US, over the use of the Anderson brand.

Getting attention is an increasing challenge in a busy, complex, dynamic world, doing something counter intuitive is one way. Given it's high profile Apple takes a fairly mainstream approach to get attention. This week, with typical drama, they put on a slick presentation announcing a range of product updates. These included the launch of the iPad Pro, offering 'desktop PC performance' into a bigger format iPad (with a 12" screen) that is 22x faster than the original iPad, and weighs pretty much the same. Further compression of compute power. This has the potential for another step-change in the way people work, creating further enhancements to personal productivity; it was interesting to see Microsoft present on stage at the iPad Pro launch.

Adding more and more power, capability and portability to tools like the iPad further enables consumers and corporations to get wider access to information. The simplicity of tools like the iPad further enhance the spread of information access. Most people have looked in awe as they have seen how easily a toddler picks up the iPad interface, and the puzzlement when 'swiping' a flat-screen TV doesn't have the same effect.

However adding performance and capability and the ability to get easier access to more information can also create complexity, and like many information or analytics problems, trying to get to a detailed, helpful piece of information is an increasing challenge. This is probably why Gartner position 'self-service delivery of analytics' at the top of it's hype cycle for Emerging Technologies. [what Gartner calls the 'peak of inflated expectations']. There are an increasing number of vendors entering the crowded analytics market with claims to offer the new sliced-bread in allowing users to intuitively gain simplified access to information that allows them to make real decisions. There are a few good technologies out there, but there is also a lot of hype and buyers should beware of claims of magic and avoid the snake oil. There is still a massive challenge in tacking large, complex data sets and transforming them into insightful, actionable information. It's very difficult to self-serve or automate things that are so complex. It'll be interesting to see how this domain unfolds over the next few years; there's no doubt that software capability will be dramatically enhanced to make the most of the data.

And whilst you're waiting for self-service analytics on your shiny new iPad Pro, there's a raft of other things you can use it for, just don't impeded clarity of vision for others, by taking selfies next time you're 'experiencing' a music festival or gig.

Sunday, 6 September 2015

The impact of TV watching on GSCE grades: causation or correlation?

This week a 'new' research study, coinciding with the end of the UK school summer holiday, highlighted parents with the alarming perils of TV on their children's exam potential. The study received widespread coverage across national and local press, with some attention-grabbing headlines, including:

Watching TV seriously harms GCSE results, says Cambridge University - The Telegraph
Each hour schoolchildren spend watching television sees GCSE results fall by equivalent of two grades, says new research - The Independent
An extra hour of TV a day costs two grades at GCSE - The Times
Teenagers who watch screens in free time 'do worse in GCSEs' - The Guardian
Extra screen time 'hits GCSE grades' - BBC

The majority of articles took the research findings at face value - watching TV results in lower exam grades, and assumed a cause and effect; the common mistake of assuming that a correlation is a causation. Just because there is a relationship between two things (correlation) does not mean that the two are related (causation) - one may not be the specific cause of the other.

A good explanation of the dangers of mistaking correlation for causation, and a related example can be found in the excellent Freakonomics book, by Stephen J. Dubner and Steven D. Levitt. This identified a study that highlighted that children got better exam results if their homes had more books. Whilst there is a connection, its not likely to be causal - the existence of many books is likely to be an indicator of the interests of the parents, and therefore highlight the parenting style and approach. Upbringing and parenting is much more likely to have a causal effect. One district took the flawed approach of responding to the original study by sending two books to all homes with children; assuming this would 'fix' the problem.

The same problem applies to this current news coverage (though the research itself was carried out in 2005-2007, so it's age may reduce its relevancy - much has changed in the last decade); as the headlines assume that the amount of TV watched is the cause of exam pass-rate variation. There is very limited coverage as to the extent to which the two are connected, even though the researchers had some awareness of the risk:

The BBC quoted lead author Dr Kirsten Corder: "We followed these students over time so we can be relatively confident of our results and we can cautiously infer that TV viewing may lead to lower GCSE results but we certainly can't be certain. Further research is needed to confirm this effect conclusively, but parents who are concerned about their child's GCSE grade might consider limiting his or her screen time." Dr Corder suggested there could be various reasons for the link, including "substitution of television for other healthier behaviours or behaviours better for academic performance, or perhaps some cognitive mechanisms in the brain".

This is further backed up by detail in the research paper:

Our analyses are prospective therefore allowing cautious inferences about direction of association; however it would be impossible to tell whether reductions in screen time caused an increase in academic performance without a randomised controlled trial.

There are clearly a wide range in potential factors in upbringing that could have influenced the results. For example the research adjusted for 'deprivation' by using a post-code based scoring indicator. But as the authors indicate more work needs to be done, to provide a greater depth of analysis.

'Screen-time', whether TC, Internet or games is clearly a factor to be balanced in children's free time, few would argue that it should argue that it should have limits. Whilst the news headlines seem to over emphasise the relationship, it would be interesting how many readers have some shift in behaviour in the current weeks as their children head back to school, or if their underlying parenting style, modified from their own upbringing has most ongoing impact.

And if you're still not convinced by the risk of mistaking correlation vs causation, then take a look at this site of 'Spurious Correlations' my favorite being: the year-on-year correlation between the 'Number of Japanese Cars sold in the US' and the level of 'Suicides by crashing of motor vehicle'.

Friday, 28 August 2015

Don't lose sight of what's important

This blog entry is short, by necessity of the fact that I'm typing (very slowly) with one hand, having broken my arm, as a result of a rather silly cycle manoeuvre. However spending time at the hospital neatly provided me with an observation for today. Healthcare is widely seen as a key area of potential enhancement through the use of data and analytics; and many technology vendors like to showcase examples from healthcare; to demonstrate how data can enhance health predictions, use IoT monitors to detect early-stage issues, use detailed analytics to enhance operational efficiency and effective use of limited resource. There are many really good use cases for healthcare data.

So I was intrigued to spot a 'dashboard' report in my hospital's A&E department [emergency room], though the contents proved to be somewhat disappointing. The report had three core elements:

summary of responses to a questionnaire that asked "how likely would you be to recommend our department to friends & family"; with a bar graph to depict the monthly volume of responses, together with another bar chart and a pie chart to show the split of responses for the last month, and also a table of data. So three depictions of the same data, to highlight that 80% of people would be highly likely to recommend the department. Incidentally 'highly recommend' is the first option in the SMS questionnaire. I'm sure they test the questionnaire by reversing the sequence of responses to ensure there's no bias in how the question is asked.....
a list of comments received with the questionnaire responses - there were a handful and included "." and "comment" - [yes seriously]
a highlight that the department had received 5 letters praising the service; listing quotes from each letter
a highlight that 6 complaints had been received; with a single word bullet for each

I'm not going to name the hospital, as this is just illustrative of some generic weaknesses in using data, and amongst all the discussion of big data, advanced analytics, machine learning there is often a fundamental failure to focus on core objectives when summarising and communicating data:

What's the key objective?

Why are we producing the dashboard? Hospital A&E's get a lot of attention due primarily to the cost of the service, and the fact that by nature patients need urgent, timely attention. But equally there's been concern that the service is abused, by non-urgent cases; and this impacts attention of care for serious cases. Understanding the motives for any analytics or summary is key. Without a clear objective, there will be incorrect focus. For example this unit is now called the 'Emergency department'; the 'accident' element has been dropped; from this it seems clear the hospital is pursuing an approach of ensuring the resources are used for only urgent cases. This is backed up by some other clear graphics that depict what the department should be used for.

Who's the audience?

Understanding the objective leads on to understanding the audience. This was a public dashboard - I expect they have a more detailed internal version. but the public version should focus on the public objectives. Providing the public with a summary of feedback on how recommendable the service is seems questionable in purpose and smacks of self-congratulation without any clear objective.

What do they need to know?

I can immediately identify some key things I'm interested in as a patient: what's the typical wait time (for the kind of issue I have)? how does wait time vary by day or week or time of day? (my broken arm needs attention, but I could come earlier or later in day if it speeds my progress). And how do these wait times compare to other hospitals. it'd be useful, and insightful. Furthermore highlighting the proportion (hopefully declining) of inappropriate cases - that would better be dealt with elsewhere, with examples and care paths for the most common of these.

What will they do with the insight?

Insight is only useful with action; and since my injury isn't critical then I could be flexible with what time of day I attend; so knowing what the peaks and troughs are might help me think about future attendance. similarly being aware of alternate options for other non-critical cases would keep me out of A&E in future. If the aim is to reduce non-urgent cases, then more help is required to flag examples, and explain alternate routes for medical treatment.

As I mention I don't want to single out this hospital, as this is just one report - it may itself be an outlier from an excellent analytics team. Instead this highlights how any piece of output needs to consider the fundamentals; and if it doesn't address these then it shouldn't be produced. Use that scarce analytical resource on something that will make more difference to strategic objectives.

In the meantime I have a few weeks to increase my one-handed-typing speeed and accuracy.

Thursday, 20 August 2015

Start exploring Open Data (it's more than just maps)

There’s been an increasing interest in Open Data in recent years; Google Trends show a steady increase searches over the last 5 years, with a heavy concentration in the UK. In part this interest mirrors the acceleration in the datasets available. The Global Open Data Index keeps track of the status of Government Open Data initiative globally – it identifies 97 places (countries) with Open Data, and monitors the scope of data available across topics such as Government spending and budgets, election results, national statistics, legislation, company registers, maps, postcodes etc. The UK is ranked with highest availability.

Trying to get a feel for how extensively such data sets are used is pretty patchy. Sources such as OpenData500 provide some summary information, though surprisingly this doesn’t include the UK – and some good graphical representations of which industry sectors use which government departments data sets – for the US the Data/technology sector being just ahead of the Financial Services sector in usage. Both sectors use the Dept of Commerce heavily, but the FS sector’s leading usage is (not unsurprisingly) data from the Securities & Exchange commission.

This kind of high level view is broadly interesting, but doesn’t actually help organisations get started with exploring what they can (and should) be doing. Equally much of the media coverage positions the concepts of OpenData via examples that stimulate more thought rather than action. This excellent article covers ‘5 ways that Open Data is changing lives’; these programmes are fairly wide-ranging from global initiatives to local (e.g. Edinburgh’s City Scrapbook). Similarly this article in the Guardian (probably the most vocal in the UK media regarding the possibilities and opportunities for big data) provides some fascinating examples of the breadth in scope of Open Data usage – with a focus on “How Open Data can help save lives” and topics as diverse as from where to locate defibrillators to understanding cycle safety hotspots.

Some commentators claim that the provision of Open Data has become a tick-box exercise with government bodies just being happy to say they’ve ‘done it’ rather than consider how useful and accessible the data is. Other angles attempt to identify the top lists of Open Data, (or here) again interesting but less of a practical help if you have a specific problem you are trying to solve.

What’s very encouraging is the lead being taken in the UK on Open Data, across a range of dimensions: collection, released and re-used – driven by an on-going government commitment to Open Data and the initiatives that include dimensions such as increased training in data initiatives and encouragement in data consumption, assisted by the National Information Infrastructure.

If you haven’t already investigated UK Open Data, the start exploring: take a look at http://data.gov.uk/ and browse or search some of the 27k data sets available, understand more about the strategy and direction of Open Data at the Open Data Institute: http://opendatainstitute.org/ The ODI kicked-off a survey in November last year that looks at how commercial organisations are using Open Data, research findings published in June this year provides some clear focus points that could help many organisations contemplating wider analytical sources:

The most popular datasets for companies are geospatial/mapping data (57%), transport data (43%) and environment data (42%).
39% of companies innovating with Open Data are over 10 years old, with some more than 25 years old, proving Open Data isn’t just for new digital startups
‘Micro-enterprises’ (businesses with fewer than 10 employees) represented 70% of survey respondents, demonstrating a thriving Open Data start-up scene. These businesses are using it to create services, products and platforms. 8% of respondents were drawn from large companies of 251 or more employees.
70% of companies surveyed use government Open Data, while almost half (49%) of the surveyed companies use Open Data from non-government sources, such as businesses, non-profits and community projects. 39% use a combination of government and non-governmental Open Data.

Thursday, 13 August 2015

Do you have bigger concerns than big data?

If you read much of the fluctuating hype around Big Data there’s a common theme in the negative camp that questions how much real, tangible value there is in Big Data analytics. Sceptics question the challenges in putting an ROI on something which by nature is exploratory. A key premise of Big Data’s distinction from traditional analytics is that whilst the latter is about providing answers to known questions, Big Data is about finding the questions you hadn’t thought of. With such ethereal qualities it’s not entirely surprising that it’s easy for the media and analysts to whip up some copy that creates fear amongst those embarking on a big data mission. And just like traditional analytical projects, tales abound of huge investments into projects which were culled after sinking significant amounts of cash, time and people with no significant return. Not surprising then that it’s easy to find nebulous articles, and research into business perception (that's been fuelled by such articles) that make vague claims about the disappointment of big data, without much tangible evidence.

Not surprising then that any board contemplating a big data project will be caught between the rock of ‘we have to do this Big Data thing, because everyone else is’ and the hard-place that demands that every investment satisfies the corporate ROI evaluation hurdles. The “Build it and they will come” philosophy never worked for Data Warehousing, and the “trust me, there’s value in the data lake” won’t for big data.

Many organisations have rushed too quickly into Big Data technology evaluation and got fingers burnt, satisfying the geeks desire for hot technologies to feature on their CV, but not providing business value.

The most viable approach is one of focus; starting with a clear use case, thinking about specific priority business challenges, that would show real-value ‘if’ data could help to resolve, or at least provide greater clarity or even marginal improvement. Then based on a short-list of these ideas develop a feasible set (small in number) that can be trialled. You may call this a pilot, a proof-of concept, a proof-of-value, but it means a fairly rapid, low cost, rapid approach to exploring the data and identifying if there is some potential value.

As the examples below highlight (from this article), there are plenty of tangible examples of where big data provides real business value, and there are few ideas in this list that don’t fulfil the above criteria:

a focus on a known, significant business problem, that has tangible financial implication
an exploration of the data (and that’s all possible data) to see which elements add value, and to discard from the solution, those that don’t

There’s a fair chance that if you don’t find any value in the potential data sets to help resolve one of your most critical business issues, then you’re not going to have a business to worry about for much longer. Your worries are bigger than Big Data.

AIG (Insurance): using a wider data set (including ‘unstructured’ like handwritten claims notes, to enhance fraud identification
AMEX (credit cards): enhanced predictive models – identifying ¼ of accounts that will close within four months
Delta (Airlines): tackling the frustration of lost baggage – and expense to the airline – by providing customers with access to baggage tracking data via a mobile app

FT.com (media): analyse content preferences to improve personalisation

Huffington Post (Media): use real-time analysis of social media trends, recommendation and moderation to enhance personalisation

Kroger (Retail): understand customer behaviour – to drive loyalty and profitability

Southwest (Airline): using speech analytics to understand (and improve) interactions between customers and personnel.Understanding online behaviours and actions, to improve offers and increase loyalty, revenue and profit.
Red Roof Inn (Hotel): using a range of data to pinpoint travel hot-spots (bad weather, cancelled flights etc) and enhance targeting of 'stranded travelers' with hotel offers

Sprint (Telco): analysing network traffic to improve quality and customer experience
Tesla (Automotive): collecting sensor data, increasingly in near-real-time, to identify performance issues, recommend maintenance schedules and enhance R&D; all to improve customer satisfaction.

UPS (logistics): using data (telematics, routes, idle time) to enhance fleet optimisation – fewer miles, less fuel, lower costs

Wallgreens (Healthcare): ensuring patients collect their prescriptions – to help them stay on their medication – and prevent future illnesses

Thursday, 6 August 2015

How safe is your data?

One of the most frequently covered media topics on analytics is data security, or rather data insecurity. It feels like every week there is a new report of data breaches in the papers. A quick review of the UK national press over the last quarter in-fact identifies just short of 120 articles that feature a 'data breach' reference.

I'm covering this as a topic because my suspicion is that the fear of breach is more widespread than actual breach. Yes, there have been many data breaches, but without the real facts the cynic in me suggests media bias is creating more attention that the nature of the problem. The cynic of me notes with a wry smile that 60 of the 117 articles appeared in The Daily Mail. Detail from the Newsdesk service from LexisNexis.

So I dug a bit further into the facts of recent breaches, and sought out some recent studies.
First stop was the Breach Level Index, that globally tracks publicly disclosed breaches, and produces an annual summary report. [I should flag that this is the work of a "leading global provider of digital
security solutions"] The 2014 contains some useful reference points:

the report is based on 1,450 breaches in 2014, an increase of 46% from 2013 - (however how much of this is more breaches or a higher level of breaches reported?). This included 117 in the UK for 2014.
the report highlights the source of these breaches: around 60% are external, but the significant remainder are either malicious insiders or accidental losses
most surprising was that of all these 1,450 global breaches less than 4% involved data that was encrypted in part or full.

Next I studied the latest 'Information Security Breaches Survey' from the UK Government Department of Business, Innovation and Skills. This survey was carried out by PWC and took responses from over 1,000 individuals, with some bias to SME's.

86% of 'large' organisations had a security breach last year, and 60% of 'small' businesses. Interestingly both figures lower than 2013.
Almost half of the breaches (47%) were caused by staff, next highest was virus impact at 27% of incidents. External attacks accounted for 16% of incident

The Information Commissioners Office also report a range of statistics of breaches that have been reported to it. For 2014/15 1,807 incidents have been reported:

over half were basic failures; losing paperwork (18%), data sent to the wrong person, by post or email (30%), insecure disposal of paperwork or computer records (5%)
but a significant 22% related to a lack of "appropriate technical and organisational measures"

Clearly these are three quite distinct snapshots, without direct comparability; but even so highlight some common themes:

most breaches are failures at a basic, preventable, level : organisations should do more to address the basics
where breaches are more complex and especially external, whether direct attacks or due to virus or malware, then improved encryption would be of significant benefit
centralisation of data creates a single point of security concern, but most specialists agree that this is also easier to secure; decentralsiation creates more potential failure points and greater challenges to effectively manage

Data security will continue to be an issue, breaches will continue to happen, but organsiations can always take some basic steps to reduce their risk and exposure.

Thursday, 30 July 2015

HMRC consultation on Big Data approach to address 'hidden economy'

Interesting to read the media coverage of plans by the UK tax agency, HMRC, to "bulk collect" data on internet transactions as a drive to identify businesses selling goods and services that are avoiding tax payments. This 'hidden economy' of undeclared income is estimated at £5.9bn or 17% of GBP. This coverage, and other similar articles highlight some interesting angles:

HMRC plan to collect names, addresses and revenue of sellers from a range of internet sites [such as Ebay, PayPal, AirBnB, Justeat] and match these to tax returns to identify discrepancies
Un-named 'senior tax accountants' warned "it could lead to "fishing expeditions" which could see small businesses and modest hobbyists sent "frightening" letters demanding they payment of taxes which they do not owe."
The accountants also "warned that the "huge" databases will have significant security and privacy risks."

Whilst most of the coverage identifies a Treasury forecast that this could raise an additional £860m in taxes, most detail is directed at the risks and downsides of the exercise. Though most of these seem to be without much fact-based detail. The coverage doesn't really highlight that the initiative is at an evaluation stage and HMRC have initiated a consultation exercise to allow input as to the scope of data gathering, the best approach and how to minimise costs of fulfilling data requests.

However this consultation highlights to me a clear example of a solid big-data project:

recognition that data can help with a specific business issue - in this case unreported income - something with quantifiable value, even if estimated
an understanding that external data can provide significant improvement in tackling this issue, and that bulk-data collection provides a highly efficient approach (see comments below). New data sources are a fundamental way of better exploiting an organisation's internal data.
presumably HMRC have already conducted a pilot exercise, so they have a feel for the practicality of doing this at scale - both in techniques and outcome

The consultant document, sets out a range of questions, but includes some fundamental details of the suggested approach, recognising that "data can be particularly powerful" and detailing some specific benefits:

efficiency by collecting information in bulk about a large numbers of traders
minimising the burden on business, by obtaining data in bulk from a few sources rather than broad-based reporting requirements.
it recognises the value of third party data as an independent check against the data taxpayers themselves report" i.e. a corroborative source
the paper highlights HMRC’s ambition, to present taxpayers with data to check rather than forms and tax returns to complete.

This whole approach indicates that the HMRC is more forward thinking than many commercial organisations, by tackling a practical issue, using data, in an innovative but very practical way.

Thursday, 23 July 2015

Cycling Data: Knowledge, Intelligence and Competitive Advantage

Choosing a topic this week to kick off this blog was pretty easy; my passion for cycling, and the final week of the Tour de France have combined to provide the ideal topic - the debate over access to cyclists data as part of the quest to ensure that the sport maintains its vision to be drug and doping free.

Chris Froome has been at the center of the storm this week, not least because of intense speculation after his performance during stage 10; fueled by an interview on the host broadcaster France 2 where athletics coach and physiologist Pierre Sallet was reported to say that "Froome's power output was 'abnormally high' ".

Sallet is well known in cycling, and back in 2010 commented on a proposal to set a theoretical limit on VO2 Max (a measure of ability to perform sustained exercise) above which it was assumed a rider was cheating. A key component to calculating VO2 Max is a measure of power, and this is where the problem arises; measuring power can be done in two ways, via a power meter, or via an estimation or modeled approach. It's the latter that Sallet has done, and he is confident that after 10 years of research his model is within 2% of a directly measured approach. Whilst delving into the numbers Sallet admitted to using a higher weight estimate for Froome, than that quoted by Sky, but was still confident that the adjusted figures were abnormal.

Amidst all this debate Sky released a 'billion points of data' to the UK Anti-doping agency and Tim Kerrison, Sky's Head of Athlete Performance walked through this data at a special press conference.
The difficulty in assessing these metrics was highlighted by a single phrase from Kerrison: "It is difficult to identify the exact start point of the climb as there is no clear landmark defining the start."

With the debate and with this information we have a combination of factors that ring alarm bells with anyone involved in data and analytics:

multiple, inconsistent methods of calculating the same KPI - and using a mix of measures and estimates
inconsistency in the periods or distances over which the measures are taken
the issue of comparability - identifying what is abnormal (and an indicator of doping) as opposed to abnormal (and an indicator of high performance / capability)

Cycling's historic issues with doping immediately provide fertile ground for suspicion, and unfortunately it's all to easy for media coverage to pick up on this, focus on some isolated measures, or more dangerous, comment on specific measures and draw conclusions on this alone.

To be of use, analytics requires a clear objective, a goal and a well-thought approach about how all the available data can help to reach a meaningful conclusion. This takes skill, experience, but above all a through understanding of the nature of the available data. This was highlighted by Nicolas Portal, Sporting Director at Sky "it's all about who can interpret that [the data].......it's better to ask a proper specialist, and don't make some suspicion".

However despite all this concern over distortion and misrepresentation was a statement from Tim Kerrison at Sky which provided a reality check on why the data was so important: "We have a lot of data on our riders and the way we apply and use it, we see that gives us a competitive advantage. As in most industries, knowledge and intelligence is giving a competitive advantage."

No different to any other 'business', the more you understand what influences your overall performance, the more you can adjust, improve and gain an edge.