Thursday 30 July 2015

HMRC consultation on Big Data approach to address 'hidden economy'

Interesting to read the media coverage of plans by the UK tax agency, HMRC, to "bulk collect" data on internet transactions as a drive to identify businesses selling goods and services that are avoiding tax payments. This 'hidden economy' of undeclared income is estimated at £5.9bn or 17% of GBP. This coverage, and other similar articles highlight some interesting angles:
  • HMRC plan to collect names, addresses and revenue of sellers from a range of internet sites [such as Ebay, PayPal, AirBnB, Justeat] and match these to tax returns to identify discrepancies
  • Un-named 'senior tax accountants' warned "it could lead to "fishing expeditions" which could see small businesses and modest hobbyists sent "frightening" letters demanding they payment of taxes which they do not owe."
  • The accountants also "warned that the "huge" databases will have significant security and privacy risks."
Whilst most of the coverage identifies a Treasury forecast that this could raise an additional £860m in taxes, most detail is directed at the risks and downsides of the exercise. Though most of these seem to be without much fact-based detail. The coverage doesn't really highlight that the initiative is at an evaluation stage and HMRC have initiated a consultation exercise to allow input as to the scope of data gathering, the best approach and how to minimise costs of fulfilling data requests.

However this consultation highlights to me a clear example of a solid big-data project:
  • recognition that data can help with a specific business issue - in this case unreported income - something with quantifiable value, even if estimated
  • an understanding that external data can provide significant improvement in tackling this issue, and that bulk-data collection provides a highly efficient approach (see comments below). New data sources are a fundamental way of better exploiting an organisation's internal data. 
  • presumably HMRC have already conducted a pilot exercise, so they have a feel for the practicality of doing this at scale - both in techniques and outcome
The consultant document, sets out a range of questions, but includes some fundamental details of the suggested approach, recognising that "data can be particularly powerful" and detailing some specific benefits:
  • efficiency by collecting information in bulk about a large numbers of traders
  • minimising the burden on business, by obtaining data in bulk from a few sources rather than broad-based reporting requirements.
  • it recognises the value of third party data as an independent check against the data taxpayers themselves report" i.e. a corroborative source
  • the paper highlights HMRC’s ambition, to present taxpayers with data to check rather than forms and tax returns to complete.
This whole approach indicates that the HMRC is more forward thinking than many commercial organisations, by tackling a practical issue, using data, in an innovative but very practical way.

Thursday 23 July 2015

Cycling Data: Knowledge, Intelligence and Competitive Advantage

Choosing a topic this week to kick off this blog was pretty easy; my passion for cycling, and the final week of the Tour de France have combined to provide the ideal topic - the debate over access to cyclists data as part of the quest to ensure that the sport maintains its vision to be drug and doping free.

Chris Froome has been at the center of the storm this week, not least because of intense speculation after his performance during stage 10; fueled by an interview on the host broadcaster France 2 where athletics coach and physiologist Pierre Sallet was reported to say that "Froome's power output was 'abnormally high' ".

Sallet is well known in cycling, and back in 2010 commented on a proposal to set a theoretical limit on VO2 Max (a measure of ability to perform sustained exercise) above which it was assumed a rider was cheating. A key component to calculating VO2 Max is a measure of power, and this is where the problem arises; measuring power can be done in two ways, via a power meter, or via an estimation or modeled approach. It's the latter that Sallet has done, and he is confident that after 10 years of research his model is within 2% of a directly measured approach. Whilst delving into the numbers Sallet admitted to using a higher weight estimate for Froome, than that quoted by Sky, but was still confident that the adjusted figures were abnormal.

Amidst all this debate Sky released a 'billion points of data' to the UK Anti-doping agency and Tim Kerrison, Sky's Head of Athlete Performance walked through this data at a special press conference.
The difficulty in assessing these metrics was highlighted by a single phrase from Kerrison: "It is difficult to identify the exact start point of the climb as there is no clear landmark defining the start."

With the debate and with this information we have a combination of factors that ring alarm bells with anyone involved in data and analytics:
  • multiple, inconsistent methods of calculating the same KPI - and using a mix of measures and estimates
  • inconsistency in the periods or distances over which the measures are taken
  • the issue of comparability - identifying what is abnormal (and an indicator of doping) as opposed to abnormal (and an indicator of high performance / capability)
Cycling's historic issues with doping immediately provide fertile ground for suspicion, and unfortunately it's all to easy for media coverage to pick up on this, focus on some isolated measures, or more dangerous, comment on specific measures and draw conclusions on this alone.

To be of use, analytics requires a clear objective, a goal and a well-thought approach about how all the available data can help to reach a meaningful conclusion. This takes skill, experience, but above all a through understanding of the nature of the available data. This was highlighted by Nicolas Portal, Sporting Director at Sky "it's all about who can interpret that [the data].......it's better to ask a proper specialist, and don't make some suspicion".

However despite all this concern over distortion and misrepresentation was a statement from Tim Kerrison at Sky which provided a reality check on why the data was so important: "We have a lot of data on our riders and the way we apply and use it, we see that gives us a competitive advantage. As in most industries, knowledge and intelligence is giving a competitive advantage."

No different to any other 'business', the more you understand what influences your overall performance, the more you can adjust, improve and gain an edge.