DataFit Blog: September 2015

Monday 28 September 2015

Summary of Hadoop Use Cases

A summary paper is now available of the recent research project into published use cases of Hadoop adoption. The seven page paper summarises the use cases by identifying the key uses to which Hadoop is being put, the major industries that are being used as case studies by the Hadoop distributors and the key benefits identified by organisations that have implemented Hadoop. The paper is available here and requires no registration.

Friday 18 September 2015

Hadoop use cases: research underway to identify common themes

Within the next week or so I should conclude a piece of research that summarises the published customer case studies for Hadoop adoption. It's been a fascinating project to work on. With all the hype around Big Data, and technologies like Hadoop it's often difficult to get a clear objective view of real usage. The research has examined close to 200 customer case studies to identify common themes in adoption, usage and benefits, the aim being to provide a reference for those looking to adopt Hadoop. The initial findings have shown some interesting insight:

whilst much of the focus of Hadoop has been around the perception of it providing a low cost analytical platform, driven by a combination of its open source foundation and use of commodity hardware, this is not the most referenced benefit
instead it is scalability that is most commonly quoted, with almost 2/3 [65%] of documented customer stories highlighting this as a key factor in adopting Hadoop. A common driving factor here is that organisations identify a need to retain more history than they could previously handle, or to explore new data sources that had inherently high volumes of data.
the next most reported benefit, and directly related to scalability, was speed of analytics (the time taken to run queries), with 57% of descriptive case studies highlighting this advantage
the 'cost driver', comes in third place with 39% of customers specifically highlighting the savings in adopting Hadoop

What's always interesting when looking at factors like scalability of analytics, and speed of queries, is to understand 'in comparison to what'. Many of the use cases undertook comparative benchmarks against other technologies, but many have migrated up from other platforms, most commonly MySQL or SQL server (by number of customers). In these latter cases, the 'faster' argument is always a bit thin; new commodity hardware is always going to be more performant that older solutions. Another reason why it's so difficult to get objective information of user adoption of technologies like Hadoop.

I'll add another blog post here once the research summary is available, with a link for download. Alternatively drop me an email kevin [at] datafit.co.uk and I'll email it when ready (your email address won't get added to a mailing list or distributed further).

Friday 11 September 2015

The return of Arthur Andersen. What’s that about?

An interesting mix of news this week, with an announcement that Arthur Anderson will, according to The Times, ‘rise from the ashes’, in the form of a new French entity; providing a "unique international business services network integrating an authentic inter-professional dimension". Wow! After the dramatic demise of Arthur Anderson through the Enron scandal that's a pretty bold move. Whats even more interesting is that there is now a dispute between this entity, and another in the US, over the use of the Anderson brand.

Getting attention is an increasing challenge in a busy, complex, dynamic world, doing something counter intuitive is one way. Given it's high profile Apple takes a fairly mainstream approach to get attention. This week, with typical drama, they put on a slick presentation announcing a range of product updates. These included the launch of the iPad Pro, offering 'desktop PC performance' into a bigger format iPad (with a 12" screen) that is 22x faster than the original iPad, and weighs pretty much the same. Further compression of compute power. This has the potential for another step-change in the way people work, creating further enhancements to personal productivity; it was interesting to see Microsoft present on stage at the iPad Pro launch.

Adding more and more power, capability and portability to tools like the iPad further enables consumers and corporations to get wider access to information. The simplicity of tools like the iPad further enhance the spread of information access. Most people have looked in awe as they have seen how easily a toddler picks up the iPad interface, and the puzzlement when 'swiping' a flat-screen TV doesn't have the same effect.

However adding performance and capability and the ability to get easier access to more information can also create complexity, and like many information or analytics problems, trying to get to a detailed, helpful piece of information is an increasing challenge. This is probably why Gartner position 'self-service delivery of analytics' at the top of it's hype cycle for Emerging Technologies. [what Gartner calls the 'peak of inflated expectations']. There are an increasing number of vendors entering the crowded analytics market with claims to offer the new sliced-bread in allowing users to intuitively gain simplified access to information that allows them to make real decisions. There are a few good technologies out there, but there is also a lot of hype and buyers should beware of claims of magic and avoid the snake oil. There is still a massive challenge in tacking large, complex data sets and transforming them into insightful, actionable information. It's very difficult to self-serve or automate things that are so complex. It'll be interesting to see how this domain unfolds over the next few years; there's no doubt that software capability will be dramatically enhanced to make the most of the data.

And whilst you're waiting for self-service analytics on your shiny new iPad Pro, there's a raft of other things you can use it for, just don't impeded clarity of vision for others, by taking selfies next time you're 'experiencing' a music festival or gig.

Sunday 6 September 2015

The impact of TV watching on GSCE grades: causation or correlation?

This week a 'new' research study, coinciding with the end of the UK school summer holiday, highlighted parents with the alarming perils of TV on their children's exam potential. The study received widespread coverage across national and local press, with some attention-grabbing headlines, including:

Watching TV seriously harms GCSE results, says Cambridge University - The Telegraph
Each hour schoolchildren spend watching television sees GCSE results fall by equivalent of two grades, says new research - The Independent
An extra hour of TV a day costs two grades at GCSE - The Times
Teenagers who watch screens in free time 'do worse in GCSEs' - The Guardian
Extra screen time 'hits GCSE grades' - BBC

The majority of articles took the research findings at face value - watching TV results in lower exam grades, and assumed a cause and effect; the common mistake of assuming that a correlation is a causation. Just because there is a relationship between two things (correlation) does not mean that the two are related (causation) - one may not be the specific cause of the other.

A good explanation of the dangers of mistaking correlation for causation, and a related example can be found in the excellent Freakonomics book, by Stephen J. Dubner and Steven D. Levitt. This identified a study that highlighted that children got better exam results if their homes had more books. Whilst there is a connection, its not likely to be causal - the existence of many books is likely to be an indicator of the interests of the parents, and therefore highlight the parenting style and approach. Upbringing and parenting is much more likely to have a causal effect. One district took the flawed approach of responding to the original study by sending two books to all homes with children; assuming this would 'fix' the problem.

The same problem applies to this current news coverage (though the research itself was carried out in 2005-2007, so it's age may reduce its relevancy - much has changed in the last decade); as the headlines assume that the amount of TV watched is the cause of exam pass-rate variation. There is very limited coverage as to the extent to which the two are connected, even though the researchers had some awareness of the risk:

The BBC quoted lead author Dr Kirsten Corder: "We followed these students over time so we can be relatively confident of our results and we can cautiously infer that TV viewing may lead to lower GCSE results but we certainly can't be certain. Further research is needed to confirm this effect conclusively, but parents who are concerned about their child's GCSE grade might consider limiting his or her screen time." Dr Corder suggested there could be various reasons for the link, including "substitution of television for other healthier behaviours or behaviours better for academic performance, or perhaps some cognitive mechanisms in the brain".

This is further backed up by detail in the research paper:

Our analyses are prospective therefore allowing cautious inferences about direction of association; however it would be impossible to tell whether reductions in screen time caused an increase in academic performance without a randomised controlled trial.

There are clearly a wide range in potential factors in upbringing that could have influenced the results. For example the research adjusted for 'deprivation' by using a post-code based scoring indicator. But as the authors indicate more work needs to be done, to provide a greater depth of analysis.

'Screen-time', whether TC, Internet or games is clearly a factor to be balanced in children's free time, few would argue that it should argue that it should have limits. Whilst the news headlines seem to over emphasise the relationship, it would be interesting how many readers have some shift in behaviour in the current weeks as their children head back to school, or if their underlying parenting style, modified from their own upbringing has most ongoing impact.

And if you're still not convinced by the risk of mistaking correlation vs causation, then take a look at this site of 'Spurious Correlations' my favorite being: the year-on-year correlation between the 'Number of Japanese Cars sold in the US' and the level of 'Suicides by crashing of motor vehicle'.