Tuesday, February 26. 2013
Talk tonight - "The Keys to the City of Knowledge" by Conrad Wolfram at Policy Exchange, calling for Computable Data. Once one got over the "Leader of company specialising in computable data search argues for more computable data to search" shocker ( ) there were some interest points made.
Firstly, he argued that computable data is about presenting the "data behind the data", for what he terms "citizen computation" - or "trying to get the answers, not other people's answers". Now this I agree with, as we note that Gresham's Law is increasingly applying to online data, ie bad data is driving out good. I also buy his argument that discussions about decisions is "where the sliders are on data models, not simple Black/White answers" that pass for public and media debate. The issue I have with this though is the level of maths capability required of everyone - and in the UK at any rate, maths and the whole STEM area is relatively lowly valued among the university educated (and relatively lowly paid in the UK), never mind teaching much higher levels of maths skills to les autres. Wolfram argues that Maths needs to be taught differently in schools, and that Computable Data now is like computers in the Assembler days, and we need need to get to "Mac" layer of computable data fast.
But that was all by the by, what really did interest me in the talk were three other points he made:
1. The Value Chain of Knowledge - here are some notes I made:
Thus A Guiding Rule: Compositional knowledge >> dead information
2. What Data is most likely to emerge first? Mainly data that is either publically funding, or largely yours, eg:
Publicly funded R&D - should be computable
Areas that are most likely to be early data sources are:
- Health - biggest gainer due to diagnosis improvement, which is inefficient & labour intensive. Sensor based medicine is coming. Also data on relative hospital performance (Tripadvisor for your bypass op, as Susan Calman may have put it)
He points out it is necessary to unpack simple metric data, (exam results, school league tables) as they are both easy to game and not hugely informative. Work off computable data, not metrics
Corporate information - a lot is missing ( very private)
3. National Productivity - the Computable country needs a computable knowledge economy, which requires:
There was a rather fascinating angle on this, during the question phase. Essentially the discussion had moved to getting corporate data out from behind the firewall (he argues for a VRM-like ownership of your own data) but the point was made that privatisation is bad for Big Infrastructure and some other areas, so maybe a Computable Country should re-nationalise some areas as the data creates more value in a public entity than in a private one*.
Now THAT is an interesting argument, if one started to calculate the value of the chained data....the economics of Open Data may move from "Interesting" to "Critical"
(*Somehow I don't think the Policy Exchange would have intended this, being a right wing think tank )
Friday, February 1. 2013
Broadstuff's riff on Prof David de Roere of OeRC's Web Science slide.
I attended the WebScience Trust event yesterday on Data Observatories, a very "Motley Crew" (As Dame Wendy Hall put it) of people who are active in the space of Web data analytics etc. It was a good session as it had represention from various academics studying the area from Oxford, Warwick, UCL, Cardiff Universities etc, a number of non-profits, research groups, and a few companies operating in the area (like us). I took copious notes
Apart from being fascinating to see how many ways so many other people are attacking this emerging and extremely varied area, my main takeaways of the day were:
One person suggested we needed to derive a "3 Laws of Robotics" for web data collection and analysis companies. Amen to that!
Also, it was interesting to see not just the mix of hard scientists and "soft" scientists, but the segue of hard scientists doing soft science, soft scientists doing hard science, etc - a Motley Crew indeed....
Wednesday, November 7. 2012
Friday, October 19. 2012
Interesting article in HBR over here, implying that "big set" data analysis reaches limitations to its effectiveness fairly fast:
Firstly, remember the Netflix competition to improve their algorithm:
Five years ago, the company launched a competition to improve on the Cinematch algorithm it had developed over many years. It released a record-large (for 2007) dataset, with about 480,000 anonymized users, 17,770 movies, and user/movie ratings ranging from 1 to 5 (stars). Before the competition, the error of Netflix's own algorithm was about 0.95 (using a root-mean-square error, or RMSE, measure), meaning that its predictions tended to be off by almost a full "star." The Netflix Prize of $1 million would go to the first algorithm to reduce that error by just 10%, to about 0.86.
I recall the guys at a UK Netflix lookalike, LoveFilm, telling me that about 5 factors got the 80/20 prediction, so there was clearly a massive falling off in effectiveness as data analysis complexity increased.
But that is predicting intention behavior of demand, so what about retention - is this any easier, after all, one should have bucketloads of data and llots of historical nous dealing with ones's customers? It would appear not:
A study [pdf here] that Brij Masand and I [Gregory Piatetsky-Shapiro] conducted would suggest the answer is no. We looked at some 30 different churn-modeling efforts in banking and telecom, and surprisingly, although the efforts used different data and different modeling algorithms, they had very similar lift curves. The lists of top 1% likely defectors had a typical lift of around 9-11. Lists of top 10% defectors all had a lift of about 3-4. Very similar lift curves have been reported in other work. (See here and here.) All this suggests a limiting factor to prediction accuracy for consumer behavior such as churn.
(Lift is the ratio of actual churn vs the churners in the "big data" analysis, so if a "Big Data" algorithm predicts a list of customers that has 20% of actual churners in it, vs an averagel churn of 2%, that is a "Lift" of (20/2) = 10. That still means the list is 80% wrong though.
And how about predicting Ad effectiveness?
The average CTR% [Click Through Rate] for display ads has been reported as low as 0.1-0.2%. Behavioral and targeted advertising have been able to improve on that significantly, with researchers reporting up to seven-fold improvements. But note that a seven-fold improvement from 0.2% amounts to 1.4% — meaning that today's best targeted advertising is ignored 98.6% of the time.
(Actually, 0.1% sounds high to me, I'd think it was almost an order of magnitude lower nowadays)
Interestingly the article predicts Big Data will help more in the emerging services:
Big data analytics can improve predictions, but the biggest effects of big data will be in creating wholly new areas. Google, for example, can be considered one of the first successes of big data; the fact of its growth suggests how much value can be produced. While analytics may be a small part of its overall code, Google's ability to target ads based on queries is responsible for over 95% of its revenue. Social networks, too, will rely on big data to grow and prosper. The success of Facebook, Twitter, and LinkedIn social networks depends on their scale, and big data tools and analytics will be required for them to keep growing.
Google's reducing profits may be a sign that it's advantage is coming to an end, which - if the view here is right about dimiinishing returns - does not augur well going forward. Also, as they warn:
if you're counting on it to make people much more predictable, you're expecting too much.
Quite. And yet, and yet...one more tweak...
Also, bear in mind there are some big impacts in pivotal areas. A small change in a competitive area like say churn can have tremendous impact, especially if one is in a zero sum game (eg mature mobile phone markets), and played over multiple cycles. For example, assume 2 companies with equal market share, both with 20% churn. A very simple simulation will show if one player can get a sustained reduction of 1% of that 20% churn - ie to 19.8% in monthly customer retention over say 36 cycles (3 years) will give that player 53.5% vs the others' reduced 46.5% share - a shift of 7% points of market share - not a bad structural change in any saturated market, in fact shifts like that can drive competitors out.....
The answer, as always, is to accurately understand the costs vs the benefits.
Friday, May 25. 2012
Facebook is in talks to buy Opera, the company behind the Opera web browser, PocketLint reports. Opera has both a mobile web browsing app and a desktop browser, and it's an alternative to Internet Explorer, Mozilla's Firefox browser, Google's Chrome browser and the Apple Safari browser. Opera says it has more than 270 million users on its browser. Additional sources told The Next Web that it's looking for buyers and currently has a hiring freeze.
Its an interesting strategy for 2 reasons:
(i) Facebook is already doing every sort of datamining it can think of (and probably that its Genetic Algorithms can think of), and apart from your credit card the next best way of knowing all about your intentions is your browser. That is whay, by the way, thers is no way I'll use Google Chrome (though it seems my privacy concerns are not a worry of the Great Masses out there, with Chrome now having the largest installed base).
I will immediately disconnect Opera from my Smartphone if Facebook get their hooks on it. And if they issue a credit card....
Tuesday, March 13. 2012
Big Data - Rough Cut Valuation
Interesting post from Nic Brisbourne, summarising a GigaOm article on Big Data case studies. Like Nic I am dubious about some areas so applied a classic 2x2 analysis (see diagram above) to parse where I thought value might lie. His summary is below, my comments are in italics:
I was interested when GigaOM put up a post this morning titled 10 ways big data changes everything. I read through the ten ‘case studies’, and summarised them below. I’ve put my opinion on the trend in italics after a summary of the GigaOM case study. There was, in my opinion, a lot of fluff in the examples they chose, and of the ten there were only two that really stood out to me as areas with the depth and breadth to be home to multiple successful startups, and they were business intelligence applications of big data and virtual assistants.
Interesting its now suddenly so popular, when we built the data analytic MDE in 2007, theer was very little interest in deep analystics of the social media datastream, now there are (I heard at the FT Conference) "50 starving London startups looking for funding for Big Data".
Another point made by Nic is that:
With many years experience in this game i'd say the potential is there, but often the culture, mindset and busines models are too differemt/difficult for the company to easily assimilate them. Still, there is nothing like a spinoff.....
(Page 1 of 1, totaling 6 entries)
More Broad Stuff
Poll of the Week
Will Augmented reality just be a flash in the pan?
Creative Commons Licence
Original content in this work is licensed under a Creative Commons License