The third post in my series on standards in Predictive Analytics is on R, a hot topic in analytic circles these days. R is fundamentally an interpreted language for statistical computing and for the graphical display of results associated with these statistics. Highly extensible, it is available as free and open source software. The core environment provides standard programming capabilities as well as specialized capabilities for data ingestion, data handling, mathematical analysis and visualization. The core contains support for linear and generalized linear models, nonlinear regression, time series, clustering, smoothing and more. The language has been in development and use since 1997 with the 1.0 release coming in 2000. The core is now at release 3.0. New capabilities can be added by creating packages typically written in the R language itself. Over 5,000 packages have been added through the open source community.
More than anything what R offers organizations is the number of people using it – it’s widely used in academic programs plus it’s free and open source so lots of people learn with it. It’s also becoming increasingly well established in commercial projects. It has risen steadily in usage in the Rexer Analytic Survey every year and hit 70% of respondents in the latest survey (see more on this here). Besides this huge user base, R’s extensibility means that the number of algorithms available for it is huge – over 5,300 packages have been developed.The likelihood is that R will have something that can solve any problem, if not always as well as a commercial product. Finally R emphasizes cross-vendor compatibility through broad support for the PMML standard (see my previous post on PMML). Using R then gives an organization access to a pipeline of research, a huge body of algorithms and an active community.
It can seem sometimes that the future of analytics lies entirely with R – “All your analytic R belong to us” if you like. In fact there are still real challenges. Because R is free projects can get started and become essential without having the support they need to succeed. Parallelism, scalability and performance are an issue with the base algorithms and tooling is pretty basic being script-based. Script management and reuse are a problem for R scripts. Commercial implementations of R that provide commercial support and training services and integrated development environments (IDEs) can mitigate these challenges. It can also be challenging to deploy R on a production system, especially when the analytic team does not completely control the analytic deployment. The use of PMML can mitigate this. Finally R tends to be strongest when analytic scores are applied in batch updates. The increasing use of real-time scoring in production puts pressure on base R and requires either adoption of PMML or the use of commercial real-time scoring services that can be created from R models.
R is not going to go away and organizations should make R part of their predictive analytics strategy. This might mean focuses all analytic development on R but is more likely to mix and match R with proprietary or commercial environments to get the performance, scalability and/or support required. In addition, adopters should plan on working with a commercial vendor that has a solid strategy for providing scalable implementations of the R algorithms you care about as well as better development tools and deployment options.