I have just finished presenting at the DataWorks Summit in San Jose. CA. where a partnership between IBM and HortonWorks was announced the aim of which is to help organizations further leverage their Hadoop infrastructures with advanced data science and machine learning capabilities.
When Apache™ Hadoop® first hit the market there was huge interest in how the technology could be leveraged – from being able to perform complex analytics on huge data sets by using a cluster of thousands of cheap commodity servers and Map/Reduce – to predictions that it would replace the enterprise data warehouse. About three years ago Apache™ Spark™ gained a lot of interest unleashing a multi-purpose advanced analytics platform to the masses – a platform capable of performing streaming analytics, graph analytics, SQL and Machine Learning with a focus on efficiency, speed and simplicity.
I won’t go into details on the size of the Hadoop market, but many organizations invested heavily for numerous reasons including, but not limited to, it being seen as an inexpensive way to store massive amounts of data, the ability to perform advanced queries and analytics on large data sets with rapid results due to the Map / Reduce paradigm. From one perspective, it was a data scientist’s dream to be able to reveal deeper insights and value from one’s data in ways not previously possible.
Spark represented a different but complementary opportunity allowing data scientists to apply cognitive techniques on data using machine learning – and other ways of querying data – in HDFS™ as well as data stored on native operating systems.
Many organizations including IBM made investments in Hadoop and Spark based offerings. Customers were enthused because these powerful analytics technologies were all based on open source representing freedom and low cost. Organizations including IBM participated in initiatives such as ODPi to help ensure interoperability and commonality between their offerings without introducing proprietary code.
Self-Service, Consumable, Cognitive tools
Frustrated with IT departments not being able to respond fast enough to the needs of the business, departments sought a “platform” that would allow them to perform “self-service” analytics without having to be die-hard data scientists / engineers or developers.
The IBM Data Science Experience (DSX) emerged as a tool that could help abstract complexity, unify all aspects of data science disciplines regardless of technical ability to allow a single user or multiple personas to collaborate on data science initiatives on cloud, locally (on-prem) or while disconnected from the office (desktop). Whether you prefer your favorite Jupyter notebook, R Studio, Python, Spark or a rich graphical UI that provides advanced users with all the tools they need – as well as cognitively guiding inexperienced users through a step by step process of building, training, testing, deploying a model – DSX helps unify many aspects into an end to end experience.
A lot needs to happen for machine learning to be enterprise ready and robust enough to withstand business critical situations. Through DSX (see figure #1), advanced machine learning capabilities, statistical methods and advanced algorithms such as Brunel visualizations are available. Sophisticated capabilities such as automated data cleansing help ensure models are executing against trusted data. Deciding which parts of the data set are key to the predictive model (feature selection) can be a difficult task. Fortunately, this capability is automated as part of the machine learning process within DSX. An issue that many data scientists face is the potential for predictive models to be impacted by rogue data or sudden changes in the market place. IBM machine learning helps address this issue by keeping the model in its optimal state through a continuous feedback loop that can fine tune parameters of the model without having to take it off line. This allows the model to sense and respond to each interaction (level of granularity defined by policy) without any human interaction.
A Knowledge Universe – Unleashing Cognitive Insights on Hadoop Data Lakes – with Power
The potential of integrating the richness of DSX and the cognitive ML capabilities with all that data residing in HDFS (as well as many other data sources outside of Hadoop) is an exciting proposition for the data science community. It could help unlock deeper insights, increasing an organization’s knowledge about itself, the market, products, competitors, customers, sentiment at scale, at speeds approaching real time. One of the key features delivered as part of Hadoop 2.0 was YARN (yet another resource negotiator) that manages resources involved when queries are submitted to a Hadoop cluster, far more efficiently than in earlier versions of Hadoop – ideal for managing ever increasing cognitive workloads.
Simply put, I cannot think of a time where there has been a better opportunity for organizations to leverage their Hadoop investments until now. The combination of Hadoop based technologies integrated with IBM ML and DSX unleashes cognitive insights to a very large Hadoop install base.
All very promising so far –but there is one more nugget to unleash that will help organizations with their cognitive workloads. IBM just announced HDF 3.0 for IBM Power Systems, bringing the built-for-big-data performance and efficiency of Power Systems with POWER8 to the edge of the data platform for streaming analytics applications. This solution joins HDP for Power Systems, recently launched, which offers a 2.4X price-performance advantage  versus x86-based deployments.
I’m excited at the possibilities that lie ahead – how data scientists and machine learning experts might leverage and benefit from our offerings and the integration with Hadoop infrastructures – how they might take it to the next level in ways we’ve not yet imagined as we continue to enrich our offerings with more capabilities.
For more information on how to get started with Machine Learning: datascience.ibm.com
Dinesh Nirmal – Vice President, IBM Analytics Development
Follow me on Twitter: @dineshknirmal
IBM, THE IBM LOGO, IBM.COM, IBM ELASTIC STORAGE SERVER, IBM SPECTRUM SCALE, POWER8 AND POWER SYSTEMS ARE TRADEMARKS OR REGISTERED TRADEMARKS OF INTERNATIONAL BUSINESS MACHINES CORPORATION IN THE UNITED STATES, OTHER COUNTRIES, OR BOTH. IF THESE AND OTHER IBM TRADEMARKED TERMS ARE MARKED ON THEIR RST OCCURRENCE IN THIS INFORMATION WITH A TRADEMARK SYMBOL (® OR TM), THESE SYMBOLS INDICATE U.S. REGISTERED OR COMMON LAW TRADEMARKS OWNED BY IBM AT THE TIME THIS INFORMATION WAS PUBLISHED. SUCH TRADEMARKS MAY ALSO BE REGISTERED OR COMMON LAW TRADEMARKS IN OTHER COUNTRIES. A CURRENT LIST OF IBM TRADEMARKS IS AVAILABLE ON THE WEB AT “COPYRIGHT AND TRADEMARK INFORMATION” AT HTTP://WWW.IBM.COM/LEGAL/COPYTRADE.SHTML.
Apache Spark, Apache Hadoop, HDFS, Spark, Apache, Hadoop and the Spark, Hadoop logos are trademarks of The Apache Software Foundation.
Other company, product or service names may be trademarks or service marks of others.
1 – Based on IBM internal testing of 10 queries (simple, medium, complex) with varying run times, running against a 10TB DB on 10 IBM Power Systems S822LC for Big Data servers (20 C/40 T), 256GB memory, HDP 2.5.3, compared to published Hortonworks results based on the same 10 queries running on 10 AWS d2.8xlarge EC2 nodes (Intel Xeon E5-2676 v3), HDP 2.5. Individual results may vary based on workload size and other conditions. Data as of April 20, 2017; pricing is based on web prices for the Power Systems S822LC for Big Data (https://ibm.biz/BdiuBC) and HP DL380 Intel Xeon HP DL380; 20 C/40 T, 2 X E5-2630 v4; 256 GB found at marketplace.hpe.com