I've been asked on several occasions by both customers and sales teams to explain why standalone open source is not enough on its own to meet an enterprise's AI needs. I figured I'd just write a blog clarifying the differences.
First and foremost, I am by no means saying that open source is not required for data science and AI. In fact, it is absolutely required, since the vast majority of the innovation in AI is coming from open source. From open-source tools such as Jupyter Notebooks and RStudio, to open source languages such as Python and R, and to open source frameworks such as Apache Spark, TensorFlow and XGBoost, a now Open Source Foundation Models contributed in the Hugging Face hub, open source must be used to harness that innovation. That innovation is happening much faster in open source than it is in commercial software - look at what traditional products such as SAS and SPSS offer and how quickly that's evolved, and then look at the pace of innovation in open source. The comparison is one-sided - open source wins by a landslide. Tools such as PyTorch and XGBoost were experimental not that long ago, and now just a few years later are already powering tremendous advances, all in production environments.
The key question is not whether open source, but rather how open source should be delivered within an enterprise - as standalone, 'naked' open source downloaded directly from the web, or embedded as part of a broader enterprise AI platform, such as the product we initially created, IBM Data Science Experience or its evolution as Watson Studio and the recently annouced watsonx. For the various reasons outlined below, our vote is very much on the 'embedded within a platform' side.
Standalone open source is a pain to manually install and manage
This isn't a complaint about open source AI tools, but rather open source tools in general. They lean heavily on the end user to know how to install and configure the tools on their own systems. With the use of installers, this task is made a little easier, but not all open source tools use installers, which then requires a manual installation and configuration. Maintaining the installation is no small task either, as open source iterates quickly and each iteration requires either an uninstall and reinstall of the latest version, or hopefully an upgrade-in-place if the open source creator had that foresight. This is compounded by the fact that libraries play such an important role in data science, so not only do you need to maintain the different versions of the tools, but also the different versions of the libraries that work with each of them. Now you have to check and take into account compatibility, making sure that the latest version of the library you want works with the latest version of the tools that you have installed.
You're essentially making the data scientist more of a system admin, and given how scarce data science talent is, do you really want that?
Open source is geared towards individual users, not teams or structured processes
Open source naturally tends to work best for individual users. Let's imagine a user who's downloaded and installed Jupyter Notebooks or a model from HuggingFace on their computer. That tool runs locally, using the resources on the user's machine. If the analysis can run within that capacity, then the user is in good shape, even if it will take a little longer. The user will most likely be running the analysis against data in flat files, since they would have to install connectors to access enterprise systems, most of which may not have open source connectors available, and jump through all of the hurdles to get authorization to access the systems. The limits of their available disk storage space and their processor will put natural limits on how big a data set they can process. If they want to share their work with a fellow data scientist, they will have to email files and set up a shared drive to share the data, and then provide step-by-step instructions on how to recreate their set-up environment and the analysis. Lastly, while they would be able to create models, they would not have the ability to deploy those models in any way using standalone open source tools, they'd have to export them and hardcode them within a system or turn to a 3rd party model deployment tool.
In short, while excellent for individual users, standalone open source does not support enterprise use cases, where data quantities are massive, sharing is required, and deployment is critical.
Open source is missing large areas of required enterprise functionality
Continuing the thread from above, making open source productive within an enterprise requires tremendous amounts of work to close the gaps.
Open source usually comes from a highly skilled individual's pain point and that individual's willingness to invest the time necessary to build a tool to address that pain point. Because it's dependent on individual contributors, it also suffers from a natural bias of those individual contributors to build the 'cool stuff', the new features and the exciting new capabilities, not the nuts and bolts things that enterprises need. No one gets excited about building a SAML connector to enable single sign-on (SSO) on Jupyter Notebooks for example, but there's no shortage of kernels to support new languages (95 at last count, based on this list https://github.com/jupyter/jupyter/wiki/Jupyter-kernels).
Enterprises don't need the 96th Jupyter kernel, they need the nuts and bolts to be in place so that open source can be managed within the enterprises' existing technical environment.
Those basic enterprise needs include the following:
- Single Sign On Integration into an enterprise's standard directory and the ability to apply existing security policies
- Hosting: Hosting the software so that users can access a running instance without having to install, configure and manage on their own
- Version Management: Keeping abreast of the rapid iterations in the open source world and making the new versions available in a tested, configured, and most importantly - seamless - way
- Security Patching and Hardening: Keeping abreast of security flaws identified and applying the patches as they are released, or better yet proactively scanning the software to identify and fix or negate security flaws
However, the items listed above are only the most basic of table stakes that are required for an enterprise to feel comfortable with the software and allow you to use it. The most important customer of the tools is not the enterprise, but rather the end user, and for that customer, there still remain many gaps in standalone open source that need to be closed for them to be more productive. Those include:
- Scalability: Flexing processing power up and down as larger or smaller jobs are consumed
- Connectors: The ability to easily connect to enterprise systems to access the data therein
- Collaboration: The ability to share work with others in a seamless and consistent way, without creating additional forks or adding friction to a user's own workflow
- Integration: Linkages to other systems or tools to tap into their power or specific capabilities on an as-needed basis. For example, one of our most popular integrations is how we connect Spark to R and Python running within Jupyter Notebooks, which means that the user can use Spark for just the part of their code that requires parallelization, and not for the entire thing if they don't want to. Previously, adding parallelization to R or Python required rewriting the entire code in SparkR, etc, this eliminates that.
- Deployment: By far the biggest gap, without a way to easily deploy an asset to production, there is no way to drive impact in the business, and AI would remain an academic exercise.
As you may be able to infer, we've built a platform that addresses all of these gaps, whether by hardening certain aspects of open source or by creating entirely new capabilities to close the functionality gaps.
What we tell people is that we think of open source as an engine. Very powerful, and highly capable, but not useful without the right setting. For example, you can think of it as a car engine. If someone comes and drops an exotic sports car engine on your front step, you can't take that bad boy and drive it down to the corner store to buy milk. You'd need wheels, tires, a frame, a windshield, a steering wheel, and of course, an awesome radio and some cool shades. Only then would it be fit to ride. We've built that sports car for you.