dotCMS is very aware of both how much attention AI has gotten and of the capabilities of the tools. However, instead of just jumping in with a "me too" feature or two, we've spent some time digging deep into both the technologies and the dotCMS platform to effectively plan how to enable our customers to make the best use of these technologies with their own content in dotCMS.
Before we discuss with you our view of these technologies, including the capabilities of AI in content management and our strategy to incorporate it within the dotCMS product, let's first define a few technical terms here.
A (Very) Brief Overview of AI Technologies
Let's agree on the basics
Before we go any further, let's make sure we're talking the same language because when people say "artificial intelligence," they can mean a lot of different things.
Artificial Intelligence (AI) covers a wide variety of technologies, which have been around in some form or another for many years. However, what most people are interested in isn't AI in general; it's some specific technologies that have recently become more widely known.
Machine Learning (ML) is the main technology that most people mean when they talk about "AI." This is a technology that uses probabilistic algorithms (also referred to as randomized algorithms) to "learn" about patterns in a specific set of data. The data set can be small, such as a proprietary data set, or extremely large, such as the entire web. ML technologies can do everything from analyzing the data to generating new data that matches the existing data in some ways.
Large Language Models (LLMs) are the technology that's made the biggest splash lately. These tools, such as ChatGPT and Google Bard, are MLs that were trained on enormous data sets, such as the whole web. The tools built on LLMs allow you to ask questions in a "chat" format, and get some very powerful answers that often sound like the answers that would come from a well-informed human.
Although LLMs are what's gotten the most attention recently, they're really just specific instances of ML tools. ML tools also cover other AI technologies, which are very valuable for content management. For the rest of this blog, we'll use the term "ML" to refer to both LLMs and other types of ML tools that can help you create and manage content.
What is a "Model" and Who Owns It?
People who use AI don't need to know about models; but people who choose AI tools should
In ML technology, the "model" refers to the dataset the system was trained on, as well as the various parameters that were used to train it. Changing either the data or the parameters changes the model, so each model is unique. In general, the more data the model is trained on, the more powerful it is.
There's an important distinction between using tools based on public models, that people outside your organization have created based on large (often public) datasets, and private models, which are created based on your own (often proprietary) dataset.
LLMs like ChatGPT are based on public models; they allow you to get the benefit of the massive (and expensive) work that someone else already did on a huge dataset. You can also create your own model based on your own data, if you're so inclined...and willing to spend the time and money.
Public Models: Powerful and inexpensive, but out of your control
There are three main advantages of public models. First, they've been trained on very large datasets, which makes them quite powerful. Second, since other people have already done the work (and spent the money) to train them, the cost to use them is relatively low. Third, you can use them literally right now - and "right now" is always a pretty important advantage.
Public models have some disadvantages, too. One is that they may not respect the privacy of your data. For example, if you send any proprietary information in a prompt, the tool may incorporate that prompt into the model's training data and return that information in responses to people outside your organization. (For more on handling Security and Privacy concerns, please see below).
Another disadvantage is that public models don't understand the details that are specific to your product or organization. For example, if you ask ChatGPT to give you instructions on how to perform some specific action in your product, it will return very precise step-by-step instructions. Unfortunately, those instructions may be very wrong. That's because ChatGPT doesn't understand your specific product. Instead, it will generate a list of steps based on its understanding of many different products of the same type.
Private Models: Proprietary and secure, but less powerful and more expensive
Private models offset the two main disadvantages of public models. In other words, they can deliver responses which are customized for your own organization and products, and they fully preserve the privacy of your proprietary data. For example, a private model may be able to generate correct step-by-step instructions on how to perform a particular action with your product, and may do this even based on non-public data such as support tickets.
However, private models also invert the advantages of public models. They're typically trained on smaller datasets than public models, limiting their power, and they take time and money to build. The more powerful you want a private model to be, the longer it takes and the more it costs to build it.
Hybrid Models: The Best of Both Worlds
Thankfully, there's also a third option here that lies somewhere between the two. You can start with a public model, and build your own "context" on top of it. Adding context essentially means instructing the base model that it should take some new information, such as your own product documentation, as a more important source of information than the underlying model. This way, when you ask it a question, it can return results that retain the power of the huge underlying model, which are customized for your particular purpose.
Hybrid models preserve most of the advantages of public models, while offsetting most of the disadvantages. Thankfully, since hybrid models are built on top of public models, they can also be implemented pretty close to "right now."
There are multiple different ways to implement a hybrid model, but we won't go into the details of that here. Because of the advantages of hybrid models, this is the type of model dotCMS has adopted for using ML technologies within dotCMS.
Security and Privacy of Public and Hybrid Models
AI vendors pose the same risks as any other vendor
The security and privacy of ML tools is another topic that's already been written about extensively by many experts, so we're not going to cover this in detail. However, it is important to highlight some important points on this subject.
Any time you send proprietary data to an external service, there are security and privacy risks. This means that any time you use a public ML model - including if you're using a hybrid model - there are some security and privacy risks.
However, while the application of ML is new to most organizations, the security and privacy issues are not. Before you send any proprietary data to any third party, you need to evaluate how that third party will handle your data.
Public and hybrid ML models are no different. Before you use any public model, you should evaluate that vendor the same way you would any other vendor, such as a web vendor with an app you use to manage your sales or customer data. This includes understanding both how secure their service is, and how they'll use any data you send to them (such as in prompts).