Dipping Your Feet into Large Language Models (LLM)s

Dipping Your Feet into Large Language Models (LLM)s
Photo by ray rui / Unsplash

If you are new at machine learning, I highly recommend that you dive into the content provided by the MIT Introduction to Deep Learning Lecture Series provided for free at Youtube as an entry point into what machine learning is, the kind of machine learning techniques there are, what neural networks are and the methodologies used to train a neural network.

The Growing Walled Garden of LLMs

If you have played with the likes of OpenAI's ChatGPT and Google's Bard, at some point you should become curious as how you can potentially train and run your own LLMs. Open AI has started charging for advance their GPT-4 as of the time of this article, are limited to 25 queries every 3 hours, in addition to having "higher" priority to their GPT 3.5 model. Bard is currently free and Microsoft annoyingly requires you to download Microsoft Edge for access to Bing Chat.

While using advanced chat bots can be immensely useful tool for knowledge based tasks, basic users are at the whim of large companies that control these models which may turn users into subscribers or products that are used by these models for their benefit (like how search is powered by ad revenue).

Open Source and Publicly Available Research

Amazingly, immense interest from the open source community and machine learning researchers has exploded and we've seen a huge abundance of pre-trained models and tooling made available publicly as enthusiasts share their work publicly. If you're interested in learning more about publicly available models, places such as Hugging Face has become a centralized location where trained models and datasets are made available, along with tool chains to download, work with and train language models.

The world of open source training of LLM research and models is growing quickly and updates will be provided as the become available going forward.

Self Hosting and Training LLMs will be Important

Data is the new oil and we'll start to see more stringent or paid access to sites that have interesting data. Microsoft, Github and OpenAI are starting to see lawsuits for using certain "publicly available" data in the training of their machine learning models. What "publicly available" data means is contentious and while this remains a grey zone in terms of "fair use" we will definitely see individuals and organizations become increasingly cautious about what data they make public or are willing to share with a 3rd party entity.

Working with LLMs hosted by OpenAI and Google will naturally means that your quieries and contextual data is being shared with these organizations. And employees of enterprise organizations should have started to recieve notices of caution or updated regulations to not use these tools to process any proprietary information as it could constitute a leak of data externally.

While it can be certain that LLMs can become a great tool to augment the efficency of knowledge workers, it becomes evident that enterprise organizations will eventually want to self host and have ther own LLMs deployed internally to ensure that data is not leaked externally of the organization.

Additionally, while training foundational models LLM models can cost between hundreds of thousands of dollars to millions of dollars, researchers have found that it is possible to fine tune LLMs with additional data using methods with Parameter Efficient Fine Tuning (PEFT) and Low-Rank Adaptation (LoRA) against pre-trained LLM at marginal additional cost. I expect that corporate entities will move towards training in models against internal datasets (wikis, FAQs, operational processes, etc) and use these models to improve operational efficieny.

With this in mind, I expect there will be a need for technical skills to train and operate LLMs within organizations going forward. There may also be a market for the licencing of trained highly capable LLMs, however that might not be the case for the time being.