top of page

How to train an AI model: The open-source secrets of the tech industry.

But first, some notes. I am back after being locked out of my website the last month! There's a lesson here... don't load all your Authy logins on a single device and forget the password...

How to train an AI model... It may not be as exciting as training dragons, but it's a rapidly growing field of interest in the tech industry as the democratisation of data and leaked training models opens the playing field.

Any application, program, or plug-in claiming to be the latest in AI has likely been trained using some form of Machine Learning. Machine Learning (ML) is a subfield of AI that involves training data sets to follow specific commands or recognize patterns and similarities within data. These trained algorithms form the foundation of AI programs. Historically, scalable ML for AI was limited to the major players in the tech industry, partly due to the massive data sets needed to train these models. These data sets, as well as the exact methods for training them, form the building blocks for LLM (Large Langauge Models), aka. what powers AI-capable chatbots and conversational partners. And they've been hard to get right, using huge quantities of data, manpower and money to get to the levels we have today. Large Language Models (LLMs) often employ unsupervised learning methods to learn the pattern, structure, and context of language. These generative AI systems make new answers based on their previous training and predictions, rather than the pre-set answers of early language models, such as ELIZA or your old Furby.

Machine Learning trains AI algorithms to perform various tasks, with different setups suitable for different model outputs, such as generative text or image creation. Machine Learning is generally divided into two methods: supervised and unsupervised learning models. No matter what, Machine Learning relies on training algorithms off massive swatches of data - labelled or non. Every company has their own method of obtaining this data; some scour the internet, Google basically knows the thoughts of everyone on the planet and so far Reddit has been unable to profit from or stop companies like Open AI from mining their site for data. Data plays a key role in the development of any AI algorithm; missing attributes, biases, or limited data can hinder AI from learning properly. The source of data is a subject of hot debate. Companies like Google boast access to massive amounts of data, but as revealed by the recent Google Memo Leak, these vast datasets might actually slow them down. Retraining their datasets for each new application requires time and money, which open-source solutions are starting to address.

Once a company has enough reliable and 'clean' data, it can use it to train algorithms in one of two general ways; supervised or unsupervised. 'Supervised' learning refers to data that has been classified or labelled by humans before being fed into the algorithm. Supervised learning involves human intervention before the data is used to train the algorithm (which should not be confused with Reinforcement Learning from Human Feedback, RLHF, which comes later). It typically involves pre-assessed or formatted data sets. Examples include regression, classification, transfer learning, and ensemble methods. In contrast, 'unsupervised learning' allows algorithms to determine outputs or trends within unlabeled data. In unsupervised learning techniques, large amounts of text data are fed into a model without explicit labels or targets. This approach allows unexpected patterns to emerge and reduces the costs associated with human input - making it a much more affordable option for those without resources to label huge quantities of data themselves. Examples of unsupervised learning models include clustering, neural networks and deep learning, word embedding, natural language processing, and reinforcement learning. (Definitions for these terms can be found at the end of the article).

When developing Large Langauge Models (what powers AI chatbots, translations, or any AI program which generates or 'understands' language) the transformer architecture used by these models consists of an encoder for inputs and a decoder for outputs. Both encoder and decoder incorporate a 'multi-head self-attention mechanism,' which allows them to weigh different parts of each sequence to determine the value and context of each word. This works like any English class would, forcing the AI to determine the parts of speech and use context to determine the overall meaning of each block of text. A key element of unsupervised learning is neural networks. A neural network is a machine learning method that mimics how the brain learns by establishing connections between data points and options. It operates similarly to how our brain sends information between neurons and greatly reduces the time it takes to manually train or analyse data. Deep learning occurs when this process is repeated three times in a row, or between three 'layers' or connections between the data points.

Major players in the field, such as OpenAI and Google, currently keep their exact training

models under wraps (for now), however, they have released the general strategies behind their LLM's. Google's Bard was trained using a combination of supervised and unsupervised techniques, making the most of Google's vast data sets. The models behind Chat-GPT (Generative, Pre-trained Transformer) were trained using unsupervised data in the form of transformer neural networks. ChatGPT was built on OpenAI's GPT-3.5 large language model, an unsupervised training system. But the system wasn't perfect. Although GPT-3.5 made huge strides towards natural, generative language, Open Ai employed 40 contractors to fine-tune the model. This has been termed Reinforcement Learning from Human Feedback (RLHF). This supervised set contained 13,00 input/output samples, which were then used within Chat GPT3.5. This combination of supervised and unsupervised methods is part of what makes Chat Gpt sound so... 'human'.

Meta's leak of LLaMA may’ve levelled the playing field, allowing ordinary people to compete. As explained by the infamous Google Memo Link on May 4th, 2023, “many of the new ideas are from ordinary people. The barrier to entry for training and experimentation has dropped from the total output of a major research organization to one person, an evening, and a beefy laptop." This leak, where META's LLaMA leaked online, has caused significant disruption for open-source developers, particularly as it relates to the democratisation of data and learning models. Meta's unintentional disclosure of LLaMA provided a glimpse into the exact training methodologies and techniques employed by these industry giants. The leak served as a wake-up call for other companies, in particular OpenAI and Google, forcing them to reevaluate their own strategies and research efforts to remain competitive in the rapidly evolving AI landscape. It also gave ordinary users a roadmap for training their own algorithms. Consequently, we have seen a boom in AI - faster and more efficient programs being developed by start-ups or anyone with 'a beefy laptop.'

Unsupervised AI learning methods and a bid towards democratising access to data sets have resulted in large-scale innovation and awesome new developments in the field of AI. Unsupervised learning techniques enable individuals to explore and discover patterns, structures, and insights within data. Ordinary people can leverage unsupervised learning algorithms to gain a deeper understanding of their data and make interpretations that will change their lives and businesses.

The open-source community plays a significant role in advancing unsupervised learning methods. Libraries, frameworks, and algorithms developed by the community which are freely available help enable individuals to leverage and contribute to collective knowledge. This collaboration is changing how we view power and innovation in the tech industry, as the value of secrets and privacy in regard to training AI is thrown into question.

Key Terms; supplied by Chat-GPT


Regression is a supervised learning technique in machine learning. It involves training a model to understand the relationship between input variables and a continuous target variable. In the context of this article, regression is mentioned as one of the examples of supervised learning methods used in machine learning.


Classification is another supervised learning technique where the model is trained to categorize input data into different classes or categories. It involves training the model on labelled data to learn patterns and make predictions about the class or category of unseen data. It is mentioned as one of the examples of supervised learning methods used in the article.

Transfer Learning:

Transfer learning is a technique where a pre-trained model, trained on a large dataset, is utilized as a starting point for training a new model on a different but related task or domain. By leveraging the knowledge learned from the pre-trained model, transfer learning can speed up the training process and improve performance on the new task. It is mentioned as one of the examples of supervised learning methods used in the article.

Ensemble Methods:

Ensemble methods refer to the approach of combining multiple models, often of different types or trained on different subsets of data, to make predictions or decisions. By aggregating the outputs of individual models, ensemble methods aim to improve the overall accuracy and robustness of predictions. It is mentioned as one of the examples of supervised learning methods used in the article.


Clustering is an unsupervised learning technique where the goal is to group similar data points together based on their inherent characteristics or similarities. It involves finding patterns and structures in unlabeled data without the need for predefined classes or categories. Clustering is mentioned as an example of unsupervised learning method in the article.

Neural Networks and Deep Learning:

Neural networks are computational models inspired by the structure and function of the human brain. Deep learning refers to the use of neural networks with multiple layers to extract hierarchical representations from data and solve complex problems. In the context of the article, neural networks and deep learning are mentioned as examples of unsupervised learning methods.

Word Embedding:

Word embedding is a technique used in natural language processing to represent words or phrases as vectors in a high-dimensional space. It captures the semantic relationships between words, enabling machines to understand the meaning and context of words in a text. It is mentioned as an example of unsupervised learning method in the article.

Natural Language Processing:

Natural Language Processing (NLP) is a subfield of AI that focuses on the interaction between computers and human language. It involves techniques and algorithms to enable machines to understand, interpret, and generate human language, both written and spoken. NLP is mentioned as an example of unsupervised learning method in the article.

Reinforcement Learning:

Reinforcement learning is a type of machine learning where an agent learns to make decisions and take actions in an environment to maximize a reward signal. It involves trial and error learning and feedback mechanisms to learn optimal behaviour. Reinforcement learning is mentioned as an example of unsupervised learning method in the article.

bottom of page