A Machine Learning Model refers to the “model artifact” that is created by the training process. Training data contains the _correct_ answer, which is called a target attribute. For example, when reading customer verbatims, you may want to know whether they are complaining or merely providing feedback. Training data would consist of customer verbatims labeled as either complaint or feedback by a human. These two labels are the target attributes we want the model to learn based on what the verbatims contain.

The algorithm finds patterns in the data that correlate the verbatims (input data) to the target attributes assigned to them. The model captures these patterns and can then predict target attributes for new data that is not pre-labeled by a human. This is why machine learning is used in instances where large volumes of data become infeasible for human analysts to work with efficiently.

Training a model adds value to your analyses because it does the job of several people for you. It is unrealistic to tag or label every single document in a dataset in order to have an additional field to pivot on. It is much more efficient to train a model to predict those labels for you. In order for the model to work, you need to manually label some of the documents, but not all of them.

How many documents do you have to label by hand? The answer is: however many it takes for the model’s accuracy to meet an acceptable threshold. We recommend that a model achieve 80% accuracy.

Accuracy depends on the amount of labels to be predicted, the amount of data it will be predicting labels for, and the complexity of that data.

A model attempting to predict between two labels, a binary model, will be more accurate with less manual labeling of the training dataset compared to a model attempting to predict more than two labels. Predicting multiple labels can still be done, but it may require more training data in order to be done accurately.

Predicting labels for very large datasets is harder for a model to do accurately due to the amount of potential verbatims diverging from the training dataset significantly. In other words, you want your training dataset to be a good representative sample of the dataset you are trying to predict labels for.

Models predicting labels for datasets that contain complex or lengthy verbatims will have difficulty predicting labels accurately because there are too many implicit associations to be drawn from the verbatims to a specific label. Again, the training dataset needs to be a representative sample of the larger dataset you want to predict labels for.

Once a model is trained to a high degree of accuracy, it can be deployed to any dataset, anytime; however, you will want to use your best judgment as to whether a particular model will work well with certain datasets. Again, if the model was trained on data that is materially different than the dataset you want to apply it to, it probably won’t be very accurate in that instance.

The value added is equivalent to the time saved predicting labels. The model trained to predict “Complaint” vs. “Feedback” on free response sections of a survey your organization regularly administers can be used indefinitely as that data stream grows with each administration of the survey in the future.

Did this answer your question?