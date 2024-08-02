Detoxifying or preventing toxic content generation from large language models (LLMs) is challenging. Data used to train these models is usually scraped from the internet which often contains toxic content. Without proper guardrails, a model can learn undesirable properties and in turn, generate toxic text. Removing toxic samples from training data can be expensive as it usually requires data annotators to manually identify samples that align with human values. Inherent bias in the annotators themselves can also affect the data labeling process in a negative way. By using the open source project TrustyAI Detoxify in conjunction with Hugging Face's SFTTrainer (Supervised Fine-Tuning Trainer) on Red Hat OpenShift AI, we can help lower the costs of detoxifying LLMs during training.

In this article, we will provide step-by-step guidance on how you can use these open source technologies to detoxify a model.