Overview#
This section frames responsible AI around two main aspects:
Safety evaluation of LLMs (how safety performance is measured and reported)
The moderation layer (how content is classified and enforced at runtime)
What’s included#
OpenAI Safety Evaluations Hub: a structured view of safety and reliability evaluations, including refusal behavior, jailbreak robustness, hallucinations, and instruction hierarchy testing. See OpenAI Safety Evaluations Hub.
OpenAI Moderation API: a traditional classifier‑based moderation layer with a published taxonomy and outputs. See OpenAI Moderation API.
GPT‑OSS‑Safeguard: a generative moderation layer that uses policy prompts and structured outputs. See GPT‑OSS‑Safeguard.
Anthropic safety approach: an enterprise example with a product‑safety lens and layered mitigations. See Anthropic — Protecting the Well‑Being of Users.
How to read this section#
Start with evaluations to understand how model safety is measured.
Then read moderation to see how those categories become runtime controls (classifier or generative).
Use GPT‑OSS‑Safeguard if you want policy‑prompted, structured moderation.
Compare with Anthropic’s approach for an enterprise‑focused example.