• ragas@lemmy.ml
    link
    fedilink
    English
    arrow-up
    7
    ·
    edit-2
    23 hours ago

    I mean I don’t know for sure but I think they often just code program logic in to filter for some requests that they do not want.

    My evidence for that is that I can trigger some “I cannot help you with that” responses by asking completely normal things that just use the wrong word.

    • Scrubbles@poptalk.scrubbles.tech
      link
      fedilink
      English
      arrow-up
      1
      ·
      20 hours ago

      It’s not 100%, and you’re more or less just asking the LLM to behave, and filtering the response through another non-perfect model after that which is trying to decide if it’s malicious or not. It’s not standard coding in that it’s a boolean returned - it’s a probability that what the user asked is appropriate according to another model. If the probability is over a threshold then it rejects.