Skip to content

papo1011/fast-detect-gpt.cpp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

██████ ████  █████ ██████    █████  █████ ██████ █████ ████ ██████    █████   ██████ ██████
██    ██  ██ ██      ██      ██  ██ ██      ██   ██   ██      ██     ██       ██  ██   ██
████  ██████ █████   ██      ██  ██ ████    ██   ████ ██      ██     ██   ███ ██████   ██
██    ██  ██    ██   ██      ██  ██ ██      ██   ██   ██      ██     ██    ██ ██       ██
██    ██  ██ █████   ██      █████  █████   ██   █████ ████   ██      █████   ██       ██

What is fast-detect-gpt.cpp?

It is a lightweight CLI app that uses llama.cpp to enable detection of AI generated text. If you want to dig into the details of how it works, read the fast detect gpt paper arXiv

The original paper implementation uses PyTorch for inference, this project instead uses llama.cpp that leverages standard consumer hardware for faster performance.

This implementation follows the analytic Fast DetectGPT approach, which optimizes detection speed by utilizing a singular model for both sampling and scoring. Unlike previous methods (DetectGPT) that required separate steps or models, this approach combines the processes, necessitating only one model call per check. The core metric is the Conditional Probability Curvature, defined as:

$$d(x, p_\theta) = \frac{\log p_\theta(x) - \tilde{\mu}}{\tilde{\sigma}}$$

Where:

  • $x$ -> The original input token
  • $\log p_\theta(x)$ -> The log likelihood of the original token
  • $\tilde{\mu}$ -> The average log likelihood of alternative samples generated by the model
  • $\tilde{\sigma}$ -> The standard deviation of those sample log likelihoods

DetectGPT

The algorithm assumes that AI generated text resides in the peaks of the model's probability curvature, as clearly illustrated in the previous figure from the DetectGPT paper. Instead of relying on computationally expensive sampling to generate local neighbors, the analytical implementation directly evaluates the model's predictive distribution for each token. By computing the theoretical expectation and variance of the conditional log probabilities, we obtain an exact estimate of the local curvature.The metric normalizes the difference between the observed log likelihood and its analytical expectation ($\tilde{\mu}$, representing the negative entropy) by dividing it by the analytical standard deviation $\tilde{\sigma}$. This renders the metric a robust and deterministic zero shot detector, capable of distinguishing between human and AI written text with high accuracy and minimal latency.

Before building

Next 2 steps are required!!

How to build

cmake -B build .
cmake --build ./build --target fast-detect-gpt -j 6

How to use

  • create a .env file (you can use .env.sample as a template)
  • Optional, if you don't have a model yet:
  • If you have a model already downloaded, move it to the models/ folder or symlink it there, then set MODEL_NAME in the .env file to the folder name of the model
  • move your input text file to the inputs/ folder and set INPUT_FILE in the .env file to the file name
  • run:
  • bash ./run.sh

Understanding the output

The output will be a discrepancy score for each input text, the higher the score the more likely it is that the text is human written.

You can compute the threshold that maximizes the F-Beta score on your train dataset using this script:

bash train.sh

About

AI detection on your hardware

Topics

Resources

License

Stars

Watchers

Forks