Skip to content

execvpe/ts-detector

Repository files navigation

Trojan Source Detector

A collection of scripts to detect possible Trojan Source attacks in GitHub repositories.

Written by Martin Weinzierl in 2025 for his bachelor's thesis "Impact of Trojan Source Attacks".

Motivation

According to a 2023 paper by Nicholas Boucher and Ross Anderson, a set of vulnerabilities grouped under the term Trojan Source represents a serious but so far underexplored attack vector in academic research.

By deliberately using

  1. control characters that manipulate the bidirectional Unicode algorithm (so-called BiDi control characters, which alter text display and reading order),
  2. invisible characters, and
  3. homoglyphs (visually similar characters),

an attacker can cause source code displayed on-screen to deviate from the actual program logic intended by the developer, without this being immediately noticeable to the user.

Usage

All modules accept command-line arguments.
Documentation is available by appending --help, which is supported by all python scripts.

The software was primarily used with Python 3.13.7.
Compatibility with older and newer Python versions cannot be guaranteed.

1. Clone this repository

Get the local project files to run all analysis tools.

$ git clone <project url>
$ cd ts-detector

2. Obtain a GitHub API token

Required for authenticated GitHub API requests, enabling quota-friendly searches.

Go to Settings > Developer Settings > Personal access tokens (or click here)

$ echo 'ghp_abc123...xyz890' > GITHUB_API_TOKEN

3. Large-scale preliminary analysis

Run an initial large-scale screening that scans repositories briefly before performing deeper analysis later.

This module alternates between calling query.py and detect.py. Together, these scripts perform a coarse pre-filtering, including:

  1. detecting unbalanced BiDi control sequences directly in the retrieved file contents,

  2. identifying any Unicode characters that appear outside of comments and string literals, and thus filtering out many false positives early, before the more complex local heuristics are applied.

$ ./loop.sh

The script should be customized to suit your use case.

4. Download the necessary files for local analysis

Fetch source code files flagged during the preliminary analysis so they can be examined offline.

$ ./download.py

5. Analyse the files locally using heuristics

Perform a deep local analysis of all downloaded files based on heuristic detection methods.

This module scans each file for:

  1. unbalanced BiDi control sequences, which may alter the visual code flow,

  2. attacks involving invisible characters, such as zero-width spaces or joiners, and

  3. homoglyph-based attacks, where visually similar characters are used to alter function identifiers.

$ ./analyse.py

6. Create useful statistics

Aggregate the results of the previous steps.

$ ./statistics.py

Related documentation

Function api_search_repo()

Function api_repos_pulls()

Information about BiDi characters

License

Copyright (C) 2025 Martin Weinzierl

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; version 2.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.

About

Trojan Source Detector

Resources

License

Stars

Watchers

Forks