Autotracko

Autotracko is a command-line tool designed to automate the process of scanning websites to identify third-party trackers. It helps researchers, developers, and privacy advocates understand the tracking landscape of specific websites.

Features

Automated Scanning: Scans a list of URLs using Puppeteer.
Tracker Identification: Identifies network requests matching known tracker domains from lists like DuckDuckGo Tracker Radar.
Data Collection: Collects information about loaded resources, approximate page size, and takes screenshots.
Caching: Remembers successfully scanned domains to avoid redundant work (can be disabled).
Incremental Results: Saves results progressively, so work isn't lost if the process is interrupted.
Configurable: Options for input/output files, caching, tracker list location, and headless browser mode.

High-Level Results

Running Autotracko on a list of websites produces a JSON output file containing detailed results for each scanned site, including:

Basic Info: Requested URL, final URL (after redirects), domain name, scan timestamp.
Resources: A list of all URLs loaded by the page during the scan.
Size: An approximate total size (in bytes) of loaded resources based on content-length headers.
Screenshot: Path to a full-page screenshot of the rendered website (if enabled).
Trackers: A list of identified tracker domains found within the loaded resources, along with details about the tracker (owner, prevalence, etc.) sourced from the DuckDuckGo list.
Errors: Any errors encountered during the scan for a specific site.

This data allows for analysis of tracking prevalence, identification of specific tracking companies, and understanding resource loading patterns across different websites.

Setup

Prerequisites:
- Node.js (Version 16+)

Clone Repository:

git clone https://github.com/Rebreda/autotracko.git
cd autotracko

Install Dependencies:
```
npm install
```
Fetch Tracker List:
```
npm run fetch-trackers
```
This pulls down the latest tracker list from github.

Prepare Domains JSON: Create a JSON file (e.g., domains.json in the project root) containing an array of objects, where each object represents a website to scan. You can cp example.domains.json domains.json to get started quickly.

Example domains.json:

[
  {
    "url": "https://www.cbc.ca/news",
    "owner": {
      "name": "Canadian Broadcasting Corporation",
      "displayName": "CBC",
      "country": "CA"
    },
    "category": "News",
    "language": "en"
  },
  {
    "url": "https://www.wikipedia.org",
    "owner": {
      "name": "Wikimedia Foundation",
      "displayName": "Wikipedia",
      "country": "US"
    },
    "category": "Reference",
    "language": "mul"
  },
  {
    "url": "developer.mozilla.org",
    "owner": {
      "name": "Mozilla Corporation",
      "displayName": "MDN",
      "country": "US"
    },
    "category": "Technology",
    "language": "en"
  }
]

url (string): Mandatory. The URL to scan. If no scheme (http:// or https://) is provided, https:// will be prepended.
owner (object, optional): Information about the website owner.
- name (string): Full name of the owner.
- displayName (string, optional): Common name or abbreviation.
- country (string, optional): ISO 3166-1 alpha-2 country code (e.g., "CA", "US").
category (string, optional): A category for the website (e.g., "News", "E-commerce").
language (string, optional): Primary language code (e.g., "en", "fr").

Usage

Run the scanner using Node.js (via ts-node for development or after building with tsc).

Basic Usage:

# Using ts-node (for development)
npx ts-node src/index.ts --domains domains.json --output results.json

After building (npm run build)

node dist/index.js --domains domains.json --output results.json

Using npm/yarn script:

Ensure domains.json exists

npm run scan -- --domains domains.json --output results.json

or

yarn scan --domains domains.json --output results.json

(Note the extra -- when passing arguments via npm run)

Command-Line Options:

-d, --domains <path>: Path to the text file containing URLs to scan (default: domains.json).
-o, --output <path>: Path to the output JSON file for results (default: results.json).
-c, --cache <path>: Path to the cache file (default: cache.json). Caching is enabled by default.
--no-cache: Disables reading from or writing to the cache file.
-t, --tracker-list <path>: Path to the tracker list JSON file (default: src/data/extension-mv3-tds.json).
--headless <mode>: Run browser headless ('new', 'true', 'false') (default: 'new'). Use 'false' to see the browser window.
-h, --help: Display help information.

Example with Options:

npx ts-node src/index.ts \
 --domains ./input/my_sites.txt \
 --output ./output/scan_run_1.json \
 --no-cache \
 --headless false

Contributing

Contributions are welcome! If you find a bug or have an idea for an improvement, please:

Open an Issue: Discuss the change you wish to make via a GitHub issue.
Fork the Repository: Create your own copy of the project.
Create a Branch: Make your changes in a dedicated branch (git checkout -b feature/your-feature-name).
Commit Changes: Make clear, concise commits.
Push Branch: Push your changes to your fork (git push origin feature/your-feature-name).
Open a Pull Request: Submit a PR back to the main repository for review.

Please ensure your code adheres to the existing style and includes tests if applicable.

License

The code for Autotracko is licensed under the Mozilla Public License Version 2.0 (MPL-2.0). You can find the full license text in the LICENSE file.

Please see the ACKNOWLEDGEMENTS.md file for information regarding the licenses of dependencies and data sources.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
.gitignore		.gitignore
ACKNOWLEDGEMENTS.md		ACKNOWLEDGEMENTS.md
LICENSE.md		LICENSE.md
README.md		README.md
analytics.json		analytics.json
example.domains.json		example.domains.json
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json
results.json		results.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Autotracko

Features

High-Level Results

Setup

Usage

After building (npm run build)

Ensure domains.json exists

or

Contributing

License

About

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

License

Rebreda/autotracko

Folders and files

Latest commit

History

Repository files navigation

Autotracko

Features

High-Level Results

Setup

Usage

After building (npm run build)

Ensure domains.json exists

or

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages