Your task is the following:
- Review each script and identify any indicators of compromise or unusual behavior.
- Summarize what the script does, highlighting any noteworthy findings.
- Think about automation. How would you detect these suspicious attributes programmatically?
- Write a detection mechanism using Rust or TypeScript to automate these detections.
This should include rules to flag suspicious activity.
A way to evaluate those rules to ensure effectiveness and minimize false positives.
Homework assignment Detections_Security - Senior Software Engineer (2).pdf
Indicators of compromise:
- window.execScript(text);
- window.eval(text);
This script extracts three strings from a script tag and sends it to "sspapi.zenyou.71360.com", then if operation is complete with 200 response, it runs the server response code.
It has a virtual machine and a deobfuscation function _0xfd2f that maps encoded indices to string values from _0x4720. It collects data from fields like name, phone address. Encodes urlencoded data to Base64 And sends it to https://cdn-report.com/
Indicators of compromise:
- Encoded_strings/data
while (!![])
In this file another indicator of compromise is a different dns from the website source this script was found. This should be mitigated with the right Content Security Policy.
Manipulates DOM by injecting mysite-frame. The iframe gets the original website referrer and loads Javascript files.
Indicators of compromise:
- Create an iframe loading scripts
It also attaches event listeners for different page events ("turbo:visit", "turbolinks:visit", "page:before - change", "turbo:before - cache", "turbolinks:before - cache""turbo:load", "turbolinks:load", "page:change") to control the behavior of the injected script.
This script stores all keys pressed and sends them to a server beforeunload event. It also sends all data collected from forms when this data is submitted.
Indicators of compromise:
- Event listeners on submit and on key pressed.
In all these files the main indicator of compromise are that the information where they send scripts have a different dns from the original website. This should be mitigated with the right Content Security Policy.
To automate these suspicious attributes programmatically, the proposed solution uses two mechanisms.
- File description using Qwen model.
- File embeddings search using all-MiniLM-L6-v2 model.
The Qwen model generates a json file with the results of reviewing each file. It generates an output with the impact of the issues. Then we generate embeddings for our detected thread files and store them to the Qdrant vector database. This allows us to detect future threads similar to the ones we already found. Then we can test our system with new files.
File embeddings search will be positive if a new file has a similarity higher than 60% to a thread we already found. File descriptor will be positive if it detects a security concern of high impact.
- File similarity positive and file descriptor positive, it's 100% a thread
- File similarity negative and file descriptor positive, we have possibly found a new thread file.
- File similarity positive and file descriptor negative, it has 50% prob. of being a thread.
- File similarity negative and file descriptor negative, it's not a thread.
- You need to have Rust with Cargo dependency management installed.
- You need to have Docker installed in your system.
- You need to have Nvidia GPU in your system, the more memory available, the bigger the model you can run.
To use the vector database with docker run:
docker compose up qdrant
thread_files: folder contains file threads we already found
potential_threads: folder contains files we want to test
potential_threads_descriptors: folder contains the results of applying Qwen2.5-Coder-32B-Instruct to potential_threads
You can run code_description with the following command:
cargo run --example code_description -- --top-p 0.9 --temperature 0.7 --repeat-penalty 1
It will generate a json of the potential thread of a file.
You can get the prompt from get_prompt in examples/code_description/main.rs and use it in demo Qwen Demo.
<<I don't have enough gpu memory to run "2.5-coder-32b-instruct" model, so I have gathered the output from Qwen Demo. I have seen that the output quality differs when using Pytorch or Candle-nn, so it might be a bug in matrix multiplication (tokens are the same).>>.
Parameters:
cpu: Run code on cpu, but not recommended for the amount of time
temperature: Afects the variability and randomness of generated responses. Values close to 0 are more deterministic.
top_p: It sets a threshold probability whose cumulative probability exceeds the threshold.
seed: Seed used for randomness
sample_len: Length of the generated sample
repeat_penalty: Penalty applied for repeating tokens
model: Name of the model to run, ex: 2.5-coder-32b-instruct
You can store thread embeddings with the following command:
cargo run --example store_thread_embeddings
It will compute embeddings on thread files and store them in the database
Parameters:
cpu: Run code on cpu, but not recommended for the amount of time
model_id: model used, default: sentence-transformers/all-MiniLM-L6-v2
revision: model revision used, default: refs/pr/21
You can run detect_new_threads with the following command:
cargo run --example detect_new_threads
It will create embeddings of files and search the most similar file. It will check as well if there high threads in file description.
Parameters:
cpu: Run code on cpu, but not recommended for the amount of time
model_id: model used, default: sentence-transformers/all-MiniLM-L6-v2
revision: model revision used, default: refs/pr/21
File: ./potential_threads/file6.js - is 100% a thread
File: ./potential_threads/file5.js - is 100% a thread
File: ./potential_threads/file100.js is not a thread