THIS IS AN AI GENERATED README SOME THINGS MAY BE WRONG
This is a Proof-of-Concept (POC) Python script for performing automated security analysis on application code, configurations, API specifications, documentation, and other text-based files. It leverages DSPy (a framework for programming with language models) and AI-driven "expert" modules to recursively chunk and analyze content, identify potential security vulnerabilities, and generate consolidated reports.
The tool breaks down large files into manageable chunks, applies specialized AI agents (e.g., for code, APIs, configs) in a recursive manner, and synthesizes findings into readable Markdown reports and detailed JSON outputs. It's designed for static application security testing (SAST) with a focus on OWASP Top 10, API security, misconfigurations, and more.
Note: This is a POC and relies on the quality of the underlying language model (LLM). Results should be verified by security experts. It does not replace professional security audits.
- Recursive Analysis: Handles large files by chunking and subdividing content recursively (with configurable depth, chunk sizes, and subdivision factors).
- Multi-Expert Modules: Uses DSPy signatures for specialized analysis:
- General Overview: High-level content assessment.
- Code Module: Scans source code for vulnerabilities (e.g., injection flaws, insecure data handling).
- API Module: Analyzes API specs (e.g., OpenAPI, Postman) against OWASP API Security Top 10.
- Configuration Module: Checks configs for misconfigurations (e.g., default credentials).
- Documentation Module: Reviews docs for security gaps.
- Threat Modeler: Identifies potential threats and attack vectors.
- Compliance Checker: Checks against standards like OWASP and data protection principles.
- Directory Support: Processes entire directories, ignoring specified patterns (e.g., logs, caches).
- Consolidated Reports: Aggregates findings across files/chunks into a single Markdown report with optional per-file details appended.
- Output Formats: Markdown for human-readable reports; JSON for detailed, structured results.
- Content Detection: Automatically detects file types (e.g., Python code, JSON configs, API specs).
- Customization: Configurable chunking, recursion limits, LLM settings, and more via command-line arguments.
- Python 3.8+ (tested on 3.12.3)
- DSPy: For defining and running AI modules.
- LiteLLM: For LLM API interactions (supports OpenAI-compatible models).
- Other dependencies:
argparse,json,os,fnmatch,collections,typing.
Install dependencies:
pip install dspy-ai litellmYou need access to an LLM API (e.g., OpenAI, or any OpenAI-compatible endpoint). Set your API key via environment variable or command-line argument.
-
Clone the repository:
-
Install dependencies:
pip install -r requirements.txt
(Create a
requirements.txtwithdspy-aiandlitellmif needed.) -
Ensure your LLM API is accessible (e.g., set
API_KEYenvironment variable).
Run the script with Python:
python appsec_analysis.py <input_path> [options]<input_path>: Path to a file or directory to analyze.
--model: LLM model name (default: empty; set to e.g.,gpt-4o).--api_base_url: LLM API base URL (default: empty; e.g.,https://api.openai.com/v1).--api_key: LLM API key (or useAPI_KEYenv var).--initial_chunk_size: Size of initial chunks (default: 12000 chars).--initial_chunk_overlap: Overlap between chunks (default: 500 chars).--max_chars_no_initial_chunk: Max file size before chunking (default: 600000 chars).--rec_min_chunks_subdivide: Min chunks to subdivide recursively (default: 4).--rec_max_depth: Max recursion depth (default: 3).--rec_max_chunks_leaf: Max chunks per leaf analysis (default: 5).--rec_subdivision_factor: Subdivision factor for recursion (default: 2).--max_output_tokens: Max tokens for LLM responses (default: 60000).--max_consolidator_input_chars: Max chars for findings consolidator input (default: 750000).--output_report_file: Markdown report filename (default:security_analysis_report.md).--output_json_file: JSON results filename (default:security_analysis_details.json).--output_dir: Directory for outputs (default: current dir).--ignore_paths: Patterns to ignore (e.g.,*.log __pycache__ temp/*).--append_detailed_reports: Append per-file reports to directory consolidated report.
-
Analyze a single file:
python appsec_analysis.py path/to/example.py --model gpt-4o --api_key your-api-key
Outputs:
security_analysis_report.mdandsecurity_analysis_details.json. -
Analyze a directory, ignoring certain patterns:
python appsec_analysis.py path/to/project_dir --model gpt-4o --api_key your-api-key --ignore_paths "*.log" "__pycache__" --append_detailed_reports
Generates a consolidated report for the directory, with optional per-file details appended.
- File Reading & Chunking: Reads files, detects content type, and chunks large content.
- Recursive Analysis: Subdivides chunks into sections, applies expert modules at leaf nodes.
- Expert Modules: Each module (e.g., CodeAnalysisSignature) uses DSPy to generate findings via LLM prompts.
- Consolidation: Aggregates raw findings, de-duplicates, and synthesizes into cohesive reports.
- Outputs: Markdown for summaries; JSON for full analysis trees and raw data.
- POC Nature: Outputs depend on LLM accuracy; may produce false positives/negatives.
- No Execution: Static analysis only; no runtime testing.
- LLM Dependency: Requires a powerful LLM for best results (e.g., GPT-4 or equivalent).
- Performance: Large directories or deep recursion may be slow/expensive due to LLM calls.
- Encoding Issues: Handles UTF-8 and Latin-1; may fail on other encodings.
- No Additional Packages: Code interpreter tool is limited to pre-installed libraries (e.g., numpy, sympy).
Contributions are welcome! Open issues for bugs or features, or submit pull requests.
- Fork the repo.
- Create a branch:
git checkout -b feature-branch. - Commit changes:
git commit -m "Add feature". - Push:
git push origin feature-branch. - Open a PR.
This project is licensed under the MIT License. See LICENSE for details.