Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 89 additions & 0 deletions HybridMSA/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# HybridMSA

HybridMSA is a versatile C++ tool for generating customized multiple sequence alignments (MSAs) for proteins — especially those produced via deep learning-driven motif scaffolding pipelines, though it supports a range of other use cases including fusion/chimeric proteins, tagged proteins, and more!

## How HybridMSA works

![HybridMSA](assets/HybridMSA.png)

HybridMSA processes MSAs (e.g. from AlphaFold2 or other sources) and generates HybridMSAs that:

- Retain evolutionary information where available

- Insert masked tokens elsewhere

- Support downstream applications (e.g., structure prediction using ColabFold's implementation of AlphaFold2)

Importantly, this enables confident and accurate structure prediction for engineered proteins that would otherwise be unsupported or poorly modeled by AlphaFold2.

## Why use HybridMSA?

Tool's relying on MSAs (e.g. AlphaFold2) often fail or perform poorly when:

- A protein includes *de novo* designed regions,

- The protein is a chimera or fusion,

- Tags (e.g., purification tags) are added,

- Genetic searches return no meaningful hits.

HybridMSA solves these issues by seperating genetic searching and MSA generation, enabling the combination of meaningful evolutionary content with masked regions such that the performance of AlphaFold2 increases tremendously.

## Benchmarking HybridMSA

![HybridMSA](assets/Benchmarking.png)

In a benchmarking case of 1000 motif scaffolds generated via a deep learning-driven protein design pipeline, HybridMSA outperformed ColabFold's implementation of AlphaFold2 under default settings (CF) and single-sequence mode (CFss) with regard to accuracy and confidence of predicted models. We attribute this to HybridMSA’s ability to preserve evolutionary context while avoiding noisy or misleading genetic searches for artificial regions. In addition, HybridMSA enables the bypassing of time-consuming genetic and structural database searches, decreasing structure predicton time significantly compared to CF.

## Installation
Follow the steps below to install dependencies and compile the source code.

1. Download reformat.pl

HybridMSA relies on HH-suite’s [reformat.pl](https://github.com/soedinglab/hh-suite/blob/master/scripts/reformat.pl) to standardize MSA formats.

2. Compile the three C++ utility programs

g++ -std=c++17 src/parse_msa_bfd_only.cpp -o parse_msa_bfd_only

g++ -std=c++17 src/prepare_msa_A.cpp -o prepare_msa_A -lstdc++fs

g++ -std=c++17 src/prepare_msa_B.cpp -o prepare_msa_B

## Usage

**parse_msa_bfd_only** processes the BFD MSA (in A3M format) from AlphaFold2 genetic searching to extract specific contiguous regions (contigs) from each aligned sequence; these contigs are specified by exact substring matches to the query (first) sequence, and the resulting sub-alignments are saved for downstream use. While it is optimized for BFD-style A3M inputs, the program is format-flexible and supports any input readable by reformat.pl, including: fas, a2m, a3m, sto, psi, clu.

Usage: ./parse_msa_bfd_only /path/to/msas/directory contig1 [contig2] ...

- <reformat.pl>: Path to the reformat.pl script from HH-suite

- <msa_dir>: Path to the directory containing bfd_uniclust_hits.a3m

- <contigX>: Substring(s) to extract from aligned sequences

**prepare_msa_A** generates a HybridMSA for the first sequence in an input FASTA file by aligning each specified contig exactly to its corresponding position within the full input sequence and then filling all other (non-contig) positions with masked tokens to indicate missing or unaligned regions. This approach enables rapid construction of MSAs even when contigs occur in arbitrary positions or in different orders within the input sequences, by preserving positional accuracy and marking gaps clearly. Each sequence within the input FASTA file should be the same length and contigs should be in the same positions (e.g. output FASTA file from ProteinMPNN).

Usage: ./prepare_msa_A input_file.fasta /path/to/output/of/run_divide_msa_bfd_only/directory contig1 [contig2] ...

- <input_fasta>: Path to the input FASTA file containing full-length sequences

- <msa_dir>: Directory containing sub-alignments for each contig (e.g., contig1.txt, contig2.txt)

- <contigX>: One or more substring(s) defining contigs to align and assemble into a HybridMSA

**prepare_msa_B** generates a HybridMSA for each sequence in an input FASTA file with proper formatting for ColabFold input by prepending sequence-specific headers to the A3M file output from prepare_msa_A. It supports different cardinalities (e.g., "monomer" or "homotrimer") and formats output filenames accordingly.

Usage: ./prepare_msa_B input_file.fasta fasta_file a3m_file cardinality

- <input_fasta>: Path to the input FASTA file containing full-length sequences

- <a3m_file>: Path to the output A3M file from prepare_msa_A corresponding to the input FASTA file

- <cardinality>: Specifies assembly type, e.g., "monomer" or "homotrimer"

## Citation
This method is described in a recent publication: Kriews and Agostino, et al. (2025)

Please cite this work if you use HybridMSA in your research!
Binary file added HybridMSA/assets/Benchmarking.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added HybridMSA/assets/HybridMSA.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
210 changes: 210 additions & 0 deletions HybridMSA/src/parse_msa_bfd_only.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <algorithm>
#include <filesystem>
#include <regex>
#include <cstdlib>
#include <sstream>

namespace fs = std::filesystem;

std::string exec(const char* cmd) {
std::array<char, 128> buffer;
std::string result;
std::unique_ptr<FILE, decltype(&pclose)> pipe(popen(cmd, "r"), pclose);
if (!pipe) {
throw std::runtime_error("popen() failed!");
}
while (fgets(buffer.data(), buffer.size(), pipe.get()) != nullptr) {
result += buffer.data();
}
return result;
}

std::vector<std::string> split(const std::string& s, char delimiter) {
std::vector<std::string> tokens;
std::string token;
std::istringstream tokenStream(s);
while (std::getline(tokenStream, token, delimiter)) {
tokens.push_back(token);
}
return tokens;
}

int main(int argc, char* argv[]) {
if (argc < 4) { // Minimum arguments: program name, MSA path, and at least one substring
std::cerr << "Usage: " << argv[0] << " <path_to_reformat.pl> <path_to_msas_directory> <contig1> [contig2] ..." << std::endl;
return 1;
}

// Get MSA directory path
std::string reformat = argv[1];

// Get MSA directory path
std::string path = argv[2];

// Get contigs
std::vector<std::string> substrings;
for (int i = 3; i < argc; ++i) {
substrings.push_back(argv[i]);
}

// Print contigs
std::cout << "MSA directory: " << path << std::endl;
std::cout << "Contigs:" << std::endl;
for (const auto& substring : substrings) {
std::cout << " - " << substring << std::endl;
}

// Reformat bfd_uniclust_hits.a3m
std::string cmd = reformat + " a3m a3m " + path + "/bfd_uniclust_hits.a3m bfd_uniclust.a3m";
system(cmd.c_str());

// Reformat bfd_uniclust.a3m
cmd = "awk '/^>/ {if (seq) print seq; printf \"%s\\n\", $0; seq = \"\"; next} {gsub(/[ \\t\\r\\n]/,\"\"); seq = seq $0} END {if (seq) print seq}' bfd_uniclust.a3m > bfd_uniclust_reformatted.a3m";
system(cmd.c_str());

// Concatenate reformatted files
system("cat bfd_uniclust_reformatted.a3m > msa.a3m");

std::string msa = "msa.a3m";
std::string seq;

// Read the second line of msa.a3m
std::ifstream msa_file(msa);
std::string line;
std::getline(msa_file, line); // Skip first line
std::getline(msa_file, seq);

// Calculate start and end positions
std::vector<int> starts, ends;
for (const auto& substring : substrings) {
std::vector<size_t> positions;
size_t pos = seq.find(substring);

while (pos != std::string::npos) {
positions.push_back(pos);
pos = seq.find(substring, pos + 1); // Look for the next occurrence
}

if (positions.empty()) {
std::cout << "Warning: No positions found for substring '" << substring << "'" << std::endl;
} else if (positions.size() > 1) {
std::cout << "Warning: Multiple positions found for substring '" << substring << "'" << std::endl;
std::cout << "Positions: ";
for (const auto& p : positions) {
std::cout << p << " ";
}
std::cout << std::endl;

starts.push_back(positions[0]);
ends.push_back(positions[0] + substring.length() - 1);
} else {
starts.push_back(positions[0]);
ends.push_back(positions[0] + substring.length() - 1);
}
}

// Print starts and ends
for (int start : starts) std::cout << start << " ";
std::cout << std::endl;
for (int end : ends) std::cout << end << " ";
std::cout << std::endl;

// Create output files
std::ifstream input_msa(msa);
std::ofstream target_names("target_names.txt");
std::vector<std::ofstream> substring_files;
for (const auto& substring : substrings) {
substring_files.emplace_back(substring + ".txt");
}

// Populate output files
int index = 0;
std::getline(input_msa, line);
std::getline(input_msa, line);
while (std::getline(input_msa, line)) {
if (index % 2 == 0) {
target_names << line << std::endl;
} else {
int counter = 0;
int saved_position = 0;
int found = 0;

for (int i = 0; i < line.length(); ++i) {
char c = line[i];
if (std::find(starts.begin(), starts.end(), counter) != starts.end() &&
(c == '-' || std::isupper(c))) {
saved_position = i;
int checkSum = i - std::count_if(line.begin(), line.begin() + i,
[](char ch) { return std::islower(ch); });
if (starts[found] != checkSum) {
std::cout << "Element " << starts[found] << " is not equal to " << checkSum << std::endl;
}
} else if (std::find(ends.begin(), ends.end(), counter) != ends.end() &&
(c == '-' || std::isupper(c))) {
substring_files[found] << line.substr(saved_position, i - saved_position + 1) << std::endl;
int checkSum = i - saved_position + 1 -
std::count_if(line.begin() + saved_position, line.begin() + i + 1,
[](char ch) { return std::islower(ch); });
if (substrings[found].length() != checkSum) {
std::cout << "Element " << substrings[found].length() << " is not equal to " << checkSum << std::endl;
}
++found;
}
if (c == '-' || std::isupper(c)) {
++counter;
}
}
}
++index;
}

// Remove lines where all contigs are blank
std::vector<int> lines_to_remove;
int line_count = 0;
std::ifstream count_file("target_names.txt");
while (std::getline(count_file, line)) ++line_count;

for (int i = 1; i <= line_count; ++i) {
bool non_dash = false;
for (const auto& substring : substrings) {
std::ifstream file(substring + ".txt");
std::string line;
for (int j = 1; j < i; ++j) std::getline(file, line);
std::getline(file, line);
if (std::regex_search(line, std::regex("[A-WYZ]"))) {
non_dash = true;
break;
}
}
if (!non_dash) {
lines_to_remove.push_back(i);
std::cout << "rm " << i << std::endl;
}
}

if (!lines_to_remove.empty()) {
std::string sed_command = "sed -i '";
for (int line : lines_to_remove) {
sed_command += std::to_string(line) + "d;";
}
sed_command.pop_back();
sed_command += "' target_names.txt";
system(sed_command.c_str());

for (const auto& substring : substrings) {
sed_command = "sed -i '";
for (int line : lines_to_remove) {
sed_command += std::to_string(line) + "d;";
}
sed_command.pop_back();
sed_command += "' " + substring + ".txt";
system(sed_command.c_str());
}
}

return 0;
}
Loading