antismash
diff --git a/‎docs/img/knownclusterblast.png‎
-7.36 KB b/‎docs/img/knownclusterblast.png‎
-7.36 KB
diff --git a/‎docs/modules/clusterblast.md‎
Lines changed: 13 additions & 18 deletions b/‎docs/modules/clusterblast.md‎
Lines changed: 13 additions & 18 deletions
@@ -16,6 +16,15 @@ with a minimum percentage identity between genes of 30%.
 
 It is normal to have multiple genes hitting for some types of genes (e.g. modular systems such as NRPS or Type I PKS clusters).
 
+As gene hits are not required to be 100% identity and query genes may hit multiple reference genes,
+all genes having a match is not a guarantee that the region is exactly the same.
+In the case of KnownClusterBlast, this also means that there is no guarantee that the compound(s) recorded for that MIBiG entry will be produced by the region.
+
+Even if 100% of genes have a hit for a reference, it may be less relevant than a lower similarity.
+Some cluster types, e.g. NRPS clusters, may only need a few aminos changed in gene translations to have a completely different product.
+
+In all cases, manual verification is required before assuming that the region produces the same compound as the reference.
+
 ### Ranking system
 
 Reference areas are sorted based on an empirical similarity score and then,
@@ -30,23 +39,9 @@ The emprical similarity score is calculated as `h + H + s + S +
 - `B` is a core gene bonus
 
 If the similarity scores are equal for multiple references, they are then ranked based on
-the cumulative BlastP bit scores between the gene clusters.
-
-### Similarity percentage
-
-Similarity in the description, e.g. `87% of genes show similarity`,
-is the percentage of genes within the reference that have a hit to any genes in the query.
-
-As gene hits are not required to be 100% identity and query genes may hit multiple reference genes,
-this total similarity percentage is no guarantee that the region is exactly the same.
-In the case of KnownClusterBlast, this also means that there is no guarantee that the compound(s) recorded for that MIBiG entry will be produce by the region.
-
-Even if 100% of genes have a hit for a reference, it may be less relevant than a lower similarity.
-Some cluster types, e.g. NRPS clusters, may only need a few aminos changed in gene translations to have a completely different product.
-
-In all cases, manual verification is required before assuming that the region produces the same compound as the reference.
+the cumulative BlastP bit scores of those references.
 
-#### Example 1: low similarity, good match
+#### Example 1: not all genes, but good match
 Reference area `R` has 70% of genes showing similarity to the query region `Q`.
 All genes with hits are very high identity in their hits, at 95% or higher.
 
@@ -57,14 +52,14 @@ but are outside `Q` due to the size of `R` being exceptionally large.
 After manually checking these extra genes and seeing that they're similar to the missing genes,
 it's much, much more likely that the genome matches the reference.
 
-#### Example 2: perfect similarity, poor match
+#### Example 2: all genes, but poor match
 Reference area `R` has 100% of genes showing similarity to the query region `Q`.
 None of the genes have a percentage identity in individual hits greater than 60%.
 
 While it is still possible that `Q` produces the same compound as `R`,
 it will depend a great deal on the type of cluster and exactly which parts of the genes are similar.
 
-#### Example 3: high similarity, poor match
+#### Example 3: most genes, but poor match
 Reference area `R` has very high (but not 100%) similarity, with all but one gene in `R` having similarity to genes in the query region `Q`.
 All of the matching genes have very high identity in their hits.