You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/modules/clusterblast.md
+13-18Lines changed: 13 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,6 +16,15 @@ with a minimum percentage identity between genes of 30%.
16
16
17
17
It is normal to have multiple genes hitting for some types of genes (e.g. modular systems such as NRPS or Type I PKS clusters).
18
18
19
+
As gene hits are not required to be 100% identity and query genes may hit multiple reference genes,
20
+
all genes having a match is not a guarantee that the region is exactly the same.
21
+
In the case of KnownClusterBlast, this also means that there is no guarantee that the compound(s) recorded for that MIBiG entry will be produced by the region.
22
+
23
+
Even if 100% of genes have a hit for a reference, it may be less relevant than a lower similarity.
24
+
Some cluster types, e.g. NRPS clusters, may only need a few aminos changed in gene translations to have a completely different product.
25
+
26
+
In all cases, manual verification is required before assuming that the region produces the same compound as the reference.
27
+
19
28
### Ranking system
20
29
21
30
Reference areas are sorted based on an empirical similarity score and then,
@@ -30,23 +39,9 @@ The emprical similarity score is calculated as `h + H + s + S +
30
39
-`B` is a core gene bonus
31
40
32
41
If the similarity scores are equal for multiple references, they are then ranked based on
33
-
the cumulative BlastP bit scores between the gene clusters.
34
-
35
-
### Similarity percentage
36
-
37
-
Similarity in the description, e.g. `87% of genes show similarity`,
38
-
is the percentage of genes within the reference that have a hit to any genes in the query.
39
-
40
-
As gene hits are not required to be 100% identity and query genes may hit multiple reference genes,
41
-
this total similarity percentage is no guarantee that the region is exactly the same.
42
-
In the case of KnownClusterBlast, this also means that there is no guarantee that the compound(s) recorded for that MIBiG entry will be produce by the region.
43
-
44
-
Even if 100% of genes have a hit for a reference, it may be less relevant than a lower similarity.
45
-
Some cluster types, e.g. NRPS clusters, may only need a few aminos changed in gene translations to have a completely different product.
46
-
47
-
In all cases, manual verification is required before assuming that the region produces the same compound as the reference.
42
+
the cumulative BlastP bit scores of those references.
48
43
49
-
#### Example 1: low similarity, good match
44
+
#### Example 1: not all genes, but good match
50
45
Reference area `R` has 70% of genes showing similarity to the query region `Q`.
51
46
All genes with hits are very high identity in their hits, at 95% or higher.
52
47
@@ -57,14 +52,14 @@ but are outside `Q` due to the size of `R` being exceptionally large.
57
52
After manually checking these extra genes and seeing that they're similar to the missing genes,
58
53
it's much, much more likely that the genome matches the reference.
59
54
60
-
#### Example 2: perfect similarity, poor match
55
+
#### Example 2: all genes, but poor match
61
56
Reference area `R` has 100% of genes showing similarity to the query region `Q`.
62
57
None of the genes have a percentage identity in individual hits greater than 60%.
63
58
64
59
While it is still possible that `Q` produces the same compound as `R`,
65
60
it will depend a great deal on the type of cluster and exactly which parts of the genes are similar.
66
61
67
-
#### Example 3: high similarity, poor match
62
+
#### Example 3: most genes, but poor match
68
63
Reference area `R` has very high (but not 100%) similarity, with all but one gene in `R` having similarity to genes in the query region `Q`.
69
64
All of the matching genes have very high identity in their hits.
0 commit comments