Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion dialectid/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

__version__ = '0.1.4'
__version__ = '0.1.5'

# from dialectid.text_repr import BoW, SeqTM
from dialectid.model import DialectId, BoW
5 changes: 3 additions & 2 deletions dialectid/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,8 @@ def decision_function(self, texts: list):
X = np.c_[-X[:, 0], X[:, 0]]
return X

def positive(self, texts: list):
def positive(self, texts: list,
margin: float=0):
"""Positive classes"""
X = self.transform(texts)
X_df = X
Expand All @@ -115,7 +116,7 @@ def positive(self, texts: list):
X = X_df
output = []
labels = self.countries
for mask, value in zip(X_df > 0, X):
for mask, value in zip(X_df > margin, X):
_ = {str(k): v for k, v in zip(labels[mask], value[mask])}
output.append(_)
return output
Expand Down
2 changes: 2 additions & 0 deletions dialectid/tests/test_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,8 @@ def test_DialectId_positive():
dial = DialectId(lang='es')
output1 = dial.positive(['comiendo unos tacos'])
assert 'mx' in output1[0] and 'gt' in output1[0]
output2 = dial.positive(['comiendo unos tacos'], margin=1)
assert len(output1[0]) > len(output2[0])
dial.probability = True
output2 = dial.positive(['comiendo unos tacos'])
assert len(output2) == len(output1)
Expand Down
30 changes: 28 additions & 2 deletions quarto/dialectid.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -437,8 +437,6 @@ detect.predict(['comiendo unos tacos',
```
:::

## Column

::: {.card title='Countries' .flow}
The available dialects for each language can be identified in the attribute `countries`, as seen in the following snippet for Spanish.

Expand All @@ -452,6 +450,8 @@ detect.countries
```
:::

## Column

::: {.card title='Decision Function' .flow}
One might be interested in all the countries from which the speaker could come. To facilitate this, one can use the `decision_function` method. DialectId uses linear Support Vector Machines (SVM) as classifiers; consequently, the positive values in the `decision_function` are interpreted as belonging to the positive class, i.e., a particular country. The following code exemplifies this idea: the first two lines import and instantiate the `DialectId` class in Spanish. The third line computes the decision-function values; it returns a two-dimensional array where the first dimension corresponds to the number of texts. In this case, it keeps only the decision-function values, where the positive values indicate the presence of the particular country. The fourth line sorts the values where the highest value is the first element. The fifth line retrieves the country and its associated decision-function values, considering only those countries with positive values.

Expand All @@ -468,6 +468,20 @@ index = df.argsort()[::-1]
```
:::

::: {.card title='Positive class' .flow}
In the case where one is interested in the positive classes, as described in the previous example, `DialectId` implements the `DialectId.positive` method to retrieve the positive labels in a list of texts, as shown in the following example.

```{python}
#| echo: true
#| label: positive-default

from dialectid import DialectId
detect = DialectId(lang='es')
pos = detect.positive(['acompañando el asado con un buen vino'])[0]
pos
```
:::

::: {.card title='Probability' .flow}
In some situations, one is interested in the probability instead of the decision-function values of a linear SVM. The probability can be computed using the `predict_proba` method. The following code exemplifies this idea: the first line imports the `DialectId` class as in previous examples. The second line differs from the last example in that the parameter `probability` is set to true. The rest of the lines are almost equivalent to the previous example.

Expand All @@ -482,6 +496,18 @@ index = prob.argsort()[::-1]
[(detect.countries[i], prob[i])
for i in index[:4]]
```

The `DialectId.positive` method can also be used when one is interested in the probabilities of the positive classes, as shown in the following lines.

```{python}
#| echo: true
#| label: positive-probability

from dialectid import DialectId
detect = DialectId(lang='es', probability=True)
pos = detect.positive(['acompañando el asado con un buen vino'])[0]
pos
```
:::

# Corpora
Expand Down
Loading