Skip to content

Commit 04f7baa

Browse files
committed
filterblock: Document block behavior in more detail
Update the documentation for the parameters to reflect the updated types (strings) after the move to yaml based block configuration. While we're at it, document a list of oeprations that make sense to use with this block. Also include some examples for cases that warrant some more detailed examples: - The `contains` operation only works with strings. - All operations can take multiple candidates for the right side of the operation (filter value) and the block will check all of them and treat the result as True if any are true. - filter_column operator filter_value Signed-off-by: Russell Bryant <rbryant@redhat.com>
1 parent 7c5c1c3 commit 04f7baa

File tree

1 file changed

+52
-2
lines changed

1 file changed

+52
-2
lines changed

src/instructlab/sdg/filterblock.py

Lines changed: 52 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -91,11 +91,61 @@ def __init__(
9191
- block_name (str): An identifier for this block.
9292
- filter_column (str): The name of the column in the dataset to apply the filter on.
9393
- filter_value (any or list of any): The value(s) to filter by.
94-
- operation (callable): A function that takes two arguments (column value and filter value) and returns a boolean indicating whether the row should be included in the filtered dataset.
95-
- convert_dtype (callable, optional): A function to convert the data type of the filter column before applying the filter. Defaults to None.
94+
- operation (string): The name of a function provided by the "operator"
95+
Python package that takes two arguments (column value and filter value)
96+
and returns a boolean indicating whether the row should be included in
97+
the filtered dataset.
98+
- convert_dtype (string, optional): the name of a Python type to convert
99+
the column values to. Supported values are "int", "float", and "bool".
100+
Defaults to None.
96101
97102
Returns:
98103
None
104+
105+
For supported values of `operation`, see the "operator" package
106+
documentation: https://docs.python.org/3/library/operator.html
107+
108+
Only a subset of the "operator" package is relevant. It has to
109+
follow the semantics of taking two parameters and returning a boolean.
110+
Some operations that work include:
111+
- eq: equal to
112+
- ne: not equal to
113+
- gt: greater than
114+
- ge: greater than or equal to
115+
- lt: less than
116+
- le: less than or equal to
117+
- contains: filter_column contains filter_value (only for string columns)
118+
119+
Note that the sematics of all operations are:
120+
- filter_column operation filter_value
121+
122+
Example: FilterByValueBlock(ctx, "filter_by_age", "age", 30, "eq", "int")
123+
- This block will filter the dataset to only include rows where the
124+
"age" column is equal to 30.
125+
126+
The `contains` operator is only supported for string columns. This is
127+
useful if you want to ensure that a string column contains a specific
128+
substring.
129+
130+
Example: FilterByValueBlock(ctx, "filter_by_name", "full_name", "John", "contains")
131+
- This block will filter the dataset to only include rows where the
132+
"full_name" column contains the substring "John".
133+
134+
`filter_value` does not have to be a single value. It can also be a list of values.
135+
In that case, the operation will be applied to each value in the list. The result is
136+
considered True if the operation is True for any of the values in the list.
137+
138+
Example: FilterByValueBlock(ctx, "filter_by_age", "age", [30, 35], "eq", "int")
139+
- This block will filter the dataset to only include rows where the
140+
"age" column is equal to 30 or 35.
141+
142+
Example: FilterByValueBlock(ctx, "filter_by_city", "city", ["boston", "charleston", "dublin", "new york"], "eq")
143+
- This block will filter the dataset to only include rows where the
144+
"city" column is equal to "boston", "charleston", "dublin", or "new york".
145+
146+
Example: FilterByValueBlock(ctx, "filter_by_name", "full_name", ["John", "Jane"], "contains")
147+
- This block will filter the dataset to only include rows where the
148+
"full_name" column contains the substring "John" or "Jane".
99149
"""
100150
super().__init__(ctx, block_name)
101151
self.value = filter_value if isinstance(filter_value, list) else [filter_value]

0 commit comments

Comments
 (0)