ALFD generates parametric design data optimally using Batch Active Learning.
ALFD is a package for generating targeted design datasets. It can handle unique constraints for both performance values and design parameter values. It is also compatible with any pair of 2+ dimensional parameter and performance spaces.
All relevant code is contained in the following files:
- AL_for_design/quantized_learner
- AL_for_design/helper.py
- AL_for_design/datasetup.py
This work was motivated by my previous research on medical walker design optimization, where there were overrepresentations of undesired performance values, increased fail rates, and decreased surrogate regressor accuracies:
A Data-driven Recommendation Framework for Optimal Walker Designs:
- Full Paper: https://arxiv.org/pdf/2310.18772
- GitHub: https://github.com/AdvaithN1/Walker-Optimization
- ASME IDETC-CIE: https://idetc.secure-platform.com/a/solicitations/228/sessiongallery/17287/application/145053
- Install ALFD with
pip install AL_for_design - Run:
import numpy as np
from AL_for_design.quantized_learner import TargetPerformanceHyperparam, ContinuousDesignBound, CategoricalDesignBound, HyperparamDataSetup, QuantizedActiveLearner
def random_regression_problem(X):
ret = []
for row in X:
perf1, perf2, perf3, perf4 = row
ret.append([perf1*perf2, perf3**np.abs(np.sin(perf1))-perf4, np.cos(perf1+perf2+perf3+perf4)])
return np.array(ret)
learner = QuantizedActiveLearner(HyperparamDataSetup(
[
ContinuousDesignBound(3,9,"FirstExampleParameter"),
ContinuousDesignBound(5,10,"SecondExampleParameter"),
CategoricalDesignBound(["ExampleCategoryA", "ExampleCategoryB", "ExampleCategoryC"], "ThirdExampleParameter"),
ContinuousDesignBound(1,5,"FourthExampleParameter"),
],
[
TargetPerformanceHyperparam(lambda x: np.ones(x.shape[0]),"FirstExamplePerformanceVal"),
TargetPerformanceHyperparam(lambda x: np.ones(x.shape[0]),"SecondExamplePerformanceVal"),
TargetPerformanceHyperparam(lambda x: np.ones(x.shape[0]),"ThirdExamplePerformanceVal")
]),
DESIGN_SPACE_DENSITY=100000,
UNCERTAINTY_DROP_THRESHOLD=0.01,
skip_redundancy = True
)
num_batches = 5
batch_size = 100
for i in range(num_batches):
queried = learner.query(batch_size)
deleted = learner.teach(queried, np.ones(batch_size), random_regression_problem(queried))The following steps outline the querying process in the best-performing Active Learner ("Quantized Active Learner")
- For all points in the pool, we calculate the harmonic mean of the uncertainties for all the performance regressors and the distance matrix to labeled points.
- Using the testing data, we set the proximity weight accordingly. If the regressors have a high average error, we set the proximity weight high to maximize exploration, whereas if there is a low average error, we set the proximity weight low to maximize exploitation.
- We use the following experimentally derived formula to "score" each point based on the predicted error in step 1:
$\text{scores}=(\text{proximity_weight}+(1-\text{proximity_weight})\cdot\text{error})^\frac{1}{\text{proximity_weight}}$ - We normalize the scores to a probability distribution.
- We select a point at random from the pool, weighted by the probability distribution.
- We create a predicted error interval with an unfixed width of
$0.2$ . This interval is centered at the predicted error of the point selected point. - We calculate the distance matrix, and choose the point farthest from a labeled point that has a predicted error score within the range.
- We add the chosen point to the batch, remove the point from the pool, and recompute the distance matrix, treating the chosen point as a labeled one.
- Repeat steps 4-7 until the batch is full.
- Return the batch
The following steps outline the teaching process for the Active Learner. This involves deleting points from the pool that are confidently invalid.
- We uniformly select 20% of the training data for error estimation and testing. This will be used in the estimation of errors.
- We retrain the invalidity classifiers and performance regressors with the training data (not including the testing data).
- We use the following experimentally derived formula to get a value for the probability of a point being invalid:
$\displaystyle\prod_{i=1}^{n}P_i^{C_i}$ where$n$ is the number of performance values,$P_i$ is the predicted validity probability for the ith performance value validity classifier, and$C_i$ is the calculated confidence value as a function of the distance to the nearest labeled point. Note that points with lower confidence are biased towards 1. This lowers the chance of points being falsely dropped from the pool. - We drop points that have an invalidity score lower than a certain threshold.
- To detect redundant performance values, we train a regressor to predict a certain performance value from all the others. If the accuracy of the regressor is better than a certain threshold, we mark the performance value as redundant. We repeat for all other performance values.
To estimate the error of the performance value regressor, we calculate the residuals of the predictions in the testing data. The testing data selection process is described here.
We then normalize the residuals and train a KNN to predict the error for any point in the design space. We use a KNN because it's output prediction range is limited to the range of the residuals. This ensures that the predictions are always between 0 and 1.
To compute the distance between two points, we first need to encode each categorical component as a one-hot vector. This ensures that Euclidean distance computations do not depend on category order. However, the distances between 2 one-hot vectors need to be normalized, which can be done by dividing by
Using the above distance computation strategy, we can form a normalized distance matrix for every point in the pool to a labeled point.
In the above diagrams, lighter colors indicate higher predicted error. The Black dots are points selected for querying, while Red and Blue points are labeled. ALFD more densely queries places of higher predicted error, but still minimizes proximity to labeled points.We do not rigorously test this query strategy, but rather propose this as a framework for novel approaches to Active Learning for Design.
Open-sourced under the MIT License. See LICENSE for more information.




