Having some automated tests for tool metadata (tool name and parameter names/descriptions) quality, while not foolproof, would make it much easier to confidently make changes to existing servers without worrying about regressing existing use cases. These tests could catch things like a new tool having a description that collides with another tool and confuses LLMs. We should be able to write some example tests using e.g. Mosaic AI Agent Evaluation or open source eval frameworks