missegmented lines (coming from non-text areas of a page) leading to erroneous OCR output should be deleteable as a whole, not just token by token