Human Evaluation vs Automated Metrics: Which Works for AI Testing