Investigate automated evaluation approaches

To complement #11, we also need a way to run automated evaluations of the agent together with the chosen LLM. In this case, we'd have a sequence of interactions from the user, and we would collect the outputs and judge the quality of the output (whether this is automated or manual would be up for investigation as well).

We should look at the research literature for this, and also look at the capabilities of LangFuse for it, since we will already be using it for observability.

Edited Dec 05, 2025 by Antonio Garcia-Dominguez