Are we truly ready for large model application development, or are we still stuck in the mindset of “as long as it works”?
Over the past few decades, software engineering has focused on reducing system risk and uncertainty through various methodologies. We’ve developed numerous approaches and frameworks to drive rapid business growth: TDD, BDD, DDD — each guiding us to minimize project uncertainty and ensure system stability. However, the advent of large language models (LLMs) challenges this stability, introducing a new reality: not everything will be stable anymore.
The inherent complexity of LLMs means their behavior is not as stable or predictable as code-based logic. In traditional software engineering, with well-designed code, we expect a deterministic output for a given input. But with LLMs, we can’t be sure that the same input will always yield the same output. This introduces a crack in our traditional software engineering approach: while the logic of your system may be clear, the output can still be random, introducing instability.
This randomness is why development in the LLM application era differs from what we’re used to. We must now focus on model evaluation. Evaluation, once a peripheral concern for algorithm specialists, is now crucial for every LLM application developer. To put it bluntly, evaluation is the business.
Why Evaluate?
As a traditional engineer, I’m used to building systems that solve business problems based on understanding requirements. I once developed an AI document generation tool with my team, initially focusing only on functionality. It wasn’t until my leader asked about business impact and precision/recall that I realized the need for evaluation.
I had no concept of evaluation then, so it wasn’t part of my project flow. We assumed that if things were running and users gave positive feedback, all was well. But I had no idea about actual effectiveness. That experience led me to research evaluation, and it inspired this post. I hope you can avoid the mistakes I made.
Why We Didn’t Need Much Evaluation Before
Evaluation isn’t new, but in the past, we mostly used stable services like databases or third-party APIs. Their behavior was predictable, so we only cared about the function itself. LLMs are different; they are inherently a source of uncertainty.
Evaluation has existed in traditional search and machine learning, but it was often a separate module. We assumed these modules were reliable. Business engineering teams focused on feature availability rather than metrics like recall and precision.
The LLM era changes this. LLMs are no longer external modules but embedded within our systems. This shift requires us to integrate evaluation into every development stage, treating it as an integral part of the process rather than a separate concern. Instead of just focusing on feature availability, we must monitor model performance, data quality, and user experience. Our team composition also changes. If previously a typical ratio for product/engineering/testing was 1:510:1, it may now become 1:510:2. That extra person is there to handle the uncertainty LLMs bring — we need more resources to ensure model performance.
How Do We Evaluate Effectively?
If you agree with the previous points, then we have a common understanding: LLMs are not inherently stable, and we must invest additional effort to ensure their stability within our systems.With this, we can start designing our evaluation systems and plans:
Start with the End: Defining Business and Technical Metrics
The first step in evaluation is defining business metrics and model inputs and outputs. This is where “evaluation is the business” comes into play. If you nail this step, you’ve already captured 80% of the value. Since business metrics are unique, let’s focus on more general technical metrics.
If your system integrates LLMs, focus on:
- Generation Quality: Assess the quality of LLM-generated content. There are existing evaluation methods (like Bilingual Evaluation Understudy, Recall-Oriented Understudy for Gisting Evaluation). But the most reliable way is through human review, evaluating content for effectiveness, fluency, coherence, and relevance.
- Model Efficiency: Evaluate LLM inference speed, throughput, and resource consumption. If using cloud-based APIs, focus on inference speed and throughput. If using local models, also evaluate model size, memory usage, and computational resources. These metrics impact your system architecture and cost.
- Model Safety:Assess the LLM’s ability to handle malicious requests, whether it generates harmful, inappropriate, or biased content, and if there are any sensitive data leaks. Without these capabilities, your application could face severe issues and potentially shut down.
If your system uses RAG with a database, you also need to look into:
- Precision: Assess the percentage of relevant documents returned among all documents. Low precision means the retrieval model returns too many unrelated documents. Increase similarity thresholds or refine the model.
- Recall: Evaluate the percentage of relevant documents returned compared to all relevant documents. Low recall means the retrieval model is missing relevant documents. Decrease thresholds or optimize the model.
- Hit Rate: Assess the likelihood of finding at least one relevant document across multiple user intents. Low hit rate may indicate gaps in your knowledge base.
If you are building code generation tools, also track code execution success rate, etc.
These are only starting points. You must tailor the metrics to your specific business. But in general, once you decide what to measure, you’re close to understanding your business. You’ll then choose suitable models and prompts and integrate them with engineering.
Clean Data: The Key to High-Quality Evaluation
After defining metrics, the next step is cleaning high-quality data for evaluation. Different evaluation objectives require different data sets. You will likely need to create your own datasets, using online data, manually curated data, or existing data with annotations.
Data cleaning is time-consuming. You need to remove errors, duplicates, incomplete information, format the data consistently, remove useless information, and normalize the data. If your data contains sensitive information, be sure to anonymize it for privacy.
Plan data collection and cleaning upfront to reduce pressure later and ensure adequate staffing.
Also, pay attention to your dataset’s quality (representativeness, accuracy, diversity, and completeness), scale, and data bias. Ensure your data doesn’t introduce biases that drive the model in the wrong direction.
Once you have quality data, the following steps are simpler: Maintain your data, update it regularly, adapt it to your business, and continuously evaluate your system and model to ensure metrics remain healthy.
Continuous Evaluation: A Standardized Process
Evaluation isn’t a one-off event. LLMs are constantly evolving, especially in cloud API scenarios, and your live data may also change. Therefore, you need continuous evaluation. Perform it regularly (e.g., weekly, monthly, or after significant releases) and integrate it into your project development workflow. This will help you identify issues and iterate quickly. Continuous evaluation isn’t optional; it’s essential for the LLM application era.
Evaluation as a Business Imperative
In the LLM era, evaluation is the business; it’s no longer optional, but essential for success. Without evaluation, your product is like a ship without a compass. As LLMs play an increasingly important role, they’re no longer an add-on, but the foundation of success. Thus, product managers and project members must understand business evaluation. A somewhat controversial statement is that if evaluation accounts for less than 30% of the business process, that business has at least a 50% optimization potential. Without evaluation, you won’t know your current status or limits, and therefore cannot improve.