Blog

  • Field Note: What Would We Assess a System Against?

    In an earlier field note, I explored the idea that testing might be understood as a form of assessment. Testing generates evidence. Assessment uses evidence to reach a judgment. But this raises a more difficult question:

    If we want to assess the quality of a system, what would we assess it against?

    For a capability assessment, the answer is relatively clear. A reference model describes the capability we expect to find. An assessment model explains how evidence should be collected, interpreted and judged.

    For system quality, the equivalent reference model is less obvious. Requirements and acceptance criteria may seem to provide the answer. They describe what the system should do and the conditions under which it can be accepted.

    But are they sufficient as a reference model for system quality? I do not think they are.

    Requirements are not a complete model of quality

    Requirements describe selected expectations. They may address functionality, performance, security, usability and other quality concerns. In a well-developed specification, they can provide a substantial basis for testing and acceptance.

    But requirements are still the result of selection. Some expectations are documented. Others remain implicit. Some stakeholders are represented more strongly than others. Some concerns are recognised early, while others become visible only during operation.

    A system can satisfy its documented requirements and still be difficult to operate, hard to maintain, inaccessible, fragile under unusual conditions, unsuitable for future growth or inappropriate for its business context.

    This does not necessarily mean that the requirements were badly written. It means that requirements describe what has been specified, not necessarily everything that matters.

    Acceptance criteria make selected expectations testable

    Acceptance criteria make expectations concrete. For example:

    Ninety-five per cent of transactions must complete within two seconds.

    This gives us something observable and testable. We can determine whether the system meets the threshold. But it does not answer the wider questions:

    • Why are two seconds acceptable?
    • Which transactions and operating conditions are included?
    • What happens to the slowest five per cent?
    • Will the threshold remain adequate as demand grows?
    • Does it reflect actual user and business needs?

    The criterion defines a threshold. It does not explain the significance of that threshold.

    The same limitation applies to qualities that are difficult to reduce to a pass-or-fail statement. A system may be maintainable for its current team but difficult for another supplier to take over. It may be secure against the threats that were tested but exposed through dependencies outside the test scope. Acceptance criteria can express parts of these expectations. They cannot easily represent the entire quality picture.

    A reference model also reveals what is missing

    Testing against acceptance criteria primarily confirms whether specified expectations have been met. A system-quality assessment must also ask whether the relevant expectations were specified in the first place. A reference model should prompt questions such as:

    • Have the relevant quality characteristics been considered?
    • Are the important stakeholder groups represented?
    • Have operational and lifecycle concerns been included?
    • Are dependencies and consequences of failure understood?
    • Have future changes in scale and use been considered?
    • Which assumptions remain unvalidated?

    This leads to an important distinction:

    Acceptance criteria confirm what has been specified. A reference model also helps reveal what has not been specified.

    That may be one of its most valuable contributions. It enables us to challenge the completeness of the test basis rather than only derive tests from it.

    A taxonomy is not yet a reference model

    Established quality models describe characteristics such as functional suitability, reliability, performance efficiency, usability, security, compatibility, maintainability and portability. These provide a useful starting point.

    But a list of characteristics is not automatically an assessment reference model. For each quality area, a usable model should help us understand what the characteristic means, which questions should be asked, what strong or weak quality might look like, what evidence could support a conclusion and how the operating context affects the judgment.

    Without this, the model tells us where to look but not how to judge what we find. It remains a classification system rather than an assessment foundation.

    Reference model and assessment model

    The distinction between a reference model and an assessment model is important. The reference model describes the quality space against which the system is assessed. The assessment model describes how the assessment is performed.

    The reference model might identify recoverability as an important aspect of reliability. The assessment model would explain how to determine whether recoverability is adequate, perhaps through recovery tests, operational exercises, incident data, architecture analysis and interviews.

    The reference model tells us that recoverability matters. The assessment model tells us how to reach a defensible conclusion about it.

    A possible structure could be:

    • quality characteristic;
    • sub-characteristic;
    • assessment question;
    • evidence requirement;
    • indicator or measure;
    • acceptance criterion.

    For example:

    • Reliability
      • Recoverability
        • Can acceptable service be restored after failure?
          • Evidence from recovery tests and operational exercises
            • Recovery time, data loss and restoration success
              • Service restored within 30 minutes with no more than five minutes of data loss

    The acceptance criterion remains important, but it becomes one element within a wider reasoning framework.

    System quality is contextual

    A reference model for system quality cannot describe only the intrinsic properties of the system. It must also address fitness for context. The same system may be suitable for one purpose and unsuitable for another. A service may be adequate for an internal pilot but unacceptable for a business-critical process.

    The assessment therefore needs two connected perspectives. The first concerns the qualities the system possesses: functionality, reliability, security, usability, maintainability, operability and performance.

    The second concerns whether those qualities are adequate for the intended context: business criticality, stakeholder needs, user groups, operational environment, expected scale, risk appetite, regulatory obligations, dependencies and consequences of failure.

    A technically strong system can still be unfit for a particular context. A technically modest system can be entirely adequate for limited, low-risk use. Quality cannot be judged independently of purpose.

    The model must support judgment and uncertainty

    Acceptance criteria are commonly propositions that can be verified. A reference model must support a broader judgment.

    It may need to conclude that a quality characteristic is strong, weak, adequate only under certain conditions, acceptable for limited use, insufficiently evidenced or dependent on unverified assumptions.

    It must also distinguish between the quality of the system and confidence in the conclusion.

    A system may appear reliable because no serious failures have been observed. But if it has never been exposed to realistic load or failure conditions, confidence in that conclusion should be low.

    A failed test can provide strong evidence of weakness. A passed test may provide weak evidence of quality when the environment, data or scope were unrepresentative.

    The model must therefore connect quality characteristics not only to criteria, but also to evidence and uncertainty.

    Perhaps the real limitation is not the test process

    When testing is criticized for being too narrow, attention often turns to the process. We conclude that testing should begin earlier, include more automation, involve more stakeholders or extend into production. These may all be valid improvements.

    But perhaps testing often appears narrow because the model against which the system is judged is narrow. When requirements and acceptance criteria define the entire quality horizon, testing can only confirm expectations that have already been made visible.

    A stronger reference model would change the questions we ask before it changes the tests we perform. It would expose missing quality concerns, broaden the evidence we consider and help us distinguish between meeting criteria and being fit for purpose.

    The challenge may therefore not merely be to improve the test process. It may be to develop a sufficiently rich reference model for system quality.

    Because before we can assess whether a system is good, we need a credible answer to a more fundamental question:

    What, exactly, are we assessing it against?

  • Blog Retrospective #1

    What Happened

    I wrote 20 posts for this blog, including 18 field notes. I think it is time to evaluate the process before I will continue.

    What Surprised Me?

    The pace of writing turned out to be extremely high. I produced the 18 field notes in two mornings, during some downtime while doing other things.

    Ideas for new field notes kept appearing. With the help of AI, those ideas matured and turned into actual blog posts. The friction between a conceptual idea and a finished post ready for publication all but disappeared. That experience itself became the source of several new field notes.

    I decided to maintain a publishing cadence of one post per day to avoid burning out. This pace of writing may not be sustainable, and I want to avoid creating pressure to write merely to maintain the cadence.

    The publishing cadence also creates time between writing a post and committing to it. I can still revise or retract a scheduled post before publication if new insights emerge.

    What Worked Well?

    The process of generating ideas, developing them through dialogue and turning them into publishable posts worked remarkably well.

    The title and description I chose for the blog at the beginning also still seem to fit the material I have created so far. The subjects have expanded, but they remain within the territory of quality and delivery transformations.

    What Did Not Work Well?

    I set up this WordPress blog from scratch, added an “About Marcel” page and immediately started writing posts. I paid very little attention to the layout, theme or functionality of the site.

    I have also not started using categories and tags. The site still uses the default theme, which may not be the most inspiring choice, and it contains links and elements that I do not use and probably never will.

    The content has developed much faster than the structure around it.

    What Will I Change?

    Besides continuing to develop new field notes, I will clean up the site. I will add categories and tags, evaluate alternative themes and remove unused links and elements.

    As the number of field notes grows, I may also create overview pages that connect related posts. These could make the emerging themes more visible and provide other ways into the material beyond the chronological list and the “About Marcel” page.

    For now, I will continue with the next set of 20 posts and then conduct another retrospective.

    The first cycle has shown that generating material is not currently the main challenge. The next challenge is to give that material enough structure, distance and editorial attention as the blog develops.

  • Field Note: Evidence Is Not Interpretation

    One of the most common weaknesses I encounter in assessments is surprisingly simple. Evidence and interpretation become mixed together. At first glance, this may seem harmless. In practice, it makes assessments harder to challenge, harder to validate, and ultimately less trustworthy.

    The Problem

    Consider the statement:

    Test automation is insufficient.

    Is this evidence? Or is it a conclusion? Most people instinctively treat it as an observation. In reality, it is an interpretation. The statement already contains a judgment. It implies a comparison against some expectation, objective, or benchmark.

    The natural response is:

    Insufficient according to whom?

    Compared to what?

    Based on what evidence?

    Without answers to those questions, the statement remains an opinion rather than an assessment finding.

    Separating the Layers

    A stronger assessment makes each step of the reasoning explicit.

    Evidence

    • 12% of regression tests are automated.
    • Automation execution requires manual preparation.
    • Critical business flows are excluded from automation coverage.

    Interpretation

    • The automation capability provides limited support for regression testing.

    Conclusion

    • Release cycles remain highly dependent on manual testing effort.

    Recommendation

    • Increase automation coverage of critical business flows.

    Each layer serves a different purpose. And more importantly, each layer can be challenged independently.

    Why This Matters

    When evidence and interpretation are merged, disagreements become difficult to resolve. A stakeholder may reject a finding without being able to explain why. The assessor may defend a conclusion without clearly identifying the supporting evidence. The discussion becomes subjective.

    When evidence and interpretation are separated, the conversation becomes more productive. A stakeholder can challenge:

    • the evidence,
    • the interpretation,
    • the conclusion,
    • or the recommendation.

    Each disagreement becomes explicit. Each assumption becomes visible.

    Traceability of Reasoning

    One useful way to think about assessment is as a chain of reasoning.

    Evidence leads to interpretation. Interpretation leads to conclusions. Conclusions lead to recommendations. A robust assessment makes this chain traceable.

    If a recommendation is challenged, it should be possible to walk backwards through the reasoning and understand how it was derived. Likewise, if new evidence emerges, its impact on conclusions should be clear. The reasoning should be visible, not hidden.

    Beyond Expert Opinion

    Many assessments rely heavily on expert judgment. There is nothing inherently wrong with expertise. However, expertise becomes significantly more valuable when the reasoning behind it is transparent.

    A strong assessment does not merely say:

    Trust me.

    It says:

    Here is the evidence, here is how I interpreted it, here is the conclusion that follows.

    The assessor’s role is not to replace reasoning. It is to make reasoning explicit.

    A Working Hypothesis

    This leads me to a simple hypothesis:

    Effective assessments require explicit separation of evidence and interpretation, with clear traceability between them.

    The quality of an assessment is not determined solely by the quality of its conclusions. It is determined by the visibility of the reasoning that produced them. After all, an assessment is more than a collection of opinions. It is an argument supported by evidence.

  • Field Note: Who Assesses the Assessor?

    In recent years, I have spent a significant amount of time helping organizations assess their capabilities.

    Quality assurance.

    Testing.

    Data quality.

    Processes.

    Governance.

    Organizations routinely evaluate how well these capabilities perform and where improvement is needed.

    Yet a question occurred to me:

    Who assesses the assessor?

    The Invisible Capability

    When conducting an assessment, we tend to focus on the subject being assessed. We discuss the quality of testing. The maturity of data management. The effectiveness of governance. The assessment itself is often treated as a neutral instrument. A means to an end.

    But what if assessment is not merely an activity? What if assessment is a capability in its own right? If so, it should be subject to the same scrutiny as any other capability.

    Separating the Subject from the Assessment

    A useful distinction is the separation of a reference model from an assessment model. The reference model describes what good looks like. The assessment model describes how we evaluate against it. Most discussions stop there.

    However, there is a third question:

    How do we know the assessment itself is any good?

    A poor assessment can produce misleading conclusions, inappropriate recommendations, and misguided investments. In other words, the quality of improvement initiatives is often limited by the quality of the assessment that precedes them.

    Assessment as a Professional Discipline

    One reason this question is rarely asked may be that assessment is often viewed as an extension of domain expertise. If someone understands testing, they assess testing. If someone understands data quality, they assess data quality.

    Yet experience suggests that understanding a domain and assessing a domain are different capabilities. Domain expertise helps us understand what is happening. Assessment expertise helps us determine what it means. The best assessments typically emerge from collaboration between both forms of expertise.

    What Makes a Good Assessment?

    If assessment is a capability, it should be possible to describe what good assessment looks like.

    Some characteristics immediately come to mind:

    • Clear scope and objectives
    • Structured methodology
    • Evidence-based conclusions
    • Traceability from findings to recommendations
    • Balanced stakeholder engagement
    • Appropriate challenge of assumptions
    • Consideration of alternative explanations
    • Actionable recommendations

    Most professionals would recognize these as indicators of assessment quality. Yet they are rarely evaluated explicitly.

    Learning from Other Disciplines

    Other professions routinely evaluate their own methods. Researchers subject their work to peer review. Auditors perform quality reviews. Architects conduct architecture reviews. Scientists continuously challenge the validity of their findings.

    The assessment profession should be no different. Assessments should be open to challenge, critique, and improvement. In fact, the strongest assessments are often those that have survived the most scrutiny.

    A Working Hypothesis

    This leads me to a simple hypothesis:

    Assessment is a capability in its own right and should therefore be assessed like any other capability.

    Organizations routinely evaluate testing, quality, governance, and architecture. Perhaps it is time to evaluate assessment as well.

    After all, if we believe in assessment as a mechanism for improvement, it seems reasonable to ask:

    Who assesses the assessor?

  • Field Note: Continuity

    A few months ago, my mother passed away. Recently, a question about our family history popped into my head. My immediate reaction was simple: I’ll ask Mom. Then I remembered, I couldn’t. The answer disappeared with her.

    A Library of One

    What struck me was not the loss itself. I had already experienced that. What struck me was the realization that my mother had carried a unique body of knowledge. Stories. Relationships. Explanations.Family history. Context. The answers to questions nobody knew would one day be asked. For years, that knowledge was always available. A simple question was enough.

    Then suddenly it wasn’t. Not because the knowledge lacked value. Because the person who carried it was no longer there.

    The Fragility of Knowledge

    This is not unique to families. Every person carries a library. Years of experience. Lessons learned. Mistakes made. Patterns observed. Insights developed.

    Much of this knowledge exists nowhere else. It remains tacit. Unwritten. Unrecorded. Accessible only through conversation. When a person leaves, part of that library often disappears as well.

    Thinking About My Own Library

    The experience made me reflect on my own situation. Over the course of a career, I have accumulated a large collection of observations, mental models, and insights. Some are personal. Some are professional. Some are useful only to my family. Others may be useful to people I will never meet.

    Like everyone else, I will eventually leave this world. The question is not whether that happens. The question is what happens to the knowledge I carry when it does.

    Two Different Forms of Continuity

    I find myself thinking about continuity in two different ways.

    The first is personal. I would like my son to have access to as much of my experience as possible. Not because I expect him to follow the same path. But because every parent hopes to pass on some of what they have learned.

    The second is broader. I would like useful ideas, observations, and lessons learned to remain available to others. Not because they are mine. But because they might help someone else think, learn, or solve a problem.

    Why Field Notes Matter

    For years, capturing knowledge required significant effort. Books had to be written. Articles had to be published. Most insights never made the journey from intuition to publication.

    Today, the friction is much lower. A thought can become a conversation. A conversation can become a field note. A field note can become part of a growing body of knowledge. Each note is a small act of continuity. A small attempt to ensure that something learned does not simply disappear.

    Beyond Immortality

    People sometimes speak about legacy or immortality. I find myself less interested in those ideas.

    The opposite of mortality is not immortality. It is continuity. I do not need my name to survive. What I hope survives are the useful things. The lessons. The observations. The insights. The things that might still help someone after I am gone.

    A Working Hypothesis

    This leads me to a simple hypothesis:

    Human lives are finite. Continuity is not.

    Every conversation, story, lesson, article, field note, and book is an attempt to carry something forward. Not forever. But long enough to matter.

    Perhaps that is one of the most meaningful things we can do with the knowledge we accumulate during a lifetime.

    Not keep it. Pass it on.