Practice Evaluation and Research
Practice Evaluation and Research
This domain (IIIC) asks two questions: can you evaluate your own work with an individual client, and can you evaluate an agency's programs at the system level? The 2026 outline leans into two areas it used to underplay: the specific instruments and techniques for measuring practice effectiveness (IIIC.2), and the full toolkit for evaluating programs (IIIC.3: needs assessment, formative and summative evaluation, process and outcome evaluation, cost-effectiveness, and cost-benefit). Four sections carry it: EBP and research-design foundations, practice evaluation with individual clients, agency program evaluation, and research ethics and design.
Evidence-Based Practice
Evidence-based practice rests on three equally weighted parts, and the common misread is to treat it as research applied rigidly. It is the opposite: research used to inform a decision while you account for the particular client in front of you.
EBP & research design foundations
Evaluating Practice Outcomes
Good intentions are not evidence. At some point you have to ask whether your intervention actually worked, and that means measuring it:
A social worker reviews the research literature on PTSD treatment but notes the client expresses strong preferences for a different modality grounded in her cultural and spiritual practices. Under evidence-based practice principles, the MOST appropriate response is to:
QUANTITATIVE research uses numbers and statistical analysis to test hypotheses (experiments, surveys, structured observation). QUALITATIVE research explores meaning through words, themes, and narrative (in-depth interviews, focus groups, ethnography, case studies). MIXED METHODS combines both, leveraging the strengths of each. The exam tests whether you can match design to question: "what is the prevalence of X" calls for quantitative survey; "how do clients experience X" calls for qualitative interview; "does this intervention work AND how do clients experience it" calls for mixed methods. Single design fits a single question; complex questions often require mixed approaches.
A research team wants to understand the lived experience of clients completing a substance use recovery program: how clients describe their journey, the meaning they ascribe to setbacks, and what aspects of the program felt most transformative. The MOST appropriate research approach is:
Data Collection and Analysis Methods
If you touch research or program evaluation, you need a working sense of how data gets gathered and what each method costs you:
- Surveys and questionnaires reach large samples but lean on self-report, which can be inaccurate.
- Interviews yield rich qualitative detail but take time and can pick up interviewer bias.
- Direct observation captures real behavior, though the observer's presence can change the very behavior being watched (the Hawthorne effect).
- Record review draws on existing documentation like case files and medical records, where the data may be incomplete or inconsistently kept.
- Focus groups spark group discussion but can be skewed by a dominant participant.
On the analysis side: descriptive statistics (mean, median, mode) summarize data; inferential statistics (t-tests, chi-square) test whether a finding is statistically significant; and qualitative analysis (coding and theme identification) organizes non-numerical data into patterns.
A researcher notices that participants in a study change their behavior simply because they know they are being observed. This phenomenon is BEST described as:
Evaluating practice with individual clients
Administering a standardized depression scale before a 12-week intervention and again at its conclusion, a social worker is using the evaluation method BEST described as:
Practice evaluation instruments and techniques (IIIC.2)
The outline wants the specific tools workers use to evaluate practice, not just the broad frameworks above. The categories to know:
Rapid Assessment Instruments (RAIs)
Brief, validated self-report tools designed for repeat administration. Used to track change session-by-session or week-by-week. Examples: PHQ-9 (depression), GAD-7 (anxiety), OQ-45 (general distress), URICA (readiness for change). The defining features: short, easy to score, sensitive to change over short periods.
Standardized outcome measures
Validated instruments with established reliability and validity, often used in pre-post designs. Examples: Beck Depression Inventory (BDI-II), Outcome Questionnaire (OQ-45), Children's Depression Inventory (CDI), Trauma Symptom Inventory. Selected to match the target outcome.
Treatment fidelity measures
Track whether the intervention was delivered as intended (also called treatment integrity). Includes checklists of required intervention components, session audio review by independent rater, supervisor review of session notes against a model fidelity rubric. Critical for evaluating evidence-based treatments.
Behavioral observation
Structured tracking of observable behaviors over time (frequency counts, duration recording, interval sampling). Often paired with single-system designs. Strength: not vulnerable to self-report bias. Weakness: observer effects, requires training and inter-rater reliability checks.
Triangulation across instruments. No single instrument captures the full picture. Best practice combines: a standardized symptom measure (e.g., PHQ-9), a functional measure (e.g., work attendance, sleep), a relational measure (e.g., satisfaction, alliance), and clinician observation. Convergence across sources increases confidence; divergence is itself data.
Treatment fidelity is the most commonly under-recognized concept on this section. When a manualized intervention does not produce expected outcomes, the first question is whether the intervention was actually delivered as designed. Without fidelity data, lack of effect could mean the treatment does not work OR could mean the treatment was not really delivered.
Tracking a single client's anxiety scores weekly for four weeks during a baseline phase, then weekly for eight weeks during a CBT intervention phase, then weekly for four weeks during a withdrawal phase, the social worker's evaluation design is BEST described as:
RELIABILITY = consistency of the measurement (does the tool give similar results when administered repeatedly under similar conditions?). VALIDITY = whether the tool measures what it is intended to measure (does the depression scale actually measure depression, not anxiety or general distress?). The exam tests the asymmetry: a tool can be RELIABLE without being VALID (a broken bathroom scale that always reads five pounds high is highly reliable, not valid). A tool CANNOT be valid without being reliable (a scale that gives wildly different readings cannot be measuring weight accurately). Reliability is necessary but not sufficient for validity. Common reliability types: test-retest, inter-rater, internal consistency (Cronbach's alpha). Common validity types: face, content, construct, criterion (concurrent and predictive).
A measurement tool is BEST described as RELIABLE when it:
Case Recording and Documentation
Documentation is two things at once: a clinical tool and a legal safeguard. Good case recording follows a few principles:
- Timeliness: write it up as soon as you can after the session, while the details are fresh.
- Accuracy: keep facts ("client stated...") separate from clinical impressions ("this suggests...").
- Relevance: include only what bears on service delivery and the treatment goals.
- SOAP format: a common structure, Subjective (the client's report), Objective (observable data), Assessment (your clinical analysis), and Plan (next steps).
- Plain language: records should make sense to other professionals who may read them.
- Client access: records can be reviewed by clients, subpoenaed in court, or audited by regulators.
Working with a client recovering from a traumatic injury, a social worker sets specific individualized goals (returning to work part-time, walking unassisted for ten minutes, sleeping six hours per night) and creates a five-level scale for each goal from much worse than expected to much better than expected. The evaluation method BEST described is:
Interdisciplinary and Intradisciplinary Collaboration
Social workers rarely work alone, and the exam expects you to know the language of collaboration:
- Interdisciplinary teams bring together different disciplines (social work, medicine, nursing, psychology, education) around shared client goals.
- Intradisciplinary collaboration is social workers working with other social workers across settings or specializations.
- Your unique contribution on the team is the person-in-environment perspective; no other discipline brings that lens.
- Shared terminology: you need a basic grasp of legal, medical, and educational vocabulary to communicate well across a team.
- Ethics: your primary obligation stays with the client, even when the team's recommendation runs against the client's wishes, and you advocate for those wishes.
Documenting a session in SOAP format, a social worker writes: 'Client reported sleeping three hours per night and feeling hopeless. Affect flat; client tearful when discussing job loss. Symptoms consistent with major depressive episode; suicide risk low. Will schedule weekly sessions and refer for psychiatric evaluation.' The entry BEST illustrates which SOAP element ordering?
Evaluating agency programs
Needs assessment: the front end of program evaluation (IIIC.3)
Before anyone designs a program, a needs assessment asks what a community actually requires. It answers a chain of questions: who needs what, in what quantity, against what existing supply, leaving what gap?
Standard methods:
- Key informant interviews. Structured conversations with community leaders, service providers, and people with direct knowledge of the population. Strong for context; vulnerable to the informants' perspective.
- Community forums and town halls. Open meetings where community members describe needs. Strengths: voice, transparency, organizing potential. Limitations: dominant voices, attendance bias.
- Surveys. Structured questionnaires distributed to the population. Reach more people; vulnerable to response rates and self-report bias.
- Focus groups. Facilitated discussions with small groups of community members. Generate nuanced data on shared concerns and language.
- Secondary data analysis. Existing data sources: census data, health department statistics, school records, social service utilization, hotline calls. Cheap, broad, lagged.
- Service utilization data. Which services are full, which have waiting lists, who is being turned away, what populations are absent from the caseload despite presumed need.
The IIIC.3 framework distinguishes:
- Normative need. A standard set by experts or professional bodies (e.g., a recommended caseload size, a target screening rate).
- Felt need. What people say they need when asked.
- Expressed need. Demand for services actually used or sought (utilization data, waiting lists).
- Comparative need. Need identified by comparing this community to similar communities with different service levels.
The four types of need rarely agree. A community may have high normative need (per expert standards), low felt need (residents do not name it as a problem), low expressed need (no one is asking for services), but high comparative need (similar communities have much more service). The disagreement itself shapes program design: a program addressing normative need without acknowledging felt need will fail to engage; a program responding only to expressed demand will miss invisible populations.
A community health center plans to launch a new mental health program but is unsure which populations to target, what services to offer, or what existing resources already meet community needs. Before designing the program, the MOST appropriate evaluation activity is to:
Formative vs. summative evaluation
This is the most basic split in program evaluation, and the outline names both. The exam checks whether you can tell which one a scenario is describing.
Formative evaluation
- Purpose: improve the program WHILE it is running
- Timing: early and ongoing
- Audience: program staff and managers
- Output: recommendations for adjustment
- Examples: mid-program staff focus groups, six-month process review, fidelity checks, pilot evaluation
Summative evaluation
- Purpose: judge the program's overall impact
- Timing: at the end (or end of cycle)
- Audience: funders, policymakers, agency leadership
- Output: verdict on continuation, expansion, or termination
- Examples: end-of-grant outcome report, three-year impact study, cost-effectiveness analysis
The exam trap: a program reports outcomes mid-cycle to a board considering early termination. Is this formative or summative? It is summative in PURPOSE (judging continuation) even though it is mid-cycle in TIMING. Purpose drives the classification, not timing alone.
Six months into a two-year pilot program, an evaluation team conducts interviews with staff and clients to identify what is working, what is not, and what should be adjusted before the program continues. This activity is BEST described as:
Process vs. outcome evaluation
The second core split. Process evaluation asks what is happening; outcome evaluation asks what changed.
- Process evaluation. Documents what the program ACTUALLY does. Captures activities, services delivered, populations reached, fidelity to model, dosage of services, dropout patterns. Answers: Are we doing what we said we would do? Are we reaching whom we said we would reach? Are we delivering at the intensity we planned? Process evaluation is also called IMPLEMENTATION evaluation. Critical when interpreting any outcome evaluation: if outcomes are weak but process data show low fidelity, the issue is implementation, not the model.
- Outcome evaluation. Measures the CHANGES the program produced. Short-term outcomes (knowledge, attitudes, skills); intermediate outcomes (behavior, status); long-term outcomes (sustained change, system-level shifts). Often visualized in a LOGIC MODEL: inputs → activities → outputs → outcomes (short, intermediate, long-term) → impact.
- Impact evaluation. A specific kind of outcome evaluation that tries to attribute change to the program (rather than to other factors). Strongest when it uses a comparison group (randomized or matched). Without a comparison, observed outcomes may be due to maturation, history, or other influences, not the program itself.
Inputs → Activities → Outputs → Outcomes: the logic-model vocabulary the exam may use. INPUTS are resources (funding, staff, facilities, partnerships). ACTIVITIES are what the program does (workshops, counseling sessions, case management). OUTPUTS are units of activity (number of clients served, sessions delivered, materials distributed). OUTCOMES are the resulting changes in clients, families, communities. Outputs are NOT outcomes: "we served 500 clients" is an output; "client depression scores decreased by an average of 6 points" is an outcome. This is a frequently-tested distinction.
Cost-effectiveness and cost-benefit analysis
Two related economic methods the outline names, close enough to confuse and distinct enough to test.
| Method | What it compares | Output | When to use |
|---|---|---|---|
| Cost-effectiveness | Program cost vs. outcomes in NATURAL units (lives saved, recidivism reduction, depression score drop) | Cost per unit of outcome (e.g., $2,400 per re-arrest avoided) | Comparing programs with the SAME outcome; outcome cannot easily be converted to dollars |
| Cost-benefit | Program cost vs. benefits all converted to DOLLAR terms | Net benefit (benefit minus cost) or benefit-to-cost ratio (e.g., $3.50 returned per $1 invested) | Comparing programs with DIFFERENT outcomes; outcomes can be monetized |
The exam-relevant distinction. Cost-effectiveness uses NATURAL OUTCOME UNITS (cost per X outcome achieved). Cost-benefit converts ALL outcomes TO DOLLARS. A cost-benefit analysis of a suicide prevention program tries to monetize a life saved (e.g., via human-capital or willingness-to-pay methods). Cost-effectiveness avoids that step and just reports cost per life saved, leaving the value judgment outside the analysis.
Caveats both methods share. What counts as a "cost" and a "benefit" depends on perspective (program, government, society). Long-term benefits are often discounted to present value. Many social-work benefits are notoriously hard to quantify (family stability, dignity, community trust). The exam rewards recognizing that economic evaluation is one input among many, not the final word on program value.
Research foundations & ethics
Research ethics
Research, like practice, runs under ethical rules. The Belmont Report (1979) set out three principles that still anchor every IRB review: respect for persons, beneficence, and justice. In operational terms:
- Informed consent. Participants understand the study purpose, procedures, risks, benefits, alternatives, and right to withdraw, and agree voluntarily. Consent is a PROCESS, not a single signature. Special protections apply to children, prisoners, pregnant participants, and adults with diminished capacity.
- Confidentiality and data security. Identifiable participant data is protected through coded identifiers, secure storage, limited access, and de-identification before publication. Certificates of Confidentiality (federal) can protect researchers from being compelled to disclose identifying information.
- Voluntary participation. Participants may decline initially or withdraw at any time without penalty. This is especially important when researcher and participant share an institutional context (clinician researching their own clients, supervisor researching supervisees) where implicit coercion is a concern.
- IRB (Institutional Review Board) approval. Federal regulations (45 CFR 46, the Common Rule) require IRB review of research involving human subjects at federally funded institutions. The IRB classifies studies as exempt, expedited, or full-board review based on risk. Practice evaluation may or may not require IRB review depending on whether findings will be generalized beyond the program.
- Minimization of harm and the beneficence principle. Research should not cause physical, psychological, social, or economic harm to participants. Where some risk is unavoidable, it must be justified by anticipated benefits and minimized through study design.
- Justice in subject selection. Burdens and benefits of research are distributed fairly across populations. Recruiting only marginalized groups for risky studies (or only privileged groups for beneficial ones) is unjust.
The historical anchors for these rules: Tuskegee Syphilis Study (1932-1972), Henrietta Lacks (HeLa cells, 1951), Willowbrook Hepatitis Studies (1956-1970), the Stanford Prison Experiment (1971). Each produced specific regulatory responses (Belmont Report, Common Rule, HIPAA, modern IRB structure). Many of the populations exploited in these studies were Black, disabled, incarcerated, or institutionalized.
A doctoral student social worker wants to interview adults in a domestic violence shelter about their experiences with services. The shelter director, citing client confidentiality, asks the student to obtain only verbal consent and not record identifying information. The MOST appropriate next step is to:
An LMSW with five years of post-licensure experience encounters a clinically complex case involving severe trauma history and active substance use. She seeks the advice of a senior LCSW with extensive trauma expertise to discuss assessment and treatment options. The relationship between the two social workers is BEST described as:
Research design quick reference (IIIC.4)
The exam tests basic recognition of the major designs, not advanced methodology. The categories to know:
- Experimental (randomized controlled trial, RCT). Random assignment to treatment and control groups. The gold standard for establishing causality because random assignment balances unmeasured differences across groups. Often impractical or unethical in social work settings.
- Quasi-experimental. Comparison groups exist but assignment is not random (matched groups, waitlist controls, pre-existing comparison sites). Stronger than no comparison; weaker than RCT because unmeasured differences may bias results.
- Pre-experimental (one-group pre-post, posttest-only). No comparison group. Cannot rule out alternative explanations (maturation, history, regression to the mean). Common in agency practice; limited for causal claims.
- Descriptive. Documents what exists without testing causal claims. Includes surveys, prevalence studies, case studies. Useful for understanding scope; not designed to test interventions.
- Qualitative designs. Phenomenology (lived experience), grounded theory (theory generation from data), ethnography (cultural immersion), narrative (story analysis), case study (in-depth single case or small set). Strong for meaning-making and theory generation.
Key threats to internal validity the exam may name: HISTORY (events outside the study affect outcomes), MATURATION (natural change over time), TESTING (the pretest itself influences posttest), INSTRUMENTATION (measure changes during the study), REGRESSION TO THE MEAN (extreme scores drift toward average), SELECTION (groups differ at baseline), ATTRITION (differential dropout). Random assignment is the most powerful defense; comparison groups address several; longitudinal repeated-measure designs address others.
Lesson summary
- "The worker should follow the manualized treatment exactly even though the client objects." No: EBP integrates research, clinical expertise, AND client values; rigid manual-following over client objection misapplies the framework.
- "A reliable instrument is also valid." No: reliability is necessary but not sufficient. A consistently wrong measure is reliable but not valid.
- "Tracking outcomes weekly over six months is pre-post testing." No: pre-post is two time points (before and after). Repeated measurement over time is single-system design.
- "The program reported strong outcomes, so the model works." Maybe, maybe not: without fidelity data, you do not know whether the model was actually delivered. Outcomes are not interpretable without process data.
- "The 300 sessions we delivered prove the program is working." No: 300 sessions is an OUTPUT, not an outcome. Outputs measure activity; outcomes measure CHANGE.
- "A mid-program review is always formative." No: TIMING does not classify the evaluation; PURPOSE does. A mid-program review used to decide whether to terminate the program is summative in purpose.
- "The community said they needed X, so the program should address X." Felt need is one of four types; comparing felt need to normative, expressed, and comparative need produces a fuller picture.
- "Cost-benefit analysis is more comprehensive than cost-effectiveness because it dollarizes benefits." Not necessarily: cost-benefit requires monetizing outcomes that often cannot be ethically or accurately converted to dollars (life saved, family stability). Cost-effectiveness avoids that step.
- "Practice evaluation does not require IRB review." Sometimes true (when findings stay internal), sometimes false (when findings will be published or generalized). The defining test is whether the activity meets the federal definition of research, not whether it is called "evaluation."
Test yourself with exam-style questions on this topic.