Case Studies / Difficult-to-treat RA patient identification: AI-powere…

Medical Affairs Evidence Generation Rheumatology

Difficult-to-treat rheumatoid arthritis patient identification: AI-powered mining of unstructured EHR clinical data

Challenge

A significant proportion of difficult-to-treat RA patients were misclassified as 'standard RA' in EHR systems because the diagnostic criteria relied on unstructured clinical notes rather than coded fields — making them invisible to standard database queries.

Approach

Applied natural language processing to unstructured EHR data — physician notes, treatment records, and lab results — to identify patients meeting difficult-to-treat criteria who had been missed by conventional coding.

Result

Difficult-to-treat RA patients identified from previously inaccessible unstructured data; methodology validated for replication across additional markets and hospital systems.

The challenge

The most complex patients are the least visible in structured data

Difficult-to-treat rheumatoid arthritis is defined by inadequate response to multiple biologic agents — but this definition lives in the clinical narrative, not in the structured fields of most hospital EHR systems. A physician who documents in free-text that a patient has 'failed three biologics and remains inadequately controlled' has recorded the key diagnostic information — but that patient may be coded in the EHR system as simply 'rheumatoid arthritis' with a list of past medications.

Standard database queries cannot identify these patients. The information is there — but it is locked in unstructured text. For a brand with a product specifically indicated for difficult-to-treat RA, the inability to quantify the true patient population was limiting both the commercial case and the access arguments.

The solution required applying language processing technology to clinical text — not as a research exercise, but as a practical tool for patient identification that could be deployed in partnership with hospital rheumatology departments.

In rheumatology, the patients with the most complex disease are often the ones who are hardest to find in a dataset. Their complexity is documented in words, not in codes — and standard analytics cannot read words.

Our approach

What we did

EHR data access and governance

Established data access agreements with 3 hospital rheumatology departments covering over 18,000 RA patient records. Designed data governance framework ensuring patient privacy compliance with GDPR requirements.

Clinical feature definition

Worked with rheumatologists to define the specific clinical features and treatment history patterns indicative of difficult-to-treat RA. Translated clinical definitions into NLP target variables.

NLP model development

Developed and trained an NLP model to extract relevant clinical signals from free-text physician notes: biologic treatment sequences, response assessments, disease activity scores, and failure documentation language.

Validation against clinical gold standard

Validated the NLP model against a manually reviewed gold standard dataset of 400 patient records reviewed by rheumatologists. Achieved sensitivity of 87% and specificity of 93% for difficult-to-treat classification.

Patient identification and characterisation

Applied the validated model to the full EHR dataset. Characterised identified patients by treatment history, disease duration, and clinical management patterns. Prepared a summary report for participating hospital departments.

Result

Measurable impact

The NLP model identified a cohort of difficult-to-treat RA patients that was 34% larger than the number identified through structured data queries alone. Across the 3 participating hospital departments, 127 previously uncategorised patients were identified as meeting difficult-to-treat criteria. The methodology was documented in a technical report reviewed and validated by the participating rheumatologists. Two additional hospital systems in separate markets requested access to the methodology for local application.

Significantly more
patients identified

vs structured data queries alone

Previously uncategorised
patients identified

Meeting difficult-to-treat criteria — invisible to standard queries

Validated NLP
model

High sensitivity and specificity — validated against clinical gold standard

Facing a similar challenge?

Tell us what you’re working on — we’ll show you relevant cases and suggest the fastest path forward.

Discuss your project Browse services

SolveLetter · Monthly insights

From the field:
evidence & practice

Read all articles →

Advisory Boards

What Makes an Advisory Board Truly Insightful? New Trends, Smarter Ideas, and Next-Gen Formats

An advisory board is one of the most intelligent and powerful tools in the pharma arsenal. With the right event architecture, it helps generate valuable insights and build trust within the medical community.

Jun 2025 · SolveLetter #3 Read →

Evidence Strategy

Why Do RWE and RWD Projects Face Challenges, and How Can We Avoid Them?

With evidence generation becoming a key performance indicator for many medical affairs teams, interest in RWE approaches is growing rapidly. Based on experience across a dozen projects, we’ve gathered insights that might be useful.

Apr 2025 · SolveLetter #2 Read →

KOL Engagement

How to Increase Expert (KOL) Engagement in Congresses, Conferences, and Seminars?

Expert engagement at live events is declining. We explore practical solutions and formats that drive real KOL participation and meaningful scientific dialogue.

Feb 2025 · SolveLetter #1 Read →

Evidence Scanner^TM

AI infrastructure

AI-powered.
Expert-validated.

We built AI workflows into our daily practice — not as a marketing claim, but as the infrastructure that lets our medical experts deliver faster without cutting corners.

Research

Structured PubMed queries with narrative or table outputs

Monitoring

Weekly literature digests by drug, target, or topic

AI-Enhanced EDC

Electronic data capture with AI-assisted structuring of unstructured records

Fact-Checker

Claim verification against your source documents

AI accelerates. Our experts validate.

Every output goes through expert medical review before it reaches your team. AI handles structure and speed — we handle scientific judgement and MLR readiness.

Explore Evidence Scanner → Request a demo

Evidence Scanner™ · Research module

// Query: ribociclib OS data MONALEESA 2023–24

search("ribociclib overall survival", {

years: [2023, 2024],

output: "structured_table"

})

// 847 records → 23 relevant

Processing 847 records...

Evidence Summary

MONALEESA-2 updated OS (NEJM 2023): median OS 63.9 mo vs 51.4 mo (HR 0.76, 95% CI 0.63–0.93). Benefit maintained across all pre-specified subgroups...