Case Studies / Difficult-to-treat RA patient identification: AI-powere…
Medical Affairs Evidence Generation Rheumatology

Difficult-to-treat rheumatoid arthritis patient identification: AI-powered mining of unstructured EHR clinical data

Challenge
A significant proportion of difficult-to-treat RA patients were misclassified as 'standard RA' in EHR systems because the diagnostic criteria relied on unstructured clinical notes rather than coded fields — making them invisible to standard database queries.
Approach
Applied natural language processing to unstructured EHR data — physician notes, treatment records, and lab results — to identify patients meeting difficult-to-treat criteria who had been missed by conventional coding.
Result
Difficult-to-treat RA patients identified from previously inaccessible unstructured data; methodology validated for replication across additional markets and hospital systems.
The challenge

The most complex patients are the least visible in structured data

Difficult-to-treat rheumatoid arthritis is defined by inadequate response to multiple biologic agents — but this definition lives in the clinical narrative, not in the structured fields of most hospital EHR systems. A physician who documents in free-text that a patient has 'failed three biologics and remains inadequately controlled' has recorded the key diagnostic information — but that patient may be coded in the EHR system as simply 'rheumatoid arthritis' with a list of past medications.

Standard database queries cannot identify these patients. The information is there — but it is locked in unstructured text. For a brand with a product specifically indicated for difficult-to-treat RA, the inability to quantify the true patient population was limiting both the commercial case and the access arguments.

The solution required applying language processing technology to clinical text — not as a research exercise, but as a practical tool for patient identification that could be deployed in partnership with hospital rheumatology departments.

In rheumatology, the patients with the most complex disease are often the ones who are hardest to find in a dataset. Their complexity is documented in words, not in codes — and standard analytics cannot read words.

Our approach

What we did

1
EHR data access and governance
Established data access agreements with 3 hospital rheumatology departments covering over 18,000 RA patient records. Designed data governance framework ensuring patient privacy compliance with GDPR requirements.
2
Clinical feature definition
Worked with rheumatologists to define the specific clinical features and treatment history patterns indicative of difficult-to-treat RA. Translated clinical definitions into NLP target variables.
3
NLP model development
Developed and trained an NLP model to extract relevant clinical signals from free-text physician notes: biologic treatment sequences, response assessments, disease activity scores, and failure documentation language.
4
Validation against clinical gold standard
Validated the NLP model against a manually reviewed gold standard dataset of 400 patient records reviewed by rheumatologists. Achieved sensitivity of 87% and specificity of 93% for difficult-to-treat classification.
5
Patient identification and characterisation
Applied the validated model to the full EHR dataset. Characterised identified patients by treatment history, disease duration, and clinical management patterns. Prepared a summary report for participating hospital departments.
Result

Measurable impact

The NLP model identified a cohort of difficult-to-treat RA patients that was 34% larger than the number identified through structured data queries alone. Across the 3 participating hospital departments, 127 previously uncategorised patients were identified as meeting difficult-to-treat criteria. The methodology was documented in a technical report reviewed and validated by the participating rheumatologists. Two additional hospital systems in separate markets requested access to the methodology for local application.

Significantly more
patients identified
vs structured data queries alone
Previously uncategorised
patients identified
Meeting difficult-to-treat criteria — invisible to standard queries
Validated NLP
model
High sensitivity and specificity — validated against clinical gold standard
Facing a similar challenge?

Tell us what you’re working on — we’ll show you relevant cases and suggest the fastest path forward.

Evidence Scanner
Evidence ScannerTM
AI infrastructure

AI-powered.
Expert-validated.

We built AI workflows into our daily practice — not as a marketing claim, but as the infrastructure that lets our medical experts deliver faster without cutting corners.

Research
Structured PubMed queries with narrative or table outputs
Monitoring
Weekly literature digests by drug, target, or topic
AI-Enhanced EDC
Electronic data capture with AI-assisted structuring of unstructured records
Fact-Checker
Claim verification against your source documents
AI accelerates. Our experts validate.
Every output goes through expert medical review before it reaches your team. AI handles structure and speed — we handle scientific judgement and MLR readiness.
Evidence Scanner · Research module
// Query: ribociclib OS data MONALEESA 2023–24
search("ribociclib overall survival", {
  years: [2023, 2024],
  output: "structured_table"
})
// 847 records → 23 relevant
Processing 847 records...
Evidence Summary
MONALEESA-2 updated OS (NEJM 2023): median OS 63.9 mo vs 51.4 mo (HR 0.76, 95% CI 0.63–0.93). Benefit maintained across all pre-specified subgroups...