Patients and intervention
A total of 226 patients aged 18 years or over, fulfilling 2010 ACR/EULAR classification criteria for RA who were eligible for treatment with anti-TNF therapy according to UK National Institute for Health and Care Excellence (NICE) guidelines, i.e. failing or intolerant to conventional synthetic disease-modifying anti-rheumatic drug (csDMARD) therapy were recruited when fulfilling the trial inclusion/exclusion criteria (for the full study protocol and baseline patient characteristics see the primary analysis of the trial)20. Due to issues with drug supply, a separate trial ‘STRAP-EU’ was opened which replicated the STRAP trial in the UK, and recruitment was expanded to four other countries in Europe in 2018. For the analyses, the data from both trials were combined. Briefly, patients underwent a synovial biopsy of a clinically active joint at entry to the trial, performed according to the expertise of the local centre as either an ultrasound-guided or arthroscopic procedure52. Following synovial biopsy, patients were randomised to receive either rituximab as two 1000-mg infusions at an interval of 2 weeks, administered at baseline, tocilizumab 162 mg administered as a weekly subcutaneous injection, or etanercept 50 mg administered as a weekly subcutaneous injection. Patients were followed up every 4 weeks (±1 week) throughout the 48-week trial treatment period, where RA disease activity measurements and safety data were collected. Follow-up of STRAP patients recruited in the UK after 1st January 2019 ended at 24 weeks from baseline, whereas the follow-up of all STRAP-EU patients continued to end at 48 weeks from baseline. Optionally, a repeated synovial biopsy of the same joint sampled at baseline was performed at 16 weeks. The study was conducted in compliance with the Declaration of Helsinki, International Conference on Harmonisation Guidelines for Good Clinical Practice, and local country regulations. The final protocol, amendments and documentation of consent were approved by the institutional review board of each study centre and relevant independent ethics committees: UK ethics committee approval MREC 14/WA/1209 (Wales REC 3); Comité d’Ethique Hospitalo-Facultaire Saint-Luc, Bruxelles, Belgium; Comitato Etico Interaziendale, A.O.U. Maggiore della Carita, Novara, Italy; Comissão de Ética para a Investigaçāo Clínica (CEIC), Portugal; Comité Ético de Investigación con medicamentos (CEIm) del Hospital Clínic de Barcelona, Spain. All patients provided written informed consent. The trial was supported by an unrestricted grant from the Medical Research Council. The study protocol is available at http://www.matura-mrc.whri.qmul.ac.uk/documents.php. These trials are registered with the EU Clinical Trials Register, 2014-003529-16 (STRAP) and 2017-004079-30 (STRAP-EU).
RNA-seq data processing and analysis
A total of 283 paired-end RNA-seq samples from 50 million reads of 150-bp length were mapped to the reference human transcriptome (Gencode v29, GRCh38.p12), and transcripts were then quantified using Salmon version 0.13.153. Tximport version 1.13.10 was used to aggregate transcript-level expression data to genes, then counts were subjected to variance-stabilising transformation (VST) using the DESeq2 version 1.25.9 package54. Following RNA-Seq quality control, with principal component analysis (PCA), five baseline and three follow-up, in total eight samples were excluded due to poor mapping rate that was originated from low RNA quality (Supplementary Fig. 1a). Thus RNA-Seq data from 208 patients were available for subsequent analysis at baseline (65 samples at later time points are not analysed here). Baseline characteristics of patients with available RNA-Seq are shown in Supplementary Table 1.
Histological analysis
A minimum of six synovial biopsies were processed in an Excelsior tissue processor before being paraffin-embedded en masse at Queen Mary University of London Core Pathology department. Tissue sections (3–5-µm thickness) were stained with hematoxylin and eosin and IHC markers CD20 (B cells), CD138 (plasma cells), CD21 (follicular dendritic cells) and CD68 (macrophages) in an automated Ventana Autostainer machine. CD79A (B cells) and CD3 (T-cells) staining was performed in-house on deparaffinized tissue following antigen retrieval (30 min at 95 °C), followed by peroxidase- and protein-blocking steps. Primary antibodies (CD79A (clone JCB117, Dako), CD3 (clone F7.238, Dako), CD20 (clone L26, Dako), CD68 (clone KP1, Dako) and CD138 (clone MI15, Dako)) were used for 60 min at room temperature. Visualisation of antibody binding was achieved by 30-min incubation with Dako EnVisionTM+ before completion by the addition of 3,3’-diaminobenzidine (DAB) + substrate chromogen for 10 s, followed by counterstaining with hematoxylin. Following IHC staining, sections underwent semiquantitative scoring (0–4), by a minimum of two assessors, to determine levels of CD20+ and CD79a+ B cells, CD3 + T-cells, CD138+ plasma cells and CD68+ lining (L) and sublining (SL) macrophages, adapted from a previously described score55 and subsequently validated18. Hematoxylin and eosin-stained slides also underwent evaluation to determine the level of synovitis according to the Krenn synovitis score (0–9)56. Synovial biopsies were classified into synovial histological patterns, also known as pathotypes, according to the following criteria: (1) lymphomyeloid, presence of grade 2–3 CD20+ aggregates, CD20 ≥2 and/or CD138 ≥2; (2) diffuse-myeloid, CD68SL ≥2, CD20 ≤1 and/or CD3 ≥1 and CD138 ≤2; and (3) pauci-immune-fibroid, CD68SL <2 and CD3, CD20 and CD138 <1.
Differential expression, modular and pathway analysis of RNA-Seq data at baseline
Patients classed as responders and non-responders based on their 16-week assessment using ACR20 criteria were compared for each individual treatment group as well as for all treatments combined. The groups showed no significant differences for baseline characteristics, including histological and molecular B-cell status, gender or disease duration (Supplementary Table 1). Low-expressed genes (expressed in fewer than 18 samples with at least a normalised count of 9) were excluded from analysis. Remaining genes were subjected to differential gene expression analysis based on general linear regression models with negative binomial distribution applied to RNA-Seq count data using DESeq2 (version 1.34.0), which uses a Wald test to compare differences between treatment response groups in synovium RNA-Seq samples. To prevent bias of results through potential muscle contamination, a supervised principal component analysis (PCA) was calculated using the prcomp function with the following 17 genes derived from Reactome skeletal muscle module: ACTA1, ACTN2, MYBPC1, MYBPC2, MYH1, MYH2, MYH7, MYH8, MYL1, MYL2, NEB, TCAP, TNNC2, TNNI1, TNNI2, TNNT1 and TNNT3. These genes were checked and confirmed to be highly specific to muscle tissue in the FANTOM5 CAGE-Seq repository. Muscle gene-specific PC1 was employed as a covariate in the DESeq2 analysis to adjust for the presence of small amounts of muscle tissue in a few samples. P values were false discovery rate (FDR) adjusted using Storey’s q value. A cut-off of q < 0.05 was used to identify significantly differentially expressed genes (DEG), illustrated by volcano plots. DEG analysis was repeated with adjustment for biopsy joint size (large = knee, ankle, elbow; small = MCP, PIP, MTP, wrist), but this had negligible impact on DEG analysis and only worsened the number of DEG identified for two of the drugs (Supplementary Fig. 2). Biopsy joint size was not significantly different between responder and non-responder groups (Supplementary Table 2).
DESeq2 outputs from each individual treatment group were used for modular analysis using the Bioconductor package quantitative set analysis for gene expression (QuSAGE, v2.30.0). Weighted gene correlation network analysis (WGCNA) gene modules from ref. 57 were selected for gene set enrichment, and relevant modules were summarised in plots. DESeq2 outputs from the differential expression analysis in all treatments were used for functional enrichment using enrichR (v3.2) with Reactome_2022 as the pathway repository.
2 × 3-way analysis of differentially expressed genes
In order to visualise the specificity of genes for predicting responsiveness to each drug, DEGs identified by the responders vs non-responders contrast in each drug cohort were displayed in three-way polar plots using the volcano3D R package (version 2.0.6) as a 2 × 3-way analysis18. DEGs upregulated in responders were categorised as E+, R+, T+ based on the Wald Chi-squared test p value significance for responder vs non-responder analysis for etanercept, rituximab and tocilizumab, respectively. Genes whose FDR result was significant for two drugs were labelled as mixed categories (E+ R+, E+ T+ or R+ T+). Genes significantly upregulated in responders with all three drugs were excluded. The graphical position of the genes along the three axes of the polar plot was calculated using the estimated log fold change for each gene. The same procedure was applied with directionality reversed to plot genes significantly upregulated in the non-responder group of each drug.
Gene expression integration into cell-specific modules
Modular approaches for gene set enrichment analysis and relative quantification of cell subsets are widely used in recent molecular studies58,59. Here, we integrated gene expression into cell-specific modules to characterise the association of synovial immune cells in RA with multidrug resistance. For the enrichment of 18 single-cell subsets identified in scRNA-Seq of RA synovial tissue21 that are composed of four fibroblast subtypes (SC-F1: CD34+ sublining, SC-F2: HLA+ sublining, SC-F3: DKK3+ sublining and SC-F4: CD55+ lining), four macrophages subtypes (SC-M1: IL1B+ pro-inflammatory, SC-M2: NUPR1+, SC-M3: C1QA+, SC-M4: IFN-activated), six T-cell subtypes (SC-T1: CCR7+ CD4+, SC-T2: FOXP3+ Tregs, SC-T3: PD-1+ Tph/Tfh, SC-T4: GZMK+ CD8+, SC-T5: GNLY+ GZMB+, SC-T6: GZMK+/GZMB+), and four B-cell subtypes (SC-B1: IGHD + CD27- naive, SC-B2: IGHG3+ CD27+ memory, SC-B3: Autoimmune associated, SC-B4: Plasmablasts), we scored each subtype using a modular approach that integrates previously published gene signatures21. The top five exclusively differentially expressed genes (based on AUC scores) were utilised as cell subtype-specific gene sets. Module scores for each subtype were calculated using the AddModuleScore function from the R package Seurat. Linear model method as implemented in R package Limma, was used to detect the differentially abundant cell sub-populations when comparing responders and non-responders.
The main aim of the STRAP clinical trial is to test the utility of analysing synovial B-cell infiltrates as a potential biomarker to guide therapeutic decisions in patients failing DMARD therapy. To stratify patients according to their synovial B-cell infiltrates into B-cell poor/rich pathotypes, we used the B-cell module gene set that contains 71 genes derived from FANTOM5 (see Supplementary Fig. 1c)60. First, we assigned the mean of normalised and scaled gene expression values to this module as previously described and validated18. Then, patients were categorised as B-cell poor or B-cell rich using a predefined cut-off of −0.0413, the median B-cell module values of RA patients recruited in the R4RA clinical trial17.
Unsupervised clustering of baseline gene expressions
To perform unbiased heatmap clustering on the entire baseline population, genes were filtered based on expression levels. For the STRAP cohort, genes were stringently filtered to only highly expressed genes using the edgeR package (version v4.2.0) function filterByExpr with arguments min.count set to 20 and min.total.count set to 5e5. Since R4RA had a smaller sample size (n = 133), min.total.count was reduced to 3e5. Expression levels underwent variance-stabilising transformation (VST) normalisation and Z-score scaling for visualisation with ComplexHeatmap (v2.14.0). Euclidean distance was employed to calculate row distances, while Pearson correlation distance was utilised for column distances. Hierarchical clustering was performed on columns using the complete linkage method, whereas k-means clustering was applied to rows. Genes within each resulting cluster were used for pathway analysis with enrichR (v3.2) using the Reactome_2022 database.
Building classifier models for the prediction of response
Machine learning models were built to predict target DAS28ESR, target DAS28CRP, ACR20 response, CDAI 50% response, EULAR DAS28-ESR good vs moderate/non-response and EULAR DAS28-CRP good vs moderate/non-response to either rituximab, tocilizumab, or etanercept treatment at the primary endpoint (16 weeks).
The model feature space was created using RNA-Seq data restricted to 507 target genes relevant to synovial biology based on the nCounter custom panel (see description of the panel below). Baseline clinical parameters were included to improve response prediction. These included: tender joint count (TJC), swollen joint count (SJC), arthritis activity (patient visual analogue score), rheumatoid factor titre (RF), anti-CCP titre, erythrocyte sedimentation rate (ESR) and C-reactive protein (CRP). TJC, SJC and ESR were square root transformed (sqTJC, sqJSC and sqESR) to be on a Gaussian distribution, and CRP was log-transformed (logCRP). Genes were filtered to those with mean expression ≥6 on the VST scale to remove low-expressed genes, which were less reliably detected by nCounter.
Following processing, data was split into 10 × 10 nested CV folds using the nestedcv R package (version 0.7.9)31. Feature selection was performed within outer CV folds using a t-test filter, with the top n genes by two-tailed t-test p value being retained for fitting of models. An alternative feature selection method, which was used for rituximab response prediction models, was based on fitting a glmnet elastic net regression model to the whole data and selecting genes which were retained in the glmnet model. The number of features selected was chosen to limit model size to between 25 and 40 predictors to design practical models more likely to be feasible in real-life clinical situations. Model hyperparameters were tuned by an inner tenfold cross-validation based on log loss. Overall model performance was determined by tenfold outer cross-validation with 25 repeats to give averaged unbiased estimates of model accuracy. Elastic net penalised regression using the glmnet package was compared against seven machine learning models from the caret package (version 6.0): random forest (RF), least-squares support vector machine (SVM) with radial basis function kernel (svmRadial), least-squares SVM with polynomial kernel (svmPoly), gradient boosted machine (GBM), mixture discriminant analysis (MDA), extreme gradient boosting (xgboost) using trees (xgbTree) or linear regression (xgbLinear). None of these models, with the exception of glmnet, are sparse, which means that they incorporate all predictors during modelling. This leads to models with large numbers of predictors, which have a tendency to fail validation in real clinical situations. Hence, feature selection was used to limit model size to between 25 to 40 predictors to design practical models more likely to be feasible in a real-life clinical situation. Previous studies have shown that with gene expression data, filtering genes with a simple t-test often performs better than more complex feature selection methods61. Thus, for models predicting etanercept or tocilizumab response, we used a t-test filter. However, it was more difficult to obtain a good-quality prediction model for rituximab, so a glmnet-based filter was used. In this two-step process, a LASSO regression model was fitted to the training folds to select optimal features, which are then passed to other models for fitting.
To evaluate overall model performance from each repeat of 10 × 10-fold nested CV, predictions from left-out outer CV test folds were pooled, and performance was determined compared to the ground truth. Multiple metrics were used including area under curve (AUC) from receiver operating characteristic (ROC) curves computed using R package pROC, accuracy and balanced accuracy. Tuning parameters for the final model were determined by a final round of CV on the whole dataset with the final model fitted to the whole dataset.
Performance of each model for predicting response to each drug was measured by AUC (area under ROC curve) with 25 repeats of 10 × 10 nested CV to compare the primary endpoint ACR20 response against five other response endpoints to identify the optimal endpoint (Supplementary Figs. 5–7): DAS28-ESR/CRP <3.2 (target DAS28-ESR/CRP); CDAI 50% response, EULAR good vs moderate/non-response for DAS28-ESR/CRP. This showed that for etanercept and tocilizumab, the best prediction of response was seen for target DAS28-ESR (<3.2). Therefore, this response outcome measure was selected for model optimisation. For etanercept and tocilizumab, models were trained with a binary outcome, with response defined as DAS28-ESR <3.2 at 16 weeks. For the rituximab model, due to difficulties with obtaining a reliable binary classification model, an ordinal outcome was used, namely DAS28-ESR status at 16 weeks which has four levels: high (DAS28-ESR >5.1), moderate (3.2 < DAS28 <5.1) or low disease activity (2.6< DAS28 <3.2) and remission (DAS28 <2.6). This four-level outcome fits alongside the original binary outcome as low disease activity/remission corresponds directly to response, and moderate/high disease activity at 16 weeks corresponds to non-response. Rituximab prediction models were fitted to this ordinal outcome as a regression, then converted to a binary outcome after the final model was fitted and performance calculated for the binary outcome.
Feature importance was measured across outer CV folds as well as the final model to rank predictors in terms of importance and estimate stability of variables in models by the frequency with which variables were selected across outer CV folds and estimate the variance across variable importance across the outer CV folds and final model.
Validation of machine learning models in R4RA
Two of the machine learning models built on STRAP data were validated in the independent R4RA cohort. In the R4RA trial, patients (n = 164) were randomised to tocilizumab or rituximab (not etanercept or anti-TNF)17. Therefore, only these two models were tested in R4RA. Of the 164 patients randomised on entry to the R4RA trial, good-quality RNA-Seq was available on n = 133 post-QC19. R4RA baseline synovial RNA-Seq data (n = 133) was batch corrected against STRAP RNA-Seq data using the ComBat function from the SVA R/Bioconductor package (version 3.52.0). Baseline clinical parameters required for each predictive model underwent identical transformation as in STRAP (square root for TJC, SJC, ESR; log for CRP). The tocilizumab and rituximab model-predicted response was compared against the actual DAS28-ESR response defined as DAS28-ESR <3.2 in individuals treated with tocilizumab (n = 65) or rituximab (n = 68) at the 16-week primary endpoint of the R4RA trial. Model performance for predicting response to each drug was measured by ROC AUC.
Development of the custom synovium nCounter panel
The nCounter platform provides the flexibility to customise assay content with up to 800 user-defined targets to meet specific project demands. Our custom panel, also called the biologic screening panel for arthritis, was created using an iterative development process. The initial panel was designed with five gene sets that contained a total of 798 unique genes related to synovial pathobiology. These gene sets included: (i) genes associated with synovial pathotypes from the PEAC RNA-Seq study18 (258 genes), (ii) protein-protein interaction networks of RA drug targets retrieved from the STRING database (https://www.string-db.org, 166 genes), (iii) gene predictors from previously developed treatment response prediction models (320 genes)19,24,30, cell-type-specific genes from single-cell and spatial transcriptomics datasets (147 genes)21,22,33,62,63, and 20 housekeeping genes for data normalisation identified through analysis of PEAC, R4RA and STRAP RNA-Seq data. Even though nCounter panels are completely customisable, the content of the manufactured panels may differ significantly from the original design because not all genes have probes available. In our case, issues with the probe design led to the exclusion of 33 genes from the panel before the manufacturing of the first cartridge. Then we tested our first panel on 48 samples, including synovium RNA isolates, Universal Human Reference RNA (Thermo Fisher QS0639) and water. Unexpectedly, in all samples, the negative control probes generated signals which were excessively strong for QC, with negative signals apparently higher than signals collected from some valid gene probes. We observed that from the first iteration that, in order to prevent extreme outliers from suppressing the signal of the other probes and reduce the noise signal in synovial tissue, it was optimal for application to synovial tissue to reduce the content of custom panels to around 600 genes with probes within a specific dynamic range of expression levels. Therefore, to refine the gene list, less informative genes which were either excessively highly expressed (an average count of >10,000) or very low expressed (maximum counts <100) were removed from the panel. Selected problematic probes were redesigned by Nanostring scientists, when possible. Probes identified as problematic (‘sticky’) by the manufacturing company were removed, as well as probes located in untranslated regions.
In the second round, we submitted a refined list of 458 genes with an additional gene set that was a collection of genes associated with JAK-STAT, TNF and IFN signalling pathways identified from the KEGG database (206 genes). This resulted in 526 unique genes. Three gene probes were subsequently discarded from the panel as their probes had low specificity and were predicted to predispose to high background signal. The second version was synthesised with 523 unique genes and tested on a further 48 synovial samples. Subsequently, the LYZ gene probe was redesigned and included in the final, third version of the panel. Test runs of version 2 onwards did not report abnormal signals from control probes, and the final version with 524 genes (507 target genes and 17 housekeeping genes, Supplementary Data 5) was manufactured for this study. Expression stability of the housekeeping genes was assessed using data collected from the test runs, and due to less stable expression patterns, three housekeeping genes (ISY1, STK11IP and TFRC) were removed from the data normalisation process. Supplementary Fig. 4 illustrates the entire process and contains all the information regarding the development of the custom panel.
nCounter analysis
The final custom synovium nanoString nCounter panel developed contained 507 genes relevant to synovial pathobiology based on previous studies of synovial gene expression (Supplementary Data 5)18,19,24,30. The panel also contained 17 housekeeping genes (i.e. a total of 524 genes) selected for detectable and stable expression in synovial tissue and technical probes (six positive and eight negative). RNA samples were assayed using the nanoString nCounter Sprint Profiler from 100 ng of synovial tissue RNA, following the manufacturer’s instructions. Raw nCounter counts were extracted from RCC files and normalised using nanoString’s official R package (nanoStringNCTools version 1.6.0). Housekeeping genes and panel standard probes (synthetic oligos) normalisation methods were applied to the raw data to scale and standardise the data.
nCounter pseudo-RNA-Seq conversion and prediction using machine learning models
Linear models for each gene were fitted between variance stabilised transformed bulk RNA-Seq gene counts and log-transformed normalised nCounter gene counts. The nCounter assay data were converted to RNA-Seq scale (referred to as pseudo-RNA-Seq) based on stored regression models for each gene. Using predefined linear regression components (intercept and slope coefficient), nCounter data was converted to pseudo-RNA-Seq and then passed as fresh data input into the finalised machine learning fitted nestedcv package model objects (glmnet model for etanercept; gbm model for tocilizumab; xgbLinear model for rituximab) to determine a predicted probability of response for each sample. Predicted response probability was compared with actual outcome in the STRAP trial to determine confusion matrices of predicted binary response vs actual response, accuracy, balanced accuracy and ROC AUC for the nCounter assay.
Statistical analysis
Statistical analysis of RNA-Seq gene expression data were performed using negative binomial general linear models using the DESeq2 R package. For differential gene expression, p-values were false discovery rate (FDR) adjusted using Storey’s q value. A cut-off of q < 0.05 was considered significant. Statistical analysis of gene modules was performed using QuSAGE with FDR adjustment, and a cut-off of q < 0.05 was considered significant (unless otherwise stated). Two-tailed tests were used throughout.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.