Healthcare expenditure has become a significant component of global spending14, and many health systems face challenges to improve efficiency and provide sustainable care15,16. Various payment incentives have been proposed and experimented with curbing the rising cost in the healthcare system to encourage healthcare providers to plan discreetly on resources17,18, promoting risk-sharing between payers and providers. Under prospective payment models, cost monitoring and management are vital for providers to stay financially competitive and are also essential for them to offer high-quality care. At the same time, precise cost information based on well-represented patient data may also assist payers in adjusting payment policies to further increase efficiency and support new payment initiatives19,20.
In this study, we focused on predicting episode-level DRG and estimating DRG-based expected payment at population level, which could help hospitals learn about their case mix and average cost at an early stage and make decisions accordingly. Existing studies21,22 examined health cost prediction from a payer perspective, such as using claims data to predict cost in subsequent years. Also, studies have shown the promise of machine learning in predicting events like unexpected 30-day readmission8,23 and identifying high-need high-cost patients24 that could preempt avoidable health expenses. Meanwhile, these studies did not focus on active patients to provide support for provider decisions. A prior study7 investigated DRG prediction in early hospital settings to enable better resource allocation, and it showed the machine learning-enabled approach could increase the contribution margin for the hospital. In our study, we developed deep learning-based models to process only routine clinical notes. The end-to-end modeling and the use of only routine patient data enables scalable and timely estimation at a population level to provide cost indicators, especially given that the standard calculation of DRGs involves intensive human effort and lengthy turn-around via manually coding ICDs. This is a major difference of our study on early DRG prediction as compared to the previous work including ICD codes in learning and prediction; ICD codes may not be readily available for all hospitals at early admission—at least not automatically. We also proposed to estimate hospital cost by predicting DRG-based payment at the hospital level instead of adopting hospital-specific evaluation.
Our study showed the feasibility of applying an NLP-based model to estimate the hospital cost based on DRG payment. DRG prediction using clinical text consistently performed better than clinical measurements alone, demonstrating the value of mining clinical text to support active care management. Though the deep learning-based NLP model could extract indicators to infer a reasonable DRG group for a patient stay, especially for diseases of circulatory system (MDC 05) and frequent DRGs, accurately predicting each individual DRG is still a challenging task. Error analysis showed the model correctly predicted many true negatives for each DRG given the class imbalance, thus increasing the average AUC performance, whereas the F1 scores reflect the errors on false positive and false negative predictions requiring future improvement. Meanwhile, at a population level, we showed that the model could achieve promising results when its predictions were translated to payment-reflecting weights and to reflect patient case mix. Population CMI, which averages over a set of patient DRG weights, is adopted by hospitals and payers as a cost indicator to review clinical complexity and efficiency. Our results showed the machine learning-based prediction could contain the CMI error under 8·0% for MS-DRG and 15·0% for APR-DRG on the first day of ICU admission. When calculating CMI, mistakes on individual cases offset each other to drive down the overall error, and we showed such performances were robust when the patient population size changed. Finally, performances in MS-DRG were in general better than APR-DRG, which could be due to APR-DRG having more fine-grained severity stratification and hence more prediction targets (1136 vs. 738).
Practically, early prediction of hospital CMI can provide valuable information on expected resource use, which can lead to better decision making in administration for improved care and managed cost25. Assuming a base payment rate of $6000 for the hospital, the CMI error rate of ~5·0% may lead to ~$472,500 difference in final reimbursement for a simulated cohort of 500 patients with a CMI of 3·15, but it allows the hospital to learn about its CMI at an early stage using only routine data and make decisions accordingly, implementing resource allocation and revenue capture. We observed better CMI performances for the MS-DRG cohort while the errors on APR-DRG were higher. Nevertheless, the model still achieved a mean below 15·0% and was more robust against changes in cohort size. The current examination of MS-DRG and APR-DRG showed the flexibility of the modeling approach on different DRG systems, which may be extended to other DRGs with similar structures of grouping and weight. Furthermore, it is possible to generalize the cost modeling strategy to scenarios other than DRG, where the model skips the DRG classification step and learns to predict patient cost in real numbers directly. We presented the approach in Supplementary Method 3 to apply the model to predict DRG payment weights in regression, where the model achieved mean absolute error (MAE) under 1·51 for both DRG cohorts (see Supplementary Table 1).
We should note that the current experimental results on CMI predictions show promise to provide management support at a hospital/population level rather than an individual/patient level. By forecasting the overall CMI of an in-hospital population, the model prediction may help administrators foresee peaks and troughs in resource usage and better arrange resources such as staffing and operating theaters. The current model performance in predicting individual DRGs is still inadequate to provide decision support on individual cases, which requires future research and additional consideration of ethical concerns.
Since the focus of this work is to examine the feasibility of DRG prediction and DRG-based cost estimation based on clinical notes, we did not examine the NLP methodologies comprehensively and regard such an investigation as future work. Meanwhile, we found CAML to be a simple yet effective baseline even compared against large pretrained models like BERT26 and ClinicalBERT27 (the domain adapted version of BERT), which provided improved AUC but lower F1 scores compared to CAML in both cohorts. Details of our experiments with these models are available in Supplementary Method 4 and results in Supplementary Table 4.
To better understand the challenges of text-based early DRG prediction and provide guidance for future methodological improvements, we selected five discrete hospital stays in the MS-DRG cohort to compare the model prediction with the true DRGs and focused on understanding the modeling errors. Here we leveraged the attention mechanism to identify the most informative tokens (in this case, 5 g) considered by the model in making its prediction to support the analysis (the role of attention to provide interpretability is still under active research28,29), shown in Table 3. We can observe that the model could extract the most indicative text when making the correct predictions in Case 1, including “coronary artery bypass graft” and “cabg”. Meanwhile, the model could fail at differentiating multiple diagnoses when faced with complex patient conditions. For example, in Case 2, the model correctly extracted text on pneumonia, thus classifying the case as DRG 193, but failed to recognize its diagnostic relation to more complicated conditions like sepsis. Finally, we should note that the attention mechanism here aims to support the analysis and its role to provide interpretability of the model is still under active research.
The presence of comorbidity or complication (CC) or major CC (MCC) also introduces challenges to the model in recognizing and prioritizing diagnoses. Case 3 shows an example where the model was confused between DRGs with MCC and with CC.
In addition, the machine learning model could also suffer from a lack of enough training data and make errors on rare DRGs, such as the model predicted the more frequent DRG 023 for Case 4, whose true DRG is in fact 955 (both DRGs are related to craniotomy). Case 5 is an extreme case where DRG 289 only appeared once in the whole cohort, and it was in the test set, creating a zero-shot learning scenario for the model as it had not seen any training sample of the DRG.
Several limitations exist in our study for the early prediction of DRG-based hospital cost. First, we only examined data from one medical center in the United States. Though MIMIC-III contains a large number of patients and the study design is at hospital level, the dataset is still limited in location and patient representativeness. Due to the data de-identification, we were not able to apply DRG weight mapping to the exact fiscal year, so we used the official DRG weights published for fiscal year 2013 given MIMIC-III collected data until 2012. Second, the focus on ICU patients also leaves room for further exploration, including input data and predictor selection. On the one hand, the acute conditions of ICU patients are more dynamic and less predictable, and strong performances on them may indicate the capacity of the approach to model other inpatients, which should be studied. On the other hand, intensive care for ICU patients usually generates more data than other inpatients, providing more signals for a learning model. Sufficient data is vital for proper modeling and hence further investigation is necessary to explore the possible trade-off between disease severity and data availability on new patient populations.
Thirdly, though achieving favorable results on population CMI, the modeling method of clinical notes remains to be improved to make more accurate DRG predictions at an individual level. Early DRG prediction involves several challenges, including extracting diagnostic evidence, identifying major or negligible comorbidities, handling rare or unseen test DRGs, and modeling dynamic patient trajectories. This will require further innovations with consideration of the task and the characteristics of clinical text. As shown in the case analyses, CC and MCC play significant roles in the DRG payment but are challenging to differentiate based on text. The different clinical functions reflected in the texts could be one reason for the difficulty; for example, a radiology report can describe a radiograph for diagnostic purposes or for examination purposes (i.e., if catheterization is performed appropriately). Future work could consider modeling different types of clinical texts with separate modules instead of modeling the concatenated notes. This could also alleviate the impact of input length on Transformer-based models like BERT, which we believe was the main constraint for BERT to outperform the simpler CAML. As we also found domain adaptation improved BERT performances (see ClinicalBERT scores in Supplementary Table 4), a domain-specific efficient BERT, like Longformer30, may achieve much better performance on the task. In addition, the ontological structure of DRG could provide information to handle rare DRGs in few-shot and zero-shot learning scenarios, which are shown to be applicable in automatic ICD coding31. Finally, the efficacy of the NLP approach remains to be investigated in other data sets, including those in languages other than English.










