Authors:
(1) Rasoul Samani, School of Electrical and Computer Engineering, Isfahan University of Technology and this author contributed equally to this work;
(2) Mohammad Dehghani, School of Electrical and Computer Engineering, University of Tehran, Tehran, Iran and this author contributed equally to this work (dehghani.mohammad@ut.ac.ir);
(3) Fahime Shahrokh, School of Electrical and Computer Engineering, Isfahan University of Technology.
Table of Links
4. Evaluation
3. Methodology
An overview of the proposed method can be observed in Figure 1. Following the preparation and preprocessing of the required data, including the conversion of textual data into numerical vectors, machine learning and deep learning models were designed and implemented to predict readmission.
3.1 Data
For this research, the MIMIC-III dataset (Medical Information Mart for Intensive Care III) [29] was utilized. This freely available dataset encompasses data from over 50,000 patients (with their identifying information removed) admitted to a US hospital between 2001 and 2012. In this study, two key tables from this dataset were employed, namely the admission table and the clinical report table. The admission table contains information regarding patient admissions, while the clinical report table includes written records such as medical history and discharge instructions documented by attending physicians in the patient's file.
3.2 Data preprocessing
Data preprocessing is a crucial step in training a model, as it directly impacts its performance. The quantity, quality, and diversity of data influence the effectiveness of machine and deep learning algorithms. To improve the reliability of the algorithms, it's vital to carefully select and preprocess the target data from the original dataset [30]. Efficient data preprocessing techniques such as dimensionality reduction and data transformation play a critical role in enhancing the extraction of valuable insights and knowledge from datasets [31]. We conducted two preprocessing phases on the input data. Initially, general preprocessing techniques were applied to the dataset records and columns, with the objective of improving data quality and structure. Following this, text preprocessing methods were utilized.
3.2.1. General preprocessing
As depicted in Figure 2, admission table and clinical report table were preprocessed separately to ensure that each dataset underwent tailored preprocessing steps suited to its specific characteristics and requirements.
For admission table, following preprocessing steps were employed:
• Data related to patients whose status was recorded as birth or death were removed.
• Additional columns were added to the admission table to record the date and type of the patient's next admission to the hospital. These columns were populated for each patient's admission, with the information of their subsequent admission arranged in chronological order. For the last patient admission, the values of these columns were predicted using the research approach. Subsequently, only the rows of data were retained where the patient's admission was elective, excluding cases where the patient was admitted as an emergency.
• A new column was added to store the number of days until the next admission, which serves as the target variable for this approach. The objective is to predict samples for which the value of this parameter will be less than 30 days, indicating readmission within a short time frame.
The preprocessing steps applied to the clinical report table were as follows:
• All notes for “Discharge summary” were filtered.
• Null texts were removed.
To further refine the dataset and reduce missing values in the considered columns, the admission tables and discharge summary reports were merged based on the patient's admission ID.
3.2.2. Text preprocessing
Text preprocessing involves cleaning, transforming, and preparing textual data to make it suitable for analysis and modeling [32]. In Figure 3, the text preprocessing steps are illustrated.
Stages of text preprocessing encompass a range of crucial tasks, including:
• Tokenization: Tokenization is the process of breaking down text into individual components such as characters, words, phrases, or other elements known as tokens [33].
• Lemmatization: Groups inflected forms of words together based on their lemma.
• Stop word removal: Stop words, such as “a”, “the”, “in”, “with”, which are often common terms in a language and lack meaningful information [34], are excluded.
• Special words removal: Commonly occurring words and phrases, including titles, unidentified brackets, and drug dosages, are eliminated from all notes during preprocessing.
In Figure 4 (a), the word cloud of the original data reveals a predominant focus on drug-related terms, dosage information, and titles, which are frequently repeated in the text but may not contribute significantly to predicting readmission. However, certain specific words, such as laboratory results, hold more relevance for conducting analysis. Therefore, we conducted preprocessing steps to filter out less relevant terms and retain more valuable ones. The results of this preprocessing are depicted in Figure 4 (b), where drug-related terms have been removed, and more informative words that are important for both doctors and our predictive model have been retained. This selective filtering enhances the quality of the data input for our analysis, enabling more accurate predictions of readmission classification.
As machines can't process free text directly, numerical conversion is essential for analysis and modeling. Raw text data must undergo transformation into numerical formats before being utilized for training model [30]. Accordingly, two methods were employed for text vectorization in the subsequent preprocessing phase. TF-IDF was utilized for machine learning models, while BDSS model was applied for deep learning model.
TF-IDF
TF-IDF, is a widely used method for weighting terms in text data. It involves assigning numerical scores to words based on their frequency within a document and their infrequency across the entire corpus, indicating their importance in the context of the document and the corpus as a whole [35]. In this study, we set the max-df parameter to 0.8 and the min-df parameter to 5. Max-df denotes the threshold for the maximum document frequency of terms, used for removing terms that appear too frequently. Conversely, min-df indicates the minimum document frequency required for terms to be considered, with terms appearing in fewer documents than the specified min-df typically excluded from analysis.
The vocabulary size of 35097 is exceptionally high, posing challenges in terms of computational resources and processing time required for thorough analysis. To address this issue, dimensionality reduction techniques are indispensable. One such technique is PCA, which effectively reduces the dimensionality of the data while preserving essential information. In our approach, we applied PCA with a consideration of 50 principal components.
Discharge Summary BERT
Trained on clinical text derived from roughly 2 million notes within the MIMIC-III v1.4 database, BDSS is designed to focus solely on discharge summaries, ensuring the corpus is customized to align with downstream tasks [36]. We utilized the BDSS model [37], and trained exclusively on discharge summaries extracted from the MIMIC database. This approach guarantees that the model is finely tuned for tasks related to clinical text analysis.
After completing preprocessing steps, the dataset's maximum text length is 7479 tokens, with a mean length of 1317 words, and the length of the third quarter is 1735 words. Selecting an appropriate input size for vectorization is crucial. If the input size is too high, it can lead to increased computational complexity, while too low of a size may result in the loss of valuable information. Given that BDSS's input size is 512, we choose an input size of 2048, which is four times the size of BDSS's input and can accommodate the third quarter of the text data. To ensure consistency in input size, we pad data with smaller sizes and truncate data with longer sizes, thus maintaining all inputs at the same size (2048) while preserving important information.
Following the initial data input, which has a size of 2048, we divide it into four segments, each of size 512. These segments serve as input to the BDSS model for vectorization. The output from BDSS has dimensions of 512 x 768 for each input. To obtain a concise representation, we compute the average for each entry, resulting in a 768-dimensional vector. The final output comprises four such vectors, each sized at 768, yielding a total size of 3072. Subsequently reached 50 features with the help of PCA. Finally, we fed these reduced features into a MLP model. This approach allowed us to effectively capture essential information while minimizing computational complexity.
This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.