Final Paper
Data Warehousing: Different Usages of Data Warehousing
Abstract
The increase in medical devices and the subsequent amount of data that comes with it generate a large amount of storage and access options and other data-related problems. The number of databases that house all this information is mismanaged, and their efficiency extremely dwindles. The volume, heterogeneous types of data, and categorization of such data is important for healthcare facilities. For healthcare systems to take on the tasks of having great synergy with the information, doctors, and patients, we need to implement much more effective techniques to increase the subsequent demand for data warehousing in the healthcare system. The importance of data warehousing needs to be highlighted so that decision making in these facilities can become more efficient, and also research for multiple healthcare facilities gets a boost from this too. This paper shows the dire need for data warehousing to be given a pivotal role in healthcare systems so that the facilities can get the best synergy between the present data and the healthcare facilities that use them.
Keywords—Data, Data Warehouse, Medical, Data Storage, Healthcare
I. Introduction
Public Healthcare data management is steadily rising in popularity among the professionals that work with it, but not fast enough. These data systems need steps, definitions, and regulations to be effective. These improvements are essential, considering the information and indicators associated with them. Big data is concerned with volume, velocity, variety, veracity, and value. The volume deals with the amount of information that you have to deal with; Variety is the different data formats from different sources, velocity deals with the arrival of oncoming data, veracity is the trustworthiness of the data and value points towards how important the data is in this line of work. Now moving onto the types of data that are involved when we discuss data warehousing, data comes in specifically three formats, structured data, semi-structured data, and unstructured data. Structured data is the data that pertains to some form, format, or defined data type. This type of data follows specific rules set out for it and is easily distinguishable and understandable. Semi-structured data organizes data with minimum structure. This data mixes the format of structured data with some variables in the data type. The final data type is a data type that you can't give form too and is hardly readable by machines. You have to give this data form to be able to discern its value. With so many variables at play, and considering the multiple data that the healthcare systems naturally produce, it's safe to say that developing insights for the different departments of healthcare is needed. This task is formidable, and the challenges are uncountable but not insurmountable. The get the most out of the present data, you need to take all of the V's of big data and find the fourth V, which is value and to present it to the users. This points towards establishing better and faster processing platforms, more efficient methodologies and technologies for data collection, and storage and visualization techniques for the apprehension of useful knowledge and better decision support in the healthcare sector.
Big data is an untapped potential which is needed to be worked on so that support for medical decisions, an inspection of diseases, and health management. Various branches of health records such as Electronic Health Record, mHealth, eHealth, and many others require synchronization and synergy so that the maximum efficiency of decision making for patients can be achieved. Architectural Frameworks are supporting all of these different branches. The data is divided and managed based on the layers of the varying architectural frameworks that they house. The key factor of using such frameworks gives rise to trends for interests in sources of big data.
The different forms of data that are present, huge volumes of it, and the unpredictability of this data make the cleaning of data that much important, and through that data, we get actual information that we can use. In the following literature review, there will be a review of many of the frameworks and subsequent uses that relate to data warehousing, and it’s importance in healthcare systems. Many of the advantages will cover multiple aspects of the healthcare system, and the need for improvement in those areas will also be mentioned.
II. Main Text
(Garcelon, 2018) Electronic Health Records are used all over to collect data but better and improved Clinical Data Warehouses (CDW) are much more efficient in this regard as they are designed in a way that gives the users the benefit of collecting information, use it for research and have tools that help in the application of many branches of medicine such as phenome-mining, record mining, etc. With the advancements to Clinical Data Warehouses, there are still some aspects of clinics that are not stored in them, and most of that data is the free text clinical narratives. The inclusion of a free-text clinical narrative is essential and makes up 60-8% of medical reports. The structured nature of CDW's isn't meant to handle this type of data as doctors include many of their uncertainties and diagnostic hypothesis. The prevalence of rare diseases also factors in which free text becomes the ideal way for the doctor to diagnose and treat the patient.
The need for including this data is pivotal if data mining is to be successful because large chunks of data are still being left out from the databases. IT teams need to make the reuse of this data a primary inclusion in a framework for research, teaching, and management.
For tackling such problems, Dr. Warehouse was launched in a French hospital named Necker Enfants Malades Hospital, which is a national center for rare and undiagnosed diseases. It also houses a research institute for rare diseases. Dr.Warehouse is an open-source document-oriented data warehouse that takes unstructured data and gives meaning within them. The algorithm works without any coded data and the use of NLP algorithm. The biggest feature that this data warehousing software hosts are that it can be deployed and be used for basic functionalities regardless of the language barrier, essentially recording all type of reports, irrespective of the language. With NLP algorithms, Dr.W will have a faster search engine and will have better recognition of many medical diseases.
As most CDW’s use the exploitation of coded data, Dr.Warehouse provides a dedicated interface and reading for the various data that it inhibits. The translational tools that it has made it easier for the patient-centric view of the software. Around 96.6% of users reported positively in favor of faster research.
(Ayyoubzadeh,2020) Length of stay is regarded as an indicator of many things such as the management, patient welfare, nursing quality, and many other such factors. As discharging patients as quickly as possible shows the efficiency of hospitals to their core, there isn't any single major notable factor that anyone can point to. Prolonged Length of stay causes displeasure among patients, uses up increased resources for a single patient, and decreases the efficiency at which the hospitals are able to treat various patients of the general populace. The need for quick discharge of patients is a paramount factor for rating hospitals. This aspect of the hospital is necessary to focus on so that any major changes or improvements can be made so that the quality of the stay is maximized while the time duration decreases exponentially.
Data mining is an essential use in figuring out the factors that contribute to the length of stay of patients. In this study, around 526 patient records were taken from Shahid-Mohammadi Hospital and all the factors were taken in as well. The Length of stay was determined by using information gain and correlation indices. Nine data mining classifiers were applied to the classification models for the Length of stay, and the models were compared to see the results.
In regard to the factors, the accuracy and sensitivity was best found in Logistics Regression and Naïve Bayes models. These models had better performance than the other seven models used to find out the prevailing factors in Length of stay of patients.
Through the use of these models, it was clear that the Length of the stay of the patients was affected by the specialty and degree of doctor, frequency of counseling, the ward that the patient was admitted in, and the cause of hospitalization. This shows that data mining is the ideal way to go about researching the causes and factors for prolonged Length of stay, through this data, we can realize the factors that contribute to LOS and the performance of these indicators can be worked on in the future.
(van Solinge,2020) Smoking is one of the highest contributors to Lung Cancer and many other related diseases. Many factors contribute to smoking and also determine it's after-effects. The need for finding out such cases is important because smokers are at a higher risk of lung and cardiovascular diseases. This study focuses on the accuracy and quality of Electronic Health Systems, and with the help of data warehouses, the smoking status of a certain population can be found out. This study also takes both structured and unstructured data, mainly a questionnaire with given options and a text-free fragment to explain their condition.
The center used for this study took patients from all departments, framing the result for a general population group. This study took around 1,661 patients, out of which 14% of them had missing data. With the help of the EHR and data mining of the databases, the diagnostic accuracy of the smokers was found out. Smoking sensitivity had an average of 88%, while smoking specificity had 92% and negative predictive value, the positive predictive value had 98%, and 63%, respectively. Through the study, it was also found out that 55% of the patients had suffered from a cardiovascular event, 42% of the patients suffered from hypertension, and 19% of them suffered from diabetes. This also gives a good co-relation of diseases, which helps us identify the cause-effect of smoking on individuals.
The implementation of data warehousing and data mining was successful in the yield of information for smoking. This study was particularly useful for diagnosing negative smoking status. The extraction of information for both structured and unstructured data from EHR was pivotal for this experiment. The amount of time that it took to get this result greatly decreased the need for researching into this field. With data mining, the trend of smoking can also be determined as multiple studies done yearly can show how smoking is prevalent in the general population, and different correlations can be made, such as the mortality rate, disease occurrence, and other mental health-related problems.
(Syed,2019) The internet of medical things (IoMT) has entered the lives of men. With the different devices that help figure out the measurements for better healthcare services, bodily recordings have become easier and smarter. IoMT can revolutionize healthcare forever. IoMT consists of many wearable sensors, caregivers, and communication technology. The collection of different variables of health and making smart decisions is what prompted this study. As such, Ambient Assisted Living (AAL) integrates multiple technologies for better health-related decisions.
AAL consists of different wearable sensors on the body which collect data related to bodily functions, relates it to cloud data, and integrates it into the data analytics layer. Multiple softwares and techniques are used to handle all of the large volumes of data, and this will help to know how healthy the users are. This study is focused on the physical lives of elderly people, and subsequently, the readings will also be a reflection of their active lifestyle.
The higher the age of the individual, the more susceptible to diseases and injuries he is, which increases the challenges for healthcare systems. Older people are more difficult to care for due to their fragile state, and this puts a lot of pressure on caretakers and even hospitals. A lot of care needs to be given to them to ensure even a sliver of a chance of them recovering from their afflicted sickness or injury. Health monitoring systems are paramount in taking care of the elderly, and that is where AAL comes in.
AAL will help the elderly through the thorough input of their bodies and managing their active lifestyle. The real-time monitoring of such patients can help rescue them in case of any calamity such as any heart attack, stroke, accident, etc. The efficiency of the health monitoring system will also increase and help record data of individuals, helping to keep track of the active lifestyle of a population.
The AAL promised 97% accuracy of readings of 12 physical activities, which will keep the elderly fit and healthy. The constant surveillance of such individuals will also keep the caretakers and hospitals free from unnecessary patients, resulting in a quality life boost for everyone.
(Deist,2020) Many healthcare institutes do not share their health data due to piracy and regulatory concerns. The retrieval of such data brings its challenges due to the volume of records that hospitals store within them. The need for research on different diseases and treatments will only be possible if access to the different databases is given. Due to the nature of the data, not many hospitals are willing to give up the data unless it's a credible source. This hampers many research breakthroughs as a legal document, and months and months of legal permissions are needed before work can even begin. Then the fact of valuable data also comes into place; if the healthcare institute does not have adequate information on which research can be conducted, then it is deemed useless. Hence the need for open medical reports need to be a must for medical research. With the aid of data mining and data warehousing, the research becomes much quicker and efficient, so open records for such methodologies need to be implemented.
This study focuses on free databases, and "The Personal health Train" helps this by finding data that give the users valuable and easily accessible data on which they can run their algorithms and tests on. For this study, the data primarily used is for lung cancer from the oncology department and was able to run an algorithm on it. The datasets were given from multiple sources from multiple countries.
The algorithm proved that it was successful as it took 23,203 patients from 8 healthcare institutes. A regression model on post-treatment two-year survival was seen from studies of two different time intervals. The PHT goes over patient privacy and helps in healthcare data sharing. The data is taken from multiple countries, which helps generalize the data over a wide population sample and helps to predict disease outcomes and also helps notice the trends in different treatments and researches conducted throughout the world.
(Hartmann,2019) As discussed before, the need for the availability of medical journals is essential. As such, during 2013, Dahlgren Memorial Library from the Georgetown University Medical Center started using text mining software so clinical informationists can quickly recover relevant information from MEDLINE while on patient rounds.
This saves proved useful as with the traditional search engine; the clinician would have to read upon numerous journal abstracts that would waste a lot of time as well. This step allows clinical informationists to quickly and easily search up and access medical journals which they can use to answer the physician's clinical questions.
(David,2020) Children with rare bone diseases (RBD) raise the likelihood of emergencies due to the nature of the disease. Many studies point towards the susceptible nature of children with complex diseases, visiting pediatric hospitals.
Consequently, the burden of children with this disease is unknown. It was even seen that after 30 days, children were readmission into the hospital. It was also seen that children with this disease had an increased readmission rate of 6.26% in 2010 to 7.02% in 2016. The need to know the trajectory of such children is important. The more well-versed the doctors will be with the condition, and it’s predictability, the more easily doctors will be able to combat it. The study also showed that 20% of the index visits did not record the children having the previous record of RDB. This shows the family's lack of information on the child's condition while also questioning the physician on the inquiry of the child's previous medical record.
A single cohort study was done, where children under 18 years were looked at, the ones that visited the pediatric emergency department in the year of 2017. The 141 visits resulted in 84 of them were healthcare visits, and 60 of them were planned, while 24 unplanned. The index PED visits were associated with children with rare bone disease. The incidence rate ratio for second healthcare visits was high, showcasing that children with RBD disease are more likely to enter the emergency room.
Although this study was done in a single research center, it still shows the need for data warehousing when it comes to the prediction of rare diseases. With the help of the stored data in the research center, it was easily checked which children had a higher rate of readmission into the hospital, with the rare bone disease. The requirement for hospitals to record this type of data has increased so that studies like this can keep on being conducted on special and rare diseases.
This study focuses on free databases, and "The Personal health Train" helps this by finding data that give the users valuable and easily accessible data on which they can run their algorithms and tests on. For this study, the data primarily used is for lung cancer from the oncology department and was able to run an algorithm on it. The datasets were given from multiple sources from multiple countries.
The algorithm proved that it was successful as it took 23,203 patients from 8 healthcare institutes. A regression model on post-treatment two-year survival was seen from studies of two different time intervals. The PHT goes over patient privacy and helps in healthcare data sharing. The data is taken from multiple countries, which helps generalize the data over a wide population sample and helps to predict disease outcomes and also helps notice the trends in different treatments and researches conducted throughout the world.
(Wang,2019) Another aspect of healthcare is the insurance that covers it. Even though it's more on the business side of things, the optimization of the health insurance agencies helps the people on the health care side too. Many people have started health organizations as non-profit organizations. These government agencies hence hold a lot of client data, and the need for the improvement of these systems is required. These systems will have the best services if management, analyzing, and application of the data is top-notch, and that can only be achieved if data warehousing is implemented. In this study, The National Health Insurance Administration is a health insurance agency, and non-profit organization is responsible for managing health insurances for the people of Taiwan. The need for presenting, interpreting data, and visualizing it will help the officials there to make better decisions and reflect on the needs of the public.
In the study, after the implementation of the data warehouse, two types of data analyses were used, categorization, and association. From the results of the study, it was seen that the questions that were asked by the public were able to be predicted. With this knowledge, a scenario-based Q&A environment was set up for the public on the website, which helped the public with it’s health insurance problems and the people at the backhand that would receive these inquiries.
(Liang,2019)The focus on mental health has seen drastic changes in the last two decades, and the research on mental health is still in its infant stage, and a lot of questions are left unanswered. Mental illness encapsulates a wide arrange of mental health diseases; that affect the general wellbeing of a person. There are millions of people suffering from mental disorders, and the data related to these disorders are heterogeneous. The cost of treating these disorders is extremely expensive, as well. The introduction of the smartphone and sensors that continuously take data and send it to the cloud for processing detect what the mood of the user is, and through these variables, we can deduce if the user has the symptoms of mental health and if he does than which one.
With the help of big data, data warehouses, and artificial intelligence, the general direction to use data warehousing can be achieved. This study uses big data sensors, social media, and healthcare systems for phenotyping mental health.
In the results of this study, there are many factors for the effect of mental health, and none of them are applied clinically, the use of big data is here to help, but due to multiple reasons, there are problems in phenotyping, but the desire for improvement is there. Emphasis on data modeling is given so that heterogeneous data can be addressed at the same time.
III. Background
The healthcare system has had Electronic health records for a long while, but they were very ineffective in their work. They had many places where there was a constant lack of synergy between the data and the doctors. The records that would be accessed would usually be incomplete or unreadable. The electronic health records couldn't record text-free data, which constituted the majority of patient diagnoses. This was important as this indicated the need for treatment of rare diseases and such. Electronic Health Records also had limited space; they couldn't handle the large amounts of data and couldn't conduct useful research on them.
This is where data warehousing comes in. This field is revolutionizing every field of life as it's the way to handle large volumes of data efficiently. It can handle structured data and unstructured as well, which helps in constituting all kinds of data, not just specific kinds of it. It's widely accessible to most data scientists, and the value of data is always there. Data warehousing is ideal for research as you can apply methods and analysis to get your desired result. You can notice trends with the help of data warehouses, and it’s tools. Many powerful data warehouse frameworks are used all over the world and especially in institutes that study rare diseases and are conducting research on the said disease(1). Data warehousing also gives access to easy data mining and the internet of things. The handling of incoming large amounts of data and to easily process that data from multiple devices can only be achieved with data warehousing tools. These tools categorize and organize the data with no hardships, and the heterogeneous nature of the data is handled quite easily as well. Data warehousing also facilitates the easy acquisition of data, instead of waiting for the old electronic health record’s search engine to bring up matching results, data warehouses find you what you’re looking for.
IV. Conclusion
Data warehousing is the next step in health records and research as well. With so many benefits, it will start the next revolution in medicine. Data warehousing essentially makes the next transition of medicine with the help of technology as it brings larger, faster, more synergized, and more organized data. This gives health center multiple benefits that doctors and clinicians can use to their advantage. The organization of data will make it easier to access previous records of matching cases, which will help identify normal diseases much easier and even make way for the identification of rare diseases as well. There are some uses of data warehousing where the patient doesn't even have to get in touch with the doctor. Instead, use devices that monitor the patient's health in real-time and on the surveillance of data make decisions. The hospitals all over the world will run much more smoothly with the implementation of data warehouses all over; the patient-doctor relations will get better, the research institutes will have better accuracy and efficiency, with better data heterogeneity in their results, and overall data handling will become easier.
REFERENCES
[1] Garcelon, N., Neuraz, A., Salomon, R., Faour, H., Benoit, V., Delapalme, A., Munnich, A., Burgun, A., & Rance, B. (2018). A clinician friendly data warehouse oriented toward narrative reports: Dr. Warehouse. Journal of Biomedical Informatics, 80, 52–63.
[2] Ayyoubzadeh, S. M., Ghazisaeedi, M., Rostam Niakan Kalhori, S., Hassaniazad, M., Baniasadi, T., Maghooli, K., & Kahnouji, K. (2020). A study of factors related to patients' Length of stay using data mining techniques in a general hospital in southern Iran. Health Information Science and Systems, 1.
[3] van Solinge, H. H. de G. B. W. I. S. M. E., Asselbergs, N. de B. B. G. E. de J. L. L. van der K. K. R. V. V. W. F. W. H. M. G.-J. M. L. M. I. M. H. P. A. T. A. T. N. P. L. J. Y. M. M. C. F. L. J. J., Groenhof, T. K. J., Koers, L. R., Blasse, E., de Groot, M., Grobbee, D. E., Bots, M. L., Asselbergs, F. W., Lely, A. T., & Haitjema, S. (2020). Data mining information from electronic health records produced high yield and accuracy for current smoking status. Journal of Clinical Epidemiology, 118, 100–106.
[4] Syed, L., Jabeen, S., S., M., & Alsaeedi, A. (2019). Smart healthcare framework for ambient assisted living using IoMT and big data analytics techniques. Future Generation Computer Systems, 101, 136–151.
[5] Deist, T. M., Dankers, F. J. W. M., Ojha, P., Scott Marshall, M., Janssen, T., Faivre-Finn, C., Masciocchi, C., Valentini, V., Wang, J., Chen, J., Zhang, Z., Spezi, E., Button, M., Jan Nuyttens, J., Vernhout, R., van Soest, J., Jochems, A., Monshouwer, R., Bussink, J., … Dekker, A. (2020). Distributed learning on 20 000+ lung cancer patients – The Personal Health Train. Radiotherapy and Oncology, 144, 189–200.
[6] David Dawei Yang, Geneviève Baujat, Antoine Neuraz, Nicolas Garcelon, Claude Messiaen, Arnaud Sandrin, Gérard Cheron, Anita Burgun, Zagorka Pejin, Valérie Cormier-Daire, & François Angoulvant. (2020). Healthcare trajectory of children with rare bone disease attending pediatric emergency departments. Orphanet Journal of Rare Diseases, 15(1), 1–9.
[7] Hartmann, J., & Van Keuren, L. (2019). Text mining for clinical support. Journal of the Medical Library Association, 107(4), 603–605.
[8] Wang, C.-S., Lin, S.-L., Chou, T.-H., & Li, B.-Y. (2019). An integrated data analytics process to optimize data governance of non-profit organization. Computers in Human Behavior, 101, 495–505.
[9] Liang, Y., Zheng, X., & Zeng, D. D. (2019). A survey on big data-driven digital phenotyping of mental health. Information Fusion, 52, 290–307.