Annotated Bibliography for below attached aricles
EXPLORING INPUT ENHANCEMENTS BIG DATA ANALYSTS NEED TO IMPROVE A CREDIT QUALIFICATION MODEL TO SUPPORT LARGE BANKS IN THEIR
RISK MANAGEMENT OPERATIONS
A Dissertation Presented in Partial Fulfillment of the Requirements for the Degree of
Doctor of Computer Science
By
Tuan Duc Nguyen
Colorado Technical University
March 2020
ProQuest Number:
All rights reserved
INFORMATION TO ALL USERS The quality of this reproduction is dependent on the quality of the copy submitted.
In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.
Published by ProQuest LLC (
ProQuest
). Copyright of the Dissertation is held by the Author.
All Rights Reserved. This work is protected against unauthorized copying under Title 17, United States Code
Microform Edition © ProQuest LLC.
ProQuest LLC 789 East Eisenhower Parkway
P.O. Box 1346 Ann Arbor, MI 48106 - 1346
27830367
27830367
2020
Committee
Alexa Schmitt, PhD, Chair
James O. Webb, PhD, Committee Member
Cynthia Calongne, PhD, Committee Member
March 24th, 2020 _________________________________
Date Approved
i
© Tuan Duc Nguyen, 2020
ii
Abstract
This study explored the use of an artificial neural network (ANN) called iQual to improve a
credit qualification model to support large banks in their risk management operations. The
research leveraged the Design Science framework to design and evaluate a web-based
Information Technology artifact, named iQual, to predict the default probability for a list of
credit borrowers. A focus group of five participants included senior data technical experts and
financial institutions’ directors, in the Washington DC Metro areas, had been selected prior to
watch a live demonstration of the iQual tool in action, and provide the expert feedback on the
artifact. The research followed the framework for concept proof, artifact construct, and artifact
enhancing of the Artificial Neural Network (ANN) machine learning prototype of the iQual
credit qualification application via the Web. The research method included semi-structured
interviews, each consisting of 7 open-ended questions, responded by 5 expert reviewers with
technical expertise and trade experience from a financial industry. The compiled list from the
expert reviewers’ feedback, recorded through transcription, was then organized into themes of
enhanced features. The enhanced features from the iQual dashboard tool were recognized by the
reviewers as follows: a) data load module, b) applicant summary view module, c) set credit
product qualification standards, d) predict execution, and e) assess accuracy of prediction. The
data analysis of the expert reviewers’ transcription of interviews also indicated that additional
elements as discussed below need to be addressed or improved for real-life application to the
banking industries such as a) quality control, b) better logs, c) different loan options, d) interest
rate calculation, and e) management of users.
Keywords: Artificial Neural Network, machine learning; credit qualification tool; iQual;
input enhancements; big data analysts
iii
Dedication
I would like to dedicate the success of this study to my supporting wife and our cheerful
daughter. They have always stood beside me throughout this journey, lending both morale and
physical support in my most difficult times. Their affection is the single greatest boost in my life.
iv
Acknowledgements
I would like to express deepest gratitude toward my Research Supervisor, Dr. Alexa
Schmitt. Your dedication, expertise, and persistence have greatly carried me through this
challenging journey.
v
Table of Contents
Acknowledgements ........................................................................................................ iv
Table of Contents ............................................................................................................ v
List of Tables ................................................................................................................. ix
List of Figures ................................................................................................................. x
Chapter One ........................................................................................................................ 1
Topic Overview/Background .......................................................................................... 2
Problem Statement .......................................................................................................... 4
Purpose Statement ........................................................................................................... 5
Research Question .......................................................................................................... 6
Propositions..................................................................................................................... 6
Conceptual Framework ................................................................................................... 7
Assumptions/Biases ........................................................................................................ 8
Significance of the Study ................................................................................................ 9
Delimitations ................................................................................................................. 10
Limitations .................................................................................................................... 11
Definition of Terms....................................................................................................... 11
General Overview of the Research Design ................................................................... 12
Summary of Chapter One ............................................................................................. 13
Organization of Dissertation ......................................................................................... 14
vi
Chapter Two...................................................................................................................... 15
Big Data ........................................................................................................................ 16
Risk Management Using Big Data Methods ................................................................ 23
Conceptual Framework ................................................................................................. 35
Figure 1. Automated lending decision system conceptual framework. ........................ 36
Summary of Literature Review ..................................................................................... 37
Chapter Three.................................................................................................................... 39
Research Tradition ........................................................................................................ 40
Research Question ........................................................................................................ 43
Research Design............................................................................................................ 43
Population and Sample ............................................................................................. 45
Sampling Procedure .................................................................................................. 46
Instrumentation ......................................................................................................... 47
Validity ..................................................................................................................... 48
Reliability .................................................................................................................. 49
Data Collection ......................................................................................................... 50
Data Analysis ............................................................................................................ 52
Ethical Considerations .............................................................................................. 54
Summary of Chapter Three ........................................................................................... 55
Chapter Four ..................................................................................................................... 57
vii
The Original Artificial Neural Network Artifact .......................................................... 58
Artifact Modification .................................................................................................... 59
iQual’s Overall Analytics Goals ............................................................................... 59
iQual’s Architectural Design .................................................................................... 59
Summary of Chapter Four ............................................................................................ 63
Chapter FIVE .................................................................................................................... 65
Data and Participant Demographics.............................................................................. 66
Chapter Six........................................................................................................................ 92
Findings and Conclusions ............................................................................................. 95
Limitations of the Study.............................................................................................. 102
Implications for Practice ............................................................................................. 102
Implications of Study and Recommendations for Future Research............................ 104
Future Study 1 ......................................................................................................... 105
Future Study 2 ......................................................................................................... 106
Conclusion .................................................................................................................. 106
References ....................................................................................................................... 108
Appendix A ..................................................................................................................... 117
Appendix B ..................................................................................................................... 118
Approved Researcher’s Permission ................................................................................ 118
Appendix C ..................................................................................................................... 119
viii
Appendix D ..................................................................................................................... 121
Appendix E ..................................................................................................................... 122
Appendix F...................................................................................................................... 124
Appendix G ..................................................................................................................... 133
ix
List of Tables
Table 1 Data Attributes Summary..................................................................................................66
Table 2 Participant Demographics ................................................................................................67
Table 3 Expert Reviewer Responses to Question 1 ........................................................................72
Table 4 Expert Reviewer Responses to Question 2 ........................................................................75
Table 5 Expert Reviewer Responses to Question 3 ........................................................................77
Table 6 Expert Reviewer Responses to Question 4 ........................................................................80
Table 7 Expert Reviewer Responses to Question 5 ........................................................................82
Table 8 Expert Reviewer Responses to Question 6 ........................................................................83
Table 9 Expert Reviewer Responses to Question 7 ........................................................................86
x
List of Figures
Figure 1. Automated lending decision system conceptual framework .........................................36
1
CHAPTER ONE
Big data analytics have become increasingly critical in the financial services industry
(Barr, Koziara, Flood, Hero, & Jagadish, 2018), especially for risk managers employed in the
financial institutions. The modern era of data centric information technology has given birth to
an enormous amount of large and complex financial data sets, which may shed valuable insights
for risk management. As a result, big data analytics bring about several opportunities to identify
and recognize both inherent and implicit interconnections, relationships, and patterns among risk
determinant factors associated with diverse credit consumer profiles (Barr et al., 2018).
However, due to the sensitive nature of banks’ data, big data analytics has not become a
mainstream technology platform for the banking industry as a whole (Sinclair, 2017).
The goal of this study wass to evaluate the capabilities of these existing big data tools,
from the standpoint of the big financial institutions which were still hesitating on implementing
the technology, with respect to optimizing risk factors. Meanwhile, the smaller organizations and
enterprises had experienced much improved credit risk management operations, according to the
Economist Intelligence Unit (“Global Retail Banking Report”, 2018) survey, which showed
promising results in credit card fraud prevention (by 31% respondents), and prevention of
defaults by accurately predicting credit repayment risk (by 26% respondents).Toward this end,
optimizing existing big data models for greater capabilities and functionalities satisfying the
requirements of major banks could be the important first step to enable industry wide adoption.
For many decades, the credit lending decision model had been built on the traditional
framework based on the credit scoring system of the three credit bureaus plus information
provided by the consumers or public records (Chandrasekhar, 2018). As a result, the default rate
was still significant at major financial institutions (“Global Retail banking Report”, 2018). With
2
the advent of big data and the social media hubs that created a wealth of client related
information, technology had produced a possible alternative to the traditional framework. The
innovation of automated big data enabled credit decision model to harbor effective tools to
evaluate a credit applicant proactively by reengineering tools beyond their normal credit profiles,
such as spending habits, network influence, fraudulent trend, and income earning potential
analysis (Chandrasekhar, 2018). This study focused on investigating the benefits of existing big
data application on the lending qualifications, as compared to traditional credit bureaus’ methods
as employed by popular banks today in the United States. Any potential gap from the current
technology with the expected standards of major banks was identified and addressed.
Chapter 1 introduced the background of the research, passion behind the study, and
objectives for the information system design science artifact. Once the problem statement had
been identified, the purpose statement woul be presented to show the goal of the study as to
emphasize the practicality of a big data instrument in the credit qualification processes. The
research question and proposition was mentioned to establish the core focuses of the research,
along with the theoretical perspective that guides this paper. Assumptions and biases pronounced
the bias the research has as one working in the financial services industry, and the assumptions
that the regulatory framework was favorable to the development of the instrument. Limitations
and delimitations of the study was also presented. Highly technical terms was defined to ease the
readers into the contents of this paper. Finally, Chapter 1 outlined the organization of the
dissertation.
Topic Overview/Background
Many industry analysts agreed that the recent 2008 subprime mortgage crisis indicated
that the credit qualification model banks have employed to extend credits and loan products were
no longer working properly; allowing greed and frauds to compromise the integrity of the honor-
3
based credit application system (Mizen, 2008). Jiang, Liao, Lu, Wang and Xiang (2019) raised
the issue that a few individuals, with the knowhows to manipulate the system with their short-
term credit score boost and exaggerated income, were often qualified for premiere loans products
with attractive terms. Meanwhile other well-deserved clients, like honest mom/pop shop owners,
may be skipped over in considerations, or failed to obtain loans to better their business and
eventually moves the economy and raise the national Gross Domestic Product (Jiang et al.,
2019). As Diaz (2016) pointed out, many otherwise highly responsible consumers might not
have maintained their active credit profiles, or have a limited credit profile due to their
preferences on using cash to pay for all transactions. There may come a time when these
consumers would need to finance a large investment and thus financial institutions may miss out
on the big business opportunity had it just examined the regular credit evaluation process of Fair,
Isaac and Company (FICO) score databases (Cornett, McNutt, Strahan, & Tehranian, 2011).
In this day and age, it is hard to ignore the term big data. The definition of big data is
often vague, and varies from source to source; some even called it big data Capacity (Hassna &
Lowry, 2016). Most sources defined big data solely based on the large volume of the dataset.
However, Heripracoyo, and Kurniawan (2016) argued that beside the large volume (terabyte and
more), the term big data also refers to the diversity of data as well as data rates. The article
interestingly illustrated big data as a dataset that is made up of an assortment of structured, semi-
structured, and unstructured data, and it contains the characteristics of the three big Vs: (1)
Volume, (2) Velocity, and (3) Variety (Heripracoyo & Kurniawan, 2016). It also contained four
phases in the value chain of big data process, which included data generation, data acquisition,
data storage, and data analysis (Chen, Mao, & Liu, 2014). As the trend of big data gathered great
interest in the information technology community, a new scientific paradigm to address the big
4
data problem and its organic growth came into recognition, and was then called the data-
intensive scientific discovery (Chen & Zhang, 2014). Chen and Zhang (2014) asserted that data-
intensive scientific discovery approach would enable the avenues for alternative credit decision
modeling tools that exhaustively vest the credit qualification criteria, and possibly mine the
fraudulent trends to reduce the risk of criminal attempts.
Problem Statement
The problem to be addressed in the study was input enhancements big data analysts need
to improve a credit qualification model to support large banks in their risk management
operations have not been identified (Petropoulos, Siakoulis, Stavroulakis, & Klamargias, 2019).
Given the fact that big data terminology and its methodology awareness has been around since
2005, the prominent modern theories and applications had been tested extensively; however, it
was still in the intermediate trial stage and not mass produced yet (van Rijmenam, 2019). The
majority of the lending industry currently relied on the credit bureaus’ scoring analytics, which
mainly measured the consumers’ behaviors over time, which did not portray the shifting in credit
cycles and systemic risk shifting and therefore is deemed irrelevant for high stake tactical risk
management decisions (Gandomi & Murtaza, 2015). Such a credit scoring approach contributed
to the gap in the body of knowledge of the financial institutions that this study sought to address
with the design science research via the input enhancements to the a credit application analytics
application based on the Artificial Neural Network (ANN) machine learning forecasting tools.
Integrating big data from the social media and alternative sources allows financial
institutions to extend their credit qualification metrics to evaluate these special market segments
of otherwise highly qualified borrowers, and consequentially, broaden financial institutions’
revenue streams (Cockcroft & Russell, 2018). An investigation of credit risk analytic solution
5
based on an improved big data approach, as such, wass beneficial to both early technology
adopters and hesitating larger financial institutions (Wu & Birge, 2016). Furthermore, in their
comprehensive review through the financial institutions’ use cases and financial performance
data. Skyrius, Giriūnienė, Katin, Kazimianec and Žilinskas(2018) concluded that even though
big banks were exposed to a similar set of risks as other businesses would, credit risk
management wass what financial institutions should focus as a main priority. Lin, Whinston, and
Fan (2015) also coincided with Skyrius et al.’s (2018) notion, where they claimed that credit risk
management is the main issue that is facing the Internet finance marketplace, where the data
involved in such transactions are voluminous and multi-dimensional.
Purpose Statement
The purpose of the design science study was to (a) explore the input enhancements big
data analysts need to improve a credit qualification model to support large banks in their risk
management operations and (b) use the findings to modify the existing default risk prediction
through machine learning credit qualification model. When the gap of current technologies as-is
capabilities, and the to-be requirements are identified, the improved big data analytic tools would
enable a credit decision model to become an effective evaluating tool. Such a model would allow
banks to examine a credit applicant by means beyond their normal credit profiles, such as
spending habits, network influence, fraudulent trend, and income earning potential analysis.
Banks could attract more good borrowers given the improved credit decision model, and weed
out those seemingly good appearance clients, but have questionable spending habits have and
credit misuse potentials.
By utilizing the machine learning tool of Artificial Neural Network (ANN) machine
learning on the consumer financial data sets, the big data analysts would be able to profile the
6
credit applicants into pools with relevant ratings and risk classification outputs. Such outputs
would then be compared to the thresholds specifically established by the financial institutions
based on their risk appetite and credit product category. The design science dissertation is part of
the new big data focused implementation of the credit risk management efforts aimed at a more
exhaustive due diligence process for vesting of the credit applications beyond the conventional
wisdom of the financial industry.
Research Question
The study utilized a qualitative design science design to focus and guide the central
research question: “What are the input enhancements to an artificial neural network machine
learning algorithm big data analyst need to improve a credit qualification model to support large
banks in their risk management operations?”
Propositions
Creswell (2014) defined propositions as researcher’s claims or proposed theories that the
intended research is to examine. This study asserted that the input enhancements, to the existing
artificial neural machine learning algorithm by Yeh and Lien (2009), would introduce a more
applicable tool and users-friendly approach for banks’ big data analysts to predict the probability
of default rates among their credit applicants. As Hilbert (2016) pointed out, the major obstacle
for big data adoption among large organizations is the absence of understanding how their
analytics model works, and the missing links between analytics insights/findings and business
values. To close such a gap, this study’s enhanced analytical software application will enable
bank managers to gain a more complete understanding of the big data analytics process, as to
adopt big data analytics in their financial risk management process.
7
Conceptual Framework
Yeh and Lien’s study (2009) already provided the initial framework for credit risk
prediction by means of various machine learning approaches, of which the ANN was found to be
most reliable technique. However, their study did not present a coding exhibit for any mentioned
machine learning techniques, nor demonstrate any specific use case for financial institutions to
build their credit risk analytics applications upon. As Hilbert (2016) pointed out, the major
obstacle for big data adoption among large organizations is the absence of understanding of how
big data analytics model work, and the missing links between analytics insights/findings and
business values. The goal of the research study was to examine and modify a qualification
analytics application, iQual, based on artificial neural machine learning algorithm by Yeh and
Lien (2009), for financial institutions to evaluate borrowers from an unstructured data set of
credit applicants.
For the stated qualification analytics system (iQual) utilized to facilitate online lending
decisions for financial institutions, the central theoretical framework behind the data warehouse
would be the Entity Relationship (ER) model, as to implement the transactional-oriented
database within the Online Transactional Processing (OLTP) system, which contained the
complete process for discovery-driven artificial neural data mining logarithm for the data
warehouses (Han, Pei, & Kamber, 2011). The star model to be used includes various consumers’
credit and identification data showing a central table of facts and collected data from various
sources, including publicly available, peer reporting sources, and customer’ provided
information, as cross-verified by the differing tables linking to this specific central fact table
(Gandomi & Murtaza, 2015).
8
The data was analyzed in the iQual analytical protocol, and then was drilled through the
knowledge discovery database, established from clusters of previous findings repository, to
support the decision, as either automated Artificial Intelligence, or human manual underwriting
process. The Knowledge Discovery in Database (KDD) process for the data is described briefly
as: Data --> Information -- > Understanding --> Knowledge --> Business (Process) Intelligence
(iQual). The BI system may employ the added data from the knowledge gained through credit
application and online profiling to make better lending decisions that are not as overarching or
risk averse. The data warehouse utilizing a centralized facts table and a star ER model wass the
founding framework for the BI system, and became powerful as it learned and grew with the
volume of consumers’ lending mechanism through the database.
Assumptions/Biases
Assumptions are the critical notions as part of this dissertation underlying framework,
and are generally assumed to be either true or plausible in a given credit qualification scenario
using big data methods (Piantanida & Garman, 2009). There were three assumptions.
Assumption 1 took on a favorable regulation landscape by the policy makers to allow sharing
and collecting social media data of the consumers. Assumption 2 assumed that the majority of
borrowers have a wide variety of data sources linkable to their social identity and status.
Assumption 3 assumed that data used for evaluation by the study’s instrument either received
full consent for research use, and/or had data of a sensitive nature removed/redacted.
Biases are defined as the tendencies in which qualitative researchers may hold from their
own professional experiences/viewpoints that prevent the unprejudiced consideration of the
research question (Pannucci & Wilkins, 2010). Bias 1 was from the researcher’s own
professional background, as the researcher had been actively working in the financial services
9
field. The researcher may hold bias and certain conservatism on automated tools on sensitive
financial matters. Bias 2 was based on the researcher’s optimistic outlook of big data and
Artificial Intelligence tools as solving the issues existing in the financial industry. However, the
literature review in chapter 2 debunked these biases as various scholarly articles and experiments
was presented, providing the methods and findings from other researchers working on the same
topics as to minimize the bias and assumption that may present in the study.
Significance of the Study
The significance of the study was that credit lending decision models are obviously
broken, as it had resulted in several credit defaults and even the recent housing market crisis and
greed to take place (Diamond & Rajan, 2009). The end result was the significant default rate at
major financial institutions, as people with greed can easily defraud the system for their own
gain. The advent of big data creates a wealth of client related information, which would produce
a possible alternative to the traditional framework. Currently, the system only took into
consideration a very low percentage of available financial, personal profile, and consumer habit
data of the consumer character package. There was other vital information that can determine a
borrower’s credit trustworthiness that is not measured under the current system. For example, if
the borrower were a bright entrepreneur with great venture ideas and stellar academic/research
credentials, he/she should have been considered with additional/alternative metrics that
determine credit worthiness instead of the regular 4C (Character, Capacity, Capital and
Conditions) formula that would have simply looked up the negative net worth and limited credit
history (“What are the 4 C’s,” n.d.).
The societal and organizational benefit of this study would be the integration to the bank
operation, and the advanced technical framework to leverage big data (mostly unstructured
10
formats) from diverse sources of public, peer, and private data sources (knowledge bases). Such
lending qualification model would minimize the risks of default, and potentially offer the best
rate to deserving clients, and in turn, making better banks and stabilize the economies from
smarter and well-informed lending decisions. When it came to large credit-based purchases that
affect the national economy bottom-line, serious due diligence would be needed to be conducted
and executed by financial institution organizations, and that is when big data platform
technological plan can help assist.
Delimitations
Delimitations are those characteristics that limit the scope of the research problem and
define the boundaries of the research (Simon, 2011). Delimitation 1 was that the study discusses
the benefits of big data within the narrow scope of lending qualification for loan products using
American financial data metrics; however, with more access to the alternative data than regular
credit due diligence process without the assistance of big data platform. Delimitation 2 was that
the study examines a limited amount of authorized data through a data set obtained from publicly
available sources, and as such cannot represent the actual outcomes in a financial institution with
large data stores of actual customers profiles. Delimitation 3 was that the study did not cover the
aspect of fraud prevention in credit qualification, even though such practice may be performed
concurrently with loan decision models to safeguard banks from criminal attempts. Delimitation
4 was the criteria to select the interviewing participants for the study, as they may not present all
differing views of bank managers/big data analysts in the industry in terms of their view with the
iQual application. Delimitation 5 was the limited scope of referenced sample dataset that iQual
conducts testing upon, due to the sensitivity nature of user’s credit information and limited
availability of such data in the public.
11
Limitations
Limitations in this study are the specific characteristics and influences of the specific
topic and design methodology that affect the study exercises, especially with the interpretation of
the findings (Price & Murnan, 2004). Limitation 1 would be the data privacy concerns from
users and regulators across the world, especially after the Facebook’s Cambridge Analytica
scandal. Officially, Facebook stated that it banned the use of its social media data to any
monetized-related service (LaForgia, Dance, & Confessore, 2018). However, this did not exclude
the data that users opt in to provide to the financial institutions, or the publicly available data.
Limitation 2 was the lack of guidance on the privacy protection of social network data policy, as
it was still up to debate among regulators and policy makers, which deemed the technology as
discussed in this paper in pending status or limited in scope for the time being (Zhang, 2018). It
was largely based on how the federal regulators/lawmakers are trying to curb the use of social
media metadata that used to be available for businesses and researchers. Big data analysts had to
watch the regulatory landscape to determine the future of this technology. Limitation 3 was the
sociotechnical plan that would incur its limitation on users with no/limited credit history and no
social media/publicly available data available that are linked to their identity.
Definition of Terms
Big data: a very large and complex dataset that are made up of an assortment of
structured, semi-structured, and unstructured data and contains the characteristics of the three big
Vs: (1) Volume, (2) Velocity, and (3) Variety (Heripracoyo & Kurniawan, 2016). The term also
contained four phases in the value chain, which included data generation, data acquisition, data
storage, and data analysis (Chen et al., 2014).
Reproduced with permission of copyright owner. Further reproduction prohibited without permission.