Annotated Bibliography for below attached aricles

profilesri18123
TuanDucNguyen.pdf

EXPLORING INPUT ENHANCEMENTS BIG DATA ANALYSTS NEED TO IMPROVE A CREDIT QUALIFICATION MODEL TO SUPPORT LARGE BANKS IN THEIR

RISK MANAGEMENT OPERATIONS

A Dissertation Presented in Partial Fulfillment of the Requirements for the Degree of

Doctor of Computer Science

By

Tuan Duc Nguyen

Colorado Technical University

March 2020

ProQuest Number:

All rights reserved

INFORMATION TO ALL USERS The quality of this reproduction is dependent on the quality of the copy submitted.

In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed,

a note will indicate the deletion.

Published by ProQuest LLC (

ProQuest

). Copyright of the Dissertation is held by the Author.

All Rights Reserved. This work is protected against unauthorized copying under Title 17, United States Code

Microform Edition © ProQuest LLC.

ProQuest LLC 789 East Eisenhower Parkway

P.O. Box 1346 Ann Arbor, MI 48106 - 1346

27830367

27830367

2020

Committee

Alexa Schmitt, PhD, Chair

James O. Webb, PhD, Committee Member

Cynthia Calongne, PhD, Committee Member

March 24th, 2020 _________________________________

Date Approved

i

© Tuan Duc Nguyen, 2020

ii

Abstract

This study explored the use of an artificial neural network (ANN) called iQual to improve a

credit qualification model to support large banks in their risk management operations. The

research leveraged the Design Science framework to design and evaluate a web-based

Information Technology artifact, named iQual, to predict the default probability for a list of

credit borrowers. A focus group of five participants included senior data technical experts and

financial institutions’ directors, in the Washington DC Metro areas, had been selected prior to

watch a live demonstration of the iQual tool in action, and provide the expert feedback on the

artifact. The research followed the framework for concept proof, artifact construct, and artifact

enhancing of the Artificial Neural Network (ANN) machine learning prototype of the iQual

credit qualification application via the Web. The research method included semi-structured

interviews, each consisting of 7 open-ended questions, responded by 5 expert reviewers with

technical expertise and trade experience from a financial industry. The compiled list from the

expert reviewers’ feedback, recorded through transcription, was then organized into themes of

enhanced features. The enhanced features from the iQual dashboard tool were recognized by the

reviewers as follows: a) data load module, b) applicant summary view module, c) set credit

product qualification standards, d) predict execution, and e) assess accuracy of prediction. The

data analysis of the expert reviewers’ transcription of interviews also indicated that additional

elements as discussed below need to be addressed or improved for real-life application to the

banking industries such as a) quality control, b) better logs, c) different loan options, d) interest

rate calculation, and e) management of users.

Keywords: Artificial Neural Network, machine learning; credit qualification tool; iQual;

input enhancements; big data analysts

iii

Dedication

I would like to dedicate the success of this study to my supporting wife and our cheerful

daughter. They have always stood beside me throughout this journey, lending both morale and

physical support in my most difficult times. Their affection is the single greatest boost in my life.

iv

Acknowledgements

I would like to express deepest gratitude toward my Research Supervisor, Dr. Alexa

Schmitt. Your dedication, expertise, and persistence have greatly carried me through this

challenging journey.

v

Table of Contents

Acknowledgements ........................................................................................................ iv

Table of Contents ............................................................................................................ v

List of Tables ................................................................................................................. ix

List of Figures ................................................................................................................. x

Chapter One ........................................................................................................................ 1

Topic Overview/Background .......................................................................................... 2

Problem Statement .......................................................................................................... 4

Purpose Statement ........................................................................................................... 5

Research Question .......................................................................................................... 6

Propositions..................................................................................................................... 6

Conceptual Framework ................................................................................................... 7

Assumptions/Biases ........................................................................................................ 8

Significance of the Study ................................................................................................ 9

Delimitations ................................................................................................................. 10

Limitations .................................................................................................................... 11

Definition of Terms....................................................................................................... 11

General Overview of the Research Design ................................................................... 12

Summary of Chapter One ............................................................................................. 13

Organization of Dissertation ......................................................................................... 14

vi

Chapter Two...................................................................................................................... 15

Big Data ........................................................................................................................ 16

Risk Management Using Big Data Methods ................................................................ 23

Conceptual Framework ................................................................................................. 35

Figure 1. Automated lending decision system conceptual framework. ........................ 36

Summary of Literature Review ..................................................................................... 37

Chapter Three.................................................................................................................... 39

Research Tradition ........................................................................................................ 40

Research Question ........................................................................................................ 43

Research Design............................................................................................................ 43

Population and Sample ............................................................................................. 45

Sampling Procedure .................................................................................................. 46

Instrumentation ......................................................................................................... 47

Validity ..................................................................................................................... 48

Reliability .................................................................................................................. 49

Data Collection ......................................................................................................... 50

Data Analysis ............................................................................................................ 52

Ethical Considerations .............................................................................................. 54

Summary of Chapter Three ........................................................................................... 55

Chapter Four ..................................................................................................................... 57

vii

The Original Artificial Neural Network Artifact .......................................................... 58

Artifact Modification .................................................................................................... 59

iQual’s Overall Analytics Goals ............................................................................... 59

iQual’s Architectural Design .................................................................................... 59

Summary of Chapter Four ............................................................................................ 63

Chapter FIVE .................................................................................................................... 65

Data and Participant Demographics.............................................................................. 66

Chapter Six........................................................................................................................ 92

Findings and Conclusions ............................................................................................. 95

Limitations of the Study.............................................................................................. 102

Implications for Practice ............................................................................................. 102

Implications of Study and Recommendations for Future Research............................ 104

Future Study 1 ......................................................................................................... 105

Future Study 2 ......................................................................................................... 106

Conclusion .................................................................................................................. 106

References ....................................................................................................................... 108

Appendix A ..................................................................................................................... 117

Appendix B ..................................................................................................................... 118

Approved Researcher’s Permission ................................................................................ 118

Appendix C ..................................................................................................................... 119

viii

Appendix D ..................................................................................................................... 121

Appendix E ..................................................................................................................... 122

Appendix F...................................................................................................................... 124

Appendix G ..................................................................................................................... 133

ix

List of Tables

Table 1 Data Attributes Summary..................................................................................................66

Table 2 Participant Demographics ................................................................................................67

Table 3 Expert Reviewer Responses to Question 1 ........................................................................72

Table 4 Expert Reviewer Responses to Question 2 ........................................................................75

Table 5 Expert Reviewer Responses to Question 3 ........................................................................77

Table 6 Expert Reviewer Responses to Question 4 ........................................................................80

Table 7 Expert Reviewer Responses to Question 5 ........................................................................82

Table 8 Expert Reviewer Responses to Question 6 ........................................................................83

Table 9 Expert Reviewer Responses to Question 7 ........................................................................86

x

List of Figures

Figure 1. Automated lending decision system conceptual framework .........................................36

1

CHAPTER ONE

Big data analytics have become increasingly critical in the financial services industry

(Barr, Koziara, Flood, Hero, & Jagadish, 2018), especially for risk managers employed in the

financial institutions. The modern era of data centric information technology has given birth to

an enormous amount of large and complex financial data sets, which may shed valuable insights

for risk management. As a result, big data analytics bring about several opportunities to identify

and recognize both inherent and implicit interconnections, relationships, and patterns among risk

determinant factors associated with diverse credit consumer profiles (Barr et al., 2018).

However, due to the sensitive nature of banks’ data, big data analytics has not become a

mainstream technology platform for the banking industry as a whole (Sinclair, 2017).

The goal of this study wass to evaluate the capabilities of these existing big data tools,

from the standpoint of the big financial institutions which were still hesitating on implementing

the technology, with respect to optimizing risk factors. Meanwhile, the smaller organizations and

enterprises had experienced much improved credit risk management operations, according to the

Economist Intelligence Unit (“Global Retail Banking Report”, 2018) survey, which showed

promising results in credit card fraud prevention (by 31% respondents), and prevention of

defaults by accurately predicting credit repayment risk (by 26% respondents).Toward this end,

optimizing existing big data models for greater capabilities and functionalities satisfying the

requirements of major banks could be the important first step to enable industry wide adoption.

For many decades, the credit lending decision model had been built on the traditional

framework based on the credit scoring system of the three credit bureaus plus information

provided by the consumers or public records (Chandrasekhar, 2018). As a result, the default rate

was still significant at major financial institutions (“Global Retail banking Report”, 2018). With

2

the advent of big data and the social media hubs that created a wealth of client related

information, technology had produced a possible alternative to the traditional framework. The

innovation of automated big data enabled credit decision model to harbor effective tools to

evaluate a credit applicant proactively by reengineering tools beyond their normal credit profiles,

such as spending habits, network influence, fraudulent trend, and income earning potential

analysis (Chandrasekhar, 2018). This study focused on investigating the benefits of existing big

data application on the lending qualifications, as compared to traditional credit bureaus’ methods

as employed by popular banks today in the United States. Any potential gap from the current

technology with the expected standards of major banks was identified and addressed.

Chapter 1 introduced the background of the research, passion behind the study, and

objectives for the information system design science artifact. Once the problem statement had

been identified, the purpose statement woul be presented to show the goal of the study as to

emphasize the practicality of a big data instrument in the credit qualification processes. The

research question and proposition was mentioned to establish the core focuses of the research,

along with the theoretical perspective that guides this paper. Assumptions and biases pronounced

the bias the research has as one working in the financial services industry, and the assumptions

that the regulatory framework was favorable to the development of the instrument. Limitations

and delimitations of the study was also presented. Highly technical terms was defined to ease the

readers into the contents of this paper. Finally, Chapter 1 outlined the organization of the

dissertation.

Topic Overview/Background

Many industry analysts agreed that the recent 2008 subprime mortgage crisis indicated

that the credit qualification model banks have employed to extend credits and loan products were

no longer working properly; allowing greed and frauds to compromise the integrity of the honor-

3

based credit application system (Mizen, 2008). Jiang, Liao, Lu, Wang and Xiang (2019) raised

the issue that a few individuals, with the knowhows to manipulate the system with their short-

term credit score boost and exaggerated income, were often qualified for premiere loans products

with attractive terms. Meanwhile other well-deserved clients, like honest mom/pop shop owners,

may be skipped over in considerations, or failed to obtain loans to better their business and

eventually moves the economy and raise the national Gross Domestic Product (Jiang et al.,

2019). As Diaz (2016) pointed out, many otherwise highly responsible consumers might not

have maintained their active credit profiles, or have a limited credit profile due to their

preferences on using cash to pay for all transactions. There may come a time when these

consumers would need to finance a large investment and thus financial institutions may miss out

on the big business opportunity had it just examined the regular credit evaluation process of Fair,

Isaac and Company (FICO) score databases (Cornett, McNutt, Strahan, & Tehranian, 2011).

In this day and age, it is hard to ignore the term big data. The definition of big data is

often vague, and varies from source to source; some even called it big data Capacity (Hassna &

Lowry, 2016). Most sources defined big data solely based on the large volume of the dataset.

However, Heripracoyo, and Kurniawan (2016) argued that beside the large volume (terabyte and

more), the term big data also refers to the diversity of data as well as data rates. The article

interestingly illustrated big data as a dataset that is made up of an assortment of structured, semi-

structured, and unstructured data, and it contains the characteristics of the three big Vs: (1)

Volume, (2) Velocity, and (3) Variety (Heripracoyo & Kurniawan, 2016). It also contained four

phases in the value chain of big data process, which included data generation, data acquisition,

data storage, and data analysis (Chen, Mao, & Liu, 2014). As the trend of big data gathered great

interest in the information technology community, a new scientific paradigm to address the big

4

data problem and its organic growth came into recognition, and was then called the data-

intensive scientific discovery (Chen & Zhang, 2014). Chen and Zhang (2014) asserted that data-

intensive scientific discovery approach would enable the avenues for alternative credit decision

modeling tools that exhaustively vest the credit qualification criteria, and possibly mine the

fraudulent trends to reduce the risk of criminal attempts.

Problem Statement

The problem to be addressed in the study was input enhancements big data analysts need

to improve a credit qualification model to support large banks in their risk management

operations have not been identified (Petropoulos, Siakoulis, Stavroulakis, & Klamargias, 2019).

Given the fact that big data terminology and its methodology awareness has been around since

2005, the prominent modern theories and applications had been tested extensively; however, it

was still in the intermediate trial stage and not mass produced yet (van Rijmenam, 2019). The

majority of the lending industry currently relied on the credit bureaus’ scoring analytics, which

mainly measured the consumers’ behaviors over time, which did not portray the shifting in credit

cycles and systemic risk shifting and therefore is deemed irrelevant for high stake tactical risk

management decisions (Gandomi & Murtaza, 2015). Such a credit scoring approach contributed

to the gap in the body of knowledge of the financial institutions that this study sought to address

with the design science research via the input enhancements to the a credit application analytics

application based on the Artificial Neural Network (ANN) machine learning forecasting tools.

Integrating big data from the social media and alternative sources allows financial

institutions to extend their credit qualification metrics to evaluate these special market segments

of otherwise highly qualified borrowers, and consequentially, broaden financial institutions’

revenue streams (Cockcroft & Russell, 2018). An investigation of credit risk analytic solution

5

based on an improved big data approach, as such, wass beneficial to both early technology

adopters and hesitating larger financial institutions (Wu & Birge, 2016). Furthermore, in their

comprehensive review through the financial institutions’ use cases and financial performance

data. Skyrius, Giriūnienė, Katin, Kazimianec and Žilinskas(2018) concluded that even though

big banks were exposed to a similar set of risks as other businesses would, credit risk

management wass what financial institutions should focus as a main priority. Lin, Whinston, and

Fan (2015) also coincided with Skyrius et al.’s (2018) notion, where they claimed that credit risk

management is the main issue that is facing the Internet finance marketplace, where the data

involved in such transactions are voluminous and multi-dimensional.

Purpose Statement

The purpose of the design science study was to (a) explore the input enhancements big

data analysts need to improve a credit qualification model to support large banks in their risk

management operations and (b) use the findings to modify the existing default risk prediction

through machine learning credit qualification model. When the gap of current technologies as-is

capabilities, and the to-be requirements are identified, the improved big data analytic tools would

enable a credit decision model to become an effective evaluating tool. Such a model would allow

banks to examine a credit applicant by means beyond their normal credit profiles, such as

spending habits, network influence, fraudulent trend, and income earning potential analysis.

Banks could attract more good borrowers given the improved credit decision model, and weed

out those seemingly good appearance clients, but have questionable spending habits have and

credit misuse potentials.

By utilizing the machine learning tool of Artificial Neural Network (ANN) machine

learning on the consumer financial data sets, the big data analysts would be able to profile the

6

credit applicants into pools with relevant ratings and risk classification outputs. Such outputs

would then be compared to the thresholds specifically established by the financial institutions

based on their risk appetite and credit product category. The design science dissertation is part of

the new big data focused implementation of the credit risk management efforts aimed at a more

exhaustive due diligence process for vesting of the credit applications beyond the conventional

wisdom of the financial industry.

Research Question

The study utilized a qualitative design science design to focus and guide the central

research question: “What are the input enhancements to an artificial neural network machine

learning algorithm big data analyst need to improve a credit qualification model to support large

banks in their risk management operations?”

Propositions

Creswell (2014) defined propositions as researcher’s claims or proposed theories that the

intended research is to examine. This study asserted that the input enhancements, to the existing

artificial neural machine learning algorithm by Yeh and Lien (2009), would introduce a more

applicable tool and users-friendly approach for banks’ big data analysts to predict the probability

of default rates among their credit applicants. As Hilbert (2016) pointed out, the major obstacle

for big data adoption among large organizations is the absence of understanding how their

analytics model works, and the missing links between analytics insights/findings and business

values. To close such a gap, this study’s enhanced analytical software application will enable

bank managers to gain a more complete understanding of the big data analytics process, as to

adopt big data analytics in their financial risk management process.

7

Conceptual Framework

Yeh and Lien’s study (2009) already provided the initial framework for credit risk

prediction by means of various machine learning approaches, of which the ANN was found to be

most reliable technique. However, their study did not present a coding exhibit for any mentioned

machine learning techniques, nor demonstrate any specific use case for financial institutions to

build their credit risk analytics applications upon. As Hilbert (2016) pointed out, the major

obstacle for big data adoption among large organizations is the absence of understanding of how

big data analytics model work, and the missing links between analytics insights/findings and

business values. The goal of the research study was to examine and modify a qualification

analytics application, iQual, based on artificial neural machine learning algorithm by Yeh and

Lien (2009), for financial institutions to evaluate borrowers from an unstructured data set of

credit applicants.

For the stated qualification analytics system (iQual) utilized to facilitate online lending

decisions for financial institutions, the central theoretical framework behind the data warehouse

would be the Entity Relationship (ER) model, as to implement the transactional-oriented

database within the Online Transactional Processing (OLTP) system, which contained the

complete process for discovery-driven artificial neural data mining logarithm for the data

warehouses (Han, Pei, & Kamber, 2011). The star model to be used includes various consumers’

credit and identification data showing a central table of facts and collected data from various

sources, including publicly available, peer reporting sources, and customer’ provided

information, as cross-verified by the differing tables linking to this specific central fact table

(Gandomi & Murtaza, 2015).

8

The data was analyzed in the iQual analytical protocol, and then was drilled through the

knowledge discovery database, established from clusters of previous findings repository, to

support the decision, as either automated Artificial Intelligence, or human manual underwriting

process. The Knowledge Discovery in Database (KDD) process for the data is described briefly

as: Data --> Information -- > Understanding --> Knowledge --> Business (Process) Intelligence

(iQual). The BI system may employ the added data from the knowledge gained through credit

application and online profiling to make better lending decisions that are not as overarching or

risk averse. The data warehouse utilizing a centralized facts table and a star ER model wass the

founding framework for the BI system, and became powerful as it learned and grew with the

volume of consumers’ lending mechanism through the database.

Assumptions/Biases

Assumptions are the critical notions as part of this dissertation underlying framework,

and are generally assumed to be either true or plausible in a given credit qualification scenario

using big data methods (Piantanida & Garman, 2009). There were three assumptions.

Assumption 1 took on a favorable regulation landscape by the policy makers to allow sharing

and collecting social media data of the consumers. Assumption 2 assumed that the majority of

borrowers have a wide variety of data sources linkable to their social identity and status.

Assumption 3 assumed that data used for evaluation by the study’s instrument either received

full consent for research use, and/or had data of a sensitive nature removed/redacted.

Biases are defined as the tendencies in which qualitative researchers may hold from their

own professional experiences/viewpoints that prevent the unprejudiced consideration of the

research question (Pannucci & Wilkins, 2010). Bias 1 was from the researcher’s own

professional background, as the researcher had been actively working in the financial services

9

field. The researcher may hold bias and certain conservatism on automated tools on sensitive

financial matters. Bias 2 was based on the researcher’s optimistic outlook of big data and

Artificial Intelligence tools as solving the issues existing in the financial industry. However, the

literature review in chapter 2 debunked these biases as various scholarly articles and experiments

was presented, providing the methods and findings from other researchers working on the same

topics as to minimize the bias and assumption that may present in the study.

Significance of the Study

The significance of the study was that credit lending decision models are obviously

broken, as it had resulted in several credit defaults and even the recent housing market crisis and

greed to take place (Diamond & Rajan, 2009). The end result was the significant default rate at

major financial institutions, as people with greed can easily defraud the system for their own

gain. The advent of big data creates a wealth of client related information, which would produce

a possible alternative to the traditional framework. Currently, the system only took into

consideration a very low percentage of available financial, personal profile, and consumer habit

data of the consumer character package. There was other vital information that can determine a

borrower’s credit trustworthiness that is not measured under the current system. For example, if

the borrower were a bright entrepreneur with great venture ideas and stellar academic/research

credentials, he/she should have been considered with additional/alternative metrics that

determine credit worthiness instead of the regular 4C (Character, Capacity, Capital and

Conditions) formula that would have simply looked up the negative net worth and limited credit

history (“What are the 4 C’s,” n.d.).

The societal and organizational benefit of this study would be the integration to the bank

operation, and the advanced technical framework to leverage big data (mostly unstructured

10

formats) from diverse sources of public, peer, and private data sources (knowledge bases). Such

lending qualification model would minimize the risks of default, and potentially offer the best

rate to deserving clients, and in turn, making better banks and stabilize the economies from

smarter and well-informed lending decisions. When it came to large credit-based purchases that

affect the national economy bottom-line, serious due diligence would be needed to be conducted

and executed by financial institution organizations, and that is when big data platform

technological plan can help assist.

Delimitations

Delimitations are those characteristics that limit the scope of the research problem and

define the boundaries of the research (Simon, 2011). Delimitation 1 was that the study discusses

the benefits of big data within the narrow scope of lending qualification for loan products using

American financial data metrics; however, with more access to the alternative data than regular

credit due diligence process without the assistance of big data platform. Delimitation 2 was that

the study examines a limited amount of authorized data through a data set obtained from publicly

available sources, and as such cannot represent the actual outcomes in a financial institution with

large data stores of actual customers profiles. Delimitation 3 was that the study did not cover the

aspect of fraud prevention in credit qualification, even though such practice may be performed

concurrently with loan decision models to safeguard banks from criminal attempts. Delimitation

4 was the criteria to select the interviewing participants for the study, as they may not present all

differing views of bank managers/big data analysts in the industry in terms of their view with the

iQual application. Delimitation 5 was the limited scope of referenced sample dataset that iQual

conducts testing upon, due to the sensitivity nature of user’s credit information and limited

availability of such data in the public.

11

Limitations

Limitations in this study are the specific characteristics and influences of the specific

topic and design methodology that affect the study exercises, especially with the interpretation of

the findings (Price & Murnan, 2004). Limitation 1 would be the data privacy concerns from

users and regulators across the world, especially after the Facebook’s Cambridge Analytica

scandal. Officially, Facebook stated that it banned the use of its social media data to any

monetized-related service (LaForgia, Dance, & Confessore, 2018). However, this did not exclude

the data that users opt in to provide to the financial institutions, or the publicly available data.

Limitation 2 was the lack of guidance on the privacy protection of social network data policy, as

it was still up to debate among regulators and policy makers, which deemed the technology as

discussed in this paper in pending status or limited in scope for the time being (Zhang, 2018). It

was largely based on how the federal regulators/lawmakers are trying to curb the use of social

media metadata that used to be available for businesses and researchers. Big data analysts had to

watch the regulatory landscape to determine the future of this technology. Limitation 3 was the

sociotechnical plan that would incur its limitation on users with no/limited credit history and no

social media/publicly available data available that are linked to their identity.

Definition of Terms

Big data: a very large and complex dataset that are made up of an assortment of

structured, semi-structured, and unstructured data and contains the characteristics of the three big

Vs: (1) Volume, (2) Velocity, and (3) Variety (Heripracoyo & Kurniawan, 2016). The term also

contained four phases in the value chain, which included data generation, data acquisition, data

storage, and data analysis (Chen et al., 2014).

Reproduced with permission of copyright owner. Further reproduction prohibited without permission.