SECURELY DATA SHARING IN CLOUD project

profilemanjusha
Proposal_Sample.pdf

Agent Survival Simulation in Minecraft: A Research Proposal

Anthony Harris Department of Computer Science

Missouri State University Springfield, Missouri, United States of America

[email protected]

Abstract—Game developers have been striving to design artificial intelligence agents in video games in such a way that players feel as though they are interacting with another human player since the founding of the gaming industry. There are many questions that must be answered prior to truly achieving this goal: How do we simulate human playstyles in artificial agents? Do we imitate human sensory input? If so, how? How do we do so in a way that is computationally feasible for agents to respond in human-like response times? What learning algorithm and models yield the most promising, or even any, results? These are but a few of the questions that have been asked over the years. In this paper, I propose and outline a research project, complete with some preliminary results, that will strive to answer some of these questions, as well as others. I outline the proposed research questions to be answered, my corresponding hypothesis, the methodology to be used, as well as the potential resource requirements. Specifically, I propose the use of q-learning and the exploration of various input structures to develop artificial general intelligence agents whose behaviors and survival strategies closely resemble human players in Minecraft and are capable of surviving without exterior influence in the survival game mode.

Keywords—artificial intelligence, reinforcement learning, Minecraft, malmo, marlo, artificial neural networks, q-learning

I. INTRODUCTION

Artificial intelligence (AI) in games can make all the difference in player immersion and overall perception of the game experience. It was clear as early as 2001, when major improvements in processing capabilities led to more high- quality graphics and a desire from players to interact with more intelligent agents, that the interactions between players and AI agents played a substantial role in the success or failure of a game [6]. Modern methods are as vast as they are varied: Q- Learning as an offshoot of Reinforcement Learning (RL) [2][3][4][5][10][11], Neuro-Evolution of Augmenting Topologies (NEAT) [7][8], and Genetic Algorithms (GA) [9] are only a few of the more popular methods, each unique from the rest. Modern organizations and groups like Microsoft and the Association for the Advancement of Artificial Intelligence (AAAI) have taken to crowd-sourcing this problem by holding competitions. These competitions urge researchers to develop more intelligent AI agents in various games, including Minecraft. That said, even with the support of a company as

huge as Microsoft, artificial general intelligence (AGI) agents, defined as agents that can perform a vast range of tasks with minimal exterior manipulation, have yet to be successfully implemented to an acceptable degree. To this end, I propose the use of RL techniques, specifically Deep Q-Networks (DQN), exploration of automated classification of meaningful sub-goals, and the exploration of various input processing and topological techniques to maximize human-like sensory comparability while minimizing computational and memory complexities, using Minecraft. The somewhat lofty end-goal would be to develop an agent that is capable of surviving in the survival game mode in Minecraft while solving various challenging tasks like sustenance acquisition and consumption, shelter design and use, and potentially even multi-agent cooperation and/or competition.

The research project outlined in this proposal is already somewhat underway. At the time of this proposal, I have spent several weeks familiarizing myself with the Malmö [1] environment and running preliminary tests using some of the more simple-to-implement methods I outline.

The remainder of this proposal is partitioned as follows:

- Section II defines my motivations for proposing such a research project.

- Section III outlines the most significant and relative literatures I investigated in my preliminary research.

- Section IV outlines the research questions I will try to answer as part of this research project.

- Section V defines my hypothesis.

- Section VI defines, in detail, the proposed methodology to be used for the duration of this project.

- Section VII provides a tentative timeline for the project.

- Section VIII outlines the resources required for the project.

- Section IX presents some of the preliminary results I have obtained from the work done on the project so far.

- The References section contains all of the relevant references used when creating this report.

- The Appendices section contains a build guide developed during my preliminary research into this project. The git repo mentioned is currently private, so any requests to clone the repo must be approved by me.

II. MOTIVATIONS

There are several core motivations that drive the proposal and development of this research project. The leading motivations are educational: this project will serve as part of my graduate thesis studies at Missouri State University (MSU). As such, the success or failure of the project will depend largely upon the contribution made to the field and the defense of said contribution as a thesis.

There are 2 primary personal motivations that should be made apparent: first, it is my intention to apply all knowledge and experience gained during this project to a personal, unrelated, project called the Omega Gaming Project (OGP). Information related to OGP can be found at https://www.omega- gaming-project.org/. Second, it is also my intention to apply all knowledge and experience gained to the development of real- world AGI agents; given our current limitations in terms of processing power when compared to the processing power of the human brain, it is necessary to demonstrate success in AGI agents in simulated environments, i.e. video games, prior to real- world applications.

III. LITERATURE REVIEW

There is an enormous amount of literary works that focus on the use of various machine learning techniques to make more intelligent AI agents. This is both a boon and a challenge regarding the task of reviewing literature prior to making a research proposal; the vast number of works means that there are plenty of different scopes, perspectives, and implementations to

consider. That said, there is such a vast array of works that it can be challenging to identify those that are the most applicable and/or relative to the proposed topic. The works outlined in this literature review are but a few of the works considered; these works represent some of the most relative work to the proposed methods.

Much of the work done in the field of RL in Minecraft is built upon the Malmö [1] project, including the work proposed here. Malmö is a project that Microsoft developed to better assist researchers in the pursuit of RL agents; when combined with CrowdAI’s MarLö project, a project that was built upon the Malmö platform in an effort to make it easier to work with, it becomes a fairly simple task to begin training agents via RL. The MarLö project, as well as associated documentation, can be found at https://github.com/crowdAI/marLo. These 2 tools are representative of the foundation upon which Minecraft RL communities have thrived.

Another piece of the foundation can be found in Observational Learning (OL) techniques, as outlined in [2]. These techniques include using an agent that has a more advanced understanding of the environment as a teacher that will affect the training of the student agent indirectly. That is, the student agent has no understanding of why or how the teacher agent performs tasks, but merely witnesses the effect upon the environment. This is unique in that it is similar to having a labeled dataset to train on, as in Supervised Learning (SL) techniques, but the training is done via RL instead. This technique allows for the agent to learn in a way that closely resemble how a human can learn just by watching someone perform a task; the student agent observes how the teacher interacts with the environment surrounding it and attempts to derive a policy that will result in the greatest reward to itself. [2] also shows that this process is not simply Imitation Learning (IL)

Figure 1: Literature map of the 5 most relative papers reviewed, organized by level of simulation complexity. The literature review has grown somewhat since the design of this map; this map represents the core support for the methods proposed by this paper.

by giving the student agent a different set of possible actions than the actions given to the teacher.

Building up the complexity further, an agent referred to as “HogRider” won the Malmö Collaborative AI Challenge in 2017 using a novel DQN and enormous state-action abstraction techniques [3]. By defining the state space as 3 numeric values, 1 defining the state of the target as catchable or not, and 2 representing the distances between the agents and the target, they managed to reduce the possible number of states from 4,057,200 to 60,000. This abstraction is critical as DQNs tend to be more successful when the possible state space is smaller.

A 2016 Stanford paper explores the effectiveness of DQNs in a survive-and-kill scenario in Minecraft. The researchers focused upon the use of a local replay memory to help smooth over the time gap between an action and its rightful reward. They utilized Convolutional Neural Networks (CNN) to represent the world states, which made the learning process more challenging. That said, they yielded positive results, showing that it is possible to train an agent to survive by being both aggressive and evasive [4].

Finally, the most complex research project considered in this literature review is the project that earned the second runner-up position in the MarLö Challenge in 2018. This project uses the concept of Hierarchical Learning (HL) to create an agent that was more representative of an AGI agent. They did this by training the agent to do several things: first, an agent would attempt to identify the policy to use based upon the visual inputs provided when spawning into the world. Then, based upon the predicted policy, the agent would interact with the world and train said policy further. This required the agent to have policies defined that applied to each of the 3 tasks given in the challenge. While this approach is arguably short-sighted in terms of a large- scale AGI agent, it represents a possible step in the right direction. Potential improvements to this approach are outlined in [11], where they successfully determined meaningful sub- goals in reward sparse Atari games like Montezuma’s Revenge. Games are considered to have sparse rewards if the actions taken by an agent do not yield obvious results in that moment; for example, in Montezuma’s revenge the agent might need to travel to one side of the map to collect a key that is necessary to exit the level on the other side. Agents tend to fail spectacularly when this is the case. By automatically determining meaningful subtasks, the agent becomes more capable of solving overarching complex tasks. That said, a game like Montezuma’s Revenge is still far less complex than the indefinite survival of an agent in Minecraft; part of the challenge will be determining just how far the automatic sub-task determination systems can go.

IV. RESEARCH QUESTIONS

There are numerous questions that this project will strive to answer:

1. How can modern reinforcement learning techniques, namely Deep Q-Networks, be improved upon to produce more successful agents in Minecraft?

2. Is it possible to produce a self-sufficient agent in Minecraft by starting with a simple “Find the Goal” challenge and gradually increasing the complexity of

the challenge, ending with the survival game mode, using an automated sub-task determination system?

3. How can the inputs to the agents be optimized to maximize the overall likeness to a human players perception of the environment and minimize memory (storage) and computational (time) complexities?

4. To what extent can the training time of Reinforcement Learning agents be reduced using methods such as pre- training using human player data?

Question 1 represents the desired contribution to the field; this project will strive to design and implement an improved DQN model that can more closely represent the level of agent interaction needed to model human players. Question 2 represents the overarching goal of the project; it is very much a desired goal to produce an agent that is capable of surviving without external influence in a Minecraft survival world. This question can be expanded in many ways but will be considered in a simplified scope initially. That is, an agent will be considered successfully surviving if it learns to harvest and consume sustenance while surviving the various dangers present in the Minecraft environment; this process requires an agent to be able to accurately classify and respond to immediate sub- tasks while simultaneously preparing for future needs. Question 3 represents a more explorative goal; one of the vital elements of human-like survival agents is undoubtedly the representation of the world state. This project will strive to define a world state that is similar to the human senses of sight, sound, and touch, while also attempting to minimize the time between agent exposure to the world state and agent determination of an action to take based upon said world state.

V. HYPOTHESIS

There are 2 hypotheses that will be tested during this project:

1. A Deep Q-Network can be made to yield more effective agents when combining world state abstraction, population-wide replay memory that is accessed and updated by agents of each generation (with population size of 1 for single-agent experiments), utilization of an automated sub-task determination system that will essentially direct the agent to perform the tasks necessary to succeed, and by using human player data to jumpstart the learning process.

2. Human-like sensory inputs can be imitated to a degree that will retain information and promote a successful learning process in agents.

Hypothesis 1 represents the proposed method to improve upon existing RL techniques; the goal is to combine some existing methods like world state abstraction and replay memories with the accelerated learning process offered by using human player data as a pseudo-labeled dataset. Hypothesis 2 represents the explorative aspect of this project; the tentative starting point for this hypothesis would be to represent the world state as follows:

- Visual input that takes a low-resolution image that the agent sees and feeding it through a CNN to produce a simplified representation.

- Audio input that takes the raw audio data output by the Minecraft game, minus the background music, and turns it into a similar simplified representation (the details are still somewhat unclear in this regard and additional research is necessary prior to implementation).

- Touch input that is represented by information regarding the equipment and inventory of the agent, as well as information regarding the type of block it is currently occupying. This input will likely not be necessary until later phases of the project.

VI. PROPOSED METHODOLOGY

A. Study Design

This study will be a quantitative, true-experimental study that will focus on comparing the results of the various attempted methods to determine which method yields the greatest level of success. Furthermore, the various methods will be compared when using various types of world state representation to determine the most human-sensory-like representation that can be applied successfully.

B. People and Roles

At the time of this proposal, there are only 2 roles: the researcher and the research supervisor/advisor. I, Anthony Harris, will serve as the researcher, and Dr. Siming Liu of MSU is currently serving as the research supervisor.

C. Data Collection and Instruments

All data related to the project will be collected using various tools provided by the Malmö and MarLö projects, as well as potentially with visualization tools provided by the TensorFlow and/or Keras libraries. This data will largely be quantitative and will be rooted closely to the performance of the agents being examined. Example data to be collected includes, but is not limited to:

- Time taken for the “fitness” of an agent to reach a predetermined threshold value.

- “Fitness” of an agent or population.

- Comparisons of different world state definitions.

No specialized instruments are required for these data collection tools, with the obvious exception being a computer equipped with said tools.

D. Data Analysis Methods

The data collected will analyzed and compared using traditional statistical techniques; there will be no novel statistical analysis.

E. Quality Assurance

All data collected, as well as all data analysis techniques employed will be made publicly available upon the completion of this project to promote transparency and peer-review. Any possible contributions to the field will be reviewed by the appropriate methods prior to the submission of any literary works.

F. Ethical Considerations

There are no ethical concerns that apply directly to this project. Potential ethical concerns might arise if successful results are applied to real-world scenarios, in that the development of functioning AGI in the real-world would potentially result in a tremendous shift in the physical labor requirements, large-scale project planning and analysis, and a multitude of other fields. That said, the application of successful results is dependent upon both the acquisition and application of said successful results and the corresponding hardware/robotics technology required to effectively impact the world around us.

VII. RESEARCH PLAN

The proposed project is one that attempts to address a topic that has baffled researchers for generations; as such, it is meant to be a step in the direction of, rather than the complete development of, AGI. The complexity and challenge of the project plays a large role in the proposed timetable; my somewhat limited experience level and the high learning curve expected associated with the topic also play a part in the timetable. The tentative timetable is as follows:

- May 2020

o Initial research proposal

o Outline tentative requirements

o Finalize preliminary phase

- June 2020

o Experiment with different learning models in the “Find the Goal” challenge

o Advance successful models to the “Find the Mob” challenge

o Further define project scope and expected products

o Firmly define desired contribution to the field

- July 2020

o Begin process of developing a literary work based on current progress

o Finalize desired input structure for use in various challenges

o Begin application of potential models in various challenges

The project plan beyond July 2020 depends heavily upon the obtained level of success to that point.

VIII. REQUIRED RESOURCES

There are minimal resources required, but there are several resources that would make the process easier and faster. Each type of resource is outlined below:

- Required

o Computer with Malmö, MarLö, TensorFlow, Keras, Python 3.6 (for Malmö), and any desired coding environment to work in (my case is Visual Studio Code for Python).

- Optional

o Graphics Processing Unit (GPU) capable of handling large-scale computations efficiently.

o Remote processing to minimize local system requirements.

IX. PRELIMINARY RESULTS

The preliminary results of using DQNs with replay memories and abstracted world state representations are promising. Currently, I am able to demonstrate an agent that is capable of solving the “Find the Goal” challenge in MarLö, albeit the movements of said agent appear as though they were drunk; the parameters in use currently are:

- World State

o Abstracted to simply the difference in the X and Z directions (the plane that players can walk along in Minecraft)

- Replay Memory

o 50,000 memories containing the Markov Decision Process (MDP) tuple of World State, Action, Reward, New World State, and a Boolean Terminal State

- Action Space

o 5 possible actions: Move forward, move backward, turn left, turn right, and no action.

- DQN Model

o 3 hidden layers, each with 256 nodes.

- Rewards

o Small negative reward defined by MarLö at each iteration (associated with an energy cost of being alive); approximately -0.01 to -0.02.

o Large positive reward defined by MarLö when the goal is found; approximately 0.5.

Currently, the World State has been experimentally adjusted to include the pitch and yaw of the agent in an effort to reduce the drunken-like movements and promote a more direct movement to the target. While the “Find the Goal” challenge in MarLö is not a particularly impressive accomplishment, it does show that the general RL algorithm is functional as it, which opens up room for modifications like changing the World State representation in an effort to compare the results.

X. CONCLUSION

There have been several breakthroughs in the field of AGI; big names like Google and Microsoft have been working diligently to solve advanced challenges, yet none have been fully successful. One of the most notable successes in recent history is Google’s implementation of an agent capable of

playing several Atari games on a level that is comparable to human players [10]. While this is undoubtedly a remarkable accomplishment, the level of complexity pales in comparison to the complexity of Minecraft; the successful creation and implementation of an AGI agent in a game like Minecraft has the potential to revolutionize the gaming industry, as well as the rest of the world. It is for this reason that I propose further research into the design, development, and implementation of an AI agent in Minecraft using the Malmö and MarLö tools. By starting with a simple task and progressively scaling the task up in terms of complexity, it may be possible to take another step toward a true AGI agent.

REFERENCES [1] Johnson, M., Hofmann, K., Hutton, T., & Bignell, D. (2016, July). The

Malmo Platform for Artificial Intelligence Experimentation. In IJCAI (pp. 4246-4247).

[2] Borsa, D., Heess, N., Piot, B., Liu, S., Hasenclever, L., Munos, R., & Pietquin, O. (2019, May). Observational learning by reinforcement learning. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems (pp. 1117-1124). International Foundation for Autonomous Agents and Multiagent Systems.

[3] Xiong, Y., Chen, H., Zhao, M., & An, B. (2018, April). HogRider: champion agent of Microsoft Malmo collaborative AI challenge. In Thirty-Second AAAI Conference on Artificial Intelligence.

[4] Udagawa, H., Narasimhan, T., & Lee, S. Y. (2016). Fighting zombies in minecraft with deep reinforcement learning. Technical report, Technical report, Stanford University.

[5] Xu, L., & Chen, Y. (2019, August). A Hierarchical Approach for MARLÖ Challenge. In 2019 IEEE Conference on Games (CoG) (pp. 1-4). IEEE.

[6] D. Johnson and J. Wiles, "Computer games with intelligence," 10th IEEE International Conference on Fuzzy Systems. (Cat. No.01CH37297), Melbourne, Victoria, Australia, 2001, pp. 1355-1358 vol.2.

[7] Stanley, Kenneth O., and Risto Miikkulainen. "Evolving neural networks through augmenting topologies." Evolutionary computation 10.2 (2002): 99-127.

[8] Han, Amy, and Han, Jeremy. "Evolving Mario to Maximize Coin Score Using Neat and Novelty."

[9] L. Cardamone, D. Loiacono, P. L. Lanzi and A. P. Bardelli, "Searching for the optimal racing line using genetic algorithms," Proceedings of the 2010 IEEE Conference on Computational Intelligence and Games, Dublin, 2010, pp. 388-394.

[10] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.

[11] Dann, Michael, Fabio Zambetta, and John Thangarajah. "Deriving Subgoals Autonomously to Accelerate Learning in Sparse Reward Domains." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. 2019.

APPENDICES How to Set Up and Run the Malmo / Marlo Environment (Windows 10)

1. INITIAL SETUP

a. Install Anaconda

2. ANACONDA SETUP

a. In Anaconda

i. Enter “conda install python=3.6”

ii. Enter “conda create python=3.6 --name marlo”

iii. Enter “conda config --add channels conda- forge”

iv. Enter “conda activate marlo”

v. Enter “conda install -c crowdai malmo”

vi. Enter “pip install -U marlo”

vii. Enter “pip install tensorflow”

viii. Enter “pip install keras”

3. TEST INSTALLATION

a. In Anaconda

i. Enter “python -c “import marlo” “

ii. Enter “python -c “from marlo import MalmoPython”

4. GET THE REPO WITH GIT BASH

a. In Git Bash

i. Navigate to the marlo directory in the anaconda envs

ii. Enter “git init”

iii. Enter “git remote add https://github.com/KillerBOB999/MarLoRe search.git”

iv. Enter “git fetch”

v. Enter “git checkout -t origin/master -f”

5. OPEN DIRECTORY WITH VS CODE FOR FILE MANIPULATION AND DEBUGGING

a. In Anaconda

i. Enter “cd %MALMO_MINECRAFT_ROOT%”

ii. Enter “cd ..”

iii. Enter code Minecraft

1. This will open an VS Code window with the Minecraft directory set as active

b. In VS Code

i. In the side menu, click “Extensions”

ii. Install the Python extension (ms- python.python)

iii. In the bottom of the window, there should be a python extension added

1. Click it and select the “Python 3.6.7 64-bit ‘marlo’: conda option”

iv. The main files of interest are

1. MAIN.PY

2. MAIN_BOTBRAIN.PY

3. MAIN_HELPERS.PY

4. Lib/site- packages/marlo/base_env_builde r.py

v. When you have MAIN.PY as the active file and the client running, you can then run the program

6. START CLIENT IN THE MARLO ANACONDA ENVIRONMENT

a. In Anaconda with marlo environment activated

i. Enter “cd %MALMO_MINECRAFT_ROOT%

ii. Enter “launchClient.bat”

1. This will likely take a while the first time through as it needs to do some preliminary work. After you successfully build it once, all subsequent builds should take less time.

7. DOCUMENTATION

a. https://marlo.readthedocs.io/en/latest/usage/.html