ImprovingCyber-AttackPredictionsthroughInformationForaging..pdf

2017 IEEE International Conference on Big Data (BIGDATA)

978-1-5386-2715-0/17/$31.00 ©2017 IEEE 4642

Improving Cyber-Attack Predictions Through Information Foraging

Adam Dalton, Bonnie Dorr, Leon Liang, Kristy Hollingshead

Florida Institute for Human and Machine Cognition Ocala, Florida

{adalton, bdorr, lliang, kseitz}@ihmc.us

Abstract—This paper describes how information foraging is useful in the implementation of new algorithms to anticipate cyber attacks. The exploration of publicly available data has been used to predict events in the socio-political domain, but the adversarial and covert behavior of actors in cyber security creates additional challenges. This paper describes a framework for Information Foraging for Algorithm Discovery (IFAD) that addresses standard data-science issues of volume and variety, by balancing human intuition with automation, and thus taking initial steps toward supporting the increasing need for rapid analysis of, and tool development for, big data. Our results demonstrate that cognitive augmentation, and information foraging in particular, is useful in the development of tools to anticipate cyber attacks using publicly available data.

Keywords-information foraging; anticipatory intelligence; predictions; cyber security; threat intelligence;

I. INTRODUCTION

Information foraging strategies can improve the accuracy

of cyber-focused forecasting systems on very large data sets.

Successful risk management in the cyber domain relies on

the ability to understand the range of potential outcomes

and probabilities of each outcome in a vast sea of data.

Improving cognitive performance is central to this objective,

particularly given the infeasibility of exhaustive inspection

of massive and ever-growing information sources.

We present Information Foraging for Algorithm Discovery (IFAD) for contemporary exploration of very large, uncon- ventional data sources, i.e., open source indicators that are

either not directly related to the target or not typically used

to predict certain types of events [1]. The main research

question here concerns the ability to rapidly discover new

algorithmic approaches to anticipating cyber attacks through

information foraging. We make the case that information

foraging can be leveraged to efficiently focus the user’s

attention to appropriate aspects of unconventional resources.

We then present an implementation of an architecture that

combines aspects of a cognitive architecture for information

foraging with an anticipatory intelligence architecture.

Our results demonstrate that cognitive augmentation, and

information foraging in particular, is useful in the devel-

opment of tools to anticipate cyber attacks using publicly

available data. The goal is to provide a framework that

addresses standard data-science issues of volume and variety

by balancing human intuition with automation, and thus

taking initial steps toward supporting the increasing need for

rapid analysis of, and tool development for, big data [2].

The next section reviews related literature to establish the

appropriateness of applying information foraging techniques

to open source anticipatory intelligence systems in the

cyber security domain. The general information foraging

approach and design of the IFAD are then described, with an

application to a well-known cyber attack timeline resource

called Hackmageddon. Strategies for maximizing recall, pre-

cision, and F-score of cyber attack predictions are examined.

Finally, the system developed to transform the environment

and enrich the available information is detailed.

II. BACKGROUND

This section presents related work in open source intel-

ligence and information exploration. Information foraging

is presented in the context of publicly available data envi-

ronments and discovery of new algorithms for anticipating

cyber attacks in the face of massive volumes of data.

A. Open Source Indicators As long as there is valuable information openly available

online in places like marketplaces, forums, blogs, and social

media via the clear-, dark-, and deep-nets, it makes sense to

create systems to organize that information in order to put

it to use. Below we describe two projects designed with this

goal in mind: EMBERS [3] and STUCCO (implemented in

PACE [4]). IFAD adopts many design decisions from these

earlier efforts, but transcends these in its application to the

domain of anticipatory intelligence for prediction of cyber

attacks from large, unconventional resources.

EMBERS was designed to identify information sources

from large data sets to enable warning generation for well

defined events and to be scored against a gold standard

report generated by a third party, typically with the assis-

tance of human analysts. Within this framework, events that

happened with a degree of regularity were separated from

significant or rare events because of the additional benefit

of being able to forecast surprising events [5].

The STUCCO project1 had the goal of “quickly putting

security events in context,” initially focusing on the effi-

cient discovery of cyber-security concepts in open source

information [4]. STUCCO was designed to extract software

vulnerabilities and exploits from online forums and market

1http://stucco.github.io/

4643

places, but conventional language processing tools did not

work because of unconventional text used on those forums,

particularly on the darkweb. STUCCO creators achieved

improvements in the PACE [4] implementation: A training

dataset was constructed with minimal supervision from a

very small seed of known entities, and this was used to train

a Maximum Entropy model that yielded “almost perfect”

results in extracting cybersecurity related entities [6].

EMBERS focuses on recall because it delegates precision

to curated input sources and then runs fusion on the output.

STUCCO, on the other hand, focuses on precision, seeking

to augment the limited availability of attention from human

analysts by avoiding the introduction of multiple sources of

noisy data. Because the domain consists of the entirety of

the web, even low recall will result in a large amount of

additional information.

B. Information Foraging Strategies IFAD’s focus on the cyber attack domain distinguishes

the approach from those designed to operate in the domain

of societal events. Both domains require big data (e.g.,

GDELT [7], Shodan [8]), but the tools must be designed

quite differently. Civil unrest, protests, strikes, and ‘occupy’

events are often designed with the explicit purpose of

generating awareness, whereas malicious online activity is

usually performed with the goal of remaining undetected [9].

Because the signal for cyber attacks is obfuscated in

big data, foraging for information that supports accurate

forecasts of attacks requires models beyond the conventional

information retrieval and natural language processing mod-

els seen in EMBERS and STUCCO. Two biology-inspired

models of information foraging theory [10, 11] align with the

goal of discovering approaches to anticipating cyber attacks

from unconventional resources: (1) the diet model fits when a set of resources is determined in advance, and the decisions

to be made include which resources to harvest and whether

or not to invest energy into developing improved methods

of harvesting those resources; (2) the patch model fits when the availability of information resources is less predictable.

These two models serve as the basis of IFAD’s design.

IFAD assists its users in deciding whether to pursue

signals that are widespread, easily transformed, and fre-

quently occurring, but low in gained value, or signals that

may be difficult to discover, require intensive computation

to transform, and substantial effort to maintain, but offer

a larger gain of value. Provenance capabilities provided

by IFAD are useful in the lifecycle management of new

algorithms to anticipate cyber attacks.

C. Information Foraging Environment Information foragers can attempt to enrich the environ-

ment by reducing the number of resources considered, by

reducing the amount of time it takes to process the informa-

tion contained in a resource, or by increasing the yield of

each specific resource [11]. Interactive information foraging

as used in IFAD enriches the environment by automatically

Predictions Data fusion and forecasting

Declarative Knowledge Patch and Diet Models

Application State Structured, Linked Data &

Ground Truth

Procedures Automated routines to

ingest data streams

Internet

WebViewVisualizations DevTools & Code Editor

IPC

W eb

B ro

w se

r

Interface

Information Foraging

Architecture

Application Architecture

Anticipatory Intelligence Architecture

Actions Web Scraping

Enriched Knowledge Post-processing

Declarative & Working Memory Procedural Knowledge

Figure 1. The IFAD architecture is informed by anticipatory intelligence and application development.

tracking paths from one resource to the next and creating

a network of resources. When a user designates a resource

as a potential lead for finding new predictive features, then

the yield of that resource will be tracked over time and

used to score the resource. Using resources’ scores, network

analytics can be applied to the tracked paths to identify other

potentially high-yield resources.

Visualization and statistical analysis capabilities enable a

user to analyze an extracted data set against a ground truth

report without leaving the information foraging application.

This is important because investigative search requires hu-

man participation [12], and the overhead of moving between

tools is disruptive [13].

Thus, an important part of IFAD’s design is proper

allocation of human attention over the vast amount of infor-

mation available through network protocols. This problem

has proved challenging since at least 1971 when Simon

[14] said “What information consumes is rather obvious:

it consumes the attention of its recipients. Hence a wealth

of information creates a poverty of attention, and a need to

allocate that attention efficiently among the over-abundance

of information sources that might consume it.”

III. INFORMATION FORAGING ENVIRONMENT DESIGN

IFAD is designed to capture the intent of a user when

he/she comes across the information, logging the path that

was taken to get to the information, and structuring the

information contained on the page in such a way that it can

be stored, referenced, and linked to other related datapoints.

IFAD is implemented as a web browser extended according

to the Foraging Architecture illustrated in Figure 1.

Built with Electron, IFAD employs web views as a

4644

window into the internet. Information is consumed by the

user who then ingests it into the application state through

widgets built with React. Declarative and working memory

are managed via the Redux interpretation of the Flux ap-

plication architecture. Procedural knowledge is delegated to

the human user who can leverage standard browser dev tools

and D3 visualizations to improve matches and generate in-

sights. This technology stack enables standalone application

building, without requiring separate services such as data

bases and web servers or any third party involvement.

IFAD’s design is a concrete rendering of the following

formalization by Pirolli [11]:

max

[ Expected value of knowledge gained

Cost of interaction

] (1)

Specifically, IFAD aims to maximize the numerator—the

expected value of the knowledge gained—by extracting

information such that it can be useful later (storing the path

that was taken to get this information) while minimizing

the need (cost) to recreate the circumstances necessary for

discovery as specified in the denominator.

The user is an information explorer, with a tool that en-

gages seamlessly in recording steps taken and key findings.

Vitally, IFAD is not a hosted service, but a client application

managed and operated entirely by the user. The fruit of

each information foraging session is owned entirely by the

user. The system allows users to share their sessions with

each other, thus adding a social dimension to the augmented

memory described in [15]. The information-rich resources

and strategies discovered by one user can thus immediately

be inherited by any other user.

Previous work on information foraging (e.g., approaches

described in Section II-B) has focused on how improving

the information scent could improve web page design and

utility.By contrast, IFAD is a tactical tool that takes the

lessons learned from building web sites and the insights into

how they’re used, and extends them to compile a knowledge

and belief environment from which predictions about the

real world can be made. The ability to navigate and forage

on the web is a skill on its own, at which some are more

talented than others, but IFAD is designed as a cognitive

orthotic [16], extending, amplifying, and leveraging the

user’s existing capabilities.

IV. METHODOLOGY

This section describes the set-up for application and

evaluation of the IFAD design. The problem space was

reduced to determining the best resource on the web to

use when predicting what cyber attacks will be included

in Hackmageddon timelines. New big data techniques were

needed for task reduction within the cyber-attack domain;

thus, IFAD’s tasks included the following steps:

1) Scrape 2017 timeline content on Hackmageddon from

the beginning of the year through the first half of June;

2) Extract host from each referenced URL in the timeline;

3) Count each host occurrence in each timeline;

4) Run Google’s advanced search to estimate total number

of items published by that host along the timeline;

5) Calculate the rate of information gain for each host;

6) Use information gain rate to suggest information forag-

ing strategy to maximize precision, recall, and F-score.

Two key metrics of concern when evaluating a forecasting

system are the lead time and the utility time. Lead time is the difference between the time when an event was reported

on and the time when a system generates a warning about

the event. Utility time (also known as date quality [5]) is the difference between when an event occurs and when a

system generates a warning. The distinction between these

two metrics is particularly important in a cyber security

context because of the measures that adversaries take in

order to remain undetected for long periods of time [9].

A. Foraging for Cyber Attack Leading Indicators By interpreting recall, precision, or F-score as the average

information value gained (g) by that web resource, the rate of gain of information for different “diet” combinations

(the information sources consumed by the system) were

calculated using Equation (2) where R is the function for rate of gain, k is the identifier of the resource, λi is average rate of encountering resource i, gi is the average information value gained for resources of type i, tWi is the effort required to consume resources of type i and so gi(tWi) is the cumulative value gained from information resources of

type i [11].

R(k) =

∑k i=1 λigi

1 + ∑k

i=1 λitWi (2)

In order to develop strategies for foraging online resources

to predict timeline entries on Hackmageddon, these values

were mapped to the parameters of Equation (2). Precision,

recall, and F-score were used as representations of the

average information gain (gi) in order to compare how strategies would differ in order to optimize those metrics.

Precision and recall were calculated using the number of

Hackmageddon entries for resource i as the true positives, and Google’s estimate for the total number of results for

each resource i was chosen to represent the total number of retrieved documents for resource i and also λi.

All values for tWi were set to 1 as a simplifying assump- tion. This assumption is justified as the task of assessing

whether a resource could be included in a Hackmageddon

timeline is similar to the search task described in [11]

where workers were skilled in examining search results

and estimating the expected amount of relevant content in

documents. The search domain for Hackmageddon resource

predictions was also simplified using a method similar to

Good-Turing Discounting [17, 18]. Each resource that only

occurred once in all the combined timelines were aggregated

4645

Table I SAMPLE OF ITEMS AVAILABLE FOR INFORMATION DIET

Host TP λ Precision Recall F-Score www.databreaches.net 54 368 0.15 0.13 0.14 www.hackread.com 26 400 0.07 0.06 0.06 securityaffairs.co 27 635 0.04 0.07 0.025 krebsonsecurity.com 9 345 0.03 0.02 0.02 Seen Once Aggregate 85 6.6M 1E-05 0.21 3E-05

into a meta-resource and used to estimate the resources that

have not yet been seen.

Using the Optimal Diet Selection Algorithm [11], the

resources were ranked according to their profitability (πi = gi/tWi ) in terms of recall, precision, and F-score. A resource was added to the “diet” if the rate of gain for a diet that

includes the resource is greater than the profitability of

the k+1st resource. That is, as long as the inequality in Equation (3) is satisfied.

R(k) =

∑k i=1 λigi

1 + ∑k

i=1 λitWi >

gk+1 tWk+1

(3)

B. Optimal Diet Results Analyzing recall, precision, and F-score as represen-

tations of profitability resulted in two distinct strategies

for predicting Hackmageddon entries. In the cases where

precision or F-score determine profitability, the optimal diet

consisted of only www.databreaches.net. When recall de-

termined profitability, the diet consisted of databreaches.net

and the meta-class that combined the instances of resources

with a single entry.

To illustrate how these strategies can inform predictions,

we built an abstract model of the 2017 June 16-30 and

July 1-15 Hackmageddon timelines based on historical

information to use as a baseline. Open source foraging

techniques were applied by IFAD to perform the information

seeking task on databreaches.net alone. For the baseline

model the distribution for each host was calculated across

all 2017 Hackmageddon timelines prior to the June 16-30

report. This resulted in 11 reports consisting of 413 entries

across 121 host sites. The timelines averaged approximately

37.5 incident reports each, so this number was used to

determine the expected number of attacks reported by each

site according to the historical distribution.

All sites with an expected contribution of fewer than 0.5

attacks were aggregated. This resulted in the expected values

seen in Table II where count is the number of times a host occurs in a Hackmageddon timeline, average is the count divided by 11 (number of timelines before June 16th,

2017), weight is the count divided by 413 (total number of linked attacks between January 1 and June 16, 2017),

and the expected value is the weight multiplied by the 37.5 average attacks per timeline, rounded up to the nearest whole

number. When Google was called within IFAD to search

Table II ANTICIPATED RESULTS FOR HACKMAGEDDON JUNE 16-30 TIMELINE

Host Count Average Weight Expected krebsonsecurity.com 9 0.818 0.02 1 motherboard.vice.com 15 1.363 0.04 2 news.softpedia.com 8 0.727 0.02 1 securityaffairs.co 27 2.454 0.06 3 bleepingcomputer.com 21 1.909 0.05 2 databreaches.net 54 4.909 0.13 5 hackread.com 26 2.363 0.06 3 ibtimes.co.uk 39 3.545 0.09 4 infosecurity-magazine.com 9 0.818 0.02 1 reuters.com 12 1.090 0.03 2 scmagazine.com 16 1.454 0.04 2 theregister.co.uk 16 1.454 0.04 2 zdnet.com 11 1 0.03 1 Rest of Web 108 9 0.26 10

Table III ACTUAL RESULTS FOR HACKMAGEDDON JUNE 16-30 TIMELINE

Hostname Links Published Predicted Premier scmagazine.com 4 177 2 No bleepingcomputer.com 4 783 2 No databreaches.net 4 66 5 No motherboard.vice.com 2 633 2 No nytimes.com 2 0 No bbc.co.uk 2 0 No proofpoint.com 1 0 Yes reuters.com 1 20000 2 No nydailynews.com 1 0 Yes securityaffairs.co 1 95 3 No krebsonsecurity.com 1 54 1 No darkreading.com 1 0 No techtalk.pcpitstop.com 1 0 Yes nymag.com 1 0 Yes miamiherald.com 1 0 Yes thetimes.co.uk 1 0 No theregister.co.uk 1 569 2 No thehill.com 1 0 Yes blog.talosintelligence.com 1 0 No bravenewcoin.com 1 0 Yes

results for the named hosts during the June 16-30 timeframe,

approximately 60,400 results were returned.

Twenty hosts were represented in the June 16-30 Hack-

mageddon timeline report. Of these, 8 were expected to

appear using the criteria above, an additional 5 had appeared

previously but below the cutoff, and 7 appeared for the first

time. The results are shown in Table III where links refers to the number of times the June 16-30 timeline references a

given host, published refers to the Google estimate of total hits published by that host in the June 16-30 timeframe (this

value is left blank for any host that was not predicted to be

included in the timeframe), predicted is the value predicted based on analysis described above, and premier indicates whether or not this is the first time a site has appeared on a

Hackmageddon timeline in 2017.

Alternatively, an information worker was tasked with re-

viewing IFAD’s Google search results from databreaches.net

for that site on dates between June 1-15, 2017. The worker

identified 11 out of 50 sites as potential entries. In ac-

4646

tuality, the June 16-50 timeline included 4 entries from

databreaches.net. Two of the 11 candidates turned out to be

true positives, one record was falsely scored as a negative,

and one Hackmageddon entry was not returned by Google.

Limiting the prediction domain to databreaches.net results

in a recall of 0.5 for databreaches.net entries and 0.6 overall

and a precision of 0.18 for both databreaches.net and overall.

Since all other hosts in the timeline contained fewer links

than published articles during the June 16-30 time frame,

it is reasonable to assume that any improvements to recall

gained by simply including more hosts for analysis would

come at a significant cost in terms of precision and effort.

Thus, in order to efficiently improve prediction performance,

the environment must be enriched so that either it takes

less work to extract value from a resource, or unproductive

resources are filtered out. The next section describes how

the system facilitates such environmental enrichment.

C. Enrichment The setup above enabled enrichment of the information

environment and implementation of strategies with improved

production. The goal was to reduce the amount of energy

expended in information foraging tasks while maintaining

a high expected gain. To illustrate our approach, we used

IFAD to enrich hackread.com, taking a functional reactive

programming approach that splits memory into declarative,

and procedural structures [11]. Elements of the EMBERS

system architecture [19] were also used to account for the

task model. The software itself uses the Redux2 implemen-

tation of the Flux3 application architecture.

1) Declarative Knowledge Declarative knowledge contains the type of information

that can be stored in a knowledge graph. Conceptual chunks

are maintained with semantic links between them. The

Redux data flow approach, based on the Flux application

architecture, was used to implement declarative memory.

2) Procedural Knowledge Procedural knowledge contains memories of how to do

things. It is a representation of the skills possessed by a

cognitive system. This was implemented in two ways in the

IFAD design. First, repeatable actions such as navigating

the web and interacting with the webview were stored as

actions and reducers. Unanticipated or improvised behavior

was incorporated into the system by taking advantage of the

development environment made available in browsers via

webtools. Redux state, and therefore declarative knowledge,

was accessible thanks to the redux-devtools project.4

3) Enrichment Walkthrough Despite being the second most productive resource in

terms of precision and F-score as shown in Table I, the

www.hackread.com resource is not included in any of the

2http://redux.js.org 3http://facebook.github.io/flux/ 4https://github.com/gaearon/redux-devtools

strategies produced by the optimal diet algorithm because

its rate of gain is not justified by the cost of effort and

opportunity. To expand the diet beyond databreaches.net,

other resources like hackread.com need to be enriched.

Enrichment shapes the environment to increase the yield

of a given strategy [11]. IFAD provides the ability to orga-

nize data contained in documents on the web and analyze

these to gain insight, and extract patterns in order to improve

the proportion of hackread.com resources that are included

in Hackmageddon timelines. The approach involved ex-

traction of all href values from links in Hackmageddon

timelines. From these, the hostname was parsed, and a filter

was applied to find links to hackread.com. This resulted in

26 documents, of which 10 were randomly selected. Each

article was visited and the article text extracted.

IFAD uses human-in-the-loop detection of webpage con-

tent rather than relying on automatic classifiers such as

Kohlschütter et al. [20]. Chrome DevTools were used to in-

vestigate important webpage elements. Javascript commands

executed via inter-process communication (IPC) extracted

interesting content from public web sites and loaded them

into a structured state within IFAD’s working memory.

The main context was selected from hackread.com pages.

Links from Hackmageddon (HACK class) led to ten page

links and to the first ten Google search results between Jan-

uary 1, 2017 and May 31, 2017 (BASE class). A language

model was constructed by counting the terms in both the

HACK and BASE corpora, after downcasing the text and

removing stopwords.

Next, the TF-IDF score was calculated for each term,

with each HACK and BASE class representing the two

documents. The top 30 scoring terms were then extracted

from the HACK class and used to refine the search. Without

the refinement, Google returned 47 results for hackread.com

between June 1-15, 2017. Only 13 results were returned

when the results included the term refinements, a 72% reduc-

tion. Importantly, the June 1-15 Hackmageddon timeframe

included one reference to hackread.com, and that resource

was also included in the refined search results.

V. RESULTS

The results discussed above do not alter the Optimal Diet

Strategy determined by Equation (3) because if all resources

required the same amount of energy, then the optimal diet

would almost always be consumption of only the resource

offering the highest gain. However, the R(k) change for hackread.com from 0.09 to 0.12 narrows the gap between it

and databreaches.net’s score of 0.13. This means that even

enrichments that are only able to reduce the work required

to consume hackread.com resources by a small amount can

still make it a viable resource.

VI. CONCLUSION

IFAD is a tool that supports data exploration with the

goal of discovering publicly available resources that can be

used to predict future cyber attacks. The system is built

4647

on the concept of information foraging and the design

decisions it incorporates are adapted from previous open

source intelligence tools such as EMBERS and STUCCO.

The system is distinct from other open source intelligence

systems because it prioritizes rapid feedback, exploratory

search, and the discovery of unconventional sources for

overcoming the challenges of concealment that come with

an adversarial domain such as cyber security. The ability to

enrich data sources to improve information gain was demon-

strated. However, in order to be effective at constructing

information exploration strategies, more research is needed

to understand the cost of information extraction.

Future iterations will include improvements to the inter-

face that make it easier to maximize Equation (2). Improve-

ments to the expected value of knowledge gained will be

incorporated by increasing the analytical toolset available

within the system, using links to declarative knowledge and

application state. Similarly, new procedural knowledge can

be used to decrease the cost of interaction by increasing

the integration of browser developer tools, making it easier

extract semantic meaning contained in a web page.

A positive side effect of sustained usage of the system will

be an annotated corpus on any topic the user is exploring.

The Hackmageddon example described in this paper, for

example, resulted in a small corpus of documents labeled

according to whether they were included in a Hackmageddon

timeline or not. When the system is used to discover

unconventional resources that enable cyber-attack prediction

more generally, the corpus will grow in size and the anno-

tations will become more granular, possibly aiding in the

construction of language models for a multitude of purposes.

Discovering resources that anticipate cyber attacks in

publicly available data is difficult because of the growth rate

of the search domain and the adaptability of attackers. The

implementation of information foraging strategies balance

human intuition with automation so that a wide variety

of new concepts from high-volume data collections can be

quickly tested for veracity.

ACKNOWLEDGMENT

This research is supported by the Office of the Director of

National Intelligence (ODNI) and the Intelligence Advanced Re-

search Projects Activity (IARPA) via the Air Force Research

Laboratory (AFRL) contract number FA875016C0114. The U.S.

Government is authorized to reproduce and distribute reprints for

Governmental purposes notwithstanding any copyright annotation

thereon. Disclaimer: The views and conclusions contained herein

are those of the authors and should not be interpreted as necessarily

representing the official policies or endorsements, either expressed

or implied, of ODNI, IARPA, AFRL, or the U.S. Government.

REFERENCES [1] A. Okutan, S. J. Yang, and K. McConky, “Predicting cyber at-

tacks with Bayesian networks using unconventional signals,” in Proceedings of the 12th Annual Conference on Cyber and Information Security Research. ACM, 2017, p. 13.

[2] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Rox- burgh, and A. H. Byers, “Big data: The next frontier for innovation, competition, and productivity,” McKinsey Global Institute, Tech. Rep., May, 2011.

[3] A. Doyle, G. Katz, K. Summers, C. Ackermann, I. Zavorin, Z. Lim, S. Muthiah, P. Butler, N. Self, L. Zhao et al., “Forecasting significant societal events using the EMBERS streaming predictive analytics system,” Big Data, vol. 2, no. 4, pp. 185–195, 2014.

[4] N. McNeil, R. A. Bridges, M. D. Iannacone, B. Czejdo, N. Perez, and J. R. Goodall, “PACE: Pattern accurate com- putationally efficient bootstrapping for timely discovery of cyber-security concepts,” in Proceedings of the 12th Inter- national Conference on Machine Learning and Applications (ICMLA), vol. 2. IEEE, 2013, pp. 60–65.

[5] S. Muthiah, P. Butler, R. P. Khandpur, P. Saraf, N. Self, A. Rozovskaya, L. Zhao, J. Cadena, C.-T. Lu, A. Vullikanti, A. Marathe, K. Summers, G. Katz, A. Doyle, J. Arredondo, D. K. Gupta, D. Mares, and N. Ramakrishnan, “EMBERS at 4 years: Experiences operating an open source indicators fore- casting system,” in Proceedings of the 22nd ACM SIGKDD. New York, NY, USA: ACM, 2016, pp. 205–214.

[6] R. A. Bridges, C. L. Jones, M. D. Iannacone, K. M. Testa, and J. R. Goodall, “Automatic labeling for entity extraction in cyber security,” arXiv preprint arXiv:1308.4941, 2013.

[7] K. Leetaru and P. A. Schrodt, “GDELT: Global data on events, location, and tone,” in ISA Annual Convention, 2013.

[8] J. Matherly, “Shodan,” 2017, https://www.shodan.io/ [Ac- cessed: November 2017].

[9] R. Bejtlich, The Tao of network security monitoring: beyond intrusion detection. Pearson Education, 2004.

[10] D. W. Stephens and J. R. Krebs, Foraging theory. Princeton University Press, 1986.

[11] P. Pirolli, Information foraging theory: Adaptive interaction with information. Oxford University Press, 2007.

[12] G. Marchionini, “Exploratory search: from finding to under- standing,” Communications of the ACM, vol. 49, no. 4, pp. 41–46, 2006.

[13] J. Heer, B. Shneiderman, and C. Park, “A taxonomy of tools that support the fluent and flexible use of visualizations,” ACM Queue, vol. 10, no. 2, pp. 1–26, 2012.

[14] H. A. Simon, “Designing organizations for an information- rich world,” Computers, communications, and the public interest, vol. 4, pp. 37–53, 1971.

[15] V. Bush, “As we may think,” The Atlantic Monthly, vol. 176, no. 1, pp. 101–108, 1945.

[16] K. M. Ford, P. J. Hayes, C. Glymour, and J. Allen, “Cognitive orthoses: toward human-centered AI,” AI Magazine, vol. 36, no. 4, pp. 5–8, 2015.

[17] I. J. Good, “The population frequencies of species and the estimation of population parameters,” Biometrika, vol. 40, no. 3-4, pp. 237–264, 1953.

[18] D. Jurafsky, Speech & Language Processing. Pearson Education India, 2000.

[19] N. Ramakrishnan, P. Butler, S. Muthiah, N. Self, R. Khandpur, P. Saraf, W. Wang, J. Cadena, A. Vullikanti, and G. Kork- maz, “‘beating the news’ with EMBERS: forecasting civil unrest using open source indicators,” in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2014, pp. 1799–1808.

[20] C. Kohlschütter, P. Fankhauser, and W. Nejdl, “Boilerplate detection using shallow text features,” in Proceedings of the Third ACM International Conference on Web Search and Data Mining. ACM, 2010, pp. 441–450.

<< /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles false /AutoRotatePages /None /Binding /Left /CalGrayProfile (Gray Gamma 2.2) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams true /MaxSubsetPct 100 /Optimize false /OPM 0 /ParseDSCComments false /ParseDSCCommentsForDocInfo false /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo false /PreserveFlatness true /PreserveHalftoneInfo true /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts false /TransferFunctionInfo /Remove /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile () /AlwaysEmbed [ true /Arial-Black /Arial-BoldItalicMT /Arial-BoldMT /Arial-ItalicMT /ArialMT /ArialNarrow /ArialNarrow-Bold /ArialNarrow-BoldItalic /ArialNarrow-Italic /ArialUnicodeMS /BookAntiqua /BookAntiqua-Bold /BookAntiqua-BoldItalic /BookAntiqua-Italic /BookmanOldStyle /BookmanOldStyle-Bold /BookmanOldStyle-BoldItalic /BookmanOldStyle-Italic /BookshelfSymbolSeven /Century /CenturyGothic /CenturyGothic-Bold /CenturyGothic-BoldItalic /CenturyGothic-Italic /CenturySchoolbook /CenturySchoolbook-Bold /CenturySchoolbook-BoldItalic /CenturySchoolbook-Italic /ComicSansMS /ComicSansMS-Bold /CourierNewPS-BoldItalicMT /CourierNewPS-BoldMT /CourierNewPS-ItalicMT /CourierNewPSMT /EstrangeloEdessa /FranklinGothic-Medium /FranklinGothic-MediumItalic /Garamond /Garamond-Bold /Garamond-Italic /Gautami /Georgia /Georgia-Bold /Georgia-BoldItalic /Georgia-Italic /Haettenschweiler /Impact /Kartika /Latha /LetterGothicMT /LetterGothicMT-Bold /LetterGothicMT-BoldOblique /LetterGothicMT-Oblique /LucidaConsole /LucidaSans /LucidaSans-Demi /LucidaSans-DemiItalic /LucidaSans-Italic /LucidaSansUnicode /Mangal-Regular /MicrosoftSansSerif /MonotypeCorsiva /MSReferenceSansSerif /MSReferenceSpecialty /MVBoli /PalatinoLinotype-Bold /PalatinoLinotype-BoldItalic /PalatinoLinotype-Italic /PalatinoLinotype-Roman /Raavi /Shruti /Sylfaen /SymbolMT /Tahoma /Tahoma-Bold /TimesNewRomanMT-ExtraBold /TimesNewRomanPS-BoldItalicMT /TimesNewRomanPS-BoldMT /TimesNewRomanPS-ItalicMT /TimesNewRomanPSMT /Trebuchet-BoldItalic /TrebuchetMS /TrebuchetMS-Bold /TrebuchetMS-Italic /Tunga-Regular /Verdana /Verdana-Bold /Verdana-BoldItalic /Verdana-Italic /Vrinda /Webdings /Wingdings2 /Wingdings3 /Wingdings-Regular /ZWAdobeF ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 200 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages true /ColorImageDownsampleType /Bicubic /ColorImageResolution 300 /ColorImageDepth -1 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /DCTEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.76 /HSamples [2 1 1 2] /VSamples [2 1 1 2] >> /ColorImageDict << /QFactor 0.76 /HSamples [2 1 1 2] /VSamples [2 1 1 2] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 15 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 15 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 200 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.76 /HSamples [2 1 1 2] /VSamples [2 1 1 2] >> /GrayImageDict << /QFactor 0.76 /HSamples [2 1 1 2] /VSamples [2 1 1 2] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 15 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 15 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 400 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000410064006f006200650020005000440046002065876863900275284e8e55464e1a65876863768467e5770b548c62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef69069752865bc666e901a554652d965874ef6768467e5770b548c52175370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002c0020006400650072002000650067006e006500720020007300690067002000740069006c00200064006500740061006c006a006500720065007400200073006b00e60072006d007600690073006e0069006e00670020006f00670020007500640073006b007200690076006e0069006e006700200061006600200066006f0072007200650074006e0069006e006700730064006f006b0075006d0065006e007400650072002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200075006d002000650069006e00650020007a0075007600650072006c00e40073007300690067006500200041006e007a006500690067006500200075006e00640020004100750073006700610062006500200076006f006e00200047006500730063006800e40066007400730064006f006b0075006d0065006e00740065006e0020007a0075002000650072007a00690065006c0065006e002e00200044006900650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000520065006100640065007200200035002e003000200075006e00640020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f00620065002000500044004600200061006400650063007500610064006f007300200070006100720061002000760069007300750061006c0069007a00610063006900f3006e0020006500200069006d0070007200650073006900f3006e00200064006500200063006f006e006600690061006e007a006100200064006500200064006f00630075006d0065006e0074006f007300200063006f006d00650072006300690061006c00650073002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f006200650020005000440046002000700072006f00660065007300730069006f006e006e0065006c007300200066006900610062006c0065007300200070006f007500720020006c0061002000760069007300750061006c00690073006100740069006f006e0020006500740020006c00270069006d007000720065007300730069006f006e002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA (Utilizzare queste impostazioni per creare documenti Adobe PDF adatti per visualizzare e stampare documenti aziendali in modo affidabile. I documenti PDF creati possono essere aperti con Acrobat e Adobe Reader 5.0 e versioni successive.) /JPN <FEFF30d330b830cd30b9658766f8306e8868793a304a3088307353705237306b90693057305f002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e305930023053306e8a2d5b9a3067306f30d530a930f330c8306e57cb30818fbc307f3092884c3044307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020be44c988b2c8c2a40020bb38c11cb97c0020c548c815c801c73cb85c0020bcf4ace00020c778c1c4d558b2940020b3700020ac00c7a50020c801d569d55c002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken waarmee zakelijke documenten betrouwbaar kunnen worden weergegeven en afgedrukt. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200073006f006d002000650072002000650067006e0065007400200066006f00720020007000e5006c006900740065006c006900670020007600690073006e0069006e00670020006f00670020007500740073006b007200690066007400200061007600200066006f0072007200650074006e0069006e006700730064006f006b0075006d0065006e007400650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f00620065002000500044004600200061006400650071007500610064006f00730020007000610072006100200061002000760069007300750061006c0069007a006100e700e3006f002000650020006100200069006d0070007200650073007300e3006f00200063006f006e0066006900e1007600650069007300200064006500200064006f00630075006d0065006e0074006f007300200063006f006d0065007200630069006100690073002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a0061002c0020006a006f0074006b006100200073006f0070006900760061007400200079007200690074007900730061007300690061006b00690072006a006f006a0065006e0020006c0075006f00740065007400740061007600610061006e0020006e00e400790074007400e4006d0069007300650065006e0020006a0061002000740075006c006f007300740061006d0069007300650065006e002e0020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400200073006f006d00200070006100730073006100720020006600f60072002000740069006c006c006600f60072006c00690074006c006900670020007600690073006e0069006e00670020006f006300680020007500740073006b007200690066007400650072002000610076002000610066006600e4007200730064006f006b0075006d0065006e0074002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create PDFs that match the "Required" settings for PDF Specification 4.01) >> >> setdistillerparams << /HWResolution [600 600] /PageSize [612.000 792.000] >> setpagedevice