HW8
2017 IEEE International Conference on Big Data (BIGDATA)
978-1-5386-2715-0/17/$31.00 ©2017 IEEE 4642
Improving Cyber-Attack Predictions Through Information Foraging
Adam Dalton, Bonnie Dorr, Leon Liang, Kristy Hollingshead
Florida Institute for Human and Machine Cognition Ocala, Florida
{adalton, bdorr, lliang, kseitz}@ihmc.us
Abstract—This paper describes how information foraging is useful in the implementation of new algorithms to anticipate cyber attacks. The exploration of publicly available data has been used to predict events in the socio-political domain, but the adversarial and covert behavior of actors in cyber security creates additional challenges. This paper describes a framework for Information Foraging for Algorithm Discovery (IFAD) that addresses standard data-science issues of volume and variety, by balancing human intuition with automation, and thus taking initial steps toward supporting the increasing need for rapid analysis of, and tool development for, big data. Our results demonstrate that cognitive augmentation, and information foraging in particular, is useful in the development of tools to anticipate cyber attacks using publicly available data.
Keywords-information foraging; anticipatory intelligence; predictions; cyber security; threat intelligence;
I. INTRODUCTION
Information foraging strategies can improve the accuracy
of cyber-focused forecasting systems on very large data sets.
Successful risk management in the cyber domain relies on
the ability to understand the range of potential outcomes
and probabilities of each outcome in a vast sea of data.
Improving cognitive performance is central to this objective,
particularly given the infeasibility of exhaustive inspection
of massive and ever-growing information sources.
We present Information Foraging for Algorithm Discovery (IFAD) for contemporary exploration of very large, uncon- ventional data sources, i.e., open source indicators that are
either not directly related to the target or not typically used
to predict certain types of events [1]. The main research
question here concerns the ability to rapidly discover new
algorithmic approaches to anticipating cyber attacks through
information foraging. We make the case that information
foraging can be leveraged to efficiently focus the user’s
attention to appropriate aspects of unconventional resources.
We then present an implementation of an architecture that
combines aspects of a cognitive architecture for information
foraging with an anticipatory intelligence architecture.
Our results demonstrate that cognitive augmentation, and
information foraging in particular, is useful in the devel-
opment of tools to anticipate cyber attacks using publicly
available data. The goal is to provide a framework that
addresses standard data-science issues of volume and variety
by balancing human intuition with automation, and thus
taking initial steps toward supporting the increasing need for
rapid analysis of, and tool development for, big data [2].
The next section reviews related literature to establish the
appropriateness of applying information foraging techniques
to open source anticipatory intelligence systems in the
cyber security domain. The general information foraging
approach and design of the IFAD are then described, with an
application to a well-known cyber attack timeline resource
called Hackmageddon. Strategies for maximizing recall, pre-
cision, and F-score of cyber attack predictions are examined.
Finally, the system developed to transform the environment
and enrich the available information is detailed.
II. BACKGROUND
This section presents related work in open source intel-
ligence and information exploration. Information foraging
is presented in the context of publicly available data envi-
ronments and discovery of new algorithms for anticipating
cyber attacks in the face of massive volumes of data.
A. Open Source Indicators As long as there is valuable information openly available
online in places like marketplaces, forums, blogs, and social
media via the clear-, dark-, and deep-nets, it makes sense to
create systems to organize that information in order to put
it to use. Below we describe two projects designed with this
goal in mind: EMBERS [3] and STUCCO (implemented in
PACE [4]). IFAD adopts many design decisions from these
earlier efforts, but transcends these in its application to the
domain of anticipatory intelligence for prediction of cyber
attacks from large, unconventional resources.
EMBERS was designed to identify information sources
from large data sets to enable warning generation for well
defined events and to be scored against a gold standard
report generated by a third party, typically with the assis-
tance of human analysts. Within this framework, events that
happened with a degree of regularity were separated from
significant or rare events because of the additional benefit
of being able to forecast surprising events [5].
The STUCCO project1 had the goal of “quickly putting
security events in context,” initially focusing on the effi-
cient discovery of cyber-security concepts in open source
information [4]. STUCCO was designed to extract software
vulnerabilities and exploits from online forums and market
1http://stucco.github.io/
4643
places, but conventional language processing tools did not
work because of unconventional text used on those forums,
particularly on the darkweb. STUCCO creators achieved
improvements in the PACE [4] implementation: A training
dataset was constructed with minimal supervision from a
very small seed of known entities, and this was used to train
a Maximum Entropy model that yielded “almost perfect”
results in extracting cybersecurity related entities [6].
EMBERS focuses on recall because it delegates precision
to curated input sources and then runs fusion on the output.
STUCCO, on the other hand, focuses on precision, seeking
to augment the limited availability of attention from human
analysts by avoiding the introduction of multiple sources of
noisy data. Because the domain consists of the entirety of
the web, even low recall will result in a large amount of
additional information.
B. Information Foraging Strategies IFAD’s focus on the cyber attack domain distinguishes
the approach from those designed to operate in the domain
of societal events. Both domains require big data (e.g.,
GDELT [7], Shodan [8]), but the tools must be designed
quite differently. Civil unrest, protests, strikes, and ‘occupy’
events are often designed with the explicit purpose of
generating awareness, whereas malicious online activity is
usually performed with the goal of remaining undetected [9].
Because the signal for cyber attacks is obfuscated in
big data, foraging for information that supports accurate
forecasts of attacks requires models beyond the conventional
information retrieval and natural language processing mod-
els seen in EMBERS and STUCCO. Two biology-inspired
models of information foraging theory [10, 11] align with the
goal of discovering approaches to anticipating cyber attacks
from unconventional resources: (1) the diet model fits when a set of resources is determined in advance, and the decisions
to be made include which resources to harvest and whether
or not to invest energy into developing improved methods
of harvesting those resources; (2) the patch model fits when the availability of information resources is less predictable.
These two models serve as the basis of IFAD’s design.
IFAD assists its users in deciding whether to pursue
signals that are widespread, easily transformed, and fre-
quently occurring, but low in gained value, or signals that
may be difficult to discover, require intensive computation
to transform, and substantial effort to maintain, but offer
a larger gain of value. Provenance capabilities provided
by IFAD are useful in the lifecycle management of new
algorithms to anticipate cyber attacks.
C. Information Foraging Environment Information foragers can attempt to enrich the environ-
ment by reducing the number of resources considered, by
reducing the amount of time it takes to process the informa-
tion contained in a resource, or by increasing the yield of
each specific resource [11]. Interactive information foraging
as used in IFAD enriches the environment by automatically
Predictions Data fusion and forecasting
Declarative Knowledge Patch and Diet Models
Application State Structured, Linked Data &
Ground Truth
Procedures Automated routines to
ingest data streams
Internet
WebViewVisualizations DevTools & Code Editor
IPC
W eb
B ro
w se
r
Interface
Information Foraging
Architecture
Application Architecture
Anticipatory Intelligence Architecture
Actions Web Scraping
Enriched Knowledge Post-processing
Declarative & Working Memory Procedural Knowledge
Figure 1. The IFAD architecture is informed by anticipatory intelligence and application development.
tracking paths from one resource to the next and creating
a network of resources. When a user designates a resource
as a potential lead for finding new predictive features, then
the yield of that resource will be tracked over time and
used to score the resource. Using resources’ scores, network
analytics can be applied to the tracked paths to identify other
potentially high-yield resources.
Visualization and statistical analysis capabilities enable a
user to analyze an extracted data set against a ground truth
report without leaving the information foraging application.
This is important because investigative search requires hu-
man participation [12], and the overhead of moving between
tools is disruptive [13].
Thus, an important part of IFAD’s design is proper
allocation of human attention over the vast amount of infor-
mation available through network protocols. This problem
has proved challenging since at least 1971 when Simon
[14] said “What information consumes is rather obvious:
it consumes the attention of its recipients. Hence a wealth
of information creates a poverty of attention, and a need to
allocate that attention efficiently among the over-abundance
of information sources that might consume it.”
III. INFORMATION FORAGING ENVIRONMENT DESIGN
IFAD is designed to capture the intent of a user when
he/she comes across the information, logging the path that
was taken to get to the information, and structuring the
information contained on the page in such a way that it can
be stored, referenced, and linked to other related datapoints.
IFAD is implemented as a web browser extended according
to the Foraging Architecture illustrated in Figure 1.
Built with Electron, IFAD employs web views as a
4644
window into the internet. Information is consumed by the
user who then ingests it into the application state through
widgets built with React. Declarative and working memory
are managed via the Redux interpretation of the Flux ap-
plication architecture. Procedural knowledge is delegated to
the human user who can leverage standard browser dev tools
and D3 visualizations to improve matches and generate in-
sights. This technology stack enables standalone application
building, without requiring separate services such as data
bases and web servers or any third party involvement.
IFAD’s design is a concrete rendering of the following
formalization by Pirolli [11]:
max
[ Expected value of knowledge gained
Cost of interaction
] (1)
Specifically, IFAD aims to maximize the numerator—the
expected value of the knowledge gained—by extracting
information such that it can be useful later (storing the path
that was taken to get this information) while minimizing
the need (cost) to recreate the circumstances necessary for
discovery as specified in the denominator.
The user is an information explorer, with a tool that en-
gages seamlessly in recording steps taken and key findings.
Vitally, IFAD is not a hosted service, but a client application
managed and operated entirely by the user. The fruit of
each information foraging session is owned entirely by the
user. The system allows users to share their sessions with
each other, thus adding a social dimension to the augmented
memory described in [15]. The information-rich resources
and strategies discovered by one user can thus immediately
be inherited by any other user.
Previous work on information foraging (e.g., approaches
described in Section II-B) has focused on how improving
the information scent could improve web page design and
utility.By contrast, IFAD is a tactical tool that takes the
lessons learned from building web sites and the insights into
how they’re used, and extends them to compile a knowledge
and belief environment from which predictions about the
real world can be made. The ability to navigate and forage
on the web is a skill on its own, at which some are more
talented than others, but IFAD is designed as a cognitive
orthotic [16], extending, amplifying, and leveraging the
user’s existing capabilities.
IV. METHODOLOGY
This section describes the set-up for application and
evaluation of the IFAD design. The problem space was
reduced to determining the best resource on the web to
use when predicting what cyber attacks will be included
in Hackmageddon timelines. New big data techniques were
needed for task reduction within the cyber-attack domain;
thus, IFAD’s tasks included the following steps:
1) Scrape 2017 timeline content on Hackmageddon from
the beginning of the year through the first half of June;
2) Extract host from each referenced URL in the timeline;
3) Count each host occurrence in each timeline;
4) Run Google’s advanced search to estimate total number
of items published by that host along the timeline;
5) Calculate the rate of information gain for each host;
6) Use information gain rate to suggest information forag-
ing strategy to maximize precision, recall, and F-score.
Two key metrics of concern when evaluating a forecasting
system are the lead time and the utility time. Lead time is the difference between the time when an event was reported
on and the time when a system generates a warning about
the event. Utility time (also known as date quality [5]) is the difference between when an event occurs and when a
system generates a warning. The distinction between these
two metrics is particularly important in a cyber security
context because of the measures that adversaries take in
order to remain undetected for long periods of time [9].
A. Foraging for Cyber Attack Leading Indicators By interpreting recall, precision, or F-score as the average
information value gained (g) by that web resource, the rate of gain of information for different “diet” combinations
(the information sources consumed by the system) were
calculated using Equation (2) where R is the function for rate of gain, k is the identifier of the resource, λi is average rate of encountering resource i, gi is the average information value gained for resources of type i, tWi is the effort required to consume resources of type i and so gi(tWi) is the cumulative value gained from information resources of
type i [11].
R(k) =
∑k i=1 λigi
1 + ∑k
i=1 λitWi (2)
In order to develop strategies for foraging online resources
to predict timeline entries on Hackmageddon, these values
were mapped to the parameters of Equation (2). Precision,
recall, and F-score were used as representations of the
average information gain (gi) in order to compare how strategies would differ in order to optimize those metrics.
Precision and recall were calculated using the number of
Hackmageddon entries for resource i as the true positives, and Google’s estimate for the total number of results for
each resource i was chosen to represent the total number of retrieved documents for resource i and also λi.
All values for tWi were set to 1 as a simplifying assump- tion. This assumption is justified as the task of assessing
whether a resource could be included in a Hackmageddon
timeline is similar to the search task described in [11]
where workers were skilled in examining search results
and estimating the expected amount of relevant content in
documents. The search domain for Hackmageddon resource
predictions was also simplified using a method similar to
Good-Turing Discounting [17, 18]. Each resource that only
occurred once in all the combined timelines were aggregated
4645
Table I SAMPLE OF ITEMS AVAILABLE FOR INFORMATION DIET
Host TP λ Precision Recall F-Score www.databreaches.net 54 368 0.15 0.13 0.14 www.hackread.com 26 400 0.07 0.06 0.06 securityaffairs.co 27 635 0.04 0.07 0.025 krebsonsecurity.com 9 345 0.03 0.02 0.02 Seen Once Aggregate 85 6.6M 1E-05 0.21 3E-05
into a meta-resource and used to estimate the resources that
have not yet been seen.
Using the Optimal Diet Selection Algorithm [11], the
resources were ranked according to their profitability (πi = gi/tWi ) in terms of recall, precision, and F-score. A resource was added to the “diet” if the rate of gain for a diet that
includes the resource is greater than the profitability of
the k+1st resource. That is, as long as the inequality in Equation (3) is satisfied.
R(k) =
∑k i=1 λigi
1 + ∑k
i=1 λitWi >
gk+1 tWk+1
(3)
B. Optimal Diet Results Analyzing recall, precision, and F-score as represen-
tations of profitability resulted in two distinct strategies
for predicting Hackmageddon entries. In the cases where
precision or F-score determine profitability, the optimal diet
consisted of only www.databreaches.net. When recall de-
termined profitability, the diet consisted of databreaches.net
and the meta-class that combined the instances of resources
with a single entry.
To illustrate how these strategies can inform predictions,
we built an abstract model of the 2017 June 16-30 and
July 1-15 Hackmageddon timelines based on historical
information to use as a baseline. Open source foraging
techniques were applied by IFAD to perform the information
seeking task on databreaches.net alone. For the baseline
model the distribution for each host was calculated across
all 2017 Hackmageddon timelines prior to the June 16-30
report. This resulted in 11 reports consisting of 413 entries
across 121 host sites. The timelines averaged approximately
37.5 incident reports each, so this number was used to
determine the expected number of attacks reported by each
site according to the historical distribution.
All sites with an expected contribution of fewer than 0.5
attacks were aggregated. This resulted in the expected values
seen in Table II where count is the number of times a host occurs in a Hackmageddon timeline, average is the count divided by 11 (number of timelines before June 16th,
2017), weight is the count divided by 413 (total number of linked attacks between January 1 and June 16, 2017),
and the expected value is the weight multiplied by the 37.5 average attacks per timeline, rounded up to the nearest whole
number. When Google was called within IFAD to search
Table II ANTICIPATED RESULTS FOR HACKMAGEDDON JUNE 16-30 TIMELINE
Host Count Average Weight Expected krebsonsecurity.com 9 0.818 0.02 1 motherboard.vice.com 15 1.363 0.04 2 news.softpedia.com 8 0.727 0.02 1 securityaffairs.co 27 2.454 0.06 3 bleepingcomputer.com 21 1.909 0.05 2 databreaches.net 54 4.909 0.13 5 hackread.com 26 2.363 0.06 3 ibtimes.co.uk 39 3.545 0.09 4 infosecurity-magazine.com 9 0.818 0.02 1 reuters.com 12 1.090 0.03 2 scmagazine.com 16 1.454 0.04 2 theregister.co.uk 16 1.454 0.04 2 zdnet.com 11 1 0.03 1 Rest of Web 108 9 0.26 10
Table III ACTUAL RESULTS FOR HACKMAGEDDON JUNE 16-30 TIMELINE
Hostname Links Published Predicted Premier scmagazine.com 4 177 2 No bleepingcomputer.com 4 783 2 No databreaches.net 4 66 5 No motherboard.vice.com 2 633 2 No nytimes.com 2 0 No bbc.co.uk 2 0 No proofpoint.com 1 0 Yes reuters.com 1 20000 2 No nydailynews.com 1 0 Yes securityaffairs.co 1 95 3 No krebsonsecurity.com 1 54 1 No darkreading.com 1 0 No techtalk.pcpitstop.com 1 0 Yes nymag.com 1 0 Yes miamiherald.com 1 0 Yes thetimes.co.uk 1 0 No theregister.co.uk 1 569 2 No thehill.com 1 0 Yes blog.talosintelligence.com 1 0 No bravenewcoin.com 1 0 Yes
results for the named hosts during the June 16-30 timeframe,
approximately 60,400 results were returned.
Twenty hosts were represented in the June 16-30 Hack-
mageddon timeline report. Of these, 8 were expected to
appear using the criteria above, an additional 5 had appeared
previously but below the cutoff, and 7 appeared for the first
time. The results are shown in Table III where links refers to the number of times the June 16-30 timeline references a
given host, published refers to the Google estimate of total hits published by that host in the June 16-30 timeframe (this
value is left blank for any host that was not predicted to be
included in the timeframe), predicted is the value predicted based on analysis described above, and premier indicates whether or not this is the first time a site has appeared on a
Hackmageddon timeline in 2017.
Alternatively, an information worker was tasked with re-
viewing IFAD’s Google search results from databreaches.net
for that site on dates between June 1-15, 2017. The worker
identified 11 out of 50 sites as potential entries. In ac-
4646
tuality, the June 16-50 timeline included 4 entries from
databreaches.net. Two of the 11 candidates turned out to be
true positives, one record was falsely scored as a negative,
and one Hackmageddon entry was not returned by Google.
Limiting the prediction domain to databreaches.net results
in a recall of 0.5 for databreaches.net entries and 0.6 overall
and a precision of 0.18 for both databreaches.net and overall.
Since all other hosts in the timeline contained fewer links
than published articles during the June 16-30 time frame,
it is reasonable to assume that any improvements to recall
gained by simply including more hosts for analysis would
come at a significant cost in terms of precision and effort.
Thus, in order to efficiently improve prediction performance,
the environment must be enriched so that either it takes
less work to extract value from a resource, or unproductive
resources are filtered out. The next section describes how
the system facilitates such environmental enrichment.
C. Enrichment The setup above enabled enrichment of the information
environment and implementation of strategies with improved
production. The goal was to reduce the amount of energy
expended in information foraging tasks while maintaining
a high expected gain. To illustrate our approach, we used
IFAD to enrich hackread.com, taking a functional reactive
programming approach that splits memory into declarative,
and procedural structures [11]. Elements of the EMBERS
system architecture [19] were also used to account for the
task model. The software itself uses the Redux2 implemen-
tation of the Flux3 application architecture.
1) Declarative Knowledge Declarative knowledge contains the type of information
that can be stored in a knowledge graph. Conceptual chunks
are maintained with semantic links between them. The
Redux data flow approach, based on the Flux application
architecture, was used to implement declarative memory.
2) Procedural Knowledge Procedural knowledge contains memories of how to do
things. It is a representation of the skills possessed by a
cognitive system. This was implemented in two ways in the
IFAD design. First, repeatable actions such as navigating
the web and interacting with the webview were stored as
actions and reducers. Unanticipated or improvised behavior
was incorporated into the system by taking advantage of the
development environment made available in browsers via
webtools. Redux state, and therefore declarative knowledge,
was accessible thanks to the redux-devtools project.4
3) Enrichment Walkthrough Despite being the second most productive resource in
terms of precision and F-score as shown in Table I, the
www.hackread.com resource is not included in any of the
2http://redux.js.org 3http://facebook.github.io/flux/ 4https://github.com/gaearon/redux-devtools
strategies produced by the optimal diet algorithm because
its rate of gain is not justified by the cost of effort and
opportunity. To expand the diet beyond databreaches.net,
other resources like hackread.com need to be enriched.
Enrichment shapes the environment to increase the yield
of a given strategy [11]. IFAD provides the ability to orga-
nize data contained in documents on the web and analyze
these to gain insight, and extract patterns in order to improve
the proportion of hackread.com resources that are included
in Hackmageddon timelines. The approach involved ex-
traction of all href values from links in Hackmageddon
timelines. From these, the hostname was parsed, and a filter
was applied to find links to hackread.com. This resulted in
26 documents, of which 10 were randomly selected. Each
article was visited and the article text extracted.
IFAD uses human-in-the-loop detection of webpage con-
tent rather than relying on automatic classifiers such as
Kohlschütter et al. [20]. Chrome DevTools were used to in-
vestigate important webpage elements. Javascript commands
executed via inter-process communication (IPC) extracted
interesting content from public web sites and loaded them
into a structured state within IFAD’s working memory.
The main context was selected from hackread.com pages.
Links from Hackmageddon (HACK class) led to ten page
links and to the first ten Google search results between Jan-
uary 1, 2017 and May 31, 2017 (BASE class). A language
model was constructed by counting the terms in both the
HACK and BASE corpora, after downcasing the text and
removing stopwords.
Next, the TF-IDF score was calculated for each term,
with each HACK and BASE class representing the two
documents. The top 30 scoring terms were then extracted
from the HACK class and used to refine the search. Without
the refinement, Google returned 47 results for hackread.com
between June 1-15, 2017. Only 13 results were returned
when the results included the term refinements, a 72% reduc-
tion. Importantly, the June 1-15 Hackmageddon timeframe
included one reference to hackread.com, and that resource
was also included in the refined search results.
V. RESULTS
The results discussed above do not alter the Optimal Diet
Strategy determined by Equation (3) because if all resources
required the same amount of energy, then the optimal diet
would almost always be consumption of only the resource
offering the highest gain. However, the R(k) change for hackread.com from 0.09 to 0.12 narrows the gap between it
and databreaches.net’s score of 0.13. This means that even
enrichments that are only able to reduce the work required
to consume hackread.com resources by a small amount can
still make it a viable resource.
VI. CONCLUSION
IFAD is a tool that supports data exploration with the
goal of discovering publicly available resources that can be
used to predict future cyber attacks. The system is built
4647
on the concept of information foraging and the design
decisions it incorporates are adapted from previous open
source intelligence tools such as EMBERS and STUCCO.
The system is distinct from other open source intelligence
systems because it prioritizes rapid feedback, exploratory
search, and the discovery of unconventional sources for
overcoming the challenges of concealment that come with
an adversarial domain such as cyber security. The ability to
enrich data sources to improve information gain was demon-
strated. However, in order to be effective at constructing
information exploration strategies, more research is needed
to understand the cost of information extraction.
Future iterations will include improvements to the inter-
face that make it easier to maximize Equation (2). Improve-
ments to the expected value of knowledge gained will be
incorporated by increasing the analytical toolset available
within the system, using links to declarative knowledge and
application state. Similarly, new procedural knowledge can
be used to decrease the cost of interaction by increasing
the integration of browser developer tools, making it easier
extract semantic meaning contained in a web page.
A positive side effect of sustained usage of the system will
be an annotated corpus on any topic the user is exploring.
The Hackmageddon example described in this paper, for
example, resulted in a small corpus of documents labeled
according to whether they were included in a Hackmageddon
timeline or not. When the system is used to discover
unconventional resources that enable cyber-attack prediction
more generally, the corpus will grow in size and the anno-
tations will become more granular, possibly aiding in the
construction of language models for a multitude of purposes.
Discovering resources that anticipate cyber attacks in
publicly available data is difficult because of the growth rate
of the search domain and the adaptability of attackers. The
implementation of information foraging strategies balance
human intuition with automation so that a wide variety
of new concepts from high-volume data collections can be
quickly tested for veracity.
ACKNOWLEDGMENT
This research is supported by the Office of the Director of
National Intelligence (ODNI) and the Intelligence Advanced Re-
search Projects Activity (IARPA) via the Air Force Research
Laboratory (AFRL) contract number FA875016C0114. The U.S.
Government is authorized to reproduce and distribute reprints for
Governmental purposes notwithstanding any copyright annotation
thereon. Disclaimer: The views and conclusions contained herein
are those of the authors and should not be interpreted as necessarily
representing the official policies or endorsements, either expressed
or implied, of ODNI, IARPA, AFRL, or the U.S. Government.
REFERENCES [1] A. Okutan, S. J. Yang, and K. McConky, “Predicting cyber at-
tacks with Bayesian networks using unconventional signals,” in Proceedings of the 12th Annual Conference on Cyber and Information Security Research. ACM, 2017, p. 13.
[2] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Rox- burgh, and A. H. Byers, “Big data: The next frontier for innovation, competition, and productivity,” McKinsey Global Institute, Tech. Rep., May, 2011.
[3] A. Doyle, G. Katz, K. Summers, C. Ackermann, I. Zavorin, Z. Lim, S. Muthiah, P. Butler, N. Self, L. Zhao et al., “Forecasting significant societal events using the EMBERS streaming predictive analytics system,” Big Data, vol. 2, no. 4, pp. 185–195, 2014.
[4] N. McNeil, R. A. Bridges, M. D. Iannacone, B. Czejdo, N. Perez, and J. R. Goodall, “PACE: Pattern accurate com- putationally efficient bootstrapping for timely discovery of cyber-security concepts,” in Proceedings of the 12th Inter- national Conference on Machine Learning and Applications (ICMLA), vol. 2. IEEE, 2013, pp. 60–65.
[5] S. Muthiah, P. Butler, R. P. Khandpur, P. Saraf, N. Self, A. Rozovskaya, L. Zhao, J. Cadena, C.-T. Lu, A. Vullikanti, A. Marathe, K. Summers, G. Katz, A. Doyle, J. Arredondo, D. K. Gupta, D. Mares, and N. Ramakrishnan, “EMBERS at 4 years: Experiences operating an open source indicators fore- casting system,” in Proceedings of the 22nd ACM SIGKDD. New York, NY, USA: ACM, 2016, pp. 205–214.
[6] R. A. Bridges, C. L. Jones, M. D. Iannacone, K. M. Testa, and J. R. Goodall, “Automatic labeling for entity extraction in cyber security,” arXiv preprint arXiv:1308.4941, 2013.
[7] K. Leetaru and P. A. Schrodt, “GDELT: Global data on events, location, and tone,” in ISA Annual Convention, 2013.
[8] J. Matherly, “Shodan,” 2017, https://www.shodan.io/ [Ac- cessed: November 2017].
[9] R. Bejtlich, The Tao of network security monitoring: beyond intrusion detection. Pearson Education, 2004.
[10] D. W. Stephens and J. R. Krebs, Foraging theory. Princeton University Press, 1986.
[11] P. Pirolli, Information foraging theory: Adaptive interaction with information. Oxford University Press, 2007.
[12] G. Marchionini, “Exploratory search: from finding to under- standing,” Communications of the ACM, vol. 49, no. 4, pp. 41–46, 2006.
[13] J. Heer, B. Shneiderman, and C. Park, “A taxonomy of tools that support the fluent and flexible use of visualizations,” ACM Queue, vol. 10, no. 2, pp. 1–26, 2012.
[14] H. A. Simon, “Designing organizations for an information- rich world,” Computers, communications, and the public interest, vol. 4, pp. 37–53, 1971.
[15] V. Bush, “As we may think,” The Atlantic Monthly, vol. 176, no. 1, pp. 101–108, 1945.
[16] K. M. Ford, P. J. Hayes, C. Glymour, and J. Allen, “Cognitive orthoses: toward human-centered AI,” AI Magazine, vol. 36, no. 4, pp. 5–8, 2015.
[17] I. J. Good, “The population frequencies of species and the estimation of population parameters,” Biometrika, vol. 40, no. 3-4, pp. 237–264, 1953.
[18] D. Jurafsky, Speech & Language Processing. Pearson Education India, 2000.
[19] N. Ramakrishnan, P. Butler, S. Muthiah, N. Self, R. Khandpur, P. Saraf, W. Wang, J. Cadena, A. Vullikanti, and G. Kork- maz, “‘beating the news’ with EMBERS: forecasting civil unrest using open source indicators,” in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2014, pp. 1799–1808.
[20] C. Kohlschütter, P. Fankhauser, and W. Nejdl, “Boilerplate detection using shallow text features,” in Proceedings of the Third ACM International Conference on Web Search and Data Mining. ACM, 2010, pp. 441–450.
<< /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles false /AutoRotatePages /None /Binding /Left /CalGrayProfile (Gray Gamma 2.2) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams true /MaxSubsetPct 100 /Optimize false /OPM 0 /ParseDSCComments false /ParseDSCCommentsForDocInfo false /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo false /PreserveFlatness true /PreserveHalftoneInfo true /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts false /TransferFunctionInfo /Remove /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile () /AlwaysEmbed [ true /Arial-Black /Arial-BoldItalicMT /Arial-BoldMT /Arial-ItalicMT /ArialMT /ArialNarrow /ArialNarrow-Bold /ArialNarrow-BoldItalic /ArialNarrow-Italic /ArialUnicodeMS /BookAntiqua /BookAntiqua-Bold /BookAntiqua-BoldItalic /BookAntiqua-Italic /BookmanOldStyle /BookmanOldStyle-Bold /BookmanOldStyle-BoldItalic /BookmanOldStyle-Italic /BookshelfSymbolSeven /Century /CenturyGothic /CenturyGothic-Bold /CenturyGothic-BoldItalic /CenturyGothic-Italic /CenturySchoolbook /CenturySchoolbook-Bold /CenturySchoolbook-BoldItalic /CenturySchoolbook-Italic /ComicSansMS /ComicSansMS-Bold /CourierNewPS-BoldItalicMT /CourierNewPS-BoldMT /CourierNewPS-ItalicMT /CourierNewPSMT /EstrangeloEdessa /FranklinGothic-Medium /FranklinGothic-MediumItalic /Garamond /Garamond-Bold /Garamond-Italic /Gautami /Georgia /Georgia-Bold /Georgia-BoldItalic /Georgia-Italic /Haettenschweiler /Impact /Kartika /Latha /LetterGothicMT /LetterGothicMT-Bold /LetterGothicMT-BoldOblique /LetterGothicMT-Oblique /LucidaConsole /LucidaSans /LucidaSans-Demi /LucidaSans-DemiItalic /LucidaSans-Italic /LucidaSansUnicode /Mangal-Regular /MicrosoftSansSerif /MonotypeCorsiva /MSReferenceSansSerif /MSReferenceSpecialty /MVBoli /PalatinoLinotype-Bold /PalatinoLinotype-BoldItalic /PalatinoLinotype-Italic /PalatinoLinotype-Roman /Raavi /Shruti /Sylfaen /SymbolMT /Tahoma /Tahoma-Bold /TimesNewRomanMT-ExtraBold /TimesNewRomanPS-BoldItalicMT /TimesNewRomanPS-BoldMT /TimesNewRomanPS-ItalicMT /TimesNewRomanPSMT /Trebuchet-BoldItalic /TrebuchetMS /TrebuchetMS-Bold /TrebuchetMS-Italic /Tunga-Regular /Verdana /Verdana-Bold /Verdana-BoldItalic /Verdana-Italic /Vrinda /Webdings /Wingdings2 /Wingdings3 /Wingdings-Regular /ZWAdobeF ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 200 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages true /ColorImageDownsampleType /Bicubic /ColorImageResolution 300 /ColorImageDepth -1 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /DCTEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.76 /HSamples [2 1 1 2] /VSamples [2 1 1 2] >> /ColorImageDict << /QFactor 0.76 /HSamples [2 1 1 2] /VSamples [2 1 1 2] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 15 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 15 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 200 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.76 /HSamples [2 1 1 2] /VSamples [2 1 1 2] >> /GrayImageDict << /QFactor 0.76 /HSamples [2 1 1 2] /VSamples [2 1 1 2] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 15 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 15 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 400 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000410064006f006200650020005000440046002065876863900275284e8e55464e1a65876863768467e5770b548c62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef69069752865bc666e901a554652d965874ef6768467e5770b548c52175370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002c0020006400650072002000650067006e006500720020007300690067002000740069006c00200064006500740061006c006a006500720065007400200073006b00e60072006d007600690073006e0069006e00670020006f00670020007500640073006b007200690076006e0069006e006700200061006600200066006f0072007200650074006e0069006e006700730064006f006b0075006d0065006e007400650072002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200075006d002000650069006e00650020007a0075007600650072006c00e40073007300690067006500200041006e007a006500690067006500200075006e00640020004100750073006700610062006500200076006f006e00200047006500730063006800e40066007400730064006f006b0075006d0065006e00740065006e0020007a0075002000650072007a00690065006c0065006e002e00200044006900650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000520065006100640065007200200035002e003000200075006e00640020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f00620065002000500044004600200061006400650063007500610064006f007300200070006100720061002000760069007300750061006c0069007a00610063006900f3006e0020006500200069006d0070007200650073006900f3006e00200064006500200063006f006e006600690061006e007a006100200064006500200064006f00630075006d0065006e0074006f007300200063006f006d00650072006300690061006c00650073002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f006200650020005000440046002000700072006f00660065007300730069006f006e006e0065006c007300200066006900610062006c0065007300200070006f007500720020006c0061002000760069007300750061006c00690073006100740069006f006e0020006500740020006c00270069006d007000720065007300730069006f006e002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA (Utilizzare queste impostazioni per creare documenti Adobe PDF adatti per visualizzare e stampare documenti aziendali in modo affidabile. I documenti PDF creati possono essere aperti con Acrobat e Adobe Reader 5.0 e versioni successive.) /JPN <FEFF30d330b830cd30b9658766f8306e8868793a304a3088307353705237306b90693057305f002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e305930023053306e8a2d5b9a3067306f30d530a930f330c8306e57cb30818fbc307f3092884c3044307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020be44c988b2c8c2a40020bb38c11cb97c0020c548c815c801c73cb85c0020bcf4ace00020c778c1c4d558b2940020b3700020ac00c7a50020c801d569d55c002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken waarmee zakelijke documenten betrouwbaar kunnen worden weergegeven en afgedrukt. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200073006f006d002000650072002000650067006e0065007400200066006f00720020007000e5006c006900740065006c006900670020007600690073006e0069006e00670020006f00670020007500740073006b007200690066007400200061007600200066006f0072007200650074006e0069006e006700730064006f006b0075006d0065006e007400650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f00620065002000500044004600200061006400650071007500610064006f00730020007000610072006100200061002000760069007300750061006c0069007a006100e700e3006f002000650020006100200069006d0070007200650073007300e3006f00200063006f006e0066006900e1007600650069007300200064006500200064006f00630075006d0065006e0074006f007300200063006f006d0065007200630069006100690073002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a0061002c0020006a006f0074006b006100200073006f0070006900760061007400200079007200690074007900730061007300690061006b00690072006a006f006a0065006e0020006c0075006f00740065007400740061007600610061006e0020006e00e400790074007400e4006d0069007300650065006e0020006a0061002000740075006c006f007300740061006d0069007300650065006e002e0020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400200073006f006d00200070006100730073006100720020006600f60072002000740069006c006c006600f60072006c00690074006c006900670020007600690073006e0069006e00670020006f006300680020007500740073006b007200690066007400650072002000610076002000610066006600e4007200730064006f006b0075006d0065006e0074002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create PDFs that match the "Required" settings for PDF Specification 4.01) >> >> setdistillerparams << /HWResolution [600 600] /PageSize [612.000 792.000] >> setpagedevice