Gate General Architecture Text Engineering

profileAsmah
Gate.pptx

Asma Aldhahri CS330 Artificial Intelligence

General Architecture software development environment and framework + macro-level organizational structure for families of software systems

Text Engineering performing tasks involving processing human language. Relates to application of computational linguistics, natural language processing and artificial intelligence

Computational Linguistics (CL): science of language that uses computation as an investigative tool.

Natural Language Processing (NLP): science of computation whose subject matter is data structures and algorithms for computer processing of human language.

Basically, it’s an environment that allows you to create all the aspects of a software system that draw information out of text

2

Develop & deploy software components

Text analysis, language processing

Language processing tasks:

Parsers

Morphology

Tagging

Language retrieval tools

Language extraction tools

All for various languages

3

Combines application development environment with GUI

Resources pane:

applications: group of processes that run on single/group of docs

language resources: docs/doc collections to be annotated (CORPUS = collection in GATE terms)

processing resources: annotation tools that operate on unstructured text within docs

data stores: specialized files located on HD or DB where docs & processing resources are kept for future use

4

Annotating a Document Example: Step 1 Load the Language Resource (document)

To add a document, right click language resources and select new > gate document

Encoding should be set and the url for the document (can browse you files or take a url straight from the web)

5

Step 2 Load Processing Resources ANNIE

Document Reset – removes old annotations

Tokenizer – splits text into tokens, groups them together (word, number, symbol, punctuation, space)

Gazetteer – identifies proper nouns with more general concepts (eg heathrow = airport)

Sentence Splitter – recognizes sentences

Tagger – produces part of speech tag

Transducer – adapts tokenizer output to POS tagger requirements

OrthoMatcher – adds identity relations between named entities

Click here

ANNIE- a nearly new information extractor

plugin based architecture separates processing components from document based resources / helps minimize memory usage

6

Step 3 Before Running an Application, Add Document to a Corpus

Right click the document you want in a corpus

Highlight multiple documents for a larger corpus

ANNIE- a nearly new information extractor

plugin based architecture separates processing components from document based resources / helps minimize memory usage

7

Step 4 Create Corpus Pipeline

Shows the time for running the application

First add document to corpus

Press “Run this Application”

Order of PRs is important, their inputs are the outputs of the previous PR

8

Step 5 View Annotations & Add Annotation

Shows the time for running the application

Highlight the text you want to annotate and right click

9

The original document before and after adding the “Arabic” and “Person” annotation

10

Step 6 Change Unknown Annotation

Select drop down list and select “Location”

“Yanbu” will be highlighted as “Location” now

11

Step 7

Tokens & Spaces Highlighted

Annotations List

12

Step 8

Sentences & Splits Highlighted

13

Step 9

Annotations List

14

Step 10

Create Arabic Pipeline

15

Step 11

Arabic Annotation

GATE now recognizes the Arabic word ماريا as a “Person” and tokenizes it as such

16

Annotation List

(closer look)

17

Another Annotation Example with more token types

18

Application Pipelines, Corpora, Documents and Processing Resources can all be saved

Annotated documents can be saved for use outside of the GATE environment

19

Thank

You

20