Gate General Architecture Text Engineering

Gate.pptx

Home >Computer Science homework help >Gate General Architecture Text Engineering

Asma Aldhahri CS330 Artificial Intelligence

General Architecture software development environment and framework + macro-level organizational structure for families of software systems

Text Engineering performing tasks involving processing human language. Relates to application of computational linguistics, natural language processing and artificial intelligence

Computational Linguistics (CL): science of language that uses computation as an investigative tool.

Natural Language Processing (NLP): science of computation whose subject matter is data structures and algorithms for computer processing of human language.

Basically, it’s an environment that allows you to create all the aspects of a software system that draw information out of text

Develop & deploy software components

Text analysis, language processing

Language processing tasks:

Parsers

Morphology

Tagging

Language retrieval tools

Language extraction tools

All for various languages

Combines application development environment with GUI

Resources pane:

applications: group of processes that run on single/group of docs

language resources: docs/doc collections to be annotated (CORPUS = collection in GATE terms)

processing resources: annotation tools that operate on unstructured text within docs

data stores: specialized files located on HD or DB where docs & processing resources are kept for future use

Annotating a Document Example: Step 1 Load the Language Resource (document)

To add a document, right click language resources and select new > gate document

Encoding should be set and the url for the document (can browse you files or take a url straight from the web)

Step 2 Load Processing Resources ANNIE

Document Reset – removes old annotations

Tokenizer – splits text into tokens, groups them together (word, number, symbol, punctuation, space)

Gazetteer – identifies proper nouns with more general concepts (eg heathrow = airport)

Sentence Splitter – recognizes sentences

Tagger – produces part of speech tag

Transducer – adapts tokenizer output to POS tagger requirements

OrthoMatcher – adds identity relations between named entities

Click here

ANNIE- a nearly new information extractor

plugin based architecture separates processing components from document based resources / helps minimize memory usage

Step 3 Before Running an Application, Add Document to a Corpus

Right click the document you want in a corpus

Highlight multiple documents for a larger corpus

ANNIE- a nearly new information extractor

plugin based architecture separates processing components from document based resources / helps minimize memory usage

Step 4 Create Corpus Pipeline

Shows the time for running the application

First add document to corpus

Press “Run this Application”

Order of PRs is important, their inputs are the outputs of the previous PR

Step 5 View Annotations & Add Annotation

Shows the time for running the application

Highlight the text you want to annotate and right click

The original document before and after adding the “Arabic” and “Person” annotation

Step 6 Change Unknown Annotation

Select drop down list and select “Location”

“Yanbu” will be highlighted as “Location” now

Step 7

Tokens & Spaces Highlighted

Annotations List

Step 8

Sentences & Splits Highlighted

Step 9

Annotations List

Step 10

Create Arabic Pipeline

Step 11

Arabic Annotation

GATE now recognizes the Arabic word ماريا as a “Person” and tokenizes it as such

Annotation List

(closer look)

Another Annotation Example with more token types

Application Pipelines, Corpora, Documents and Processing Resources can all be saved

Annotated documents can be saved for use outside of the GATE environment

Thank

You