IT quiz

profileLoola22
Unit7_Metadata_Classification.pdf

IT 380

Electronic Document and

Record Management

Systems

Unit 7 Metadata, Classification, and

Converting Manual Records

Instructor: Dr. Michelle Liu

METADATA

2

Topics

▪ Metadata concepts and standards

▪ Sources of metadata

▪ Applying metadata to records

▪ Automated metadata collection

3

Definition of Metadata

4

“Data describing context, content and structure of

records and their management through time.”

-Source: ISO15489

“Initially, metadata defines the record at its point of

capture, fixing the record into its business context and

establishing management control over it.”

-Source: ISO 23081

“Structured information that describes, explains,

locates, or otherwise makes it easier to retrieve, use,

or manage an information resource. Metadata is often

called data about data or information about

information.”

-Source: NISO (National Information Standards Organization)

Metadata

5

Metadata will have “properties”

These

describes the

characteristics

and rules

Where do you encounter metadata

every day?

6

Metadata

7

Where else do you use metadata

every day?

8

What is Metadata?

▪ Provides information about a document or record ▪ Machine understandable

▪ Has well defined semantics and structure

▪ May include descriptive information about the context,

quality, condition, or characteristics of the data

▪ Metadata provides context for data and is used to

facilitate the understanding, characteristics, and

usage of data

▪ All the objects managed by an ERM system have

metadata, including volumes, files, and classes.

▪ Also referred to as Tags or Attributes ▪ Adding consistent tags to an unstructured piece of

data gives it context 9

What is Metadata?, cont’d

▪ For a document or record metadata could be

its author, its title, the issue date, and other

information which can usefully be associated

with it

▪ Defined in terms of units called elements,

fields, index fields, or profile fields.

▪ Some fields also support sub-elements or

attributes to differentiate different types of

similar fields. ▪ E.g., Date_paid, Date_enrolled,

Date_graduated

10

Why Metadata?

11

▪ Supporting efficient retrieval

▪ Providing logical links between records and the

context of their creation, and maintaining them

in a structured, reliable and meaningful way

▪ Supporting the identification of the technological

environment used to create the record and

needed to access it

▪ Supporting efficient and successful migration of

records from one environment or computer

platform to another or any other preservation

strategy.

Why Metadata?

12

▪ Protecting records as evidence and ensuring

their accessibility and usability through time

▪ Facilitating the ability to understand records

▪ Supporting and ensuring the evidential value

of records

▪ Helping to ensure the authenticity, reliability,

and integrity of records

▪ Supporting and managing access, privacy,

and rights

Purpose of Metadata

▪ Identification / Distinction [Title, Date, Publisher, Version, Type, etc.]

▪ Search / Retrieve / Browse [Country, Region, Subject, Sector, Theme, Topic, etc.]

▪ Use Management [Authorized By, Rights Management, Access Rights, Location, Disclosure Status, etc.]

▪ Compliant Document Management [Record Identifier, Retention Schedule, Relation, Disposal Status, etc.]

13

14

Types of Metadata

Source: ISO 23081-1

Image source: http://archives.govt.nz/advice/continuum-resource-kit/continuum-publications-html/g14-technical-guide-implementing-recordkee

Metadata about the Record

▪ Provides minimal, but essential information

about an item or record

▪ May come directly from the file itself

▪ Examples: Author, Contributor, Creator,

Date, Identifier, Status, Title, or Document

Type.

15

Metadata about business rules,

policies, and mandates

▪ For “discovery” and context

▪ Making future search and retrieval

operations more effective

▪ More precise and targeted searches against

a range of criteria

▪ Facilitates collection of items into “virtual

files” that satisfy certain criteria

▪ Examples: Audience, Accessibility,

Addressee, Coverage, Description,

Language, Location, Publisher, Relation, or

Source 16

Metadata about Agents or Users

▪ Who created the record

▪ Who indexed it

▪ Who has accessed to perform record-related

functions

▪ Who has extracted it

▪ Examples: Indexer, Scanner, User, Manager,

Owner, or Reviewer

17

Administrative Metadata

▪ About business activities or processes

▪ Provide general information on how to

manage a resource

▪ Often includes technical metadata intrinsic to

the file or its location

▪ Examples: File Type, File Date,

Compression Type, or Access Level

18

Metadata about Records

Management Activities

▪ Subset of administrative metadata that is

more specifically designed to support

records management activities

▪ Usually of most relevance to an

organization’s record managers.

▪ Actions such as changing metadata entries

for security, disposal and preservation are

the domain of records administration

▪ Examples: Aggregation, Digital signature,

Disposal, Mandate, Preservation, Rights 19

Business-specific Metadata

▪ It may be necessary to add additional

elements or refinements to meet particular

business needs.

▪ Any new elements need to be carefully

defined and specified so that the user

community understands their purpose

▪ Constrain the choice of values to a ‘pick-list’

or pre-determined encoding scheme for the

information to be held, to ensure consistency

of use. 20

Examples of Business-Specific

Metadata

▪ Case 1: A business organization determines

that existing date-related metadata is

confusing because all fields use the same

date. It decides to create and use refined

elements including “date.invoicepaid”,

“date.billed”, and “date.projcomplete”.

▪ Case 2: A university decides to introduce a

Student element to differentiate records of

staff from student records. 21

Mandatory vs. Optional Metadata

▪ Trade-off between too many mandatory

elements and too few ▪ Why?

▪ Records Management staff often want as

much mandatory metadata as possible to

achieve a metadata-rich repository. ▪ At odds with the user community, who are often

not prepared to spend the time and effort

entering it!

▪ A sensible compromise is necessary, as well

as the use of automation techniques 22

Metadata for Physical Objects

▪ Metadata entries in an ERM system can be

made not only for electronic records, but also

for physical objects ▪ A library of audio tapes, CD-ROMs containing

experimental data, or other collections of

physical objects

▪ Physical objects need a Location metadata

element if the objects can be moved about.

▪ Record-keeping information and actions can

apply to physical objects just as with

electronic items ▪ E.g., Retention periods and disposal

instructions 23

Sources of Metadata

▪ The record itself can provide much

metadata ▪ When the record is first created, it will have

its own intrinsic metadata relating to the

digital file itself: file date, file size, and so

forth.

▪ The record could also provide metadata

itself

▪ A Word document will often have a title, author,

etc.

24

Sources of Metadata

▪ Document metadata/properties: ▪ Most Office Automation (OA)

software packages can associate

metadata with documents as they

are created or received.

▪ Example: MS Office permits the

entry of document “Properties”

25

Sources of Metadata

▪ Manual data entry ▪ Users are expected or required to enter

information into profile fields

▪ Profile fields could be in the document

properties themselves

▪ They could be in an ERMS, ECMS, or

imaging application.

▪ Data entry staff vs. all users

26

How metadata works for an

Individual Record

▪ Suppose you work as an intern in one local

IT company. You need to create a word file

to document details of one contract. When

you first created the contract document, it

has some intrinsic metadata: ▪ Title

▪ File format

▪ File size

▪ Date of creation

27

Declare a Record

▪ The document is declared as a record later

▪ As part of the process, additional metadata is

added to the record depending on what type

of record it is, whether it is a vital record, its

location, etc.

28

Perspectives on Metadata

▪ The act of entering metadata values is often

called “Indexing”

▪ Different views and perspectives on records

management metadata ▪ The business perspective: metadata support

business processes;

▪ The records management perspective:

metadata support records management over

time

▪ The user view: metadata enable the retrieval

and support understanding and interpretation of

records. 29

Manual Metadata Entry

▪ The most obvious method

▪ Users enter metadata for the record: paper

or electronic

▪ Application may prescribe types and/or

values of metadata ▪ Controlled vocabularies

▪ Data masks

▪ Expensive!

30

Controlled Vocabularies

▪ One of the challenges with metadata entry and

usage: ▪ Ensuring enough freedom to users that the system is

usable

▪ Putting a control framework in place that ensures the

metadata is manageable

▪ Ensure that users use metadata fields and terms

consistently

▪ Controlled vocabularies: supporting tools which

should be based on collections of terms which

users use to describe aspects of a record, other

than its business context. 31

Extraction from the Record

▪ Intrinsic metadata from the properties

▪ Recognition technologies ▪ OCR/IR

▪ Barcode recognition

▪ QR code

▪ Specialized recognition technologies

▪ Recognizing and extracting data from audio,

video, and other rich media types

32

Metadata from other data sources

▪ The idea is that the data is already stored in

a database somewhere ▪ Stored in logical structure

▪ Normalized and deduplicated

▪ The organization would do well to reuse that

data ▪ Manual or automated

33

Case Study

▪ An organization scanned project-related paper

documentation including contracts and

invoices.

▪ As part of this process the organization

performed data entry on more than 20 index

fields.

▪ The large number of fields and amount of

scanning being done caused them to look at

ways to automate the metadata capture

process.

▪ They decided on the following approach: 34

The Solution

35

The Result

▪ The organization went from approximately a 2%

error and rework rate to less than 0.01%

▪ Resulting in savings of more than $800,000 the

first year.

▪ Not bad for a system that cost less than $75,000

to acquire, integrate, and implement.

36

Automated Metadata Collection

▪ Software can help ▪ Document templates can contain code to

capture metadata

▪ Templates can also contain “bookmarks”,

“fields” and other features to “grab” metadata

▪ Software can also be used to look-up details

of the user ▪ E.g.,: LDAP (Lightweight Directory Access

Protocol) capture certain details, such as the

current user’s name, job title, department, etc.

▪ An ERM system should automate the

capture of metadata values as much as

possible 37

Metadata for Disposed Records

▪ MoReq2 changed its terminology from

‘retention schedule’ to ‘retention and

disposition schedule’

▪ Some metadata will remain after the records

are destroyed.

▪ Additional metadata might be created ▪ Destruction_date

▪ “Reviewer” or other similar fields

38

Document Metadata: What is

Hidden?

39

So what’s the problem?

▪ Information Security Policy relates to

employees’ duty to protect the company’s

confidential information and IP.

▪ We can’t say we have control over this if we

don’t implement education and technology to

protect against it.

40

Education: What employees must

consider

41

Risk of Data Leaking

▪ Comments

▪ Hidden columns/rows

▪ Previous authors

▪ Track changes

▪ Versions

▪ Redacted text

▪ Reviewers

▪ Footnotes

▪ Small text 42

Technology: Implement to enforce

policy

43

CLASSIFICATION

44

Overall Requirements

▪ Compliance

▪ Record capture

▪ Classification scheme

▪ Authenticity

▪ Audit trail

▪ Metadata

▪ Retention and disposal

▪ Access and use

▪ Documentation

▪ System Testing

▪ Non-electronic record handling 45

Compliance ▪ The system must manage and control

electronic records according to the standards

for compliance and the requirements for

legal admissibility and security, and must be

capable of demonstrating this compliance

▪ Compliance regulations vary from industry to

industry ▪ Legislation and regulation

▪ Industry- standards

What is Classification?

▪ Simply put, grouping information together ▪ Think about how you’ve structured the files that

you’ve got on your computer into folders and

why you’ve done it that way.

▪ Formal definition from ISO 15489: ▪ “The systematic identification and arrangement

of business activities and/or records into

categories according to logically structured

conventions, methods, and procedural rules

represented in a classification system.”.

47

Why Important?

▪ Two main approaches for effective access to

stored information: ▪ Classification: ‘Aggregate and organize'

▪ Search engine: 'Find by raw power'

▪ We want to be able to find all records related

to a particular topic, project, or client.

▪ Nobody wants to look at “My Documents”

folder with 1,000 document in it

▪ Easier to manage files and records in groups

▪ It is what we as a species do. 48

Why Important? - Identify Your

“Toxic” Data

49

Benefits of an Effective

Classification

▪ Linkages between individual records can be

provided easily and that these can be

accumulated to provide a continuous record

of an activity

▪ Ensure that records are named in a

consistent manner over time

▪ Assist in the retrieval of all records relating to

a particular function, topic or activity.

▪ Determining security protection and access

appropriate for sets of records 50

Benefits of Classification, Cont’d

▪ Allocating user permissions for access to, or

action on, particular groups of records

▪ Distributing responsibility for the

management of particular sets of records:

▪ Distributing records for action

▪ Determining appropriate retention periods

and disposition actions for records more

easily

51

Source: ISO 15489

For the “Data User” – has to be

obvious

52

Classification Scheme

▪ From ISO 11179-1: ▪ “Descriptive information for an arrangement or

division of objects into groups based on

characteristics, which the objects have in

common”

▪ Any structure an organization uses for

organizing, accessing, retrieving, storing &

managing its information.

▪ A business classification scheme (BCS) is a

classification scheme which is based on an

organization's business functions & activities 53

Awareness Programs

54

Organizational Taxonomy

▪ The most extensive and most complex

classification structure

▪ Enterprise-wide, including all departments

and functions and all information

repositories.

▪ There are prebuilt taxonomies available for a

number of different vertical industries.

55

Authenticity ▪ To be authentic in archives and records

management, a record must be genuine, or be “what it claims to be”

▪ In order to trust that a record is authentic, the user must be assured that the systems that create, capture, and manage electronic records maintain inviolate records that are protected from accidental or unauthorized alteration and from deletion while the record still has value

Audit Trails

▪ A system audit trail is a record that tracks operations performed on the system

▪ The audit trail documents the activities performed on records and their metadata from creation to disposal

▪ The audit trail typically documents the activities of creation, migration and other preservation activities, transfers or the movement of records, modification, deletion, defining access, and usage history

▪ The system must automatically capture the audit trail

▪ The audit trail data must be unalterable

▪ The audit trail must be logically linked to the records they document, so that users can review audit information when they retrieve records

Demonstrating Compliance

58

Track Key Trends

59

Security and Control

▪ The system must allow only authorized personnel to create, capture, update or purge records, metadata associated with records, files of records, classes in classification schemes, and retention schedules

▪ The system must control access to the records according to well-defined criteria

Records Retention Schedule

▪ “A comprehensive instruction covering the

disposition of records to assure that they are

retained for as long as necessary based on

their administrative, fiscal, legal and historic

value.”

Source: UN ARMS

61

Retention and Disposition

▪ Specify periods of time an organization must

retain a document, based on the content of a

document and how it is used.

▪ Present a list of Record Series, categories of

documents with very similar purpose, and a

period of time an organization must retain the

records in each series.

▪ The system must provide for the automated

destruction of records in accordance with

authorized and approved records retention

schedules 62

Retention Schedule Format ▪ Often a database

▪ Enable the people to make updates to

retention periods as regulations change

▪ Add new record series as the organization

expands its operations into new functional

areas

▪ Only records management staff and their

direct delegates should have access to the

retention schedule database

63

Preservation Strategies, Backups

and Recovery

▪ The system must incorporate a strategy or plan for backing up and preserving records

▪ The system must ensure that records, components of records, audit trails, metadata, links to metadata or to files, and classification schemes can be converted or migrated to new system hardware, software and storage media without loss of vital information

System Testing

▪ The performance and reliability of system

hardware and software must be regularly

tested

▪ Most important component is backups ▪ Computer components do and will fail

▪ Human errors will occur

▪ Upgrades are always point of vulnrtability

Non-Electronic Records

▪ The system must be capable of classifying and managing non-electronic, physical records and of managing electronic and non-electronic records in an integrated manner.

▪ The system must be able to classify, create and retrieve audit information and other metadata, and control access

▪ The system must be capable of defining security, and applying retention and disposal schedules for non- electronic records according to the same requirements that have been defined for electronic records

66

Systems Design ▪ Systems Development Lifecycle (SDLC) ▪ System concept: purpose, goals, scope

▪ Analysis: user/functional requirements

▪ Design

▪ data design: what information?

▪ software design: processed how?

▪ interface design: user interaction?

▪ Coding and testing: execute & evaluate

▪ Key issue: Systems do (only) what they’re designed to – purpose, goals, scope, requirements

67

Cost Benefits ▪ Costs up front to digitize your office

▪ Costs saved at “back end of process” ▪ No storage costs

▪ Can locate documents

▪ Ease in use of documents

▪ Saved paper costs (80% estimate)

▪ Immediate access to information

▪ Easy to reorganize information

▪ Better stewardship

68

Collaboration

▪ The need to collaborate electronically ▪ With employees

▪ With clients

▪ With partners

▪ With consultants

▪ With others, dependent on type of business

▪ Impact of globalization

▪ Need to reduce time frames for work

69

What Hardware Do We Need?

▪ High speed scanner: ▪ Key to successful conversion and operation

▪ Must be networked

▪ Must be able to save and send documents as

PDF

▪ Must be able to send documents directly to

email recipients

▪ Larger flat screen monitors (22 inches or

larger)

▪ Networked workstations ▪ Connected with the high speed scanner

70

Additional Hardware? ▪ Digital faxing ▪ Faxes arrive as emails and are sent as emails

▪ Get rid of the facsimile machine: use software to

convert or use a service

▪ Tablet computers ▪ Pen input

▪ Capacitive touch: using your finger to scroll

▪ OneNote: electronic notebook

▪ Smart Pens ▪ Capture notes electronically

▪ Record audio 71

What Software Do We Need? ▪ Adobe Acrobat (Latest Version): ▪ Portable document format reader

▪ PDF editor

▪ Bookmarks and nesting

▪ OCR functions: Acrobat 8 Pro and later

▪ Text on image = searchable PDF

▪ Document security

▪ Removal of metadata (E.g., Document Inspector in

MS Office 2007 and later)

▪ Signatures

▪ “Locking” 72

What Software Do We Need?

▪ Office products: ▪ Word

▪ Excel

▪ Access

▪ PowerPoint

▪ Now include ability to make pdf documents ▪ Adobe released details of standard

▪ Acrobat add ins: ▪ Autoink (http://www.evermap.com/autoink.asp)

▪ PDF Annotator(http://www.pdfannotator.com/en/)

▪ Xobni: email searching and organizing 73

CONVERTING MANUAL

RECORDS

74

Topics

▪ Scanning Historical Records

▪ OCR

▪ Barcoding

▪ On-site vs. Off-site

▪ Quality assurance

75

76

Digitization

▪ Definition: ▪ Converting written and printed information into

electronic form

▪ Creation of computerized version of a printed analog.

▪ Contents – text image, audio, video or combination of these (multimedia)

Capture all data in one system

77

Product Overview

EDRMS

Centralized

Document Repository

Email

Simultaneous & Restricted

Access to Documents over the

web

Desktop Files Faxes Scanned Documents

Optical Character Recognition

(OCR)

▪ The recognition of printed or written text

characters by a computer ▪ involves analysis of the scanned-in image

▪ translation of the character image into character

codes, such as American Standard Code for

Information Interchange (ASCII)

▪ Being applied by libraries, businesses, &

government agencies ▪ to create text-searchable files for digital

collections

78

Imaging

▪ Scan paper, film to create electronic images

▪ Allows simultaneous access to records

▪ May reduce storage costs

▪ Can improve or enhance: ▪ Security

▪ findability

79

Issues with Imaging ▪ Determine what to scan ▪ Records vs. non-records

▪ Backfile vs. day-forward

▪ Document preparation

▪ Indexing

▪ What to do with the originals

80

Reasons for Retaining Paper

Records ▪ Not cost effective to scan because of: ▪ Large volumes

▪ Document size

▪ Document condition

▪ Little activity

▪ Intrinsic value of record

▪ Legislation

81

82

OCR - Process

▪ The scanner or camera typically produced

TIFF images but now pdf is common

▪ The software cleans the image for noises

and starts recognizing patterns ▪ Recognized patterns in alphabets and numbers

▪ Unrecognized patterns into images

Barcodes

▪ As far back as the 1960s, barcodes were used in

industrial work environments

▪ In the early 1970s, common barcodes started

appearing on grocery shelves

▪ To automate the process of identifying grocery

items, UPC barcodes were placed on products

Bar Code Systems ▪ An organization may determine a need to use a

bar code system with Record Management

Applications (RMA) ▪ Target how to manage physical records or

documents in a manner consistent with their

electronic counterparts?

▪ While the app stores a digital object that

represents the record

▪ containing metadata like where the physical record is

being stored, the appropriate disposition schedule for

the record, etc.

▪ Users need a barcode to identify which physical

records correspond to which digital objects 84

Storage of Records

▪ Records must be stored in such a way that they are

accessible and safeguarded against environmental

damage ▪ Typical paper documents may be stored in a filing

cabinet in an office

▪ Some organizations employ file rooms with specialized

environmental controls including temperature & humidity

▪ Vital records may need to be stored in a disaster-

resistant safe or vault to protect against fire, flood,

earthquakes and conflict

▪ In extreme cases, the item may require both disaster-

proofing and public access

▪ E.g. the original, signed US Constitution 85

Off-site vs On-Site Scanning

▪ Most on-going scanning and indexing is done on-site

▪ In addition to on-site storage of records, many

organizations operate their own off-site records

centers or contract with commercial records centers

▪ Off-site scanning often used to handle the backlog

as an EDRMS is initiated ▪ Often the value of the system is diminished if all the data

is not present

▪ Many companies specialize in scanning documents ▪ Scanning America

▪ archSCAN

86

Quality Assurance ▪ Concerned with assessing and ensuring that ▪ data is accurate and consistent

▪ the RMA is consistent with its requirements

▪ May include: ▪ requirements of protection for CIA

▪ establishment of a robust assessment process

▪ use of third-party assessments

▪ contract performance measures/incentives

▪ use of regulatory and contract enforcement

authority, including civil, criminal, and financial

penalties 87

Quality Assurance

▪ May Also Include: ▪ customer review and approval

▪ Not addressed in US law except for credit reports

▪ Fair Credit Reporting Act

▪ EU- Individuals have the right to access data

collected about them, to correct inaccurate or

incomplete data, and to have those corrections sent

to those who have received the data

▪ Directive 95/46/EC (European Privacy Directive)

88

Data in Document Management

▪ Much of a document management system’s

content is documents which have been

imaged ▪ Not searchable in their own right

▪ Two technologies available: ▪ Optical Character Recognition followed by full

text searching

▪ Indexing (addition of metadata) followed by

metadata searching

89

Copyright

▪ Traditional copyright laws apply

▪ Copyright principle based on the belief that the public is entitled to freely use portions of copyrighted materials for purposes of commentary and criticism

▪ Unfortunately, if the copyright owner disagrees with your fair use interpretation, the dispute will have to be resolved by courts or arbitration

▪ The four factors for measuring fair use: ▪ the purpose and character of your use

▪ the nature of the copyrighted work

▪ the amount and substantiality of the portion taken

▪ the effect of the use upon the potential market

▪ Extended in academic environment to the Teach Act 90

Integrity Issues

▪ Need to guarantee the integrity of all documents

in the system ▪ System of record

▪ E-Discovery processes

▪ Version control ▪ Understanding when and why documents get

changed

▪ Check-in/checkout functionality ▪ Only one person can be modifying a document at

one time

▪ Access controls ▪ Who can upload documents to a repository

91

Quiz 3 Terms

-Covers Unit 5,6, and 7

▪ Auditing

▪ Audit trail

▪ Authenticity

▪ Barcode

▪ BCS (Business

Classification Scheme)

▪ Check-in and check-out

▪ Classification scheme

▪ Cloud computing

▪ Controlled vocabularies

▪ Data masks

▪ DoD 5012.2-STD

▪ DRM

▪ Indexing

▪ Metadata

▪ MoReq 2

▪ NARA

▪ Non-electronic record

▪ OA (Office Automation)

▪ OCR

▪ QA (Quality Assurance)

▪ Retention schedule

▪ RMA (Record Management

Application)

▪ SDLC

▪ Types of metadata

▪ Version control 92