IT quiz
IT 380
Electronic Document and
Record Management
Systems
Unit 7 Metadata, Classification, and
Converting Manual Records
Instructor: Dr. Michelle Liu
METADATA
2
Topics
▪ Metadata concepts and standards
▪ Sources of metadata
▪ Applying metadata to records
▪ Automated metadata collection
3
Definition of Metadata
4
“Data describing context, content and structure of
records and their management through time.”
-Source: ISO15489
“Initially, metadata defines the record at its point of
capture, fixing the record into its business context and
establishing management control over it.”
-Source: ISO 23081
“Structured information that describes, explains,
locates, or otherwise makes it easier to retrieve, use,
or manage an information resource. Metadata is often
called data about data or information about
information.”
-Source: NISO (National Information Standards Organization)
Metadata
5
Metadata will have “properties”
These
describes the
characteristics
and rules
Where do you encounter metadata
every day?
6
Metadata
7
Where else do you use metadata
every day?
8
What is Metadata?
▪ Provides information about a document or record ▪ Machine understandable
▪ Has well defined semantics and structure
▪ May include descriptive information about the context,
quality, condition, or characteristics of the data
▪ Metadata provides context for data and is used to
facilitate the understanding, characteristics, and
usage of data
▪ All the objects managed by an ERM system have
metadata, including volumes, files, and classes.
▪ Also referred to as Tags or Attributes ▪ Adding consistent tags to an unstructured piece of
data gives it context 9
What is Metadata?, cont’d
▪ For a document or record metadata could be
its author, its title, the issue date, and other
information which can usefully be associated
with it
▪ Defined in terms of units called elements,
fields, index fields, or profile fields.
▪ Some fields also support sub-elements or
attributes to differentiate different types of
similar fields. ▪ E.g., Date_paid, Date_enrolled,
Date_graduated
10
Why Metadata?
11
▪ Supporting efficient retrieval
▪ Providing logical links between records and the
context of their creation, and maintaining them
in a structured, reliable and meaningful way
▪ Supporting the identification of the technological
environment used to create the record and
needed to access it
▪ Supporting efficient and successful migration of
records from one environment or computer
platform to another or any other preservation
strategy.
Why Metadata?
12
▪ Protecting records as evidence and ensuring
their accessibility and usability through time
▪ Facilitating the ability to understand records
▪ Supporting and ensuring the evidential value
of records
▪ Helping to ensure the authenticity, reliability,
and integrity of records
▪ Supporting and managing access, privacy,
and rights
Purpose of Metadata
▪ Identification / Distinction [Title, Date, Publisher, Version, Type, etc.]
▪ Search / Retrieve / Browse [Country, Region, Subject, Sector, Theme, Topic, etc.]
▪ Use Management [Authorized By, Rights Management, Access Rights, Location, Disclosure Status, etc.]
▪ Compliant Document Management [Record Identifier, Retention Schedule, Relation, Disposal Status, etc.]
13
14
Types of Metadata
Source: ISO 23081-1
Image source: http://archives.govt.nz/advice/continuum-resource-kit/continuum-publications-html/g14-technical-guide-implementing-recordkee
Metadata about the Record
▪ Provides minimal, but essential information
about an item or record
▪ May come directly from the file itself
▪ Examples: Author, Contributor, Creator,
Date, Identifier, Status, Title, or Document
Type.
15
Metadata about business rules,
policies, and mandates
▪ For “discovery” and context
▪ Making future search and retrieval
operations more effective
▪ More precise and targeted searches against
a range of criteria
▪ Facilitates collection of items into “virtual
files” that satisfy certain criteria
▪ Examples: Audience, Accessibility,
Addressee, Coverage, Description,
Language, Location, Publisher, Relation, or
Source 16
Metadata about Agents or Users
▪ Who created the record
▪ Who indexed it
▪ Who has accessed to perform record-related
functions
▪ Who has extracted it
▪ Examples: Indexer, Scanner, User, Manager,
Owner, or Reviewer
17
Administrative Metadata
▪ About business activities or processes
▪ Provide general information on how to
manage a resource
▪ Often includes technical metadata intrinsic to
the file or its location
▪ Examples: File Type, File Date,
Compression Type, or Access Level
18
Metadata about Records
Management Activities
▪ Subset of administrative metadata that is
more specifically designed to support
records management activities
▪ Usually of most relevance to an
organization’s record managers.
▪ Actions such as changing metadata entries
for security, disposal and preservation are
the domain of records administration
▪ Examples: Aggregation, Digital signature,
Disposal, Mandate, Preservation, Rights 19
Business-specific Metadata
▪ It may be necessary to add additional
elements or refinements to meet particular
business needs.
▪ Any new elements need to be carefully
defined and specified so that the user
community understands their purpose
▪ Constrain the choice of values to a ‘pick-list’
or pre-determined encoding scheme for the
information to be held, to ensure consistency
of use. 20
Examples of Business-Specific
Metadata
▪ Case 1: A business organization determines
that existing date-related metadata is
confusing because all fields use the same
date. It decides to create and use refined
elements including “date.invoicepaid”,
“date.billed”, and “date.projcomplete”.
▪ Case 2: A university decides to introduce a
Student element to differentiate records of
staff from student records. 21
Mandatory vs. Optional Metadata
▪ Trade-off between too many mandatory
elements and too few ▪ Why?
▪ Records Management staff often want as
much mandatory metadata as possible to
achieve a metadata-rich repository. ▪ At odds with the user community, who are often
not prepared to spend the time and effort
entering it!
▪ A sensible compromise is necessary, as well
as the use of automation techniques 22
Metadata for Physical Objects
▪ Metadata entries in an ERM system can be
made not only for electronic records, but also
for physical objects ▪ A library of audio tapes, CD-ROMs containing
experimental data, or other collections of
physical objects
▪ Physical objects need a Location metadata
element if the objects can be moved about.
▪ Record-keeping information and actions can
apply to physical objects just as with
electronic items ▪ E.g., Retention periods and disposal
instructions 23
Sources of Metadata
▪ The record itself can provide much
metadata ▪ When the record is first created, it will have
its own intrinsic metadata relating to the
digital file itself: file date, file size, and so
forth.
▪ The record could also provide metadata
itself
▪ A Word document will often have a title, author,
etc.
24
Sources of Metadata
▪ Document metadata/properties: ▪ Most Office Automation (OA)
software packages can associate
metadata with documents as they
are created or received.
▪ Example: MS Office permits the
entry of document “Properties”
25
Sources of Metadata
▪ Manual data entry ▪ Users are expected or required to enter
information into profile fields
▪ Profile fields could be in the document
properties themselves
▪ They could be in an ERMS, ECMS, or
imaging application.
▪ Data entry staff vs. all users
26
How metadata works for an
Individual Record
▪ Suppose you work as an intern in one local
IT company. You need to create a word file
to document details of one contract. When
you first created the contract document, it
has some intrinsic metadata: ▪ Title
▪ File format
▪ File size
▪ Date of creation
27
Declare a Record
▪ The document is declared as a record later
▪ As part of the process, additional metadata is
added to the record depending on what type
of record it is, whether it is a vital record, its
location, etc.
28
Perspectives on Metadata
▪ The act of entering metadata values is often
called “Indexing”
▪ Different views and perspectives on records
management metadata ▪ The business perspective: metadata support
business processes;
▪ The records management perspective:
metadata support records management over
time
▪ The user view: metadata enable the retrieval
and support understanding and interpretation of
records. 29
Manual Metadata Entry
▪ The most obvious method
▪ Users enter metadata for the record: paper
or electronic
▪ Application may prescribe types and/or
values of metadata ▪ Controlled vocabularies
▪ Data masks
▪ Expensive!
30
Controlled Vocabularies
▪ One of the challenges with metadata entry and
usage: ▪ Ensuring enough freedom to users that the system is
usable
▪ Putting a control framework in place that ensures the
metadata is manageable
▪ Ensure that users use metadata fields and terms
consistently
▪ Controlled vocabularies: supporting tools which
should be based on collections of terms which
users use to describe aspects of a record, other
than its business context. 31
Extraction from the Record
▪ Intrinsic metadata from the properties
▪ Recognition technologies ▪ OCR/IR
▪ Barcode recognition
▪ QR code
▪ Specialized recognition technologies
▪ Recognizing and extracting data from audio,
video, and other rich media types
32
Metadata from other data sources
▪ The idea is that the data is already stored in
a database somewhere ▪ Stored in logical structure
▪ Normalized and deduplicated
▪ The organization would do well to reuse that
data ▪ Manual or automated
33
Case Study
▪ An organization scanned project-related paper
documentation including contracts and
invoices.
▪ As part of this process the organization
performed data entry on more than 20 index
fields.
▪ The large number of fields and amount of
scanning being done caused them to look at
ways to automate the metadata capture
process.
▪ They decided on the following approach: 34
The Solution
35
The Result
▪ The organization went from approximately a 2%
error and rework rate to less than 0.01%
▪ Resulting in savings of more than $800,000 the
first year.
▪ Not bad for a system that cost less than $75,000
to acquire, integrate, and implement.
36
Automated Metadata Collection
▪ Software can help ▪ Document templates can contain code to
capture metadata
▪ Templates can also contain “bookmarks”,
“fields” and other features to “grab” metadata
▪ Software can also be used to look-up details
of the user ▪ E.g.,: LDAP (Lightweight Directory Access
Protocol) capture certain details, such as the
current user’s name, job title, department, etc.
▪ An ERM system should automate the
capture of metadata values as much as
possible 37
Metadata for Disposed Records
▪ MoReq2 changed its terminology from
‘retention schedule’ to ‘retention and
disposition schedule’
▪ Some metadata will remain after the records
are destroyed.
▪ Additional metadata might be created ▪ Destruction_date
▪ “Reviewer” or other similar fields
38
Document Metadata: What is
Hidden?
39
So what’s the problem?
▪ Information Security Policy relates to
employees’ duty to protect the company’s
confidential information and IP.
▪ We can’t say we have control over this if we
don’t implement education and technology to
protect against it.
40
Education: What employees must
consider
41
Risk of Data Leaking
▪ Comments
▪ Hidden columns/rows
▪ Previous authors
▪ Track changes
▪ Versions
▪ Redacted text
▪ Reviewers
▪ Footnotes
▪ Small text 42
Technology: Implement to enforce
policy
43
CLASSIFICATION
44
Overall Requirements
▪ Compliance
▪ Record capture
▪ Classification scheme
▪ Authenticity
▪ Audit trail
▪ Metadata
▪ Retention and disposal
▪ Access and use
▪ Documentation
▪ System Testing
▪ Non-electronic record handling 45
Compliance ▪ The system must manage and control
electronic records according to the standards
for compliance and the requirements for
legal admissibility and security, and must be
capable of demonstrating this compliance
▪ Compliance regulations vary from industry to
industry ▪ Legislation and regulation
▪ Industry- standards
What is Classification?
▪ Simply put, grouping information together ▪ Think about how you’ve structured the files that
you’ve got on your computer into folders and
why you’ve done it that way.
▪ Formal definition from ISO 15489: ▪ “The systematic identification and arrangement
of business activities and/or records into
categories according to logically structured
conventions, methods, and procedural rules
represented in a classification system.”.
47
Why Important?
▪ Two main approaches for effective access to
stored information: ▪ Classification: ‘Aggregate and organize'
▪ Search engine: 'Find by raw power'
▪ We want to be able to find all records related
to a particular topic, project, or client.
▪ Nobody wants to look at “My Documents”
folder with 1,000 document in it
▪ Easier to manage files and records in groups
▪ It is what we as a species do. 48
Why Important? - Identify Your
“Toxic” Data
49
Benefits of an Effective
Classification
▪ Linkages between individual records can be
provided easily and that these can be
accumulated to provide a continuous record
of an activity
▪ Ensure that records are named in a
consistent manner over time
▪ Assist in the retrieval of all records relating to
a particular function, topic or activity.
▪ Determining security protection and access
appropriate for sets of records 50
Benefits of Classification, Cont’d
▪ Allocating user permissions for access to, or
action on, particular groups of records
▪ Distributing responsibility for the
management of particular sets of records:
▪ Distributing records for action
▪ Determining appropriate retention periods
and disposition actions for records more
easily
51
Source: ISO 15489
For the “Data User” – has to be
obvious
52
Classification Scheme
▪ From ISO 11179-1: ▪ “Descriptive information for an arrangement or
division of objects into groups based on
characteristics, which the objects have in
common”
▪ Any structure an organization uses for
organizing, accessing, retrieving, storing &
managing its information.
▪ A business classification scheme (BCS) is a
classification scheme which is based on an
organization's business functions & activities 53
Awareness Programs
54
Organizational Taxonomy
▪ The most extensive and most complex
classification structure
▪ Enterprise-wide, including all departments
and functions and all information
repositories.
▪ There are prebuilt taxonomies available for a
number of different vertical industries.
55
Authenticity ▪ To be authentic in archives and records
management, a record must be genuine, or be “what it claims to be”
▪ In order to trust that a record is authentic, the user must be assured that the systems that create, capture, and manage electronic records maintain inviolate records that are protected from accidental or unauthorized alteration and from deletion while the record still has value
Audit Trails
▪ A system audit trail is a record that tracks operations performed on the system
▪ The audit trail documents the activities performed on records and their metadata from creation to disposal
▪ The audit trail typically documents the activities of creation, migration and other preservation activities, transfers or the movement of records, modification, deletion, defining access, and usage history
▪ The system must automatically capture the audit trail
▪ The audit trail data must be unalterable
▪ The audit trail must be logically linked to the records they document, so that users can review audit information when they retrieve records
Demonstrating Compliance
58
Track Key Trends
59
Security and Control
▪ The system must allow only authorized personnel to create, capture, update or purge records, metadata associated with records, files of records, classes in classification schemes, and retention schedules
▪ The system must control access to the records according to well-defined criteria
Records Retention Schedule
▪ “A comprehensive instruction covering the
disposition of records to assure that they are
retained for as long as necessary based on
their administrative, fiscal, legal and historic
value.”
Source: UN ARMS
61
Retention and Disposition
▪ Specify periods of time an organization must
retain a document, based on the content of a
document and how it is used.
▪ Present a list of Record Series, categories of
documents with very similar purpose, and a
period of time an organization must retain the
records in each series.
▪ The system must provide for the automated
destruction of records in accordance with
authorized and approved records retention
schedules 62
Retention Schedule Format ▪ Often a database
▪ Enable the people to make updates to
retention periods as regulations change
▪ Add new record series as the organization
expands its operations into new functional
areas
▪ Only records management staff and their
direct delegates should have access to the
retention schedule database
63
Preservation Strategies, Backups
and Recovery
▪ The system must incorporate a strategy or plan for backing up and preserving records
▪ The system must ensure that records, components of records, audit trails, metadata, links to metadata or to files, and classification schemes can be converted or migrated to new system hardware, software and storage media without loss of vital information
System Testing
▪ The performance and reliability of system
hardware and software must be regularly
tested
▪ Most important component is backups ▪ Computer components do and will fail
▪ Human errors will occur
▪ Upgrades are always point of vulnrtability
Non-Electronic Records
▪ The system must be capable of classifying and managing non-electronic, physical records and of managing electronic and non-electronic records in an integrated manner.
▪ The system must be able to classify, create and retrieve audit information and other metadata, and control access
▪ The system must be capable of defining security, and applying retention and disposal schedules for non- electronic records according to the same requirements that have been defined for electronic records
66
Systems Design ▪ Systems Development Lifecycle (SDLC) ▪ System concept: purpose, goals, scope
▪ Analysis: user/functional requirements
▪ Design
▪ data design: what information?
▪ software design: processed how?
▪ interface design: user interaction?
▪ Coding and testing: execute & evaluate
▪ Key issue: Systems do (only) what they’re designed to – purpose, goals, scope, requirements
67
Cost Benefits ▪ Costs up front to digitize your office
▪ Costs saved at “back end of process” ▪ No storage costs
▪ Can locate documents
▪ Ease in use of documents
▪ Saved paper costs (80% estimate)
▪ Immediate access to information
▪ Easy to reorganize information
▪ Better stewardship
68
Collaboration
▪ The need to collaborate electronically ▪ With employees
▪ With clients
▪ With partners
▪ With consultants
▪ With others, dependent on type of business
▪ Impact of globalization
▪ Need to reduce time frames for work
69
What Hardware Do We Need?
▪ High speed scanner: ▪ Key to successful conversion and operation
▪ Must be networked
▪ Must be able to save and send documents as
▪ Must be able to send documents directly to
email recipients
▪ Larger flat screen monitors (22 inches or
larger)
▪ Networked workstations ▪ Connected with the high speed scanner
70
Additional Hardware? ▪ Digital faxing ▪ Faxes arrive as emails and are sent as emails
▪ Get rid of the facsimile machine: use software to
convert or use a service
▪ Tablet computers ▪ Pen input
▪ Capacitive touch: using your finger to scroll
▪ OneNote: electronic notebook
▪ Smart Pens ▪ Capture notes electronically
▪ Record audio 71
What Software Do We Need? ▪ Adobe Acrobat (Latest Version): ▪ Portable document format reader
▪ PDF editor
▪ Bookmarks and nesting
▪ OCR functions: Acrobat 8 Pro and later
▪ Text on image = searchable PDF
▪ Document security
▪ Removal of metadata (E.g., Document Inspector in
MS Office 2007 and later)
▪ Signatures
▪ “Locking” 72
What Software Do We Need?
▪ Office products: ▪ Word
▪ Excel
▪ Access
▪ PowerPoint
▪ Now include ability to make pdf documents ▪ Adobe released details of standard
▪ Acrobat add ins: ▪ Autoink (http://www.evermap.com/autoink.asp)
▪ PDF Annotator(http://www.pdfannotator.com/en/)
▪ Xobni: email searching and organizing 73
CONVERTING MANUAL
RECORDS
74
Topics
▪ Scanning Historical Records
▪ OCR
▪ Barcoding
▪ On-site vs. Off-site
▪ Quality assurance
75
76
Digitization
▪ Definition: ▪ Converting written and printed information into
electronic form
▪ Creation of computerized version of a printed analog.
▪ Contents – text image, audio, video or combination of these (multimedia)
Capture all data in one system
77
Product Overview
EDRMS
Centralized
Document Repository
Simultaneous & Restricted
Access to Documents over the
web
Desktop Files Faxes Scanned Documents
Optical Character Recognition
(OCR)
▪ The recognition of printed or written text
characters by a computer ▪ involves analysis of the scanned-in image
▪ translation of the character image into character
codes, such as American Standard Code for
Information Interchange (ASCII)
▪ Being applied by libraries, businesses, &
government agencies ▪ to create text-searchable files for digital
collections
78
Imaging
▪ Scan paper, film to create electronic images
▪ Allows simultaneous access to records
▪ May reduce storage costs
▪ Can improve or enhance: ▪ Security
▪ findability
79
Issues with Imaging ▪ Determine what to scan ▪ Records vs. non-records
▪ Backfile vs. day-forward
▪ Document preparation
▪ Indexing
▪ What to do with the originals
80
Reasons for Retaining Paper
Records ▪ Not cost effective to scan because of: ▪ Large volumes
▪ Document size
▪ Document condition
▪ Little activity
▪ Intrinsic value of record
▪ Legislation
81
82
OCR - Process
▪ The scanner or camera typically produced
TIFF images but now pdf is common
▪ The software cleans the image for noises
and starts recognizing patterns ▪ Recognized patterns in alphabets and numbers
▪ Unrecognized patterns into images
Barcodes
▪ As far back as the 1960s, barcodes were used in
industrial work environments
▪ In the early 1970s, common barcodes started
appearing on grocery shelves
▪ To automate the process of identifying grocery
items, UPC barcodes were placed on products
Bar Code Systems ▪ An organization may determine a need to use a
bar code system with Record Management
Applications (RMA) ▪ Target how to manage physical records or
documents in a manner consistent with their
electronic counterparts?
▪ While the app stores a digital object that
represents the record
▪ containing metadata like where the physical record is
being stored, the appropriate disposition schedule for
the record, etc.
▪ Users need a barcode to identify which physical
records correspond to which digital objects 84
Storage of Records
▪ Records must be stored in such a way that they are
accessible and safeguarded against environmental
damage ▪ Typical paper documents may be stored in a filing
cabinet in an office
▪ Some organizations employ file rooms with specialized
environmental controls including temperature & humidity
▪ Vital records may need to be stored in a disaster-
resistant safe or vault to protect against fire, flood,
earthquakes and conflict
▪ In extreme cases, the item may require both disaster-
proofing and public access
▪ E.g. the original, signed US Constitution 85
Off-site vs On-Site Scanning
▪ Most on-going scanning and indexing is done on-site
▪ In addition to on-site storage of records, many
organizations operate their own off-site records
centers or contract with commercial records centers
▪ Off-site scanning often used to handle the backlog
as an EDRMS is initiated ▪ Often the value of the system is diminished if all the data
is not present
▪ Many companies specialize in scanning documents ▪ Scanning America
▪ archSCAN
86
Quality Assurance ▪ Concerned with assessing and ensuring that ▪ data is accurate and consistent
▪ the RMA is consistent with its requirements
▪ May include: ▪ requirements of protection for CIA
▪ establishment of a robust assessment process
▪ use of third-party assessments
▪ contract performance measures/incentives
▪ use of regulatory and contract enforcement
authority, including civil, criminal, and financial
penalties 87
Quality Assurance
▪ May Also Include: ▪ customer review and approval
▪ Not addressed in US law except for credit reports
▪ Fair Credit Reporting Act
▪ EU- Individuals have the right to access data
collected about them, to correct inaccurate or
incomplete data, and to have those corrections sent
to those who have received the data
▪ Directive 95/46/EC (European Privacy Directive)
88
Data in Document Management
▪ Much of a document management system’s
content is documents which have been
imaged ▪ Not searchable in their own right
▪ Two technologies available: ▪ Optical Character Recognition followed by full
text searching
▪ Indexing (addition of metadata) followed by
metadata searching
89
Copyright
▪ Traditional copyright laws apply
▪ Copyright principle based on the belief that the public is entitled to freely use portions of copyrighted materials for purposes of commentary and criticism
▪ Unfortunately, if the copyright owner disagrees with your fair use interpretation, the dispute will have to be resolved by courts or arbitration
▪ The four factors for measuring fair use: ▪ the purpose and character of your use
▪ the nature of the copyrighted work
▪ the amount and substantiality of the portion taken
▪ the effect of the use upon the potential market
▪ Extended in academic environment to the Teach Act 90
Integrity Issues
▪ Need to guarantee the integrity of all documents
in the system ▪ System of record
▪ E-Discovery processes
▪ Version control ▪ Understanding when and why documents get
changed
▪ Check-in/checkout functionality ▪ Only one person can be modifying a document at
one time
▪ Access controls ▪ Who can upload documents to a repository
91
Quiz 3 Terms
-Covers Unit 5,6, and 7
▪ Auditing
▪ Audit trail
▪ Authenticity
▪ Barcode
▪ BCS (Business
Classification Scheme)
▪ Check-in and check-out
▪ Classification scheme
▪ Cloud computing
▪ Controlled vocabularies
▪ Data masks
▪ DoD 5012.2-STD
▪ DRM
▪ Indexing
▪ Metadata
▪ MoReq 2
▪ NARA
▪ Non-electronic record
▪ OA (Office Automation)
▪ OCR
▪ QA (Quality Assurance)
▪ Retention schedule
▪ RMA (Record Management
Application)
▪ SDLC
▪ Types of metadata
▪ Version control 92