MOD3_LECTURE_QCQA_AND_EDITING3.pdf

MODULE 3: QUALITY CONTROL

AND QUALITY ASSURANCE Terminology

Data Quality Terminology

Ensuring Data Quality

Errors Introduced During Encoding Process

Fixing Errors in Data

Edgematching and Rubber Sheeting

Geocoding Address Information

Re-Projection, Transformation, and Re-Scaling

Integrating the Database

Introduction

Why Quality Control (QC) And Quality

Assurance (QA) Are Important

GIGO (Garbage In, Garbage Out) is a common

saying in computing circles --meaning that, if input poor quality data, your results will

also be poor

This is especially important in geographic

information systems --GIS programs allow the user to combine and

analyze spatial and non-spatial data in a

variety of ways:

--by combining maps

--through the use of buffer zones

--through the interpolation of point data sets, etc.

However, if you data sets have errors --these errors can be compounded farther down the road

This can in turn lead to a final output that is

misleading for users --this is compounded by the observation that maps look

professional and authoritative

--this in turn can lead naïve users to believe what they see

Terminology

Data Quality refers to how good your data are

This can refer to the overall fitness or suitability of

data for a specific purpose

It can also be used to indicate whether data are free

from errors or other problems

Issues That Can Affect the Quality of Individual Data

Sets:

1) Error

2) Accuracy

3) Precision

4) Bias

1) Error refers to flaws in data --i.e. the physical difference between the real world and

the GIS facsimile of the Real World

2) Accuracy refers to the extent to which an

estimated data value approaches its true value

--100% accuracy is impossible to attain

--it is possible, however, to have an accuracy within

specified tolerances (.002” for example)

Terminology

3) Precision refers to the recorded level of detail in the data

A planar coordinate that is specified as being precise to within three meters is more precise than a geographic coordinate that is specified for being precise to the nearest three decimal places (.001 degree of latitude would be approximately 111 meters)

It is important to remember, however, that precision is not the same as accuracy --a map’s precision may be high (precise to

.001 meters, for instance) --it could still contain inaccurate data

4) Bias refers to the systematic variation of data from reality

Bias tends to introduce a consistent error throughout a data set

Terminology

Issues That Affect the Representation of Features Within A Database

Includes: 1) Resolution 2) Generalization

1) Resolution describes the smallest feature in a data set that can be displayed or mapped

Resolution is dependent on:

a. The scale of the original map

b. The point size and line width used to represent features

c. The precision of the data acquisition method (i.e. digitizing board, scanner, gps unit, other survey instruments, etc.)

2) Generalization is the process of simplifying real-world complexities to produce maps

Generalization is, by definition, a subjective process --a cartographer selective removes real-

world detail to make it understandable and attractive in map form

Terminology

Generalization (Continued)

Generalization introduces several issues, of which

the GIS user needs to be made aware:

1) It can introduce positional accuracies

As map scale decreases, objects may be shifted to

depict their locations relative to one another

2) Generalization can cause area features

(features that cover a specified territory) to

be represented as point features

While a city will be represented by a

polygon (areal symbol) on a large-scale

map

--many small-scale maps may represent cities as

points

3) Line thicknesses are often exaggerated on

smaller scale maps --at a 1:000,000 scale, a .3 mm line (smallest visible)

would represent 300 meters on the ground

--I-10 (both lanes) is less than 100 meters wide

Data Quality Terminology

Other Factors That Affect Overall Data Quality

1) Completeness 2) Compatibility 3) Consistency 4) Applicability

1) Completeness

Complete data sets cover the entire study area and time period of interest

Incomplete data sets result from: a. Missing Attribute Data b. Dual sets of attributes for one polygon c. Data that come from samples of a larger set

--samples are incomplete, by definition --sample completeness is determined by having a

sufficiently large sample size to adequately represent real-world variations in the data

2) Compatibility

Compatible datasets can be used together sensibly

Maps with 1:50,000 and 1:500,000 scale can be said to have incompatible scales --this is because a 1:500,000 scale map generally has a

higher degree of generalization

Data Quality Terminology

3) Consistency

Data sets need to be developed using similar

methods of data capture, storage, manipulation,

and editing

Inconsistency can be introduced by:

a. Obtaining different sections of the data from

different source documents

b. Different sections of the data being input by

different people

--Each person involved with data input

introduces different types of bias

--Input devices (gps devices, scanners,

digitizing tablets/tables) have different

input accuracies

4) Applicability

a. Is the data suitable for the type of analysis

that the user wants to conduct?

b. Is the type of analysis that the user intends to

conduct appropriate to the data set that is

being used?

Digitizing Table/PC Combination

Large Format Scanner

Ensuring Data Quality

When data is input into a GIS database, it is

bound to contain errors --it is therefore important to manage the data’s quality

It is best to intercept errors during the early

stages --i.e. during and after data input

Otherwise, these data errors can contaminate the

GIS database --these errors are likely to become compounded later in

the project

There Are Three Main Sources of Input Errors:

1) Errors in the Source Data

Errors in the source data are often difficult to

identify

They may be due to several factors:

a. Due to printing or copying errors

b. Due to surveying methods

c. Due to data input errors in the field

2) Encoding Errors

Digitizing (see reading supplement) can

introduce various errors into a spatial dataset

COMMON DIGITIZING ERRORS

Error Description

Missing

entities

Missing points, lines or boundary

segments

Duplicate

entities

Points, lines or boundary

segments that have been digitized

twice

Mislocated

entities

Points, lines or boundary

segments that have been digitized

in the wrong place

Missing

labels Unidentified polygons

Duplicate

labels

Two or more identification labels

for the same polygon

Digitizing

artifacts

Undershoots, overshoots, wrongly

placed nodes, loops, and spikes

Noise

Irrelevant data entered during

digitizing, scanning, or data

transfer

Errors Introduced

During Encoding Process

Certain types of error often help to identify

other problems with encoded data

Example: Dangling, or “dead end” nodes

may indicate missing lines, overshoots, or

undershoots

Looking for patterns of this type can help to

direct editing, and speed up the process

Most GIS packages provide sets of tools that

can identify and remove vector data errors

Although the user can allow the GIS to

automatically correct errors:

Visual comparison against the source

document --helps to reveal obvious omissions, duplications,

and erroneous additions often missed by

autocorrection

--automatic correction can also correct non-

existent errors if tolerances are set incorrectly,

causing the user to lose data

Examples of Spatial Error in Vector Data

Errors Introduced

During Encoding Process

Encoding Errors in Raster Data

Missing entities and noise are also common

errors associated with encoding raster data

Environmental or cultural obstacles can

result in missing entities in raster data, due

to:

1) Difficulties obtaining data near airports

2) High cloud cover can result in missing

vegetation data during the rainy season

with certain types of sensors

Noise

Nonsensical data that is inadvertently added

during collection or processing

Example: The user may find an individual

pixel representing water in a forest, despite

ground truthing that finds no water present

in this area

Examples of Spatial Error in Vector Data

Fixing Errors in Data FILTERING

Used with Raster Data --although it is used to remove noise in raster data

--filtering is generally considered to be an analysis

technique

Filtering involves passing a filter (a small grid of

pixels: 3 x 3 is a common filter) over a noisy data set --then the value of the central pixel is recalculated as a

function of all pixel values within the filter

Filters are most commonly used for:

1) Enhancing line and edge detail to produce a

sharper, less blurry image

2) Selectively smoothing the image to remove

spurious values that were introduced by the

imaging system or other raster creation process

Filters are generally used to improve or enhance a

raster image in some specific way --so that an analyst can more readily interpret the image

NOTE: Use Raster Filters with Caution

--the user can lose genuine features if too

large a filter is used

Example Filter Window (3x3)

Edgematching and Rubbersheeting

Edgematching

When a study area extends over more than one map sheet --small differences or mismatches often occur between

adjacent sheets --also caused by the datum changes in the North

American Datum since 1983

The maps need to be edgematched to ensure contiguity across boundaries --this is done by joining adjacent lines that are

mismatched across map boundaries --or by finishing incomplete polygons so they match

across former map sheet boundaries

Rubbersheeting

If edgematching problems are major, or if the data contains internal distortions --a process known as rubber sheeting is used to correct the

problem

Rubber sheeting involves

--“Tacking down” objects that are accurately placed to provide controls

--other objects are then moved to fit with the control points that were established

Care is needed when rubbersheeting

--having too few control points may lead to unrealistic distortions

Edgematching: Match the polygons across

the boundary between adjacent map units

Rubbersheeting: Establish a number of fixed

points on the original line or polygon

then establish a set of new control points to

which you wish to move the line or boundary

Geocoding Address Data

Geocoding is the process of converting an

address into a point location

Geocoding is often required to:

1) Turn names and addresses from a census or

questionnaire survey

--into a map of client distributions

2) Locate an incident on a street for efficient

routing of emergency responses

3) Locate houses for sale that fit potential clients’

search criteria

Address Matching involves geocoding street

addresses to a published street network

Locations are determined based on the address

ranges stored for each street segment

Since address data are frequently insistent --it is often necessary to geocode addresses interactively

Edgematching: Match the polygons across

the boundary between adjacent map units

Rubbersheeting: Establish a number of fixed

points on the original line or polygon

then establish a set of new control points to

which you wish to move the line or boundary

Re-Projection, Transformation, and Re-Scaling

When different datasets are used in a project --it is often necessary to process the data geometrically to

provide a common reference network

Projections, scaling, and resolution of source data may all need to be addressed, once spatial and attribute data have all been encoded and edited

Re-Projection

Data derived from maps drawn on different projections will need to be converted to a common projection --before combining or analysis can take place

If not re-projected, the data will not plot in the same location --this causes offsets when plotted

ArcGIS can re-project datasets on the fly when the projection has been defined --sometimes the data has not been re-defined, however --something which was addressed in Chapter 3 of MGIS

When working with project datasets, it is important to have all datasets in the same projected coordinate system --to minimize offset issues when combining layers --to minimize analysis errors due to the use of different datums

and/or projection parameters

Re-projection involves converting x-y coordinate values in a spatial data file to a different coordinate system

Example Filter Window (3x3)

Re-Projection, Transformation, and Re-Scaling

Transformation At times it is necessary to convert a dataset from one datum to another --due to data being referenced to different datums --such as NAD 1927 vs. NAD 1983 for USGS datasets

This process is called datum transformation

Usually, one datum is selected for a project, to be used throughout the project --datum transformation does not always use exact mathematical

formulas and may require localized estimates and fitting --the process of transformation mahy introduce new errors in

coordinate locations (up to several meters) --it is thus not desirable to transform GIS data back and forth

repeatedly, since it will multiply these errors

Transformation Might Involve: 1) Translation and scaling (1 m vs. 10 m)

--the transformation may involve multiplying by a factor of the difference

2) Creating a common origin --the two datasets may share coordinate resolutions, but not

common origins --the origin of one of the datasets may be shifted in line by

adding the difference between the two origins to its coordinates

3) Rotating Map Coordinates --use simple trigonometry to fit one or more datasets onto a

grid of common origin

Example Filter Window (3x3)

Re-Projection, Transformation, and Re-Scaling

Re-Scaling When data comes from maps of different scales, it is important to keep in mind the following axiom:

The accuracy of the output of a GIS analysis is only as good as the worst input data

When using two different scale sources --you should generalize data from the larger-scale source to fit

with that of the smaller-scale source --i.e., you should generalize a 1:50,000 map to a 1:100,000

scale, not the reverse

This is because of the compromises that are made during generalization --removal of detail --removal of features --shifting of features

Thus, the 1:50,000 scale map contains more details, and is likely to be more accurate at higher resolutions than the 1:100,000 scale map --you cannot reasonably add more detail to the latter to make the

two equal in content --the analysis results would thus be misleading, if the analysis is

performed with the expectation that they would be valid at the larger of the two scales

Remember, a resolution of .001 results in a potential offset of --50 meters (164 ft) at 1:50,000 --100 meters (328 feet) at 1:100,000

Example Filter Window (3x3)

Integrating the Database

For large projects, 30 or more data sources may be used

This may require various data encoding and editing tasks

that take place over a period of several years 1) Manual digitizing

2) Re-projection

3) Edgematching

4) Geocoding

5) Transformation

Conclusion When putting together a GIS project, it is

important to plan for the processes of:

1) Obtaining the data

2) Encoding the data

3) Manipulating the Data

4) Transforming the Data

It is common to expect that 50-80% of project time

may be taken up with data encoding and editing --time-consuming, but necessary parts of any GIS project

The quality of the analysis results ultimately depends on --the quality of the input data and the integration of this data into

the database