gis
MODULE 3: QUALITY CONTROL
AND QUALITY ASSURANCE Terminology
Data Quality Terminology
Ensuring Data Quality
Errors Introduced During Encoding Process
Fixing Errors in Data
Edgematching and Rubber Sheeting
Geocoding Address Information
Re-Projection, Transformation, and Re-Scaling
Integrating the Database
Introduction
Why Quality Control (QC) And Quality
Assurance (QA) Are Important
GIGO (Garbage In, Garbage Out) is a common
saying in computing circles --meaning that, if input poor quality data, your results will
also be poor
This is especially important in geographic
information systems --GIS programs allow the user to combine and
analyze spatial and non-spatial data in a
variety of ways:
--by combining maps
--through the use of buffer zones
--through the interpolation of point data sets, etc.
However, if you data sets have errors --these errors can be compounded farther down the road
This can in turn lead to a final output that is
misleading for users --this is compounded by the observation that maps look
professional and authoritative
--this in turn can lead naïve users to believe what they see
Terminology
Data Quality refers to how good your data are
This can refer to the overall fitness or suitability of
data for a specific purpose
It can also be used to indicate whether data are free
from errors or other problems
Issues That Can Affect the Quality of Individual Data
Sets:
1) Error
2) Accuracy
3) Precision
4) Bias
1) Error refers to flaws in data --i.e. the physical difference between the real world and
the GIS facsimile of the Real World
2) Accuracy refers to the extent to which an
estimated data value approaches its true value
--100% accuracy is impossible to attain
--it is possible, however, to have an accuracy within
specified tolerances (.002” for example)
Terminology
3) Precision refers to the recorded level of detail in the data
A planar coordinate that is specified as being precise to within three meters is more precise than a geographic coordinate that is specified for being precise to the nearest three decimal places (.001 degree of latitude would be approximately 111 meters)
It is important to remember, however, that precision is not the same as accuracy --a map’s precision may be high (precise to
.001 meters, for instance) --it could still contain inaccurate data
4) Bias refers to the systematic variation of data from reality
Bias tends to introduce a consistent error throughout a data set
Terminology
Issues That Affect the Representation of Features Within A Database
Includes: 1) Resolution 2) Generalization
1) Resolution describes the smallest feature in a data set that can be displayed or mapped
Resolution is dependent on:
a. The scale of the original map
b. The point size and line width used to represent features
c. The precision of the data acquisition method (i.e. digitizing board, scanner, gps unit, other survey instruments, etc.)
2) Generalization is the process of simplifying real-world complexities to produce maps
Generalization is, by definition, a subjective process --a cartographer selective removes real-
world detail to make it understandable and attractive in map form
Terminology
Generalization (Continued)
Generalization introduces several issues, of which
the GIS user needs to be made aware:
1) It can introduce positional accuracies
As map scale decreases, objects may be shifted to
depict their locations relative to one another
2) Generalization can cause area features
(features that cover a specified territory) to
be represented as point features
While a city will be represented by a
polygon (areal symbol) on a large-scale
map
--many small-scale maps may represent cities as
points
3) Line thicknesses are often exaggerated on
smaller scale maps --at a 1:000,000 scale, a .3 mm line (smallest visible)
would represent 300 meters on the ground
--I-10 (both lanes) is less than 100 meters wide
Data Quality Terminology
Other Factors That Affect Overall Data Quality
1) Completeness 2) Compatibility 3) Consistency 4) Applicability
1) Completeness
Complete data sets cover the entire study area and time period of interest
Incomplete data sets result from: a. Missing Attribute Data b. Dual sets of attributes for one polygon c. Data that come from samples of a larger set
--samples are incomplete, by definition --sample completeness is determined by having a
sufficiently large sample size to adequately represent real-world variations in the data
2) Compatibility
Compatible datasets can be used together sensibly
Maps with 1:50,000 and 1:500,000 scale can be said to have incompatible scales --this is because a 1:500,000 scale map generally has a
higher degree of generalization
Data Quality Terminology
3) Consistency
Data sets need to be developed using similar
methods of data capture, storage, manipulation,
and editing
Inconsistency can be introduced by:
a. Obtaining different sections of the data from
different source documents
b. Different sections of the data being input by
different people
--Each person involved with data input
introduces different types of bias
--Input devices (gps devices, scanners,
digitizing tablets/tables) have different
input accuracies
4) Applicability
a. Is the data suitable for the type of analysis
that the user wants to conduct?
b. Is the type of analysis that the user intends to
conduct appropriate to the data set that is
being used?
Digitizing Table/PC Combination
Large Format Scanner
Ensuring Data Quality
When data is input into a GIS database, it is
bound to contain errors --it is therefore important to manage the data’s quality
It is best to intercept errors during the early
stages --i.e. during and after data input
Otherwise, these data errors can contaminate the
GIS database --these errors are likely to become compounded later in
the project
There Are Three Main Sources of Input Errors:
1) Errors in the Source Data
Errors in the source data are often difficult to
identify
They may be due to several factors:
a. Due to printing or copying errors
b. Due to surveying methods
c. Due to data input errors in the field
2) Encoding Errors
Digitizing (see reading supplement) can
introduce various errors into a spatial dataset
COMMON DIGITIZING ERRORS
Error Description
Missing
entities
Missing points, lines or boundary
segments
Duplicate
entities
Points, lines or boundary
segments that have been digitized
twice
Mislocated
entities
Points, lines or boundary
segments that have been digitized
in the wrong place
Missing
labels Unidentified polygons
Duplicate
labels
Two or more identification labels
for the same polygon
Digitizing
artifacts
Undershoots, overshoots, wrongly
placed nodes, loops, and spikes
Noise
Irrelevant data entered during
digitizing, scanning, or data
transfer
Errors Introduced
During Encoding Process
Certain types of error often help to identify
other problems with encoded data
Example: Dangling, or “dead end” nodes
may indicate missing lines, overshoots, or
undershoots
Looking for patterns of this type can help to
direct editing, and speed up the process
Most GIS packages provide sets of tools that
can identify and remove vector data errors
Although the user can allow the GIS to
automatically correct errors:
Visual comparison against the source
document --helps to reveal obvious omissions, duplications,
and erroneous additions often missed by
autocorrection
--automatic correction can also correct non-
existent errors if tolerances are set incorrectly,
causing the user to lose data
Examples of Spatial Error in Vector Data
Errors Introduced
During Encoding Process
Encoding Errors in Raster Data
Missing entities and noise are also common
errors associated with encoding raster data
Environmental or cultural obstacles can
result in missing entities in raster data, due
to:
1) Difficulties obtaining data near airports
2) High cloud cover can result in missing
vegetation data during the rainy season
with certain types of sensors
Noise
Nonsensical data that is inadvertently added
during collection or processing
Example: The user may find an individual
pixel representing water in a forest, despite
ground truthing that finds no water present
in this area
Examples of Spatial Error in Vector Data
Fixing Errors in Data FILTERING
Used with Raster Data --although it is used to remove noise in raster data
--filtering is generally considered to be an analysis
technique
Filtering involves passing a filter (a small grid of
pixels: 3 x 3 is a common filter) over a noisy data set --then the value of the central pixel is recalculated as a
function of all pixel values within the filter
Filters are most commonly used for:
1) Enhancing line and edge detail to produce a
sharper, less blurry image
2) Selectively smoothing the image to remove
spurious values that were introduced by the
imaging system or other raster creation process
Filters are generally used to improve or enhance a
raster image in some specific way --so that an analyst can more readily interpret the image
NOTE: Use Raster Filters with Caution
--the user can lose genuine features if too
large a filter is used
Example Filter Window (3x3)
Edgematching and Rubbersheeting
Edgematching
When a study area extends over more than one map sheet --small differences or mismatches often occur between
adjacent sheets --also caused by the datum changes in the North
American Datum since 1983
The maps need to be edgematched to ensure contiguity across boundaries --this is done by joining adjacent lines that are
mismatched across map boundaries --or by finishing incomplete polygons so they match
across former map sheet boundaries
Rubbersheeting
If edgematching problems are major, or if the data contains internal distortions --a process known as rubber sheeting is used to correct the
problem
Rubber sheeting involves
--“Tacking down” objects that are accurately placed to provide controls
--other objects are then moved to fit with the control points that were established
Care is needed when rubbersheeting
--having too few control points may lead to unrealistic distortions
Edgematching: Match the polygons across
the boundary between adjacent map units
Rubbersheeting: Establish a number of fixed
points on the original line or polygon
then establish a set of new control points to
which you wish to move the line or boundary
Geocoding Address Data
Geocoding is the process of converting an
address into a point location
Geocoding is often required to:
1) Turn names and addresses from a census or
questionnaire survey
--into a map of client distributions
2) Locate an incident on a street for efficient
routing of emergency responses
3) Locate houses for sale that fit potential clients’
search criteria
Address Matching involves geocoding street
addresses to a published street network
Locations are determined based on the address
ranges stored for each street segment
Since address data are frequently insistent --it is often necessary to geocode addresses interactively
Edgematching: Match the polygons across
the boundary between adjacent map units
Rubbersheeting: Establish a number of fixed
points on the original line or polygon
then establish a set of new control points to
which you wish to move the line or boundary
Re-Projection, Transformation, and Re-Scaling
When different datasets are used in a project --it is often necessary to process the data geometrically to
provide a common reference network
Projections, scaling, and resolution of source data may all need to be addressed, once spatial and attribute data have all been encoded and edited
Re-Projection
Data derived from maps drawn on different projections will need to be converted to a common projection --before combining or analysis can take place
If not re-projected, the data will not plot in the same location --this causes offsets when plotted
ArcGIS can re-project datasets on the fly when the projection has been defined --sometimes the data has not been re-defined, however --something which was addressed in Chapter 3 of MGIS
When working with project datasets, it is important to have all datasets in the same projected coordinate system --to minimize offset issues when combining layers --to minimize analysis errors due to the use of different datums
and/or projection parameters
Re-projection involves converting x-y coordinate values in a spatial data file to a different coordinate system
Example Filter Window (3x3)
Re-Projection, Transformation, and Re-Scaling
Transformation At times it is necessary to convert a dataset from one datum to another --due to data being referenced to different datums --such as NAD 1927 vs. NAD 1983 for USGS datasets
This process is called datum transformation
Usually, one datum is selected for a project, to be used throughout the project --datum transformation does not always use exact mathematical
formulas and may require localized estimates and fitting --the process of transformation mahy introduce new errors in
coordinate locations (up to several meters) --it is thus not desirable to transform GIS data back and forth
repeatedly, since it will multiply these errors
Transformation Might Involve: 1) Translation and scaling (1 m vs. 10 m)
--the transformation may involve multiplying by a factor of the difference
2) Creating a common origin --the two datasets may share coordinate resolutions, but not
common origins --the origin of one of the datasets may be shifted in line by
adding the difference between the two origins to its coordinates
3) Rotating Map Coordinates --use simple trigonometry to fit one or more datasets onto a
grid of common origin
Example Filter Window (3x3)
Re-Projection, Transformation, and Re-Scaling
Re-Scaling When data comes from maps of different scales, it is important to keep in mind the following axiom:
The accuracy of the output of a GIS analysis is only as good as the worst input data
When using two different scale sources --you should generalize data from the larger-scale source to fit
with that of the smaller-scale source --i.e., you should generalize a 1:50,000 map to a 1:100,000
scale, not the reverse
This is because of the compromises that are made during generalization --removal of detail --removal of features --shifting of features
Thus, the 1:50,000 scale map contains more details, and is likely to be more accurate at higher resolutions than the 1:100,000 scale map --you cannot reasonably add more detail to the latter to make the
two equal in content --the analysis results would thus be misleading, if the analysis is
performed with the expectation that they would be valid at the larger of the two scales
Remember, a resolution of .001 results in a potential offset of --50 meters (164 ft) at 1:50,000 --100 meters (328 feet) at 1:100,000
Example Filter Window (3x3)
Integrating the Database
For large projects, 30 or more data sources may be used
This may require various data encoding and editing tasks
that take place over a period of several years 1) Manual digitizing
2) Re-projection
3) Edgematching
4) Geocoding
5) Transformation
Conclusion When putting together a GIS project, it is
important to plan for the processes of:
1) Obtaining the data
2) Encoding the data
3) Manipulating the Data
4) Transforming the Data
It is common to expect that 50-80% of project time
may be taken up with data encoding and editing --time-consuming, but necessary parts of any GIS project
The quality of the analysis results ultimately depends on --the quality of the input data and the integration of this data into
the database