inverted Indexing
2 years ago
15
Ex3_InvertedIndexing_Instructions.pdf
Exercises_rubric.pdf
Ex3_InvertedIndexing_Instructions.pdf
ITI220: Inverted Indexing Exercise List Student Names in Group (and circle your name): An inverted index, which is a common data structure in IR systems, is a list of all the index terms in a collection with pointers to the documents in which each term occurs, as well as the frequency that a term appears in the document and its location in the document. Specifically, an inverted index consists of:
• A collection of postings lists - one associated with each unique term in the collection. • Each posting list consists of a number of individual postings. • Each posting holds a unique document identifier (docno) and the frequency (count) of the term in
a given document.
Let's look at an example starting with the following famous lines from Shakespeare's Merchant of Venice
Now, let's treat each line (above) as if it were a "document." The completed inverted index would look something like this:
Let's interpret one of the terms - if - in the inverted index:
The number directly after the term is its document frequency or df for short. The df specifies the number of documents that contain this term. Since if appears in all four documents, its df is 4. Although the df can be easily reconstructed by counting the number of postings, it is often explicitly stored in the inverted index. The postings list contains a number of postings, each of which is a (docno, tf) tuple. The docno is simply a unique identifier for the document (one through four, in this case). The tf, which stands for term frequency, is the number of times the term appears in the document. The term if appears once in every document. Typically, postings are sorted by ascending docno (as shown above).
Exercise #3 - Instructions
1. Get in small groups of 2-4. The instructor will distribute the “In-Class Activity: Exercise #3” handout to each student. Put the names of your group at the top of the handout and circle your name. Each student will complete the activity handout.
2. As a group, select ONE of the document collections (see below - circle your choice and the group should select the same document collection) from Shakespeare's famous quotes to create an inverted index. Each line is ~ to a document. (Quotations are from famous William Shakespeare Quotes, http://www.brainyquote.com/quotes/authors/w/william_shakespeare.html)
3. Working together as a group, each member will complete “In-Class Activity Exercise #3” and write an inverted index (on the handout) that would be built for the selected document collection. Refer to the "If you prick us" inverted index example for guidance in creating your group's inverted index. Note: The terms are to be listed in alphabetical order.
4. As a group, discuss and write your response to the following question (in your handout): what did you learn about indexing as it pertains to information retrieval from this exercise?
5. Submit your group's inverted index (and the selected Shakespeare Document Collection) to the instructor at the end of class. This exercise will be graded based on the “Information Retrieval Exercises Rubric.”
Shakespeare Document Collection #1 1. doubt thou the stars are fire 2. doubt that the sun doth move 3. doubt truth to be a liar, but 4. never doubt I love Shakespeare Document Collection #2 1. a fool thinks himself to be wise 2. but a wise man knows himself to be a fool Shakespeare Document Collection #3 1. we know what we are, but 2. know not what we may be Shakespeare Document Collection #4 1. when a father gives to his son, both laugh 2. when a son gives to his father, both cry Shakespeare Document Collection #5 1. we know what we are 2. but know not what we may be Shakespeare Document Collection #6 1. some are born great 2. some achieve greatness 3. and some have greatness thrust upon them Group’s Response to the Following Question: What did you learn about indexing as it pertains to information retrieval from this exercise?
Exercises_rubric.pdf
Rubric by Denise Kreiger, Instructional Design/Technology Services, SC&I, Rutgers, 11/2015 1
Information Retrieval Exercises Rubric
Criteria Unsatisfactory-Beginning Developing Accomplished Exemplary Total Analysis of
Exercise
0-30 points 31-35 points 36-40 points 41-45 points /45 Demonstrates an unsatisfactory analysis of relevant aspects of information retrieval systems inherent in the exercise. Discussion is incomplete in comparing technology tools, search processes, and/or search methods and their usefulness in locating information. Observations, interpretations, and/or conclusions, if included, are unclear, illogical, and/or unsupported in relation to the stated exercise results.
Demonstrates a less than adequate analysis of relevant aspects of information retrieval systems inherent in the exercise. Discussion includes a minimal comparison of technology tools, search processes, and/or search methods and their usefulness in locating information. Observations, interpretations, and/or conclusions are somewhat reasonable and/or vague in relation to the stated exercise results.
Demonstrates satisfactory analysis of relevant aspects of information retrieval systems inherent in the exercise. Discussion includes an adequate comparison of technology tools, search processes, and/or search methods and their usefulness in locating information. Makes somewhat intuitive and reasonable observations, interpretations, and/or conclusions pertaining to the exercise results that are clearly stated for the most part.
Demonstrates strong analysis of relevant aspects of information retrieval systems inherent in the exercise. Discussion includes a thorough comparison of technology tools, search processes, and/or search methods and their usefulness in locating information. Makes insightful and logical observations, interpretations, and conclusions pertaining to the exercise results that are clearly stated.
Information Retrieval Technologies and Search Processes Used in Exercise
0-30 points 31-35 points 36-40 points 41-45 points /45 Demonstrates unsatisfactory proficiency and understanding in using information retrieval technologies, search processes, and/or effective search strategies inherent in the exercise and/or the exercise is incomplete, missing required steps.
Demonstrates a less than adequate understanding in using information retrieval technologies, search processes, and/or effective search strategies inherent in the exercise and/or may be missing a required step in the exercise.
Demonstrates an adequate proficiency and understanding in using information retrieval technologies, search processes, and/or effective search strategies inherent in the exercise. Required steps in the exercise are completed for the most part.
Demonstrates strong proficiency and understanding in using information retrieval technologies (e.g., search engines, databases), search processes (e.g., query, indexing, ranking), and/or effective search strategies (e.g., Boolean logic, phrasing, ordering) inherent in the exercise. All required steps in the exercise are fully completed.
Writing Quality
0-6 points 7 points 8 points 9-10 points /10 Poor writing style that is lacking in standard English, clarity, language used, and/or frequent errors in grammar, punctuation, usage, and spelling.
Average writing style that is sometimes unclear and/or with some errors in grammar, punctuation, usage, and spelling.
Above average writing style using standard English with minor errors in grammar, punctuation, usage, and spelling.
Well written using standard English, characterized by elements of a clear writing style and basically free from grammar, punctuation, usage, and spelling errors.
Timeliness* (* unexcused
late)
Deduct 11 points to failing Deduct 6-10 points Deduct 1-5 points 0 points deducted /-- Exercise paper is submitted 2-3 days (49-72 hours) after the deadline.
Exercise paper is submitted 1-2 days (25-48 hours) after the deadline.
Exercise paper is submitted within 1 day (24 hours) after the deadline.
Exercise paper is submitted on or before the deadline.
TOTAL POINTS (sum of 4 Criteria) /100
- Discussion Question 5
- please do it on time
- APPENDIX I - IO AND MULTIMEDIA DEVICES WORKSHEET*****Already A++ Rated Tutorial Use as Guide Paper*****
- Considering the article, write an APA-formatted paper of 4–5 pages that addresses the f
- Area of 23 ft circle
- 3a
- Writing assignment
- ENGLISH II Unit III Research Proposal
- Kim Woods
- *** KIM WOODS** Research-Based Interventions on Stress Disorders, Anxiety Disorders, Conduct Disorders, and ADHD Paper