Database problems help needed
Wiki admin database
Wikipedia adminship election Dataset information
Wikipedia is a free encyclopedia written collaboratively by volunteers around the world. A small part of Wikipedia contributors are administrators, who are users with access to additional technical features that aid in maintenance. In order for a user to become an administrator a Request for adminship (RfA) is issued and the Wikipedia community via a public discussion or a vote decides who to promote to adminship. Using the latest complete dump of Wikipedia page edit history (from January 3 2008) we extracted all administrator elections and vote history data. This gave us nearly 2,800 elections with around 100,000 total votes and about 7,000 users participating in the elections (either casting a vote or being voted on). Out of these 1,200 elections resulted in a successful promotion, while about 1,500 elections did not result in the promotion. About half of the votes in the dataset are by existing admins, while the other half comes from ordinary Wikipedia users.
Dataset has the following format:
E: did the elector result in promotion (1) or not (0)
T: time election was closed
U: user id (and screen name) of editor that is being considered for promotion
N: user id (and screen name) of the nominator
V: vote(1:support, 0:neutral, -1:oppose) user_id time screen_name
Source (citation)
· J. Leskovec, D. Huttenlocher, J. Kleinberg. Signed Networks in Social Media. CHI 2010.
· J. Leskovec, D. Huttenlocher, J. Kleinberg. Predicting Positive and Negative Links in Online Social Networks. WWW 2010.
Note the following points about the dataset:
1. It is a textfile encoded in UTF-8 format and human viewable. But if you decide to store it any other format you should be able to figure out how to read it into computer memory – I will not be able to offer you help. On Windows platform you can use WordPad to view it. On Mac/Linux any text viewer should be fine.
2. The same person can be nominated or voted upon more than once, example, “sam_vimes”.
3. The dataset is about 5 MB – small but not too small for you to manually do anything with it. You need computer help for sure.
4. You can assume that the screen name and user id number are unique to a user. While I haven’t checked the data, either one can serve as the key for a user.
5. While framing a solution to the problem, some of the data in the dataset that may be irrelevant to the questions may be ignored.
6. Since Wikipedia has users from all over the world, it won’t be uncommon to see screen names that have a character set outside of the English alphabet – such as björn-isak. For the purpose of the finals, you will simply store them as such, and you can assume that queries will not require you to handle such characters.
7. The screen names can be numerical, or alpha-numerical.
The goals of this exercise are:
(i) To reverse engineer the dataset into a database so we can answer some queries efficiently
(ii) To design the database in an extensible manner. Meaning that, if a future election were to be held, data from this new election can be added to the database easily. So your design of the database should accommodate future data. You should not have to reinsert everything in the tables just the new data.
Programming Problems:
1. Given a user id, output the total number of times the user has voted to support or be neutral or oppose the candidate considered for promotion for all people the user has voted for collectively.
2. Given a screen name, output the total number of times the user has voted to support or be neutral or oppose the candidate considered for promotion for all people the user has voted collectively. Note that this is the same as question 1 except the input is different.
Note: To emphasize, for questions 1 and 2, you must use the voting information of the user in all the elections the user participated as a voter. Also, if the user didn’t participate in any voting, an appropriate message output would be relevant.
3. Given a user id considered for promotion, output the user id and screen name of the nominator. For multiple nominations, you will output all nominators. If the user is not nominated ever, output an appropriate message.
4. Given user screen name, output the total number of votes the user got in support, neutral and in opposition and the result, if the user was considered was promotion. Otherwise, (the user was never considered for promotion), output an appropriate message.
Work to be done and submitted:
1. You should design relevant relational database schemas and load the entire data into the tables. Submit the schema diagram with keys, foreign keys and referential constraints.
2. You may or may not decide to use the “time” information in the dataset. You may introduce additional attributes as you see fit or absolutely essential to do this exercise. Obviously, you need to create the data on your own if you introduce additional attributes.
3. You will submit DDL scripts for creating the tables. You can do this directly on your db platform via SQL. No php or JDBC is needed.
4. For processing the programming problems above, use php in conjunction with simple web forms that can be used to test the correctness of the queries. Submit the URLs so I can verify. Also submit the php scripts. The web interface just needs to be functional and by no means fancy.
5. I am not particular whether you read the data from the given file directly into the tables or programmatically pre-process it somehow and load the data into tables via code. In either case, you will submit the code to do it. I did say on the last day of classes that you want to learn how to load data directly from file to the tables. Now I am saying you can do it either way.
6. Design a relevant data model (whether it is ER or UML or whatever) for this exercise and submit it.
7. Submit information pertaining to whether your schema implementation in (1) satisfies 1st, 2nd or 3rd normal forms.
8. Please make sure your schema implementation satisfies the goals (i) and (ii) of the exercise described above. You will be marked off failing to meet the goals.
Guidelines or Suggestions:
· I would like to see the entire application on the departmental MySQL server preferably. But I won’t insist on it.
· You can develop the database application on either MySQL or Oracle but not SQL server.
· You can develop the application on your laptop or desktop first to make life easier. On the other hand, you may find life just as easy developing on our server.