School of Computing and Information SystemsThe University of MelbourneCOMP30027, Machine Learning, 2019Project 2: Short Text Location PredictionTask: Gain understanding about using textual information to build a location classifierDue: Stage I: Friday 24 May, 1pm UTC+10Stage II: Friday 31 May, 1pm UTC+10Submission: Stage I: Report (PDF) to Turnitin; test output(s) and code to LMSStage II: Reviews and Reflection (via Turnitin PeerMark)Marks: The Project will be marked out of 20, and will contribute 20% of your total mark.Groups: Groups of 1 or 2, with commensurate expectations for each (see below).1 OverviewThe goal of this Project is to build and critically analyse some supervised Machine Learning methods, with theaim of automatically identifying the location from which a textual message was sent, as one of four Australiancities. Although this is a simplification of the more general problem of geotagging, it is still a very difficultproblem, which has been well-studied, but a solution remains elusive.This aims to reinforce the largely theoretical lecture concepts surrounding learners, data representation, andevaluation, by applying them to a sophisticated problem. You will also have an opportunity to practice yourgeneral problem-solving skills, written communication skills, and creativity.2 Deliverables1. Stage I: the output(s) of your classifier(s), comprising predictions of labels for the test instances2. Stage I: one or more programs, written in Python1, which implement machine learning methods that buildthe model, make predictions, and evaluate where necessary3. Stage I: an anonymous written report, of 1100-1350 words (for a group of one student) or 2200-2400words (for a group of two students)4. Stage II: reviews of two reports written by other students, of 200-400 words each5. Stage II: a written reflection piece, with the same structure as a review3 Terms of UseThe data has collected from Twitter (https://www.twitter.com), specifically for this Project. Accordingto Twitter’s Terms of Service, you cannot share this data for any other purpose, or reproduce it in any form,other than as isolated examples. (Which is to say, you can include a few tweets in your report.)Please note that the dataset is a sample of actual data posted to the World Wide Web. As such, it may containinformation that is in poor taste, or that could be construed as offensive. We would ask you, as much as possible,to look beyond this to the task at hand. (For example, it is generally not necessary to read individual tweets.)The opinions expressed within the data are those of the (anonymised) authors, and in no way express the officialviews of the University of Melbourne or any of its employees; using the data in an educative capacity does notconsitute endorsement of the content contained therein.If you object to these terms, please contact us (nj@unimelb.edu.au) as soon as possible.1We will waive the Python requirement under certain circumstances.4 DataThe data files are available via the LMS, and are described in a corresponding README.Briefly, you will be provided with a set of training tweets, and a set of development tweets. These have beenlabelled with a “class” according to the location, as one of four Australian cities: Sydney, Melbourne, Brisbane,Perth2. There is also a set of test tweets, which will not be labelled, for whose instances you will submitpredictions as part of your submission.Your job is to come up with one or more implemented Machine Learning system(s), which are trained using(a representation of) the training dataset, and evaluated using the development dataset. You will then runthe trained classifier over the test dataset, and submit the corresponding predicted labels. Three possible preprocesseddata representations have been provided, which you may use or ignore according to your needs.5 AssessmentThe Project will be marked out of 20, and is worth 20% of your overall mark for the subject. The markbreakdown will be:Ranking of your best-performing classifier 3 marksReport 12 marksReviews 3 marksReflection 2 marksTOTAL 20 marksThe report will be marked according to the accompanying rubric; the details of the Stage II assessment will beannounced via the LMS.The mark for the system ranking will be calculated by first determining the accuracy of each set of predictionsfor every group. We will then apply equal-frequency binning of the systems in the final system ranking, andassign a score to each group based on the output which occurs in the highest-ranking bin. This procedure willbe applied separately for groups of 1 member, and groups of 2 members. We may assign a bonus mark toremarkable submissions.Since all of the tweets exist on the World Wide Web, it is inconvenient but possible to “cheat” and identify someof the author ages from the test tweets using non-Machine Learning methods. If there is any evidence of this,the system ranking will be ignored, and you will instead receive a mark of 0 for this component. The code willnot be directly assessed, but if you do not submit it, it will be assumed that you are attempting to circumventthe Machine Learning requirement, and you will receive a mark of 0 for the system ranking component.6 SubmissionAll submission will be via the LMS. Stage I submissions will open one week before the due date. Stage IIsubmissions will be open as soon as the reports are available, immediately following the Stage I submissiondeadline.2Obviously, there were other cities that could also have been included. However, the problem is already very difficult, and havingmore classes would have made it moreso. Please do not read into the absence of various large urban areas; there is nothing inherentlysignificant ab代写COMP30027作业、代做Information Systems作业、Python课程设计作业代做、代写Pythoout the choice of class set.7 Individual vs. Two-Person ParticipationYou have the option of participating as a group of one member, or in a group of two. In the case that you opt toparticipate individually, you will be required to enter the predictions (and accompaying code to generate them)for at least 1, and up to 4 distinct systems. Groups of two will be required to enter at least 3 and up to 4 distinctsystems, of which one must be either a semi-supervised system, or a stacked ensemble system3. The reportlength requirement also differs, as detailed below:Group size Distinct system submissions required Report length1 1–4 1,100–1,350 words2 3–4 2,200–2,400 wordsIf you wish to form a two-student group, by Friday 10 May, indicating this. Note that once you have signed up for a given group, you will not be allowedto change groups. If you do not contact us, we will assume that you will be participating as an individual, evenif you were in a two-student group for Project 1.8 ReportThe report should be 1,100-1,350 words (groups of one student) or 2,200-2,400 words (groups of two students)in length (±10%) and will include:1. a basic description the task;2. a short summary of some related literature;3. a conceptual description of what you have done, including any learners that you have used, or featuresthat you have engineered4,4. an evaluation of your classifier(s) over the development tweets;You should also aim to have a more detailed discussion, which:5. contextualises the behaviour of the method(s), in terms of the theoretical properties we have identified inthe lectures;6. attempts some error analysis of the method(s);7. and, summarises the principal conclusions — which is to say, what a reasonably-informed reader willhave learned from your efforts;And don’t forget:8. A bibliography, which includes related work from your literature summaryNote that we are more interested in seeing evidence of you having thought about the task and determinedreasons for the relative performance of different methods, rather than the raw scores of the different methodsyou select. This is not to say that you should ignore the relative performance of different runs over the data, butrather that you should think beyond simple numbers to the reasons that underlie them.We will provide LATEXand RTF style files that we would prefer that you use in writing the report. Reports are tobe submitted in the form of a single PDF file. If a report is submitted in any format other than PDF, we reservethe right to return the report with a mark of 0.Your name and student ID should not appear anywhere in the report, including the metadata (filename, etc.).3scikit-learn provides some support for both of these concepts, but ultimately quite different to how they are defined in thecontext of this subject. Consequently, there will be a non-trivial implementation component.4Again, this should be at a conceptual level; a detailed description of the code is not appropriate for the report.9 ReviewsDuring the reviewing process, you will read two submissions by other students. This is to help you contemplatesome other ways of approaching the Project, and to ensure that students get some extra feedback. For eachpaper, you should aim to write 200-400 words total, responding to three “questions”: Briefly summarise what the author has done Indicate what you think that the author has done well, and why Indicate what you think could have been improved, and why10 Changes/Updates to the Project SpecificationsWe will use the LMS to advertise any (hopefully small-scale) changes or clarifications in the Project specifications.Any addendums made to the Project specifications via the LMS will supersede information contained inthis version of the specifications.11 Late SubmissionsLate submissions will seriously create havoc with the reviewing process. You are strongly encouraged to submitby the date and time specified above. If circumstances do not permit this, then the marks will be adjusted asfollows: Each business day (or part thereof) that the report is submitted after the due date (and time) specifiedabove, 10% will be deducted from the marks available, up until 5 business days (1 week) has passed,after which regular submissions will no longer be accepted. A late report submission will mean that yourreport might not participate in the reviewing process, and so you will probably receive less feedback. There is no mechanism by which the reviews may be uploaded to the system after the deadline, consequently,it is a major hassle to accept late submissions. Any late submission of the reviews will incura 50% penalty (i.e. 1.5 of the 3 marks), and will not be accepted more than a week after the reviewingdeadline. The reflective task will largely be non-sensical to attempt after the deadline. We will reluctantly acceptlate submissions at a 50% penalty (1 of the 2 marks) up until a week after the task deadline.12 Academic HonestyWhile it is acceptable to discuss the Project with in general terms, excessive collaboration with students outsideof your group is considered cheating. We will be vetting system submissions for originality and will invokethe University’s Academic Misconduct policy (http://academichonesty.unimelb.edu.au/policy.html) where either inappropriate levels of collaboration or plagiarism are deemed to have taken place.13 Important DatesRelease of training and development data 1 May 2019Deadline for group registration 10 May 2019Deadline for submission of results over test data 24 May 2019 (1:00pm)Deadline for submission of written report 24 May 2019 (1:00pm)Deadline for submission of reviews 31 May 2019 (1:00pm)Deadline for submission of reflection 31 May 2019 (1:00pm)转自:http://www.7daixie.com/2019051511703429.html
网友评论