COMP3425 Data Mining 2019Assignment 2Maximum marks 100Weight 20% of the total marks for the courseLength Maximum of 10 pages, excluding cover sheet, bibliography andappendices.Layout A4 margin, at least 11 point type size, use of typeface, marginsand headings consistent with a professional style.Submission deadline 9:00am, Monday, 20 MaySubmission mode Electronic, via WattleEstimated time 15 hoursPenalty for lateness 100% after the deadline has passedFirst posted: 1st April, 9amLast modified: 1st April, 9amQuestions to: Wattle Discussion ForumThis assignment specification may be updated to reflect clarifications and modifications after it isfirst issued.It is strongly suggested that you start working on the assignment right away. You can submit as manytimes as you like. Only the most recent submission at the due date will be assessed.In this assignment, you are required to submit a single report in the form of a PDF file. You may alsoattach supporting information (appendices) as one or more identified sections at the end of thesame PDF file. Appendices will not be marked but may be treated as supporting information to yourreport. Please use a cover sheet at the front that identifies you as author of the work using your unumberand name, and identifies this as your submission for COMP3425 Assignment 2. The coversheet and appendices do not contribute to the page limit. You are expected to write in a styleappropriate to a professional report. You may refer to http://www.anu.edu.au/students/learningdevelopment/writing-assessment/report-writingfor some useful stylistic advice. You are expected touse the question and sub-question numbering in this assignment to identify the relevant answers inyour report.No particular layout is specified, but you should follow a professional style and use no smaller than11 point typeface and stay within the maximum specified page count. Page margins, heading sizes,paragraph breaks and so forth are not specified but a professional style must be maintained. Textbeyond the page limit will be treated as non-existent. 2This is a single-person assignment and should be completed on your own. Make certain youcarefully reference all the material that you use, although the nature of this assignment suggests fewreferences will be needed. It is unacceptable to cut and paste another authors work and pass it offas your own. Anyone found doing this, from whatever source, will get a mark of zero for theassignment and, in addition, CECS procedures for plagiarism will apply.No particular referencing style is required. However, you are expected to reference conventionally,conveniently, and consistently. References are not included in the page limit. Due to the context inwhich this assignment is placed, you may refer to the course notes or course software whereappropriate (e.g. “For this experiment Rattle was used”), without formal reference to originalsources, unless you copy text which always requires a formal reference to the source.An assessment rubric is provided. The rubric will be used to mark your assignment. You are advisedto use it to supplement your understanding of what is expected for the assignment and to directyour effort towards the most rewarding parts of the work.Your assignment submission will be treated confidentially. It will be available to ANU staff involvedin the course for the purposes of marking.TaskYou are to complete the following exercises. For simplicity, the exercises are expressed using theassumption that you are using Rattle, however you are free to use R directly or any other datamining platform you choose that can deliver the required functions. You are expected, in your ownwords, to interpret selected tool output in the context of the learning task. Write just what is neededto explain the results you see. Similarly, you should describe the methods used in terms of thelanguage of data mining, not in the terms of commands you typed or buttons you selected.1. PlatformBriefly describe the platform for your experiments in terms of memory, CPU, operating system, andsoftware that you use for the exercises. If your platform is not consistent throughout, you mustdescribe it for each exercise. This is to ensure your results are reproducible.2. DataLook at the pairwise correlation amongst all the numeric variables using Pearson product-momentcorrelation.(a) Explain why you would expect DayofYear and WeekofYear to be highly correlated.(b) Qualitatively describe the correlations amongst the variables High, Low, Open, Close andVolume. Explain what you see in terms of the source of the data.33. Association mining: What factors affect volume of sales?(a) Compute and give the 5-number summary for Volume. Qualitatively describe what it tells youabout Volume.(b) For this exercise, bin Volume into 5 categories using Quantiles. Why is quantile binningappropriate for association mining with Volume?When you have completed this question 3, remove the extra variable you created so they do notinterfere with other exercises.(c) Generate association rules, adjusting min support and min confidence parameters as you need.What parameters do you use? Bearing in mind we are looking for insight into what factors affectVolume, find 3 interesting rules, and explain both objectively and subjectively why they areinteresting.(d) Comment on whether, in general, association mining could be a useful technique on this data.4. Study a very simple classification taskAim to build a model to classify Change. Use Change as the target class and set every other variableas Input (independent). Using sensible defaults for model parameters is fine for this exercise wherewe aim to compare methods rather than optimise them.(a) This should be a very easy task for a learner. Why? Hint: Think how Change is defined.(b) Train each of a Linear, Decision tree, SVM and Neural Net classifier, so you have 4 classifiers.Hint: Because the dataset is large, begin with a small training set, 20%, and where run-time speedsare acceptable, move up to a 70% training set. Evaluate each of these 4 classifiers, using a confusionmatrix and interpreting the results for the context of the learning task.(c) Inspect the models themselves where that is possible to assist in your evaluation and to explainthe performance results. Which learner(s) performed best and why?5. Predict a Numeric VariableOne investment strategy could rely on the previous day’s price to predict the opening price for astock, enabling you to place a buy or sell offer overnight ready for the next day. To predict theopening price for a day, you cannot use any of the other prices or Volume for that same day as thatinformation is not available until the close of the day, when it is too late. So, using the variables youhave in the dataset, but ignoring all of High, Low, Close, Volume, Close-Open, Change, High-Low, andHMLOL, aim to predict the opening price Open using the previous day’s closing price PriorClose. andthe date and stock-related variables. Use a regression tree or a neural net.(a) Explain which you chose of a regression tree or neural net and justify your choice.(b) Train your chosen model and tune by setting controllable parameters to achieve a reasonableperformance. Explain what parameters you varied and how, and the values you chose finally.(c) Assess the performance of your best result using the subjective and objective evaluationappropriate for the method you chose, and justify why you settled with that result. 46. More Complex ClassificationAn alternative investment strategy might be to predict where there will be a big proportional changein the price over a day, once the day has opened, and so a good opportunity for a short term gain(or loss) that day.(a) Transform HMLOL to a categoric class variable by binning into 2 classes, using a k-meansclustering of the HMLOL. The skewed distribution of HMLOL is helpful here as the higher values weneed for investing are relatively rare. When you have completed this question 6, remove the extravariable you created so it does not interfere with other exercises. Now, be sure to ignore most ofthe current day price and volume variables, that is ignore all of High, Low, Close, Volume, CloseOpen,Change, High-Low, and HMLOL. Explain why HMLOL should be ignored. This time, use theOpen price, PriorClose and all the date and company description variables for learning.Initially, use a small training set, 20%, and where run-time speeds are acceptable, experiment with alarger training set. Explain how you will partition the available dataset to train and validateclassification models below.(b) Train a Decision Tree Classifier. You will need to adjust default parameters to obtain optimalperformance. State what parameters you varied and (briefly) their effect on your results. Evaluateyour optimal classifier using the error matrix, ROC, and any quality information specific to theclassifier method.(c) Train an SVM Classifier. Then proceed as for (b) Decision Tree above.(d) Train a Neural Net classifier. Then proceed as for (b) Decision Tree above.7. Clustering(a) Restore the dataset to its original distributed form, removing any new variables you haveconstructed above. For clustering, use only the five raw variables, Date, Open, High, Low and Volumeand remove all of the others.Experiment with clustering using the k-means algorithm. Rescale the variables to fall in the range 0-1 prior to clustering. Use the full dataset for clustering (do not partition) by building cluster modelsfor each of k= 2, 5 and the recommended default (i.e. squareroot (n/2) for dataset of size n)clusters. Choose your preferred k and it’s cluster model for k-means to answer the following.(a) Justify your choice of k as your preferred (Hint: have look at parts b-d below for each clustermodel).(b) Calculate the sum of the within-cluster-sum–of-squares for your chosen model. The withincluster-sum–of-squaresis the sum of the squares of the Euclidean distance of each object from itscluster mean. Discuss why this is interesting.(c) Look at the cluster centres for each variable. Using thi代写COMP3425作业、代做Data Mining作业、R实验作业代写、代做R程序设计作业 代写留学生 Statists information, discuss qualitatively howeach cluster differs from the others.(d) Use a scatterplot to plot (a sample of) the objects projected on to each combination of 2variables with objects mapped to each cluster by colour (Hint: The Data button on Rattle’s Cluster 5tab can do this). Describe what you can see as the major influences on clustering. Include theimage in your answer.8. Qualitative Summary of Findings (approx 1/2 page)Would you use these results to advise your investment decisions?Comparatively evaluate the techniques you have used and their suitability or not for mining thisdata. This should be a qualitative opinion that draws on what you have found already doing theexercises above. For example, what can you say about training and classification speeds, the size orother aspects of the training data, or the predictive power of the models built? Finally, what elsewould you propose to investigate to assist your investment decisions?6Assessment RubricThis rubric will be used to mark your assignment. You are advised to use it to supplement your understanding of what is expected for the assignment and todirect your effort towards the most rewarding parts of the work. Your assignment will be marked out of 100, and marks will be scaled back to contribute tothe defined weighting for assessment of the course.ReviewCriteriaMaxMarkExemplary Excellent Good Acceptable Unsatisfactory1. Platform & 2.Data10 9-101.Platform descriptioncomplete (memory, CPU,operating system, software).2. All required correlations(Year/DayofYear & all pairsof High, Low, Open, Closeand Volume) clearlyexplained in terms of thedata domain, in the correctdirections and for correctreasons demonstrating anunderstanding of the data.7-81. Platform descriptioncomplete.2a partial or unclear2b partial or unclear2b partial explanation5-61. Platform descriptioncomplete.2a attempt but withcorrelation direction wrong2b Partial description of 4variables or unclear2b Partial explanation0-41. Platform descriptionincomplete.2a. correlation reasonmissing or unrelated toDayofYear and WeekofYear2b. Description unrelated toHigh, Low, Open, Close andVolume2b. Explanation unrelated todata source3. Associationmining10 9-10Answers demonstrate deepunderstanding of associationmining, by the carefulselection of interesting anddifferentiated rules and clearrationale for interestingness.Comment shows original andinsightful analysis ofassociation mining on theproblem.7-8a 5-number summarycompleteb 5-number explanation forVolume clearb Binning understoodc Support and confidenceclearc 3 interesting rules givenc objective interestingness isgiven for all 3c subjective interestingnessattemptedd Comment makes sense5-6a 5number summary okb 5-number explanation forVolume poorb Binning misunderstoodc Support or confidence notclearc c objective interestingness isincompletec subjective interestingnessis incompleted Comment is cursory or offtrack0-4Required information notprovided and/or incorrect ormisleading, demonstratinglack of engagement with theproblem7ReviewCriteriaMaxMarkExemplary Excellent Good Acceptable Unsatisfactory4. Simpleclassification10 9-10Deep understanding of the 4models demonstratedthrough analysis ofperformance on the changetask.7-8a correctly explains whydefinition of change makes itseem easyb 4 confusion matrixes givenb confusion matrixesexplained in terms of thedata and the method andthe model learnt.c some evidence ofunderstanding what themodels are doingc reasoning for comparativeperformance demonstratingunderstanding of themethods behind them5-6a partially explains whydefinition of change makes itseem easyb 4 confusion matrixes givenb confusion matrixesexplained at face value onlyc weak understanding oflearnt modelsc comparative performanceonly cursorily presentedc reason for comparativeperformance is shallow0-4a inadequate explanationb confusion matrix missingb confusion matrixmisunderstoodc Interpretation of confusionmatrix missingc no apparent understandingof what the models are doingc missing or unexplainedcomparative analysis5. Prediction 20 17-20Approach to problemdemonstrates effort toproduce good results and adeep understanding of therelative benefits of the 2models in the context of theproblem domain.Results are interpreted inthe context of the problemdomain.14-16a justification for choiceshows understanding of thecomparative benefits ofeach and extensiveexperiments withperformance.b parameter variationsshows a combination ofexperimentation andunderstanding of theparameters with justificationfor stopping at selectedparameters.c several subjective andobjective evaluationmeasures used asappropriate to methodchosenc justification for stoppingdemonstrates awareness ofappropriateness of best12-13a justification for choiceshows understanding of thecomparative benefits ofeach and experiments withperformance.b parameter variationsshows a combination ofexperimentation andunderstanding of theparametersc multiple subjective andobjective evaluationmeasures used asappropriate to methodchosenc justification for stoppingdemonstrates awareness ofappropriateness of bestresult10-11a justification for choiceshows some understandingof the comparative benefitsof each or experiments withperformance.b parameter variationdemonstrates someexperimentationc cursory evaluation givenc justification for stoppingperfunctory0-9a weak justification forchoiceb parameter variationinsufficientc evaluation fails todemonstrate effort orunderstanding of evaluationc justification for stoppingeffectively absent8ReviewCriteriaMaxMarkExemplary Excellent Good Acceptable Unsatisfactoryresult and scope of potentialfor further improvement6. ComplexClassification30 26-30Exemplary use ofclassification models withcomprehensive and fit-forpurposeperformanceanalysis on the problem22-25a explanation correctb,c,d parameter variationclear and extensivedemonstratingunderstanding of effect in all3 methodsb.c.d error matrix and ROCcorrectly interpretedin all 3 methodsb,c,d extensive use ofspecific evaluation methodsused and significance clearlyexplained in all 3 methods18-21a explanation correcta satisfactory approach todataset partitioningb parameter variation clearand sufficient for goodresultsb error matrix correctlyinterpretedb ROC correctly interpretedb some specific evaluationmethods usedc parameter variation clearand sufficient for goodresultsc error matrix correctlyinterpretedc ROC correctly interpretedc some specific evaluationmethods usedd parameter variation clearand sufficient for goodresultsd error matrix correctlyinterpretedd ROC correctly interpretedd some specific evaluationmethods used15-17a explanation correcta satisfactory approach todataset partitioningb parameter variationperfunctoryb error matrix givenb ROC givenb few specific evaluationmethods usedc parameter variationperfunctoryc error matrix givenc ROC givenc few specific evaluationmethods usedd parameter variationperfunctoryd error matrix givend ROC givend few specific evaluationmethods used0-14a explanation incorrecta unsound use oftraining/testing/validationdatab no parameter variationb no error matrixb no or faulty ROCb specific evaluationmethods missingc no parameter variationc no error matrixc no or faulty ROCc specific evaluationmethods missingd no parameter variationd no error matrixd no or faulty ROCd specific evaluationmethods missing9ReviewCriteriaMaxMarkExemplary Excellent Good Acceptable Unsatisfactory7. Clustering 10 9-10The application of k-meansalgorithm to the dataset andits evaluation demonstratesexemplary understanding ofthe algorithm, its evaluation,and its limitations.Suitable evaluation methodsor clustering experimentsbeyond those required heremay be used.7-8a justification convincingb measure calculatedcorrectly. Discussionrecognises value andlimitationsc discussion on centresreflects numeric results andemphasises the interestingparts that relate to thesignificance in domain termsd correct image included anddescription showsunderstanding linked to datadomain5-6a justification offered butnot clear or unconvincingb measure calculatedcorrectlyc discussion on centresreflects numeric resultsd correct image included0-4Clustering experimentationand discussion inadequate8. QualitativeSummary10 9-10Many aspects of evaluationare discussed and a clearconclusion is drawn, withdirect reference to potentialgoals of the domain of thedata.Proposal for furtherinvestigation demonstratescreativity and thoughtfulengagement with theproblem, clearly building onthe work reported.8A clear conclusion is drawnfrom the work reported anda defended proposal forfurther investigation isproposed, with clear links toboth the work reported andthe domain of application.7A rounded, balancedsummary of the work ispresented with a justifiedproposal given.6A summary of the work ispresented and a proposalmade.0-4Answer does notdemonstrate adequateengagement with theproblem nor a qualitativeunderstanding of the workreported.转自:http://www.7daixie.com/2019051621116616.html
网友评论