美文网首页
讲解:dataset、Python、Python、scikit-

讲解:dataset、Python、Python、scikit-

作者: kuijinzhong | 来源:发表于2020-01-12 11:44 被阅读0次

    Assignment 2For this assignment you will experiment with various classification models using subsets of some real-world datasets. In particular, you will use the K-Nearest-Neighbor algorithm to classify text documents, experiment with andcompare classifiers that are part of the scikit-learn machine learning package for Python, and use some additionalpreprocessing capabilities of pandas and scikit-learn packages.1.K-Nearest-Neighbor (KNN) classification on Newsgroups [Dataset: newsgroups.zip]For this problem you will use a subset of the 20 Newsgroup data set. The full data set contains 20,000 newsgroupdocuments, partitioned (nearly) evenly across 20 different newsgroups and has been often used for experiments intext applications of machine learning techniques, such as text classification and text clustering (see the descriptionof the full dataset). The assignment data set contains a subset of 1000 documents and a vocabulary of terms.Each document belongs to one of two classes Hockey (class label 1) and Microsoft Windows (class label 0). Thedata has already been split (80%, 20%) into training and test data. The class labels for the training and test data arealso provided in separate files. The training and test data contain a row for each term in the vocabulary and acolumn for each document. The values in the table represent raw term frequencies. The data has already beenpreprocessed to extract terms, remove stop words and perform stemming (so, the vocabulary contains stems notfull terms). Please be sure to read the readme.txt file in the distribution.Your tasks in this problem are the following [Note: for this problem you should not use scikit-learn forclassification, but create your own KNN classifer. You may use Pandas, NumPy, standard Pythonlibraries, and Matplotlib.]a.Create your own KNN classifier function. Your classifier should allow as input the training data matrix,the training labels, the instance to be classified, the value of K, and should return the predicted class forthe instance and the top K neighbors. Your classifier should work with Euclidean distance as well asCosine Similarity. You may create two separate classifiers, or add this capability as a parameter for theclassifier function.b.Create a function to compute the classification accuracy over the test data set (ratio of correctpredictions to the number of test instances). This function will call the classifier function in part a on allthe test instances and in each case compares the actual test class label to the predicted class label.c.Run your accuracy function on a range of values for K in order to compare accuracy values for differentnumbers of neighbors. Do this both using Euclidean Distance as well as Cosine similarity measure. [Forexample, you can try evaluating your classifiers on a range of values of K from 1 through 20 and presentthe results as a table or a graph].d.Using Python, modify the training and test data sets so that term weights are converted to TFxIDFweights (instead of raw term frequencies). Then, rerun your evaluation on the range of K values (asabove) and compare the results to the results without using TFxIDF weights.e.Create a new classifier based on the Rocchio Method adapted for text categorization. You shouldseparate the training function from the classifiation function. The training part for the classifier can beimplemented as a function that takes as input the training data matrix and the training labels, returningthe prototype vectors for each class. The classification part can be implemented as another function thatwould take as input the prototypes returned from the training function and the instance to be classified.This function should measure Cosine similarity of the test instance to each prototype vector. Your outputshould indicate the predicted class for the test instance and the similarity values of the instance to eachof the category prototypes. Finally, compute the classification accuracy using the test instances andcompare your results to the best KNN approach you tried earlier. (Ignore this Question)2.Classification using scikit-learn [Dataset: bank_data.csv]For this problem you will experiment with various classifiers provided as part of the scikit-learn (sklearn) machinelearning module, as well as with some of its preprocessing and model evaluation capabilities. [Note: This moduleis already part of the Anaconda distributions. However, if you are using standalone Python distributions, you willneed to first obtain and install it]. dataset留学生作业代做、Python编程语言作业调试、Python实验作业代写、scikit-learn作业代做 You will work with a modified subset of a real data set of customers for a bank.This is the same data set used in Assignment 1. The data is provided in a CSV formatted file with the first rowcontaining the attribute names.Your tasks in this problem are the following:a.Load and preprocess the data using Numpy or Pandas and the preprocessing functions from scikitlearn.Specifically, you need to separate the target attribute (pep) from the portion of the data to beused for training and testing. You will need to convert the selected dataset into the StandardSpreadsheet format (scikit-learn functions generally assume that all attributes are in numeric form).Finally, you need to split the transformed data into training and test sets (using 80%-20% randomized split). [Review Ipython Notebook examples from Week 4 for different ways to performthese tasks.]b.Run scikit-learns KNN classifier on the test set. Note: in the case of KNN, you should first normalizethe data so that all attributes are in the same scale (normalize so that the values are between 0 and 1).Generate the confusion matrix (visualize it using Matplotlib), as well as the classification report. Also,compute the average accuracy score. Experiment with different values of K and the weight parameter(i.e., with or without distance weighting) for KNN to see if you can improve accuracy (you do not need toprovide the details of all of your experimentation, but provide a short discussion on what parametersworked best as well as your final results).c.Repeat the classification using scikit-learns decision tree classifier (using the default parameters) andthe Naive Bayes (Gaussian) classifier. As above, generate the confusion matrix, classification report, andaverage accuracy scores for each classifier. For each model, compare the average accuracry scores onthe test and the training data sets. What does the comparison tell you in terms of bias-variance tradeoff?d.Discuss your observations based on the above experiments.3.Data Analysis and Predictive Modeling on Census data [Dataset: adult-modified.csv]For this problem you will use a simplified version of the Adult Census Data Set. In the subset provided here, someof the attributes have been removed and some preprocessing has been performed.Your tasks in this problem are the following:a.Preprocessing and data analysis:Examine the data for missing values. In case of categorical attributes, remove instances with missingvalues. In the case of numeric attributes, impute and fill-in the missing values using the attribute mean.Examine the characteristics of the attributes, including relevant statistics for each attribute, histogramsillustrating the distribtions of numeric attributes, bar graphs showing value counts for categoricalattributes, etc.Perform the following cross-tabulations (including generating bar charts): education+race, workclass+income,work-class+race, and race+income. In the latter case (race+income) also create a tableor chart showing percentages of each race category that fall in the low-income group. Discuss yourobservations from this analysis.?Compare and contrast the characteristics of the low-income and high-income categories across thedifferent attributes.b.Predictive Modeling and Model Evaluation:Using either Pandas or Scikit-learn, create dummy variables for the categorical attributes. Thenseparate the target attribute (income_>50K) from the attributes used for training. [Note: you need todrop income_Use scikit-learn to build classifiers uisng Naive Bayes (Gaussian), decision tree (using entropy asselection criteria), and linear discriminant analysis (LDA). For each of these perform 10-fold crossvalidation(using cross-validation module in scikit-learn) and report the overall average accuracy.For the decision tree model (generated on the full training data), generate a visualization of tree andsubmit it as a separate file (png, jpg, or pdf) or embed it in the Jupyter Notebook.Notes on Submission: You must submit your Jupyter Notebook (similar to examples in class) which includes yourdocumented code, results of your interactions, and any discussions or explanations of the results. Please organizeyour notebook so that its clear what parts of the notebook correspond to which problems in the assignment. Pleasesubmit the notebook in both IPYNB and HTML formats (along with any auxiliary files). Your assignment should besubmitted via D2L.转自:http://www.7daixie.com/2019050556938040.html

    相关文章

      网友评论

          本文标题:讲解:dataset、Python、Python、scikit-

          本文链接:https://www.haomeiwen.com/subject/mnmiactx.html