美文网首页
Book Review: Applied Predictive

Book Review: Applied Predictive

作者: 马文Marvin | 来源:发表于2018-09-12 01:19 被阅读65次

作者:Dean Abbott
出版社:Wiley
副标题:Principles and Techniques for the Professional Data Analyst
发行时间:March 31st 2014
来源:下载的 pdf 版本
Goodreads:4.57 (23 Ratings)
豆瓣:无

A small direct response company had developed dozens of programs in cooperation with major brands to sell books and DVDs. These affinity programs were very successful, but required considerable up-front work to develop the creative content and determine which customers, already engaged with the brand, were worth the significant marketing spend to purchase the books or DVDs on subscription. Typically, they first developed test mailings on a moderately sized sample to determine if the expected response rates were high enough to justify a larger program.
One analyst with the company identified a way to help the company become more profitable. What if one could identify the key characteristics of those who responded to the test mailing? Furthermore, what if one could generate a score for these customers and determine what minimum score would result in a high enough response rate to make the campaign profitable? The analyst discovered predictive analytics techniques that could be used for both purposes, finding key customer characteristics and using those characteristics to generate a score that could be used to determine which customers to mail.

Analytics is the process of using computational methods to discover and report influential patterns in data. The goal of analytics is to gain insight and often to affect decisions. Data is necessarily a measure of historic information so, by definition, analytics examines historic data. The term itself rose to prominence in 2005, in large part due to the introduction of Google Analytics. Nevertheless, the ideas behind analytics are not new at all but have been represented by different terms throughout the decades, including cybernetics, data analysis, neural networks, pattern recognition, statistics, knowledge discovery, data mining, and now even data science.

The difference between the business intelligence and predictive analytics measures is that the business intelligence variables identified in the questions were, as already described, user driven. In the predictive analytics approach, the predictive modeling algorithms considered many patterns, sometimes all possible patterns, and determined which ones were most predictive of the measure of interest (likelihood). The discovery of the patterns is data driven.

Figure 2-2 shows a conceptualized timeline for defining the target variable. If the date the data is pulled is the last vertical line on the right, the “Data pull timestamp,” all data used for modeling by definition must precede that date. Because the timestamp for the definition of the target variable value must occur after the last information known from the input variables, the timeline for constructing the modeling data must be shifted to the left.

The Target Variables
Two target variables were identified for use in modeling: TARGET_B and TARGET_D. Responders to the recovery mailing were assigned a value of 1 for TARGET_B. If the lapsed donor did not respond to the mailing, he received a TARGET_B value of 0. TARGET_D was populated with the amount of the gift that the lapsed donor gave as a result of the mailing. If he did not give a gift at all (i.e., TARGET_B = 0), the lapsed donor was assigned a value of 0 for TARGET_D.
Thus, there are at least two kinds of models that can be built. If TARGET_B is the target variable, the model will be a binary classification model to predict the likelihood a lapsed donor can be recovered with a single mailing. If TARGET_D is the target variable, the model predicts the amount a lapsed donor gives from a single recovery campaign.

First, look at the minimum and maximum values. Do these make sense? Are there any unexpected values? For most of the variables, the values look fine. RAMNTALL, the total donation amount given by a donor to the non-profit organization over his or her lifetime, ranges from 13 to9,485. These are plausible.
Next, consider AGE. The maximum value is 98 years old, which is high, but believable. The minimum value, however, is 1. Were there 1-year-old donors? Does this make sense? Obviously, 1-year-olds did not write checks or submit credit card donations to the organization. Could it mean that donations were given in the name of a 1-year-old? This could be, but will require further investigations by someone to provide the necessary information. Note, however, that we only know that there is at least one 1-year-old in the data; this summary does not indicate to us how many 1-year-olds there are.
Third, consider FISTDATE, the date of the first gift to the organization. The maximum value is 9603, meaning that the most recent date of the first gift in this data is March 1996. Given the timeframe for the data, this makes sense. But the minimum value is 0. Literally, this means that the first donation date was the year of Christ’s birth. Obviously this is not the intent. The most likely explanation is that the value was unknown but at some time a value of 0 was used to replace the missing or null value.
Next consider the standard deviations. For AGE, the standard deviation is about 17 years compared to the mean of 61.61 years old. If AGE were normally distributed, you would expect 68 percent of the donors to have AGE values between 44 years old and 78 years old, 95 percent of the donors to have AGE values between 27 and 95 years old, and 99.7 percent of the AGE values to range between 10 and 112 years old. It is obvious from these values, where the maximum value is less than the value assumed by the normal distribution for three standard deviations above the mean, that the data is not normally distributed and therefore the standard deviation is not an accurate measure of the true spread of the data. This can be seen even more clearly with RAMNTALL where one standard deviation below the mean is negative, less than the minimum value (and a value that makes no operational sense).

Quartiles are a special case of the general rank-ordered measure called quantiles. Other commonly used quantiles are listed in Table 3-5. The consensus of predictive analytics software is to use quartile, quintile, decile, and percentile. However, there is no consensus for the name of the label for 20 bins; three of the names that appear most often are shown in the table. Of course, if you know the deciles, you also know the quintiles because quintiles can be computed by summing pairs of deciles. If you know the percentiles, you also know all of the other quantiles.

Simpson’s Paradox isn’t just an academic oddity; it happens in many real data sets. One of the most famous is the case against the University of California at Berkeley, which was accused of gender bias in admissions to graduate schools. In 1973, 44 percent of male applicants were accepted, whereas only 35 percent of female applicants were accepted, a statistically significant difference very unlikely to occur by chance. It seemed like a clear case of discrimination.
However after separating the admissions rates by department, it turned out that women were admitted at a higher rate than men in most departments. How can this be when the overall rates were so overwhelmingly in favor of men? Women applied to departments that were more competitive and therefore accepted lower percentages of applicants, suppressing their overall acceptance rates, whereas men tended to apply to departments with higher overall acceptance rates. It was only after considering the interactions that the trend appeared.

Spurious correlations are relationships that appear to be related but in fact have no direct relationship whatsoever. For example, when the AFC wins the Super Bowl, 80 percent of the time the next year will be a bull market. Or butter production in Bangladesh produces a statistically significant correlation to the S&P 500 index.
One such rule is the so-called Redskin Rule. The rule is defined in this way. Examine the outcome of the last Washington Redskins home football game prior to a U.S. Presidential Election, and between 1940 and 2000, when the Redskins won that game, the incumbent party won the electoral vote for the White House. On the other hand, when the Redskins lost, the incumbent party lost the election. Interesting, it worked perfectly for 17 consecutive elections (1940 –2000). If you were to fl ip a fair coin (a random event with odds 50/50), the likelihood of flipping 17 consecutive heads is 1 in 131,072, very unlikely.
However the problem with spurious correlations is that we aren’t asking the question, “How associated is the Redskins winning a home game prior to the presidential election to the election outcome” any more than we are asking the same question of any other NFL team. But instead, the pattern of Redskin home wins is found after the fact to be correlated as one of thousands of possible patterns to examine. If you examined enough other NFL outcomes, such as examining the relationship between Presidential elections and all teams winning their third game of the year, or fourth game of the year, or fi rst pre-season game of the year, and so on until you’ve examined over 130,000 possible patterns, one of them is likely to be associated.

Data preparation clearly takes considerable thought and effort to ensure the data is presented to the algorithms in a way they can be used effectively. Predictive modelers need to understand how the algorithms interpret the dat a so they can prepare the data appropriately for the algorithm. The extensive discussion in this Chapter about how to correct for skewed distributions is not needed for decision trees, for example, but can be necessary for linear regression.
Do not consider Data Preparation a process that concludes after the first pass. This stage is often revisited once problems or deficiencies are discovered while building models. Feature creating, in particular, is iterative as you discover which kinds of features work well for the data.
Overfitting the data is perhaps the biggest reason predictive models fail when deployed. You should always take care to construct the sampling strategy well so that overfitting can be identified and models adjusted appropriately.

Text mining can uncover insights into data that extend well beyond information in structured databases. Processing steps that interpret the text to uncover these patterns and insights can range from simple extraction of words and phrases to a more complex linguistic understanding of what the data means. The former treats words and phrases as dumb patterns, requiring minimal linguistic understanding of the data. This approach can apply, with minor modifications, to most languages. The latter requires extensive understanding of the language, including sentence structure, grammar, connotations, slang, and context.
This chapter takes the pattern extraction approach to text mining rather than a natural language processing (NLP) approach. In the former, the purpose of text mining is to convert unstructured data to structured data, which then can be used in predictive modeling. The meaning of the text, while helpful for interpreting the models, is not of primary concern for creating the features. While text mining of patterns is simpler, there is still considerable complexity in the text processing and the features can provide signifi cant improvements in modeling accuracy.

相关文章

网友评论

      本文标题:Book Review: Applied Predictive

      本文链接:https://www.haomeiwen.com/subject/mnrugftx.html