美文网首页每周500字
ML4T | week 9 The Big Short, Tim

ML4T | week 9 The Big Short, Tim

作者: 我的名字叫清阳 | 来源:发表于2019-04-28 06:54 被阅读2次

    This Week's Lessons.

    This week, you should watch ... well, no lessons, but a few videos: The Big Short, Time Series Data (First 30 Minutes), and Technical Trading.

    • Slides for Technical Trading

    the videos are hosted on Youtube.

    Project 5 Video

    Project 5 Video

    Here is the process to create a market simulator

    1. construct a df_prices to record the prices (adjusted close price) for each trading day for each stock in the portfolio. Add a cash variable in the data frame and set the value for cash to $1.
    2. construct df_trades to hold all the tradings. Make sure to get the value of the "cash" column correct.
    3. df_holdings: the number of stocks and cash for each column.
    4. df_value: = df_prices * df_holdings.
    5. df_port_val: = sum of the df_value.

    Time Series Data (First 30 Minutes)

    • When using technical indicators to predict stock prices, we usually use X in the indicators to predict the price of several days later.
    • To construct a dataset for this, we can pair Y on t and X on t - n to get an X-Y pair for the model building.
    • usually, we split the data into a training set and a testing data. And the testing data should always be gathered later than the training data
      • building model with later data and testing on the former data will bias the result because the "now" is influenced by history.

    Backtesting to Validate the Model

    1. go back to the beginning of the data, select a chunk of them for training and building the model; Based on that model, make a Forecast towards the future (n days ahead), then make a trade decision.
    2. Then skip the train and forecast period, select another chunk of data to repeat step 1.

    This process is called Roll forward cross validation or Out of sample validation

    In this method, we can only use the slice of data in the future for testing. because it is easier to predict past than predict the future

    10 Ways that backtest can go wrong.

    1. In sample backtesting

    • description: train and test the model using the same dataset.
    • problem: the in sample error will be small.
    • how to avoid: get different dataset for training and testing purpose. e.g. train on 2007 data and testing with 2008+ data.

    2. survivor bias

    image.png
    • description: As time goes, participant of an experiment might drop from the study. the final results can only be measured from the participants who are still in the experiment
    • problem: as seen in the graph above, the green line, which is the S&P 500 performance with the current members (survivors), the purple line is the S&P 500 using the membership from the beginning of the data. The green line has overestimated the return ( or price of the index).
    • how to avoid: 1) use the historical membership; 2) use SBF (survivor bias free) data 3) use the indices as the universe for testing.

    3. ignore the market impact

    • description: The historical data has no information about your trading. but your trading might affect the price of the stock you trade. Ignoring this will lead to an inaccurate prediction
    • problem: When taking action based on a prediction, the stock price at the trading period might affect the performance of the stock because of your live trading.
    • how to avoid: include a "slippage" or "market impact" model when back testing.

    Case Study in Vectorization of Technical Strategy Python Code

    • note: we have the code in project 5 to read in order files and calculate portfolio value.
    • Relative strength index(wiki) can be used to predict oversold and overbought

    In the graph above, there are

    • SMA (the brown line)
    • Bollinger Bands ( the red lines)
    • Relative strength index ( the black curve below the bar graph)
    strategy

    the easy but slow way of calculating the indicators

    The above method is a correct way of doing the calculation. However, it is very slow to calculate the needed statistics using the method.

    real 5m24.461s
    user 5m24.281s
    sys 0m0.185s

    Solution: vectorize the calculation. Here is the PPT used in the video


    Closed Caption of the video.

    • Each plane represent a feature for the stocks as data frames, instead of time going down, as it usually does left to right.
    • In technical analysis these are technical features which are features that we compute from volume and price.
    • vertical slices represent the value of the features on a particular day. We also have historical prices, and we use this of course to train our system. Now, on the last day which is here indicated in yellow, that's today. We're trying to make a forecast for prediction into the future and that's what our Y is.

    we train our system: 1) go back to some historical time, look at the value of the features on a particular date (X). 2) Look at the change in prices into the future (Y); 3) use the pairs of X&Y to train and our system.

    how we would actually backtest one of these strategies.

    • First, cannot train and test with the same data.
    • Second, cannot train on later data and test on the historical data

    ---------- I will stop here--------

    If the stock has these features, and look ten days later, it went up to 5%. and then you go back, you know it's like time travel. it's like being able to go back and know that you should have sold all your stocks before the market crash, or that usually yeah anyways is it. so it's a kind of cheating. so we have to take steps to prohibit that so here's what you do you roll back a time and let's say we've rolled it back to this to the is essentially the boundary between this blue and white that's that's our artificial today now we're allowed to train our system with all the data before that date but not any data after that date so we train our model, we make a forecast so we're classifying stocks as buys or sells or predicting how much they're going to go up or down and that forecast is let's say five days in the future remember we're not allowed to peek and see what really happens there we're only allowed to use this data and then that last day to query our system and make that forecast then we enter our orders and trade an important thing to point out here is you if you're basing your information on market clothes you should not trade and market clothes you can't you can't trade it until the next the next day at the earliest so it's just not it's not feasible to act on something at the same time you observe it so keep keep that in mind when you're building your back tester anyways you trade now you have to hold those positions and see what happens over time so time advances up to that up to that date that you have made your forecast and again you train your system on the available data now one question is how far back can you go so you know this area over here is in the past still so yes you couldn't use that data for training but in my experiences it's good to be consistent and just always look back a constant amount instead of all the way back over all the data if you look back over all the data the performance of the system gets kind of muddled because it's being trained over different market regimes and so on it's better just to look over more recent time periods okay so you rebuild your model there you make another forecast your trade and then time to this board you see how it works this method is called roll forward cross-validation we had talked earlier when we first introduced with machine learning about cross-validation where we slice the data up into like ten slices and we were trained on nine and test on the tenth and then very each time which of the tenth pieces we would test on remember that now the important distinction between that and this is you can only test on slices of the future can't you can't like slice up to the data over the last ten years and use the last nine years for training and ten years ago for testing and the reason is it just the way the market behaves in the future is extremely predictive of the way that was in the past it's less helpful here but it let me try to restate is better it's easier to predict the past than it is to predict the future is that maybe a better way anyway if people are doing weightless but the changes in market regimes typically are happening in the past and they affect what happens in the future and they're usually irreversible so it's Stream Li easy if you train over data in the future to do really really well fast and if you do that you just should have trust your results you can't trust them so always always with stock data you can't do this negative this multi fold cross validation you can only do roll forward cross belt roll forward cross validation okay I'm gonna so that's basically an overview of what you should do for this next project and in general how we train and test machine learning strategies I want to talk a little bit about ways that back tests can go wrong the the in fact the name of this talk is some nine ways that back tests lie and what it sort of comes from is every now and then in this business you'll run into some young whippersnapper who has some awesome strategy and they'll show you some Bank test that is fantastic and try to convince you to use it but if you ask a few simple questions usually discover that there's something wrong there and this is a video that's further questions you should ask yourself as you build these strategies and I think the the tenth wavelength this why is that there's something nice okay first wave it back test lie is I just mentioned it in the sample back testing so you train your model over a few years and then you go back and you test it over that same dare believe it or not a lot of it doesn't even occur to people that that's totally focused but it should not certainly occur to you and this method hasn't image to succeed right one thing I didn't want to mention is for this upcoming project we are actually going to allow you to test your test your method in the sample because it's often really it's it's almost impossible you know with with the training you've got so far to build a machine learning system that's going to work well a real market so we at least want to give you an opportunity to work well you know back where your training data but we certainly want you also to test it out of sample so you can see you can see the difference okay I showed this already kind of covered anyways the the general way to avoid this there's two parts to it really so one is let's suppose you were doing this roll forward cross-validation I was talking about and you keep tweaking your techniques and so on and you train from 2007 to 2016 oh it didn't work well and I think I should use linear regression instead of decision trees and do it again but we should use this factor instead of that I want to do it again keep doing it no you do you're still doing roll forward cross-validation but you're tweaking your methods each time so that it works better over that time period you're still in that case of it subject to this in-sample fallacy and as you you know now which features work over that whole time period you didn't know that back in 2007 but you know that so in a way an additional form of roll forward validation is when you're doing your initial testing only work over some time period in history like don't go past 2010 do all the tweaking you want to over that period find the best model you can then let it go forward and see how it works and then that's an important technique survivor bias anybody know what survivor biases raise your hand Dave at least knows what it is okay one person in the back okay the official description is it's selective use of data in a statistical study that emphasizes examples that are alive let me let me give you an example let's suppose suppose we're a drug company and we're going to do a test on our blood pressure blood pressure drug you know do a 5-year study we'd take 500 people at random and we we measure their blood pressure we've given this drug and we measure their blood pressure each month so in the first month the average of a blood pressure was 160 over 110 which is high by the end of the study our average patient now only has look pressure 135 or 80 it sounds like her drug is fantastic right but what if I were to tell you that 58 of those 500 patients died during the study and what about further told you it was the ones with the high blood pressure so at the end of the study we're only testing the people who lived right and in some sense they're guaranteed to be in better health than the ones that died so it's a survivor biased test where we're only testing the ones who survived well it's the same thing if someone tells you okay I've invented this great strategy and I I tested it using the S&P 500 eg there's 500 stocks it must be a pretty thorough test right well if you go back from 2008 to the present fifty-eight of the members of the S&P 500 article on the they died in 2008 2009 so if you work with the S&P 500 as of today it means you're only working with stocks that survived through the market downturn let me show you let me show you essentially what this means let's say this black line represents the performance of the S&P 500 overall and you see that big drawdown 2008-2009 well if we're using the members of the S&P 500 today each each one of them each one of these green line trying with it that time you know they all did reasonably well but we're not counting those that those that died we're obviously biased to the positive side in fact here's an example when done what the purple line represents is the performance of the S&P 500 over time according to which stocks were actually in the S&P 500 at times the green line is the performance of a portfolio that is reasoning the S&P 500 a so today and there's about a 13% of difference over that time what that means is if you were to run a study using as your universal stocks the current membership of the S&P 500 you've got a built-in 13% advantage so someone you know young whippersnappers shows you this fantastic strategy that outperforms the S&P 500 by 13% and you ask them well what's your universe and they say 500 as of today Yuji yeah if you're investing in that spot for example would the spy throw leaves would be similar line we're just know SP why is the because it it was in you know in 2009 it was constructed of the stocks that were in the S&P 500 in 2009 the green line is the current membership of the SP y but going back with those same stocks historically emissions online no oh okay how to prevent so one thing to mention is in your data that you got for this class you have you have been membership of SP 500 as of 2008 Plus membership of the S&P 500 as of 2012 so if you want to do a survival is free type of act test you have the list of stocks that existed in 2008 and 2012 so you have some day that you can do that with now if you were going to go you know work at Citadel or some of the hedge fund they pay for that data and they have it my company of the Senate has it essentially what we have is for a number of major indexes we have on every day in history which stocks made up that index and so if we're building a strategy based on say the restful 3000 we go back to whatever they were starting our back tests we say okay who was in the S Russell 3000 on this day we get that list of stocks we use that to select what we might trade and then we step forward and we do that every day to see which stocks we could use was there a question over here okay I'm going to skip this one is important it turns out that even if you were I'm going to do just one more than handed over to Dave but anyways even if your system is predictive in other words you just you said call me back you make a prediction and what what happens to the price but you don't act on you just make predictions and see what happens it turns out that you can have a system that is predictive that accurately predicts changes in price but when you act on it the very act of engaging the market with the information causes the price to go against you enough that you essentially nullify that informational advantage now typically how we know I did how I discovered it I was working at a hedge fund out of California and you know throwing machine learning and milling a model trying to get something going and we found something that worked in fact it backtest it had this though it had an absurd Sharpe ratio of seven it really had a Sharpe ratio set and my boss said no way that is there's no way that that works anytime someone shows you a Sharpe ratio of seven that's what you should say too and I said no anyways I convinced him so this is what it would look like historically and this this time right here is where we were in time it was in 2009 and anyways I convinced him it looked good enough that we started trading it so we traded a million dollars against his strategy and then it did so it's going just horizontal didn't make mine anymore it was a long short strategy it was anything I told you when I was like so we spent weeks overhauling the code I mean we we were certain that somehow in our code we hadn't managed to peek into the future a little bit and that's why we had this Sharpe ratio of seven and that of course when you start live free and you can't peek at the future and so that's why the the advantage of that but anyways we we stopped trading and we overhauled the code made sure there's nothing wrong and indeed there was nothing wrong with the code and heck we stopped trading it did this why welcomed it it turned out when we dug down deep to figure out what it happened so we go forward and we were able to predict prices of these thoughts and the difference over about a week or the amount that the sorts of changes we were predicting run they were about twenty fifths or a 0.2% of change what turns out that depending on how actively traded a stock is you can affect the price more than that so we discovered that we were the the stocks we were discovering we're very thinly traded stocks so there were stocks that maybe for a whole day only had like $50,000 worth of trades and then we were like trading ten thousand dollars worth of it so we were we were 20% of the volume of a particular stock and we were affecting it so much that yeah we we were predicting what would happens without us being in the market but we were crushing the price and remember we had to exit the position - so we will crush it like 12 tips and we bought it and that precious 12 tips when we sold it so we're losing 24 bits but we only have a prediction advantage of 20 bits so we're yes whatever the question for those period I think what you said is well what if you just traded less yes so probably would work so here here are a couple observations about this why essentially what we discovered is some market inefficiency in other words we found some information that according to well according to the efficient market hypothesis that information should immediately affect the stock price but we found some places where it didn't well why was that well so look at these stocks were only trading maybe fifty thousand dollars a day well suppose you are a twenty billion dollar hedge fund citadel as an example it's not even worth your while to consider don't unders not like that because you can't you can't buy twenty million dollars worth of the 1 million dollar stock you'll just totally obliterate the the price so the reason there is still those inefficiencies there is because the big hedge funds haven't pointed their guns at it they don't care it's not worth it for them but yeah potentially a bunch of young whippersnappers in their garage if they if they find 500 stocks that are thinly traded and they buy a hundred dollars of each one yep but eventually every strategy has a capacity right so this strategy might really work great until they've got two million dollars in it instead of five hundred thousand and then eventually the advantage of that breaks okay those are the many points I wanted to cover I wanted to hand it over to Dave now who's going to talk about how to operationalize some some technical strategies but answer questions first are there any non legality aspects that just repeating a question where you make trace it might may affect the price so there are it's not well-defined so so just buying a stocking and impacting the price because you buy it is sort of a legitimate thing but there are a lot of folks who will do the following they will don't put a big order you know they've got a co-located machine they'll put a big order in so that everybody sees in the order book but they take it out within a few milliseconds so say here in California and Dave and I are here in New York I put this big order in he advertises it so you see it in California but I take it out and by that time using it in California and say oh I I like that you enter a trade but by the time it gets here in New York it's gone but you reveal yourself to me I know you would have bought it at that price so I can then exploit that information that might be illegal or the future it's currently not really but that's something people are getting agitated about online questions okay Dave come on then okay so it was Dave I was done not turning on a full screen just if you can make that work I have a lot of detail to cuz it'll probably be hard okay I don't know what I think happens is when I go to full screen they see the button be screaming out cats with us okay what why don't you get it started and then I'll advise you what's happening oh yeah that's what happens okay I don't know well do you prefer that well I think so because then at least people in the room can read it and the comics online can probably assume in better than people okay so Tucker asked me to do a little bit of a vectorization tutorial today that is specifically targeted at the next assignment that little class I did not actually solve the next assignment what I did was I solved an old version of the next assignment so the assignment is a trading strategy you have to create a custom technical indicator which means any information and time series that you can create for just price and volume for measurement and then use that technical indicator to create a custom trading strategy that you will then back test through time and evaluated performance you don't have to write the back test yourself you will and orders file out of this assignment and that word file will be in the format of the market simulator that you already wrote so you can leverage that market simulator to do the actual evaluation of how well in strategy so the specific approach was the requirements like I said create a technical indicator create a trading strategy based on it and then given any date range and symbol set and our auto trader could do that to use anything at all we may not actually automate this but it should be able to run anything you need to generate an orders file of what your strategy would do for that time period for those cents so what I chose in order to make it complicated and difficult vectorized because I'm a masochist is I decided to do a skewed indicator meaning I was going to take several technical indicators and make a decision on them together as to whether or not I should buy or sell stock so the purpose of the basket indicator I chose will be one that looks for divergence between X stock and the stock market so divergence based strategies are basically when you try to find two things that normally move together most stocks go up when the market goes up down when the market goes down that's your beta from the caffeine model but sometimes among stocks that typically do that they will suddenly break from the market and either go up all the market goes down or down while the market goes up and that can be an interesting event to try to track that lets you know that something you might care about is happen so the Constituent indicators I'm using for my custom indicator or the price to SMA ratio this is exactly what it sounds like you know how to calculate the SMA it's just the last M days added up and then divided by Howard MIT is the prices estimated ratio will just be today's price divided by that period SMA so it's effectively a ratio where one means today we're right at the SMA random one mean for above it less than one in four pilot I'm also using the Bollinger Bands percent you've seen Bollinger Bands earlier in the course you know that the Bollinger Band is involved tracking the SMA is central line of events the SMA standard deviations of the same rolling period as the upper line and the SMA - two standard deviations is the lower line and that's typically used to tray when you cross back into the bands after being outside of it here I'm thinking that one step further and turning it into a percentage of how far am i from the bottom band to the top so the bottom band would be 0% the top band would be 100% and that would let me get it as a single indicator that I knew and finally the relative strength index this is a different type of oscillating indicator it's called an oscillator because it attempts to predict when the market or the stock you're watching is reaching a high or a love rather than being a momentum indicator that tells you which direction to trim so relative strength is very easy to explain because a ratio of Ondes this stock goes up how much does it go up / on days the stock goes down how much does it go down you can see where that would be useful because if you just looked at the number of up days versus the number of down days you might find the stock goes up and down relatively equally often and you might think well there's no trade here there's nothing I can do with that but if you found out that when it goes up it goes up twice as much as it goes down when it goes down you might do something with that information so the relative strength index normalizes back to the zero to 100 scale where the lobe of 30 is typically interpreted as meaning that the stock is oversold or unfairly punished and makes them go up again and above 70 means that it's overbought or unfairly rewarded and makes them go down this is an example of what those indicators look like on a chart from stockcharts.com which is one that I sometimes use the bottom area is showing the RSI and the little green filled areas are identifying regions where the indicator says that Google stock was overbought or oversold at that time and of course you already dealt with the Bollinger Bands live life so the basic strategy I'm wrapping around this is I want my strategy to go long whenever the stock I'm looking appears to be oversold or maybe unfairly punished but the index is not oversold so it looks like the stock has gone down more than might be reasonable relative to the market in a short period of time and I could grab expected reversion to the meet conversely if the symbol is overbought and the index is not that very shortly and I'll close any positions whenever the price crosses through the moving average so I've completed a reversion to the meeting and I'm not sure what's going to happen next of closing and I won't read all this at you but basically I want to be a little bit outside of the upper or lower volume and make the decision I want to be substantially above or below the price to SMA ratio and then I mentioned the RSI I don't claim that this is a great strategy this is just one I thought would be complex enough to implement the assignment be interesting and do some vectorization so I initially read it as iterative code which is how I think that's all that people think either that or recursion vectorization is not natural to most people that started as sort of normal software engineering from old programmers if you come from a math background it probably is very natural to you and you're very lucky people chosen to do machine learning so fully iteratively very quickly all we need to do to calculate SMA it's just create an array that is the same size and shape as our daily price today the copy function is not as bad as you think it is by the way pandas does not copy indexes and it does not copy values when you issue a copy command you might ask what was the copy that matches it copies pointers and it updates them when the values diverge and the data frame from the source data frame so we copy it we clear it out we have a blank correctly sized SMA array that looks just like the price array and then I simply loop over all the days over all the symbols they have to have a third nested loop to loop over the look-back period every single day and of all the prices and that look-back period and then take the average to get my SMA time seekers so triple nested loop not great we do the extent equal to zero that then actually because on pointers to change yes yeah that updates all the wonders because basically it's pointing in a new data space but it doesn't do the internal magic until it diverges from the old space I was surprised it actually turns out to be faster to use copy them to create a new zero filled array that's new weird okay same idea for Bollinger Bands since I'm not vectorizing anything I have to look over all the days loop over all the symbols and then loop over the look-back period I've already calculated estimated so at least I can use that part and then on that fifth line you can see for each day I calculate the Bollinger Band by taking the standard amount calculated standard deviation so it's price on that day - the price of the SMA squared and I'm accumulating those squared differences as a normal for standard deviation then once I finished looking over the look-back period I can just complete calculation by getting the bottom bands - M value plus two times the standard deviation and the other way around and then I can do my percentage just by normalizing that from zero to one all of this code by the way is in these sticky posts at the top of the Piazza so you can not only see this explanation again but you can get like every version of the code from iterative and every change that made all the way through after I do the Bollinger Bands I don't need that raw SMA anymore now what I need is my SMA price ratio than I want so I can simply run through and for every day replace the SMA with the price for that day divided by the estimated for that day now my SMA raise the price nest egg ratio indicator than that one the last one note of strength is very straight forward loop over all the days before all the symbols look back period here what I need to know is I have to look at each day and look back period and the day before to see if the price went up or down between those two days if the price went up then the following day was an up day and I need to accumulate it into my update variable if it was a down day that I need to separately accumulate it into my down base rate because I'm looking for the average of how much it went up on days it went up and vice versa for down days then once I have all that at the end I can take the ratio of up and down and normalize that according to the ORS life form this one actually turns out to be really hard to that dress finally I'll only look at the first part of this but this is the unit of code to apply the indicator to actually make our trade a decision so again when we look through all the days all the symbols but now we want to use the indicators we've calculated to make a decision so this line right here is the first trading decision and this is all of the times that I want to go along so for each day I'm saying if this day this stock the price to SMA ratio is a low point nine five which means I'm trading well we'll load my moving average and my Bollinger Band percentage below zero meaning I'm outside the bottom band so I'm like way de and my RS is below 30 which means the RSI indicator is showing that I'm in an oversold range and Masek in turn offers and the index is not oversold which means I have my divergence that I wanted then if I'm not already holding the stock long I'll output in order to buy the stock on this day I do the opposite for selling shorter on the next line and then my two closed conditions are simply again I have to look at two days back-to-back if yesterday I was on one side at the SMA today I'm on the other side of the SMA that means I crossed over and so if I was holding the stock I do the appropriate water to zero out my holdings and go back to them now that works perfect but it took five and a half minutes to run on my core i7 quad processor in my pretty new macbook pro five and a half minutes doesn't seem that bad for stock trading but consider I'm looking over seven years of data a lot of times people want to be much more than that I'm only looking at seven stocks a lot of times you'd like to at least consider the entire S&P 500 for trade maybe the entire Russell 3000 I'm using the 13 indicators a lot of times people use a basket of indicators that numbers 10 20 50 and I'm only looking back over 14 days you might want to do 28 you might want to do a 60 day look back here given that I'm doing all these order in the squared triple loops if I start inflating all those numbers like that I may not live until I see my answers so and you can get that five and a half minutes down on this toy problem so that I have room to scale up I do a really serious problem so if we're going to go through multiple steps of vectorizing and this is the actual order in which it occurred to me to vectorize it one step at a time not being a person who's really great at factorize the way that I always approach this and the way I recommend you do it it's not natural for you just get the code working iteratively no matter how long it takes just make it work get the right answers and then keep a copy of your outlet and one step at a time try to replace each loop with vectorized code and then dip it against the output from your iterative code and make sure your outlet never changed and if you do that every single step as you're going through trying to factorize the code then you'll always know immediately if something didn't work or change the values you'll be able to correct it before you build up some huge snowball of incorrect results so the first things that obviously I've heard to me where I should get rid of the triple loot at least the innermost for the two simple indicators the SMA and the Bollinger Bands and that turns out to be extremely right so a pair here I was looping through the look-back period 14 days and for each day I was individually accumulating the price into the SMA variable clearly I don't have to do that I can just define it range that is that 14-day period for that symbol and then do a song in a single step that's obvious on the Bollinger Bands the same thing instead of looping through to 14 days calculating it over and over again I can just take the difference that I wanted between the price and the SMA go ahead and square it and accumulate that sum as a single step here so this is reduced my calculation by more than you might appreciate because if you think about what I was doing in the SMA I'm calculating I'm using the price for each day as part of 14 separate calculations because on day 14 and I'm adding up day 1 2 3 4 5 6 7 up to 14 on day 15 I'm adding the same prices I just did from 2 3 4 5 6 7 not through 50 so every day 13 of my 14 calculations are redundant and I already did them and so given that it should be no surprise that just making those two simple simple simple changes got me from five and a half minutes down to a minute 50 seconds which is 66% faster just for changing like two lines of code so even if you can't figure out how to fully vectorize everything still pick off those low-hanging fruit and you'll probably see dramatic improvement just from your first couple of iterations the next thing that was obvious to do is anytime I'm zeroing out an array it was foolish just for example to go through fully iteratively layout of each index one at a time clearly I can do that affect your eyes you all know how to do that now I can go back and try to eliminate the second for loop in the SMA why am i moving over all of the symbols when I could just as easily define a range for the stumbles and do that all at once that doesn't need too much explanation either and just take the look-back period is a range and here I had symbol before now I'm doing them all at once now that I have more than one symbol I need to make sure that some across the correct axis so that I'm summing across some days I want someone crosses those symbols or I'll end up going the wrong way and then divide that by the look-back to get my SMA all in one truck again one simple change 35 percent faster than the previous step so we're making progress very rapidly on speeding this up now I can go ahead and get rid of the last loop with the SMA the outer loop and just do the whole thing as a single simple vector operator this strategy comes up again later you might think well if you've got a 14-day look-back period how do you define a range to iterate over all of the days and calculate a value but then somehow within that also iterate over subsets of 14 days and birthday answer is you can't do that just as a simple range substring inside an array but if you think a little bit indirectly and you'll do this a lot you'll come up with a way that you could get one set of numbers where you can do a simple range and in this case what occur to me is to use the cumulative sum because of what is a 14 day sliding some across a time period other than if you did the cumulative sum for all days and then subtract today with the value from 14 days ago everything is going to be the same except for those 14 days in between so if I just cume sum the whole thing and then take today and subtract 14 days ago that's exactly the same as doing that rolling Sun so now that I've been on the cube some I can calculate everything in one step just like I wanted to by simply subtracting everything from 14 days before to get the difference which was my sum of the 14 days divided by the load back and I've manually calculated my rolling window now obviously if you don't want to go through that there are built-in panelist functions for this you can see below but that felt like cheating giving them I was trying to show how to vectorize things okay we can do the exact same thing for the Bollinger Bands loop the Bollinger Bands loop is just iterating through the days the stocks and the look-back period and using the SMA I already have hand calculating the standard deviation and using that to get the upper and lower logarithms I don't need to do all of that I can use the pandas price Rowland function with standard deviation to calculated my rolling standard deviations in one spec I already have my rolling SMA that I calculated myself and now as vector math I can just say well the whole top band of the Bollinger Bands for my entire date range is just the whole SMA plus two times the whole rolling standard deviation and the opposite for the bottom of it so now with a vector top man and a vector bottom band I can calculate the whole volunteer band percent series in one step by just subtracting the prices from the bottom and to get down to my zero and then see how far from the topic that is vectorizing the estimated prices in my calculation is trivial and we'll talk about it the harder part RSI RSI was really hard to vectorize on purpose I kind of hated myself for picking up so the first thing you do in an RSI and the inner inner inner loop is you calculate the Delta between each day and the previous day to see if it was an up day or a down pack clearly that's the same as calculating daily returns so I can simply pre calculate my day or returns and I don't have to do all that and all that stuff the pre calculation happens before the first loop there's like ten different ways to do daily returns you know at least three of them but you just want to subtract every day or the day and store that you track just vectorizing that instead of calculating those deltas every single time through the loop sped it up by another 46% we're now in the 30-day silence okay now I can tackle that inner our side so the interrupts alert was the one that's handling the look-back shirt instead of dealing with the look-back period individually what I would like is I'd like to say for a whole 14-day look-back period how come I am one step get us some of all the up days and the Sun of all the down days separately I can't do that with data grads and I can't do it with a slice because I don't know which days are wish this is where you get into the real power of none high the fact that you can approach it like a database and give it a query Clause and that where Clause can contain any boolean that refers to an array or the same size as the one that you're slicing so in this case I can get all my up games at once by saying here's a slice of the look-back period but I only want the elements where that element is greater than or equal to zero and then some of those this boolean array keeps the indexes so I get zeros for all the days where the condition wasn't true so it doesn't impact my son so with this one statement I get up gain total for all up days in the period and the down lost code for all the days and then I can use those to do my calculation and I've saved myself another 60% time but I don't have to stay with just that either now I can eliminate the other inner loop and say why am I looking at all the symbols one at a time I've already pre calculated everything I just need to use it so instead of looping through the symbols and then indexing into up gain I can just take the whole vertical vector for the symbols and calculate RSI is one step across all assemblance you've done that a lot of times yourself an important note here it's easy to track yourself of vectorization I have lost one of my checks I used to have a gift that's got a special case where I'm going to divide by zero and I'll be really bad so don't do that I can't do that if I'm vectorizing all the mountains going on one step but they anticipated this when they created online so numpy does not explode if you divide by zero and its default behaviour it will actually assign the special num i dot infinity valley to the result of dividing by zero because that's what the limit approached so that means I can go back afterwards and fix this just by doing the division and then say okay now everything that was infinity fix that up to the value it was supposed to be if the down days were all right the last major code that I've got you add code now because it's really hard to eliminate the very last loop of RF sobbing because I need to know ahead of time one vector for all of the updates um and all of the down base up so that means I need to pre calculate all the stuff so very beginning of my code I'm going to take that daily rest again I'm gonna get all days where daily rest with greater than or equal to 0 so that masks out when we might update I'm going to fill all of the nav values with zero to make sure it doesn't destroy myself and I'm going to Kim Suk and the same thing for the down dates so on a given day now my value is the sum of this day and every updated came before it and now we can use the same trick I use before now to get the sum of the updates in evening window I can just take the cumulative sum at the end and the cumulative sum at the beginning and subtract them and I get the sum of just the days in the winter so that trick will come in handy over and over and over again cubes mom can get you basically just one simple subtraction to get back a sliding window without having to do all that loop having those Reds and down Reds then I may just do what I said that's these lines where I offset and subtract and now on each day I've got up getting is the value of the 14 previous days up days and down the loss because without the 14 previous day's losses and then I can finish fully vectorizing our lives so now that I've got all that calculated I can do the relative strength for all days at once for all symbols at once just by saying take that whole up gain array and divide it by the look-back divided by that hold down Austrade divided by the look back and I get everything all at one time and that's 40% faster it final thing the training strategy this was also kind of noggins because I made complicated decisions with four conditions so the first problem is I haven't ordered the list but I'm just throwing things in clearly if it's a Python list that starts out empty I can't affect the rise figuring out what's in it so I'm going to have to go to an orders array that's the same size and shape as my price already and everything else then in order to be able to fully vectorize this I want everything to be the same size and shape so I'm going to take my index farside that I used for my decisions and broadcast that out into a full sized array that's the same size as my prices the last problem I faced is I don't have an indicator for crossovers but remember my closes say when I cross over the SMA I need to do something so I have to make that last indicator that I need that actually turns out to be really simple so I create a zero set array of the same size as everything else and then I set only the days where the SMA is above one to one so now my array is zero when I'm below the SMA and one when I'm above the estimate then I can just do the diff on that array which is a built-in vectorized operation that subtracts every in the day for it if you think about what happens after that on days where I was below before and I'm still below it'll be zero minus zero days where I was above and stayed above one minus one is one days where I crossed up I'll have a one and days where I crossed down I'll have a month so just a couple of lines of code I've now got this perfect cross indicator that tells me only days when I crossed over in which direction and then I can use a compound boolean index to do my entire trading strategy all at once single ampersand and a single pipe let you do and and or inside a boolean attacks this only works because all these arrays at the same size please remember that but now I can say one orders where SMA is this and bbp is this and our sizes sq eyes that and I will get back a master a only where all these things are true and then I set my target share position to 100 - Audrey depending on whether I won't be long or short and on days where I want to get out of the market I said to zero know that I've done target shares I could not think of a way to directly calculate orders this is what I want my holdings but then I can do a for the Bill of all my names because those were days where I had to no opinion so for killing man's now just repeats the previous value to say well I want to have the same number I had yesterday if it wasn't today where I was supposed to take action and then again I can do one more dip and taking your target Holdings for every day and running a different exactly gives you an order because if you have the same Holdings as yesterday that means no change no order if you have more Holdings than yesterday you buy something less holding yesterday again sell something and then finally there's no way to vectorize writing to disk a custom csv file that you just made up the contents and that's not necessarily clean but I can delete my index copy I can delete all of the days that have no orders and that reduces by 90% the number and indeed I get one more 90 percent savings so my final output which is exactly the same as what I started with it is now a thousand times faster than where I began going from five minutes 24 seconds to 0.35 seconds and producing the identical line with the identical output in all cases so please find us online and I think you'll see why and then there's like a step a store just F&J where you can get a full and complete code that you can run yourself every single step see what I did and see how it works and there's also a Piazza thread you can comment on and I'll be happy to answer questions about it thank you you can yep thanks today that was great we can take one question in case anybody has one okay your bones get out of here okay we will collect at the back of the room in case you want to ask us a question later on and any question online okay oh and I'll just answer it on the Piazza thanks everybody .

    相关文章

      网友评论

        本文标题:ML4T | week 9 The Big Short, Tim

        本文链接:https://www.haomeiwen.com/subject/msgtpqtx.html