Survival Analysis of Mortgage Data
- Powerpoint Presentation - I present the methodology and findings here.
- Code Used
- 1_Sample.sas - Assembles a complete set of Loan ID's available in the dataset, then subsamples by 2% to obtain the sample I build the analysis on. The complete dataset is approximately 15 million loans, I build my analysis on 1% of the data. (150,000 loans)
- 2_BuildOrigData.sas - Using the set of loan ID's we sampled, I merge them against the origination dataset to only get the origination information for those loans we are interested in. Since I am trying to make a model that is generalizable beyond this time period, I don't subset based on time. (IE, I'm not making cohorts. Of course a cohort study would find a cohort effect. Had I more time, I would have made more effort to make sure I eliminated any cohort effect by inclusion of macroeconomic indicators)
- 3_BuildPerfData.sas - Same as the origination data. Each performance data file is merged against the loan ID database to get only those loans we're interested in. Both origination and performance data are stored in datasets identified by Year and Quarter of origination of the loan.
- 4_ExpandData.sas - Here is where I merge the origination and performance datasets. This adds the origination data to every row observation in the performance set, which hugely expands the size of the database. (hence expanddata)
- 5_rme.sas - Here I add the macroeconomic effects, as well as transformations of the macroeconomic effects. (things like change in unemployment since start of the loan)
- 6_CollapseData.sas - Here I collapse the data into 6 month, 4 month, or 3 month groups in order to limit the size of the dataset for importation to R, and to make the computation a little less unwieldy.
- 7_Analysis.R - Finally, I build the Cox PH Models in R.