Tuesday, August 30, 2011

The Methodology Behind Words of Loss and Words of Win

Last month I posted a list of words on Prosper which, when used in a listing which successfully became a loan, were more successful than average and those that were less successful than average.

A comment from havastat on the prospers.org forum made me realize that I had neglected to talk about the methodology used to find these words. It is as follows:


For every loan created successfully on Prosper before the end of 2007 I created a list of all of the words used in the Title and Description of the loan. For every instance of a word in a loan that was Paid I added 1 to a running total of PaidInstances. For every instance of a word in a loan that had any other status I added 1 to a running total of UnpaidInstances.

I then calculated the percentage for the word with the formula:
PaidInstances / (PaidInstances + UnpaidInstances)
(Which is to say: PaidInstances / TotalWordUsage)

I reduced the list to words which had been used at least 1000 in the loan set and sorted it from words that were most often in Paid loans to words that were least often in Paid loans and compared that list to the overall likelihood of any loan to be paid back.

I found the word 'lender' at the top, with loans containing the word having been Paid 68.96% of the time. I found the word 'payday' at the bottom, with loans containing the word having been Paid only 38.89% of the time. (This compared to an average Paid percentage, across all loans, of about 61% for this time period.)

Now I think that there is an argument to be made that it would have been better to count each word a maximum of once for each listing -- what I did measures use of the word itself, more than it measures the use of the word in the listing ("help, help, help, help!" in one listing counts 4 times, instead of just once), but I think that the best choice really depends on what you're trying to do with the information.

Sunday, August 28, 2011

An initial investment in Lending Club

I've been a Prosper lender for years but I've been thinking about branching out to lending on Lending Club, as well. Since I've been looking at words which have performed poorly on Prosper I wanted to extend that search to Lending Club, as well.

What follows is Lending Club data taken from about two months ago. For D, E, F and G loans we can see that the words used seem to correlate with the loans similarly on both sites:

DescriptionTotal LoansPercent GoodPercent BadFully PaidCurrentCharged OffDefault
D,E,F,G All884081.1%9.4%15.5%65.6%7.5%.1%
D,E,F,G With 'need'145577.9%16.2%21.6%56.3%13.8%.1%
D,E,F,G With 'help'142280.3%11.3%18.9%61.4%9.3%.4%
D,E,F,G With 'chance'8483.3%7.1%22.6%60.7%6%0%
D,E,F,G With 'behind'6578.5%16.9%18.5%60%15.4%0%
D,E,F,G With 'payday'1241.7%58.3%25%16.7%58.3%0%

'Need', 'help', 'behind', and 'payday' all have fewer Good loans (defined as 'Fully Paid' and 'Current') and more Bad loans (defined as 'Charged Off', 'Default', and 'Late (31-120 days)'). The only exception in the words I searched here is for the word 'Chance' which actually performed better than the average loan. (This could be true for Prosper, as well, and is worth further investigation.)

When taken as a group, the four bad words yield the following results:

DescriptionTotal LoansPercent GoodPercent BadFully PaidCurrentCharged OffDefault
D,E,F,G All884081.1%9.4%15.5%65.6%7.5%.1%
D,E,F,G without 'help', 'behind', 'need', 'payday'643681.8%7.8%13.8%68%6%.1%

So, as I start to invest, I expect that I'll only be investing in loans where the title and description does not have any of these words.

Saturday, August 27, 2011

AI Series: A Performance Measure

As mentioned previously, I'm planning on following along with Stanford's Introduction to Artificial Intelligence class this coming semester. I've just received the book, Artificial Intelligence A Modern Approach and am starting to work my way through it.

As I go through the book, I'll be thinking about what I would do if I were building my own program to analyze loans and writing up my analysis here.

Chapter 2 discusses the idea of rationality and how we determine whether an agent (program, in this case) has done well. Specifically we would create a performance measure. In the case of Peer to Peer lending, I think the following performance measure would be in order:

  1. The total return from investing in a loan
  2. Subtracting a small percentage of the investment for the time the money is invested but the loan as not yet started
  3. Subtracting some amount for a loan that goes over 30 days late
The first rule speaks for itself. Since our goal is to maximize return we want our agent to choose to invest in loans which will give the most return on investment. Let's say that we are investing $100 in every loan and the total return from the loan is $110. We know that this loan has done well for us but it hasn't done as well as a loan that returns $115. And it has done much better than a loan that returns $40.

The second rule is a rule to encourage the agent to pick loans which are closest to closing. Given two loans that are exactly equal in every other way, we'd rather invest in the one that is two days from being funded than the one that is 10 days from being funded.

The third rule is more of a personal preference. Even if the agent were able to pick out borrowers who would pay over 30 days late and still end up paying off all of the loan value (and perhaps even more, with penalties) I don't want these loans. They would drive me crazy every time I would look at my portfolio. So rule three biases these loans downwards to make the agent value these loans less.

Sunday, August 21, 2011

A look at "family" words

It's nice to see all the traffic we've been getting for the analysis of words from Prosper loans. A couple sites latched onto the fact that in Pre-2008 loans we saw certain family words performing appearing in loans that performed poorly.

Since that topic appeared to be of interest to people I wanted to explore it in more detail:

DescriptionTotal LoansPaidRecoveredNever Recovered
Pre 2008, D, All311759.3%60.2%39.8%
Pre 2008, D, Contain a "family" word92452.2%53%47%
2008-2009, D, All237946.7%47.4%31.6%
2008-2009, D, Contain a "family" word48743.3%43.9%35.5%
2010, D, all131413.9%13.9%4.9%
2010, D, Contain a "family" word15514.8%14.8%2.6%
2011, D, All11932.8%2.8%0%
2011, D, Contain a "family" word1031.9%1.9%0%

It looks like the case is true through the 2009 loans so far. The 2010 and 2011 loans are too young to draw conclusions, but given the initial numbers I'm not certain that I would be comfortable saying that loans with a "family" word are a bad investment.

"Family" words are defined as: husband, child, children, mother, daughter, son in either the title or description of the loan.

Saturday, August 20, 2011

A look at Prosper's 2008 Loans

Previously I published a list of words which did poorly in pre-2008 loans and words that did well in pre-2008 loans. I wanted to see if those lists could predict what would happen in 2008 loans on Prosper.

I created a new value, WordValue, which will become negative if a loan has more words which had previously failed in it and become positive if a loan has more words which had previously succeeded in it. (Additional description of the value is below.)

Suffice it to say, I expected that lower WordValues would repay less often than higher WordValues. It turns out that this was not the case for 2008 loans:

DescriptionPaidRecoveredNever Recovered
2008, D, All49.8%50.7%34.7%
2008, D, WordValue <-150.9%51.7%34.2%
2008, D, WordValue >= -147.5%48.3%35.8%
2008, E, WordValue < -144.7%45.6%39.1%
2008, E, WordValue >= -142.2%42.7%46.4%

What I found is that lower WordValues actually repayed at a higher rate than higher WordValues--exactly the opposite of what I was expecting. This means that some of the low performing words in loans before 2008 performed better than average in 2008 loans.

Since my original conjecture is that the word "need" performs less well than average I tested that on this same set of loans and found the following:

DescriptionPaidRecoveredNever Recovered
2008, D, All49.8%50.7%34.7%
2008, D, Title or Description contain "need"49.7%50.7%36.2%
2008, D, Title or Description do not contain "need"49.9%50.7%33.8%
2008, E, Title or Description contain "need"43.5%44.6%39.3%
2008, E, Title or Description do not contain "need"44.6%45.4%41.6%
2009, D, All27.9%27.9%13.4%
2009, D, need in title or desc25.5%25.5%14.7%

We get mixed messages here, too. In 2008 D loans with "need" were about 2.5% more likely to never recover (meaning they were confirmed to Default or Charge Off) but roughly equally as likely to have ended with a "Paid" status. 2008 E loans with "need" are, to date, less likely to have finished their loan with a "Paid" status but 5.5% less likely to never recover. 2009 D loans are less likely to have ended and Paid and more likely to never recover.

Now obviously not all 2008 and 2009 3-year loans have reached the end of their term. We'll be able to draw better conclusions in the coming months, but it's entirely possible that there is no correlation between "need" and loans which aren't repayed.

In future posts I'll whittle down my list of words that fail and see if I can find a set of words that consistently has results which are worse than the average.



About the WordValue number:
I created the WordValue by taking the difference between the Paid percentage of loans containing each word and the average repayment rate for loans before 2008. I only used words that were more or less than .5% of the average.

The WordValue number is the sum of each of those differences from the average taken only once per word.

Wednesday, August 17, 2011

Machine Lending

For those of you not already in the know, Stanford is offering a few free AI classes online during the fall semester. Introduction to Artificial Intelligence and Machine Learning seem like they'd both offer interesting advice on creating a program which could pick loans with a better chance of repayment than I would pick by hand.

I'm looking forward to the start of the classes. It isn't as cool as designing a self driving car but it's a start.

Monday, August 15, 2011

What's Coming

With the belief that the way people write their requests for loans reflects their attitude towards lending and, therefore, their probability of repayment, I set out on a journey to find words which are associated with failed loans.

So I looked at the word "need" in a few different ways. I saw that there is variation on Prosper across the years and across the credit grades, and I've seen that something similar is happening on Lending Club.

I also looked at words that are associated with failed loans and words that are associated with loans that get paid off.

In the next few weeks, I'll start testing the words that have failed before and see if they fail on newer loans with the idea of building a list of words that, if used by a borrower, indicate that a loan is more likely to fail. I'll show how my first attempt failed -- words associated with failure in Prosper loans before 2008 weren't all associated with failure in 2008 loans -- and I'll try different sets of words to see if I can find some consistent pattern.

Sunday, August 7, 2011

Needs Series: Comparing to Lending Club

Of course all of this needs data does us absolutely no good if it isn't generalizable beyond the loans that it came from. Since the initial data was taken from Prosper, it seemed that Lending Club loans would be make a good comparison and continue to tell us if we're finding something relevant or just random noise.

Data Set: All Lending Club Loans made in 2007 and 2008

2007-2008 LoansTotal LoansPercent Charged Off or Defaulted
All Loans299621.2%
"Need" in title or description81426.8%
"Need" not in title nor description218219.2%
"Payday" in title or description757.1%
"Payday" not in title nor description298921.1%

So there you have it, the word "Need" appears to correlate more often with a failure to repay a loan in Lending Club as well. (I included "Payday" data just for fun -- with only 7 loans the data isn't likely to be relevant, but it does fall in line with what we'd expect.)

Since so many of the 2009 loans won't be paid off until 2012, I included the >30 days late category with the already Charged Off and Defaulted loans and found the following stats:

2009 LoansTotal LoansPercent Charged Off, Defaulted or >30 days late
All Loans528110.7%
"Need" in title or description126313.6%
"Need" not in title nor description40189.8%
"Payday" in title or description683.3%

The trend continues...


All Articles in the Needs Series
An Introduction
Initial Findings
Correlation Matrix
Comparing to Lending Club