Tuesday, August 30, 2011

The Methodology Behind Words of Loss and Words of Win

Last month I posted a list of words on Prosper which, when used in a listing which successfully became a loan, were more successful than average and those that were less successful than average.

A comment from havastat on the prospers.org forum made me realize that I had neglected to talk about the methodology used to find these words. It is as follows:


For every loan created successfully on Prosper before the end of 2007 I created a list of all of the words used in the Title and Description of the loan. For every instance of a word in a loan that was Paid I added 1 to a running total of PaidInstances. For every instance of a word in a loan that had any other status I added 1 to a running total of UnpaidInstances.

I then calculated the percentage for the word with the formula:
PaidInstances / (PaidInstances + UnpaidInstances)
(Which is to say: PaidInstances / TotalWordUsage)

I reduced the list to words which had been used at least 1000 in the loan set and sorted it from words that were most often in Paid loans to words that were least often in Paid loans and compared that list to the overall likelihood of any loan to be paid back.

I found the word 'lender' at the top, with loans containing the word having been Paid 68.96% of the time. I found the word 'payday' at the bottom, with loans containing the word having been Paid only 38.89% of the time. (This compared to an average Paid percentage, across all loans, of about 61% for this time period.)

Now I think that there is an argument to be made that it would have been better to count each word a maximum of once for each listing -- what I did measures use of the word itself, more than it measures the use of the word in the listing ("help, help, help, help!" in one listing counts 4 times, instead of just once), but I think that the best choice really depends on what you're trying to do with the information.

Sunday, August 28, 2011

An initial investment in Lending Club

I've been a Prosper lender for years but I've been thinking about branching out to lending on Lending Club, as well. Since I've been looking at words which have performed poorly on Prosper I wanted to extend that search to Lending Club, as well.

What follows is Lending Club data taken from about two months ago. For D, E, F and G loans we can see that the words used seem to correlate with the loans similarly on both sites:

DescriptionTotal LoansPercent GoodPercent BadFully PaidCurrentCharged OffDefault
D,E,F,G All884081.1%9.4%15.5%65.6%7.5%.1%
D,E,F,G With 'need'145577.9%16.2%21.6%56.3%13.8%.1%
D,E,F,G With 'help'142280.3%11.3%18.9%61.4%9.3%.4%
D,E,F,G With 'chance'8483.3%7.1%22.6%60.7%6%0%
D,E,F,G With 'behind'6578.5%16.9%18.5%60%15.4%0%
D,E,F,G With 'payday'1241.7%58.3%25%16.7%58.3%0%

'Need', 'help', 'behind', and 'payday' all have fewer Good loans (defined as 'Fully Paid' and 'Current') and more Bad loans (defined as 'Charged Off', 'Default', and 'Late (31-120 days)'). The only exception in the words I searched here is for the word 'Chance' which actually performed better than the average loan. (This could be true for Prosper, as well, and is worth further investigation.)

When taken as a group, the four bad words yield the following results:

DescriptionTotal LoansPercent GoodPercent BadFully PaidCurrentCharged OffDefault
D,E,F,G All884081.1%9.4%15.5%65.6%7.5%.1%
D,E,F,G without 'help', 'behind', 'need', 'payday'643681.8%7.8%13.8%68%6%.1%

So, as I start to invest, I expect that I'll only be investing in loans where the title and description does not have any of these words.

Saturday, August 27, 2011

AI Series: A Performance Measure

As mentioned previously, I'm planning on following along with Stanford's Introduction to Artificial Intelligence class this coming semester. I've just received the book, Artificial Intelligence A Modern Approach and am starting to work my way through it.

As I go through the book, I'll be thinking about what I would do if I were building my own program to analyze loans and writing up my analysis here.

Chapter 2 discusses the idea of rationality and how we determine whether an agent (program, in this case) has done well. Specifically we would create a performance measure. In the case of Peer to Peer lending, I think the following performance measure would be in order:

  1. The total return from investing in a loan
  2. Subtracting a small percentage of the investment for the time the money is invested but the loan as not yet started
  3. Subtracting some amount for a loan that goes over 30 days late
The first rule speaks for itself. Since our goal is to maximize return we want our agent to choose to invest in loans which will give the most return on investment. Let's say that we are investing $100 in every loan and the total return from the loan is $110. We know that this loan has done well for us but it hasn't done as well as a loan that returns $115. And it has done much better than a loan that returns $40.

The second rule is a rule to encourage the agent to pick loans which are closest to closing. Given two loans that are exactly equal in every other way, we'd rather invest in the one that is two days from being funded than the one that is 10 days from being funded.

The third rule is more of a personal preference. Even if the agent were able to pick out borrowers who would pay over 30 days late and still end up paying off all of the loan value (and perhaps even more, with penalties) I don't want these loans. They would drive me crazy every time I would look at my portfolio. So rule three biases these loans downwards to make the agent value these loans less.

Sunday, August 21, 2011

A look at "family" words

It's nice to see all the traffic we've been getting for the analysis of words from Prosper loans. A couple sites latched onto the fact that in Pre-2008 loans we saw certain family words performing appearing in loans that performed poorly.

Since that topic appeared to be of interest to people I wanted to explore it in more detail:

DescriptionTotal LoansPaidRecoveredNever Recovered
Pre 2008, D, All311759.3%60.2%39.8%
Pre 2008, D, Contain a "family" word92452.2%53%47%
2008-2009, D, All237946.7%47.4%31.6%
2008-2009, D, Contain a "family" word48743.3%43.9%35.5%
2010, D, all131413.9%13.9%4.9%
2010, D, Contain a "family" word15514.8%14.8%2.6%
2011, D, All11932.8%2.8%0%
2011, D, Contain a "family" word1031.9%1.9%0%

It looks like the case is true through the 2009 loans so far. The 2010 and 2011 loans are too young to draw conclusions, but given the initial numbers I'm not certain that I would be comfortable saying that loans with a "family" word are a bad investment.

"Family" words are defined as: husband, child, children, mother, daughter, son in either the title or description of the loan.