By Manuel Garrido — Nov 17, 2014

Building a Recommendation Engine for Reddit. Part 3

We left Part 2 with a dataset including a set of Redditors and the Subreddits they comment on. We also defined which similarity index we are going to use to measure the similarity among each Subreddit.

So, let's continue!

Step 3. Calculating Subreddit similarity

Let's refresh how we want the Subreddit similarity table look like

 sub1     |  sub2   |  similarity

funny     |  aww    |  similarity(funnny-aww)

funny     |  Iama   |  similarity(funny-Iama)

...

My first implementation of the Similarity calculation algorithm was done in Python.
However, this task is very heavy in terms of processing, so I ended up writing another implementation in Go (Go language performs better than Python in general).

The algorithm goes like this:
0. Sort all the Subreddits alphabetically
1. for each Subreddit, sub1:
2. For each other subreddit alphabetically after sub1, sub2:
3. Calculate the number of users who wrote comments in sub1 AND sub2, Union1-2.
4. Calculate the number of users who wrote comments in sub1 OR sub2, Intersection1-2.
5. Calculate the Jaccard Similarity Index of sub1 and sub2 as similarity1-2 = Union1-2 / Intersection1-2.
6. Write on the similarity table the row sub1,sub2,similarity1-2.

Here is the code to implement the algorithm:

package main

import (  
    "database/sql"
    "log"
    "sync"

    "utils"

    "github.com/coopernurse/gorp"
    _ "github.com/lib/pq"
)

func main() {  
    dbmap := InitDb()
    var subs []string
    n_threads := 200 //This will depend on the computing power of your machine
    _, err := dbmap.Select(&subs, "select distinct(sub) from redditors_comments order by sub asc")
    utils.CheckErr(err, "Error selecting redditors")
    calculateSimilarity(subs, n_threads)
}
func createSplitSlice(s []string, n int) [][]string {  
    lenSlice := int(len(s) / n)
    splitSlice := make([][]string, n)
    i := 0
    for _, e := range s {
        if len(splitSlice) > lenSlice {
            i = i + 1
        }
        splitSlice[i] = append(splitSlice[i], e)

    }
    return splitSlice
}

type SimilarityComments struct {  
    Sub1       string  `db:"sub1"`
    Sub2       string  `db:"sub2"`
    Similarity float64 `db:"similarity"`
}

func InitDb() *gorp.DbMap {  
    user := <DB_USER>
    password := <DB_PWD>
    host := <DB_HOST>
    port := <DB_PORT>
    dbname := <DB_NAME>

    dbString := "host=" + host +
        " port=" + port +
        " user=" + user +
        " password=" + password +
        " dbname=" + dbname
    db, err := sql.Open("postgres", dbString)
    dbmap := &gorp.DbMap{Db: db, Dialect: gorp.PostgresDialect{}}

    utils.CheckErr(err, "sql.Open failed")
    //dbmap.TraceOn("[gorp]", log.New(os.Stdout, "reddit:", log.Lmicroseconds)) //uncomment to log the sql queries
    dbmap.AddTableWithName(RedditorSubsComments{}, "redditors_comments")
    dbmap.AddTableWithName(SimilarityComments{}, "similarity_comments")
    return dbmap
}

func calculateSimilarity(subs []string, n_operators int) {  
    lenSlice := int(len(subs) / n_operators)
    var start int
    var end int
    var wg sync.WaitGroup
    for op := 0; op < n_operators; op++ {
        start = op * lenSlice
        end = (op + 1) * lenSlice
        wg.Add(1)
        go calculateSliceSimilarity(subs, start, end, &wg)
    }
    wg.Wait()
}
func calculateSliceSimilarity(subs []string, start int, end int, wg *sync.WaitGroup) {  
    /* sort subs
        n processors
        each procesor process lenSplice = len(subs) / n_processors
        each processor process subs[i*lenSplice:(i+1)*lenSplice]
            for each sub:
            calculate similarity of that sub and those after it
    */
    dbmap := InitDb()
    defer dbmap.Db.Close()
    lenSubs := len(subs)
    for i := start; i < end; i++ {
        for j := i + 1; j < lenSubs; j++ {
            calculateSubsSimilarity(subs[i], subs[j], dbmap)
        }
    }
    wg.Done()
}

func calculateSubsSimilarity(sub1 string, sub2 string, dbmap *gorp.DbMap) {  
    var count_union []int
    var count_intersection []int
    var union int
    var intersection int
    var similarity float64
    query := "Select count(distinct(redditor)) from redditors_comments where sub ='" + sub1 +
        "' or sub ='" + sub2 + "';"
    _, err := dbmap.Select(&count_intersection, query)
    utils.CheckErr(err, "Error on subsimilarity union")
    intersection = count_intersection[0]

    query = "select count(distinct(redditor)) from redditors_comments where sub='" +
        sub1 + "' and redditor in (select distinct(redditor) from redditors_comments where sub='" + sub2 + "');"

    _, err = dbmap.Select(&count_union, query)
    utils.CheckErr(err, "Error on subsimilarity intersection")
    union = count_union[0]
    similarity = float64(union) / float64(intersection)
    s := &SimilarityComments{sub1, sub2, similarity}
    err = dbmap.Insert(s)
    utils.CheckErr(err, "Error inserting similarity")
}

This script will take several hours to finish (on my 2 years old laptop it took 3 complete days).

If you are using Postgresql like me, you might need to modify your posgres.conf file to increase the maximum number of connections.

Once the similarity calculation is done, we will know how similar each one of Reddit top subreddits to each other.

Good thing of this implementation is, you can continue running it while your application is in production, so the algorithm gets better with every new user information that gets to the database.

Here is the list of the top pairs of subreddits in terms of Similarity:

sub1	sub2	similarity
iOSthemes	jailbreak	0.142020815264527
asktransgender	transgender	0.112781954887218
aviation	flying	0.101759227319765
ukpolitics	unitedkingdom	0.101730688054031
Liberal	progressive	0.0998427672955975
keto	ketorecipes	0.0974065138721351
Pokemongiveaway	pokemontrades	0.0894308943089431
ImaginaryCharacters	ImaginaryMonsters	0.088479809976247
frugalmalefashion	malefashionadvice	0.0846720846720847
GiftofGames	RandomActsOfGaming	0.0812588069516205

By observing the table with the most similar subreddits, we can see that the similarity calculations have some relevancy, all of the pairs of subreddits are very very similar to each other.

Another interesting analysis we could do would be to display which subreddits are more similar to a specific one.

For example, here are the most similar subreddits to the subreddit /r/beer:

sub	similarity
beerporn	0.0723035554227667
Homebrewing	0.0376598768356229
cocktails	0.0122950819672131
wine	0.0110444935310824
bourbon	0.0105506099571382
newjersey	0.0103806228373702
drunk	0.0100190839694656
showerbeer	0.00997719498289624
Cooking	0.00884573894282632
Coffee	0.008186852877438

Pretty decent results, we see how the most similar subreddits to the /r/beer sub are alcoholic/food related, and for some reason, /r/newjersey.

Now, the last step to figure out is, how can we recommend Subreddits to a Redditor?.

Very easy. Given a Redditor r with a set of subreddits he/she likes [sub1,sub2,...] ,
we can sum the similarity of each subreddit on the dataset with each subreddit that redditor likes.

so on this example, r1 likes [sub1,sub2,...], so we compute the most similar subreddits to [sub1,sub2,...]:

For sub1:

sub1	sub2	sim1-3
sub1	sub4	sim1-4
sub1	sub5	sim1-5
sub1	...	...

For sub2:

sub2	sub3	sim2-3
sub2	sub4	sim2-4
sub2	sub5	sim2-5
sub2	...	...

we continue calculating this subtables for each Subreddit that r1 likes.
Note that we don't calculate sim1-2 because sub1 and sub2 are already liked by r1.

Now, we calculate the similarity of each Subreddit with the Redditor r1:

sim3-r1 = sim1-3 + sim2-3 ...  
sim4-r1 = sim1-4 + sim2-4 ...  
sim5-r1 = sim1-5 + sim2-5 ...  
...

We now have the similarity of each Subreddit with Redditor r1.

Last step it's just returning to the user the top N subreddits sorted by this similarities.

So now we can say we have built a similarity engine. Oh yeah!

On the next (and final) part I will talk about how to use the similarity engine on a live Web Application.