# Building a Recommendation Engine for Reddit. Part 3

We left Part 2 with a dataset including a set of Redditors and the Subreddits they comment on. We also defined which similarity index we are going to use to measure the similarity among each Subreddit.

So, let's continue!

#### Step 3. Calculating Subreddit similarity

Let's refresh how we want the Subreddit similarity table look like

`````` sub1     |  sub2   |  similarity

funny     |  aww    |  similarity(funnny-aww)

funny     |  Iama   |  similarity(funny-Iama)

...
``````

My first implementation of the Similarity calculation algorithm was done in Python.
However, this task is very heavy in terms of processing, so I ended up writing another implementation in Go (Go language performs better than Python in general).

The algorithm goes like this:
0. Sort all the Subreddits alphabetically
1. for each Subreddit, `sub1`:
2. For each other subreddit alphabetically after sub1, `sub2`:
3. Calculate the number of users who wrote comments in sub1 AND sub2, `Union1-2`.
4. Calculate the number of users who wrote comments in sub1 OR sub2, `Intersection1-2`.
5. Calculate the Jaccard Similarity Index of sub1 and sub2 as `similarity1-2 = Union1-2 / Intersection1-2`.
6. Write on the `similarity` table the row `sub1,sub2,similarity1-2`.

Here is the code to implement the algorithm:

``````package main

import (
"database/sql"
"log"
"sync"

"utils"

"github.com/coopernurse/gorp"
_ "github.com/lib/pq"
)

func main() {
dbmap := InitDb()
var subs []string
n_threads := 200 //This will depend on the computing power of your machine
_, err := dbmap.Select(&subs, "select distinct(sub) from redditors_comments order by sub asc")
utils.CheckErr(err, "Error selecting redditors")
}
func createSplitSlice(s []string, n int) [][]string {
lenSlice := int(len(s) / n)
splitSlice := make([][]string, n)
i := 0
for _, e := range s {
if len(splitSlice) > lenSlice {
i = i + 1
}
splitSlice[i] = append(splitSlice[i], e)

}
return splitSlice
}

Sub1       string  `db:"sub1"`
Sub2       string  `db:"sub2"`
Similarity float64 `db:"similarity"`
}

func InitDb() *gorp.DbMap {
user := <DB_USER>
host := <DB_HOST>
port := <DB_PORT>
dbname := <DB_NAME>

dbString := "host=" + host +
" port=" + port +
" user=" + user +
" dbname=" + dbname
db, err := sql.Open("postgres", dbString)
dbmap := &gorp.DbMap{Db: db, Dialect: gorp.PostgresDialect{}}

utils.CheckErr(err, "sql.Open failed")
//dbmap.TraceOn("[gorp]", log.New(os.Stdout, "reddit:", log.Lmicroseconds)) //uncomment to log the sql queries
return dbmap
}

func calculateSimilarity(subs []string, n_operators int) {
lenSlice := int(len(subs) / n_operators)
var start int
var end int
var wg sync.WaitGroup
for op := 0; op < n_operators; op++ {
start = op * lenSlice
end = (op + 1) * lenSlice
go calculateSliceSimilarity(subs, start, end, &wg)
}
wg.Wait()
}
func calculateSliceSimilarity(subs []string, start int, end int, wg *sync.WaitGroup) {
/* sort subs
n processors
each procesor process lenSplice = len(subs) / n_processors
each processor process subs[i*lenSplice:(i+1)*lenSplice]
for each sub:
calculate similarity of that sub and those after it
*/
dbmap := InitDb()
defer dbmap.Db.Close()
lenSubs := len(subs)
for i := start; i < end; i++ {
for j := i + 1; j < lenSubs; j++ {
calculateSubsSimilarity(subs[i], subs[j], dbmap)
}
}
wg.Done()
}

func calculateSubsSimilarity(sub1 string, sub2 string, dbmap *gorp.DbMap) {
var count_union []int
var count_intersection []int
var union int
var intersection int
var similarity float64
query := "Select count(distinct(redditor)) from redditors_comments where sub ='" + sub1 +
"' or sub ='" + sub2 + "';"
_, err := dbmap.Select(&count_intersection, query)
utils.CheckErr(err, "Error on subsimilarity union")
intersection = count_intersection

query = "select count(distinct(redditor)) from redditors_comments where sub='" +
sub1 + "' and redditor in (select distinct(redditor) from redditors_comments where sub='" + sub2 + "');"

_, err = dbmap.Select(&count_union, query)
utils.CheckErr(err, "Error on subsimilarity intersection")
union = count_union
similarity = float64(union) / float64(intersection)
err = dbmap.Insert(s)
utils.CheckErr(err, "Error inserting similarity")
}
``````

This script will take several hours to finish (on my 2 years old laptop it took 3 complete days).

If you are using Postgresql like me, you might need to modify your posgres.conf file to increase the maximum number of connections.

Once the similarity calculation is done, we will know how similar each one of Reddit top subreddits to each other.

Good thing of this implementation is, you can continue running it while your application is in production, so the algorithm gets better with every new user information that gets to the database.

Here is the list of the top pairs of subreddits in terms of Similarity:

sub1 sub2 similarity
iOSthemes jailbreak 0.142020815264527
aviation flying 0.101759227319765
ukpolitics unitedkingdom 0.101730688054031
Liberal progressive 0.0998427672955975
keto ketorecipes 0.0974065138721351
ImaginaryCharacters ImaginaryMonsters 0.088479809976247

By observing the table with the most similar subreddits, we can see that the similarity calculations have some relevancy, all of the pairs of subreddits are very very similar to each other.

Another interesting analysis we could do would be to display which subreddits are more similar to a specific one.

For example, here are the most similar subreddits to the subreddit /r/beer:

sub similarity
beerporn 0.0723035554227667
Homebrewing 0.0376598768356229
cocktails 0.0122950819672131
wine 0.0110444935310824
bourbon 0.0105506099571382
newjersey 0.0103806228373702
drunk 0.0100190839694656
showerbeer 0.00997719498289624
Cooking 0.00884573894282632
Coffee 0.008186852877438

Pretty decent results, we see how the most similar subreddits to the /r/beer sub are alcoholic/food related, and for some reason, /r/newjersey.

Now, the last step to figure out is, how can we recommend Subreddits to a Redditor?.

Very easy. Given a Redditor `r` with a set of subreddits he/she likes `[sub1,sub2,...]` ,
we can sum the similarity of each subreddit on the dataset with each subreddit that redditor likes.

so on this example, `r1` likes  `[sub1,sub2,...]`, so we compute the most similar subreddits to  `[sub1,sub2,...]`:

For sub1:

 sub1 sub2 sim1-3 sub1 sub4 sim1-4 sub1 sub5 sim1-5 sub1 ... ...

For sub2:

 sub2 sub3 sim2-3 sub2 sub4 sim2-4 sub2 sub5 sim2-5 sub2 ... ...

we continue calculating this subtables for each Subreddit that `r1` likes.
Note that we don't calculate `sim1-2` because sub1 and sub2 are already liked by `r1`.

Now, we calculate the similarity of each Subreddit with the Redditor `r1`:

``````sim3-r1 = sim1-3 + sim2-3 ...
sim4-r1 = sim1-4 + sim2-4 ...
sim5-r1 = sim1-5 + sim2-5 ...
...
``````

We now have the similarity of each Subreddit with Redditor `r1`.

Last step it's just returning to the user the top N subreddits sorted by this similarities.

So now we can say we have built a similarity engine. Oh yeah!

On the next (and final) part I will talk about how to use the similarity engine on a live Web Application.