Building a food recommendation engine with Spark / MLlib and Play

Recommendation engines have become very popular in the last decade with the explosion of e-commerce, on demand music and movie services, dating sites, local reviews, news aggregation and advertising (behavioral targeting, intent targeting, …). Depending on your past actions (e.g., purchases, reviews left, pages visited, …) or your interests (e.g., Facebook likes, Twitter follows), the recommendation engine will present other products that might interest you using other users actions and user behaviors (page clicks, page views, time spent on page, clicks on images/reviews, …).

In this post, we’re going to implement a recommender for food using Apache Spark and MLlib. So for instance, if one is interested by some coffee products then we might recommend her some other coffee brands, coffee filters or some related products that some other users like too.

rating

To make it more interactive, we implemented a simple web interface using the Play framework where you are prompted a list of products that you can rate (ala hotornot but for food). After rating a certain number of products, you can choose to get recommendations by clicking on the “Recommendation” tab at the top. This will train the recommender and then display about 10 products that might interest you. Note that this can take about a minute or so (this process is normally run offline).

recommendation-cut

In order to run the example you will need:

The example is on github at: http://github.com/chimpler/blog-spark-food-recommendation

You can pull the code as follows:

$ git clone http://github.com/chimpler/blog-spark-food-recommendation

As a training set we use some Amazon reviews from the Fine Food sections that were written between 2010 and 2012. These reviews can be downloaded from the SNAP Stanford website. It contains about 500,000 reviews written by 256,000 users on 74,000 products. For the sake of simplicity we’re going keep reviews left by users who wrote between 10 and 20 reviews, that leaves us with around 60,000 reviews.

You can run the script download.sh to download the file and convert it into a simple CSV (userId, productId, rating).

The Play application is pretty straightforward with 2 main routes /rating and /recommendation that maps to the Application class. The rating method will pick up a random product taken from the CSV, show it to be rated and the rating is stored in MongoDB.

The parsing of the CSV is pretty straightforward:

  val rawTrainingRatings = sc.textFile(ratingFile).map {
    line =>
      val Array(userId, productId, scoreStr) = line.split(",")
      AmazonRating(userId, productId, scoreStr.toDouble)
  }

  // only keep users that have rated between MinRecommendationsPerUser and MaxRecommendationsPerUser products
  val trainingRatings = rawTrainingRatings.groupBy(_.userId)
                                          .filter(r => MinRecommendationsPerUser <= r._2.size  && r._2.size < MaxRecommendationsPerUser)
                                          .flatMap(_._2)
                                          .repartition(NumPartitions)
                                          .cache()

In order to present the product information (image and description), we fetch the corresponding Amazon page and parse it using the Lagarto HTML parser as follows:

object AmazonPageParser {
  def parse(productId: String): Future[AmazonProduct] = {
    val url = s"http://www.amazon.com/dp/$productId"
    HttpClient.fetchUrl(url) map {
      httpResponse =>
        if (httpResponse.getStatusCode == 200) {
          val body = httpResponse.getResponseBody
          val domBuilder = new LagartoDOMBuilder()
          val doc = domBuilder.parse(body)

          val responseUrl = httpResponse.getUri.toString
          val nodeSelector = new NodeSelector(doc)
          val title = nodeSelector.select("span#productTitle").head.getTextContent
          val img = nodeSelector.select("div#main-image-container img").head.getAttribute("src")
          val description = nodeSelector.select("div#feature-bullets").headOption.map(_.getHtml).mkString

          AmazonProduct(productId, title, responseUrl, img, description)
        } else {
          throw new RuntimeException(s"Invalid url $url")
        }
    }
  }
}

The ALS recommender accepts as input an RDD of Ratings(user: Int, product: Int, rating: Double). Since in the CSV, all the IDs are String identifier, we create a simple dictionary Dictionary to map Strings to their position in an index.

The ratings from the existing reviews can then be created as follows:

  val userDict = new Dictionary(MyUsername +: trainingRatings.map(_.userId).distinct.collect)
  val productDict = new Dictionary(trainingRatings.map(_.productId).distinct.collect)

  private def toSparkRating(amazonRating: AmazonRating) = {
    Rating(userDict.getIndex(amazonRating.userId),
      productDict.getIndex(amazonRating.productId),
      amazonRating.rating)
  }

  private def toAmazonRating(rating: Rating) = {
    AmazonRating(userDict.getWord(rating.user),
      productDict.getWord(rating.product),
      rating.rating
    )
  }

  // convert to Spark Ratings using the dictionaries
  val sparkRatings = trainingRatings.map(toSparkRating)

Then in order to train the recommender, we pass the existing ratings from the CSV as well as the ratings that you left and then predict the ratings on all the products for you that you haven’t rated. We then order them by rating and keep the top 10.

  def predict(ratings: Seq[AmazonRating]) = {
    // train model
    val myRatings = ratings.map(toSparkRating)
    val myRatingRDD = sc.parallelize(myRatings)
    val model = ALS.train((sparkRatings ++ myRatingRDD).repartition(NumPartitions),
                          10, 20, 0.01)

    val myProducts = myRatings.map(_.product).toSet
    val candidates = sc.parallelize((0 until productDict.size).filterNot(myProducts.contains))

    // get all products not in my history ordered by rating (higher first)
    val myUserId = userDict.getIndex(MyUsername)
    val recommendations = model.predict(candidates.map((myUserId, _))).collect()
    recommendations.sortBy(-_.rating).take(NumRecommendations).map(toAmazonRating)
  }

Conclusion

We showed in this post how to implement a simple recommender. We skipped the steps to determine the best values for lambda, the rank and the number of iterations using cross validations which are well explained on the official Spark Tutorial pages on Recommendations using MLlib. As one can see, the recommendation phase take some time (around a minute for only 60,000 ratings) and thus cannot be real-time. It is generally computed offline and recommendation are usually shown or sent by email slightly delayed. In the past 10 years, recommendation algorithms have improved to support incremental updates (Incremental Collaborative Filtering or ICF) to provide real-time recommendations that can be particularly useful especially in advertising for real-time intent targeting. Indeed presenting relevant ads to a user based on her immediate history have more impact than presenting ads based on her history from 2 hours ago.

Advertisements

About chimpler
http://www.chimpler.com

3 Responses to Building a food recommendation engine with Spark / MLlib and Play

  1. Pingback: Building a food recommendation engine with Spar...

  2. danilobits says:

    Hello,
    This example is very interesting. Great job!

    I tried to rum in my computer and I received one error in Recommendations section, like:
    [RuntimeException: The key ‘userId’ could not be found in this document or array]

    Can you help me?
    I’m using Scala 2.10.4, MongoDB 3.0.2 and Play Framework 2.3.8 with Activator 1.3.2.

    Thanks!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: