By Neal Lathia, Senior Data Scientist, London. Learning to Rank for Flight Itinerary Search Searching for the perfect flight is one of the core activities for users. Similarly, thinking about ways to book the ideal trip is one of the problems that we find ourselves discussing the most. Recently, at the (part of ), I had the opportunity to give a keynote where I shared some of our recent experiments in this domain — and particularly, on how we’re applying machine learning within this problem. In this post, I’ll expand on some of the thoughts that I shared there. Skyscanner app help our users Workshop on Recommenders in Tourism ACM RecSys 2017 A Complex Search Problem Many of us will have already experienced the pain of searching for an ideal flight — there are so many different factors that come into play. Is this too expensive? Could we get to the airport for this time? Should we stopover somewhere to save some money, or pay a bit more for a direct flight? Should we leave a day earlier and come back a day later? Depending on where you are flying from and trying to fly to, the possibilities may seem endless. In a recent post, Zsombor described how the app’s flight search result page —it is now moving away from being an infinite list of price-sorted flights and towards being a set of tools (or ) that help users navigate the complexity of this choice. While there are many ideas and improvements that we can envisage that will surface information and control, one of the questions that we asked was: can we help users by immediately surfacing the ‘best’ flights? Skyscanner has evolved over time widgets Learning to Rank In the research literature, sorting ‘items’ (in this case, flight itineraries) using some notion of ‘best’ or ‘relevant’ is known as . Applying various forms of machine learning in this problem space has been studied extensively and is increasingly common across various products (e.g., , , and ). learning to rank Search at Slack Venue Search in Foursquare Ranking Twitter’s Feed The basic premise is that if we had data about items that users think of as positive examples — and items that users think were negative examples — then we can use these to train a machine learning model that can . In doing so, we have translated the problem of ranking items into a binary regression one there are other methods, such as the to ranking which I do not cover here). relevant to their query — not relevant to their query — predict the probability that a user will find a flight relevant to their query (note: pairwise approach Defining relevance in any context is tricky. Many systems rely on measuring this implicitly, by looking at . In the Skyscanner app, though, clicking on a search result is not a very strong signal that users have found what they are looking for — you may simply be clicking on the itinerary to find out a little bit more. A much stronger signal of is the commitment click through to the airline/travel agent’s website to purchase it, which requires multiple actions from the user. what items users are clicking on relevance Relevance in flight search: an search result is relevant if you bought it. There were a few stages in the journey from the idea of ranking flight search results with machine learning to experiments. We tackled these with two streams of work — offline and online experiments. Could it work? Offline Experiments We first refined how we collected data about each user’s search experience; this data was then fed through a pipeline that took care of joining, transforming, and reshaping it into a and a . set of features binary relevance score Having this kind of historical data allows us to ask kind of questions. What if we had sorted flights in a different way — would the flight that you picked appear closer to the top of your search result list? what-if Formally, this means conducting a number of offline evaluations. We developed a toolkit to support this work — much like , this toolkit took care of the basics of conducting a machine learning experiment: splitting the data into training/test and collecting a variety of metrics (e.g., and ) during each test. We developed our own tool so that we could compare how any machine learning approach compares to a simple baseline: price sorting. similar open-source tools Mean Average Precision Mean Reciprocal Rank Unsurprisingly, many of our initial experiments did not pay off. Our baseline was proving to be incredibly difficult to beat — perhaps given that all of our data was sourced from users viewing search results ranked this way. We kept on iterating here until we found one set of features that seemed to be doing better than price-sorting, using an incredibly simple model to begin with: . logistic regression Will it work? An Online Experiment An open challenge with offline machine learning experiments is understanding how . In other words, just because an algorithm seems to do better than the baseline , it doesn’t mean that it will do better for users. To validate this, we turned to an online experiment. offline metrics correlate with online performance on historical data To do so, we built all of the remaining parts we needed to run a production experiment. At Skyscanner, this means building components that interact with and a to serve predictions. our data platform micro service One way to evaluate this approach in an A/B test would be to completely replace the price-sorted list with a new, relevance sorted one. While this approach is being actively explored, the UI that we initially tested was a widget that recommended flights the price-sorted list. above In the end, we ran an experiment comparing users who were given recommendations , users who were given recommendations using a heuristic that only took price and duration into account, and users who were not given any recommendations at all. using machine learning Online experiment: recommending flights above the traditional, price-sorted list. Our First Results: Search Effort and Conversion As above, we did not evaluate this by looking at clicks — we were not interested in users clicking on a widget. Instead, we first quantified by looking at how often users would filter/resort/re-search for an itinerary in a single session, both when they received flight recommendations and when they didn’t. In this case, we didn’t find any significant differences. search effort More broadly, though, we were interested in measuring — how often users would purchase a flight that was recommended by the widget versus purchasing a flight that fell below it, in the ‘traditional’ result list. We found evidence of great promise — much to our surprise, this first ranking model drove more purchases into the widget than the rule-based variant. conversion A Glimpse into Future Work This experiment is one of many; a lot of the process was about laying the foundation so that we can explore the thousands of hypotheses that using machine learning for flight ranking offers — both offline and online. There is much more to come in this space at : experiments that focus on the UI, that go across platforms, that test new machine learning models, and on the various widgets that can be built using this approach. Skyscanner Work with us We do things differently at Skyscanner and we’re on the lookout for more Engineering Tribe Members across our global offices. Take a look at our for more vacancies. Skyscanner Jobs We’re hiring! About the Author Hi, I’m Neal. I’m currently a Senior Data Scientist in Skyscanner’s London office. I worked with a squad that is mostly based in our Budapest office for the experiments described above — so the actual story behind building these features for travelers also included traveling a few times (to a beautiful city that I had not been to before). You can find me on and , or find out about my former academic life . Medium Twitter on my website _Read writing from Neal Lathia on Medium. Data Science, etc. Every day, Neal Lathia and thousands of other voices read…_medium.com Neal Lathia - Medium