Photo by on NASA Unsplash In this article series I will describe some essential techniques for querying geospatial data in MongoDB, which can be useful if you want your app or API to provide access to information ordered based on distance from some specific location. For example: businesses ( restaurants, shops etc. ) or other places of interest, other users of the app ( as in the case of dating apps ). The techniques will allow your service to scale and remain efficient as they enable constant time access to the data (regardless of the amount of data in your database) and minimise caching required on the client side. What you will learn In the following sections I will show you how to: store location data ( longitude, latitude pairs ) in MongoDB documents query such documents using the location data, with results sorted by distance from a specified point ( from closest to furthest ) efficiently page through results of such queries You should be able to follow this tutorial even if you haven’t used MongoDB before. On the other hand if the first 2 points sound familiar, you might want to skip straight to section 3. We’ll be using Node.js and the official Javascript MongoDB driver. Code snippets will be in Coffeescript 2. If you want to run the code examples locally, clone the , and follow the instructions in the . accompanying repo README.md Storing location data. For longitude, latitude pairs we need to use the object format. GeoJSON Point The above is what you'd insert into a mongo collection, it has a field called which value is a GeoJSON Point object. The name of the field is immaterial, we could have used any other valid key name instead of . We could also nest the GeoJSON deeper into the object structure, however putting and at the top level of wouldn't work. It's also possible to store multiple GeoJSON objects in one document, for example: doc location location type coordinates doc Note that: longitude comes first ( this is a reverse to what you might be used to from, for example, google maps queries ) longitude values need to be between -180 and 180 ( both inclusive ) latitude values need to be between -90 and 90 ( both inclusive ) Now that you know the basics, lets generate some data to work with. We’ve created 6 documents, with locations starting at the equator, increasing the latitude in equal intervals, while keeping longitude fixed at 0. Here are the points plotted on a sphere: See code used to generate this on . JSFiddle Note that in the following code examples we will omit the boilerplate required to obtain a mongodb collection object and insert documents into it. In the , this boilerplate has been contained in the and helper functions. accompanying repo [with_collection](https://github.com/adrian-gierakowski/paging_geospatial_data_code/tree/f7014c4418c185a06dc3a83d2da0c72ce158e193/src/helpers/with_collection.coffee) [with_collection_with_points](https://github.com/adrian-gierakowski/paging_geospatial_data_code/tree/f7014c4418c185a06dc3a83d2da0c72ce158e193/src/helpers/with_collection_with_points) Querying for documents based on location. To query documents based on their distance from a specific point we are going to use the . But before we do this, we need to create a on the field containing the GeoJSON Point objects. [$near](https://docs.mongodb.com/manual/reference/operator/query/near/index.html) query operator 2dsphere index on github runnable example Setting the background option is very important when creating an index on a live database, since by default will while the index is being created ( which might take a while if the collection is big ). createIndex block all other operations on the database Now the basic query, which returns documents sorted by distance ( to be precise ) from point , based on data at field in the documents: all great-circle distance [ 0, 0 ] location on github runnable example Unless your intention is to process documents in the collection ( in which case you’ll probably call instead of ), you'll want to limit, the number of returned documents. Here's how its done: all [.stream](http://mongodb.github.io/node-mongodb-native/3.1/api/Cursor.html#stream) [.toArray](http://mongodb.github.io/node-mongodb-native/3.1/api/Cursor.html#toArray) on github runnable example On the illustration below, the white point marks the location used in the above query ( ) and the points circled in green represents the results of the query with set to . [ 0, 0 ] limit 2 see code used to generate this on JSFiddle Paging through results, from closest to furthest (forwards). Note: the technique discussed in this section has been by A. Jesse Jiryu Davis, who implemented the MondoDB feature which makes this technique possible. His article goes into details of why this method is performant, so its worth a read, however code examples are in python, therefore we’ll go through it step by step here for the benefit of the Node.js community. previously described Building on the query we defined in the previous section, the simplest way to implement paging would be to use to control the page/batch size, and the to set the desired page offset. For example: limit [skip](https://docs.mongodb.com/manual/reference/method/cursor.skip/index.html) method on github runnable example The above query will return the 2nd page of results when querying our test data, as illustrated below. see code used to generate this on JSFiddle However the performance of decreases linearly as the offset increases ( as demonstrated in ), since the MongoDB server needs to scan through all the query results from the beginning until the offset is reached. skip the article mentioned above A constant time alternative involves using the , to exclude results which lay within given radius from the query point. [$minDistance](https://docs.mongodb.com/manual/reference/operator/query/minDistance/index.html) query operator Assuming that we know the distance between the query point and the furthest document from a given page of results, we could query for the next page as follows: But where do we get the distance from? We could try to calculate it using a hand rolled implementation of a formula for a distance between two points on a sphere. Or use something like . However we'd have to make sure that our chosen implementation matches the implementation used by MongoDB. Fortunately we don't have to go through all that hassle since we can ask MongoDB to attach a dynamically calculated distance to each document in the query results. We'll just have to convert our query into an equivalent pipeline, using stage with its option. [getDistance](https://github.com/manuelbieh/Geolib/blob/8273a52d86f7dfbd3b0e2aa2b7473ef5149c5374/src/geolib.js#L237) from Geolib find aggregation $geoNear distanceField Now we can use the property of the last document from the query results ( since they are sorted by distance, in ascending order ), to fetch the next page. Lets use the above function to fetch first two pages. calculated_distance on github runnable example And here is a visualisation of the results of the second call to , with the yellow circle enclosing the area excluded from the query by using . fetch_page minDistance see code used to generate this on JSFiddle However, this is not exactly what we want: the last document of the previous page is included in the next page, since only excludes documents, which distance is smaller then given value. To prevent this, we need to add the following query to our aggregation stage. minDistance $geoNear on github runnable example The query uses the to skip documents with given s when gathering the results. Are we done? Nearly! Consider the following set of points. [$nin](https://docs.mongodb.com/manual/reference/operator/query/nin/) operator _id Notice that the 2nd and 3rd points lay at exactly the same distance from point . Now, if we set the to and to , what would the result of fetching the second page be? Assuming that the points returned by the first query were ordered as in the array above ( last ), here is how it would look like. [ 0, 0 ] query_point [ 0, 0 ] page_size 3 [ 0, -15 ] see code used to generate this on JSFiddle This is because neither nor exclude the document with coordinates . Therefore instead of just using the of the last document, we need to collect s of all documents which distance is equal to distance of . Putting it all together: minDistance $nin [ 0, 15 ] _id _id last_doc Given here's how you would fetch the next one: current_page on github runnable example Note that the logic in and will most likely be executed on the client, hence it was not included in the function, which expects precomputed and to be passed in. This is to minimize the amount of data sent over the wire. Alternatively the server could include a HTTP Link Header with all information necessary to fetch the next page, in which case all of the above code would be executed on the server. get_last_distance get_ids_to_exclude fetch_page exclude_ids last_distance Finally we need to handle a case where there are so many documents with the same distance, that the last document in a page ends up having the same distance as the used to fetch that page. For example, when fetching 3rd page from the following set of points ( with and ): last_distance query_point = [ 0, 0 ] page_size = 2 Using the above implementation we would get points instead of desired , since of the last point from page 1 ( ), is not excluded even though its distance is the same as . Therefore in such cases, we need to carry from previous query to the next. [ [ 0, -15 ], [ 0, 30 ] ] [ [ 0, 30 ], [ 0, 45 ] ] _id [ 0, -15 ] minDistance exclude_ids on github runnable example It is worth noting that in the extreme case off all documents having the same distance, the size of array, and hence the amount of data sent over the wire on each request, would increase linearly as we progress through the pages. On top of this, query performance itself would degrade in similar fashion ( and my bet is that it would be worse then the naive implementation relying solely on ). To mitigate this, instead of accumulating , we could keep count of documents to skip ( and use it in conjunction with ). This would keep the size of the request constant, and the query performance within bounds of the simple solution. However keeping track of s to exclude has some advantages over using skip in cases where queried data is highly dynamic ( when changes are expected to happen in between fetches ), which we might discuss in future articles. exclude_ids skip exclude_ids minDistance skip _id Finally, I’d like to draw your attention to the fact that the above method doesn’t facilitate jumping directly to a specific page ( without fetching all pages in between ). However this can be achieved by adding appropriate ( equal to ) to the query. Using will obviously incur a performance penalty proportional to the amount of documents skipped, but there is really no way around it ( unless instead of specifying amount of pages/documents to skip, your use case could do with with using to skip an unknown number of documents ). Fortunately we only need to pay this price once per page jump, since once the desired page is fetched using , the next one can be queried for based using only and . skip page_size * pages_to_skip_count skip minDistance skip minDistance exclude_ids Conclusion We have demonstrated how to store and query geospatial data in mongodb, and discussed how to efficiently page through large amount of such data ( from closest location to furthest ). In Part 2 of this series, you will learn how to use a nifty trick to page through locations in reverse order. Don’t forget clone the and play with code examples which can be easily ran from the command line. accompanying repo