Recently on Facebook (the CEO of Hackernoon) posted an article in which he listed . He also mentioned that if someone wished to make a similar list about say he would be happy to feature it on the frontpage of . David Smooke 2018’s Top Tech Stories JavaScript Hackernoon In a constant struggle to get more people to read my work I could not miss this opportunity, so I immediately started to plan how to approach making such a list. Since the year was coming to an end and I had limited time I decided not to search for the posts by hand but use my web-scraping skills instead. I believe learning how to make such a scraper can be a useful exercise and serve as an interesting case study. If you have read my article about then you know that the best way to interact with websites with is to use the library which controls a instance. This way we can do everything a potential user could do on a website. How I created an instagram bot Node.js puppeteer chromium Here is the to the repository. link Creating a scraper Let’s abstract away creating a ’s browser and pages with this simple helper: puppeteer const createBrowser = async () => {const browser = await puppeteer.launch({ headless: true }) return async function getPage<T>(url: string, callback: (page: puppeteer.Page) => Promise<T>) {const page = await browser.newPage() try {await page.goto(url, { waitUntil: 'domcontentloaded' }) page.on('console', (msg) => console.log(msg.text())) const result = await callback(page) await page.close() return result } catch (e) {await page.close() throw e }}} We use the page inside a callback, so that we can avoid repeating the same code over and over again. Thanks to this helper we don’t need to worry about going to a given url, listening to from inside the and closing the page after everything is done. The result of the function is going to be returned inside a promise so we can just it later and don’t have to use the result inside the callback. console.logs page.evaluate await Let’s talk about the data There is a where we can find all the articles with the tag published by . They are sorted by date but sometimes out of nowhere there is an article published way before, like 2016, so we have to watch out for this. website JavaScript Hackernoon We can extract all the needed information from this post preview alone — without actually opening the post in a new tab which makes our work much easier. In the box shown above we see that all the data we want: Author’s name and url to his/her profile Title of an article and url Number of claps Read time Date Here’s the interface of an article: interface Article {articleUrl: stringdate: stringclaps: numberarticleTitle: stringauthorName: stringauthorUrl: stringminRead: string} On there is an infinite scroll which means that as we scroll down more articles are loaded. If we were to use requests to get the static and parse it with a library such as then getting those articles would be impossible because we can’t use scroll with static . That is why is a life-saver when it comes to any kind of interaction with a website. Medium GET HTML JSDOM HTML puppeteer To get all the loaded posts we can use: Array.from(document.querySelectorAll('.postArticle')).slice(offset).map((post) => {}) Now we can use each post as a context for the selectors — instead of writing we are now going to write . This way we can restrict the search only to a given post element. document.querySelector post.querySelector Also, notice the snippet - since we are scrolling down and not opening a new page, the already parsed articles are still there. Of course we could parse them again but that would not be really effective. Offset starts at 0 and everytime we scrap some articles we add the length of the collection to the offset. .slice(offset) offset += scrapedArticles.length Scraping the data of a post The most popular error when it comes to scraping data is “Cannot read property ‘textContent’ of null”. We are going to create a simple helper function that prevents us from ever trying to get a property of a non-existing element. function safeGet<T extends Element, K>(element: T,callback: (element: T) => K,fallbackValue = null,): K {if (!element) {return fallbackValue} return callback(element)} will only execute the callback if the exists. Now let’s use it to access the properties of elements holding the data we are interested in. safeGet element Date when an article was published const dateElement = post.querySelector('time')const date = safeGet(dateElement,(el) => new Date(el.dateTime).toUTCString(),'',) Should something happen with and it was not found our will prevent errors. element has an attribute called which holds a string representation of the date when the article was published. dateElement safeGet <time> dateTime const authorDataElement = post.querySelector<HTMLLinkElement>( '.postMetaInline-authorLockup a[data-action="show-user-card"]',) const { authorUrl, authorName } = safeGet(authorDataElement,(el) => {return {authorUrl: removeQueryFromURL(el.href),authorName: el.textContent,}},{},) Inside this element we can find both a user’s profile URL and his/her name. <a> Also, here we use because both the author’s profile URL and post’s URL have this weird source parameter in the query that we would like to remove: removeQueryFromURL https://hackernoon.com/javascript-2018-top-20-hackernoon-articles-of-the-year-9975563216d1? source=———1——————— The character in a URL denotes the start of query parameters, so let’s simply remove everything after it. ? const removeQueryFromURL = (url: string) => url.split('?').shift() We split the string at and return only the first part. ? Claps In the example post above we see that the number of “claps” is 204 which is accurate. However, once the numbers exceed 1000 they are displayed as 1K, 2K, 2.5K. This could be a problem if we needed the exact number of claps. In our use case this rounding works just fine. const clapsElement = post.querySelector('span > button') const claps = safeGet(clapsElement,(el) => {const clapsString = el.textContent if (clapsString.endsWith('K')) { return Number(clapsString.slice(0, -1)) \* 1000 } return Number(clapsString) },0,) If the string representation of claps ends with K we just remove the K letter and multiply it by 1000 — pretty straightforward stuff. Article’s url and title const articleTitleElement = post.querySelector('h3')const articleTitle = safeGet(articleTitleElement,(el) => el.textContent) const articleUrlElement = post.querySelector<HTMLLinkElement>('.postArticle-readMore a',) const articleUrl = safeGet(articleUrlElement,(el) => removeQueryFromURL(el.href)) Again, since the selectors are used inside the context we don’t need to get overly specific with their structure. post “Min read” const minReadElement = post.querySelector<HTMLSpanElement>('span[title]')const minRead = safeGet(minReadElement, (el) => el.title) Here we use a somewhat different selector: we look for a that contains property. <span> data-title Note: Later we are going to be working with property, thus it is important to make a distinction between them. .title Ok, we have now scraped all the articles currently displayed on the page but how do we scroll to load more articles? Scroll to load more articles await page.evaluate(() => {window.scrollTo(0, document.body.scrollHeight)}) // scroll to the bottom of the page await page.waitFor(7500) // wait to fetch the new articles We scroll the page to the bottom and wait for 7.5 seconds. This is a “safe” time — the articles could load in 2 seconds but we would rather be sure that all posts are loaded than to miss some. If time was an important factor we would probably set some interceptor at the request which would fetch the posts and move on once it’s done. When to end the scraping If the posts were sorted by date we could stop the scraping the moment we came across an article from 2017. However, since there are some weird cases of old articles showing up in between the articles from 2018 we cannot do this. What we can do instead is filter the scraped articles for those published in 2018 or later. If the resulting array is empty we can safely assume that there are no more articles we are interested in. In we keep the articles that were posted in 2018 or later and in we have only the articles that were posted in 2018. matchingArticles parsedArticles const matchingArticles = scrapedArticles.filter((article) => {return article && new Date(article.date).getFullYear() >= 2018}) if (!matchingArticles.length) {return articles} const parsedArticles = matchingArticles.filter((article) => {return new Date(article.date).getFullYear() === 2018}) articles = [...articles, ...parsedArticles] If is empty we return all articles and thus end the scraping. matchingArticles Putting it all together Here is the entire code needed to get the articles: const scrapArticles = async () => {const createPage = await createBrowser() return createPage<Article[]>('https: let articles: Article[] = []let offset = 0 //hackernoon.com/tagged/javascript', async (page) => { while (true) {console.log({ offset }) const scrapedArticles: Article\[\] = await page.evaluate((offset) => { function safeGet<T extends Element, K>( element: T, callback: (element: T) => K, fallbackValue = null, ): K { if (!element) { return fallbackValue } return callback(element) } const removeQueryFromURL = (url: string) => url.split('?').shift() return Array.from(document.querySelectorAll('.postArticle')) .slice(offset) .map((post) => { try { const dateElement = post.querySelector('time') const date = safeGet(dateElement, (el) => new Date(el.dateTime).toUTCString(), '') const authorDataElement = post.querySelector<HTMLLinkElement>( '.postMetaInline-authorLockup a\[data-action="show-user-card"\]', ) const { authorUrl, authorName } = safeGet( authorDataElement, (el) => { return { authorUrl: removeQueryFromURL(el.href), authorName: el.textContent, } }, {}, ) const clapsElement = post.querySelector('span > button') const claps = safeGet( clapsElement, (el) => { const clapsString = el.textContent if (clapsString.endsWith('K')) { return Number(clapsString.slice(0, -1)) \* 1000 } return Number(clapsString) }, 0, ) const articleTitleElement = post.querySelector('h3') const articleTitle = safeGet(articleTitleElement, (el) => el.textContent) const articleUrlElement = post.querySelector<HTMLLinkElement>( '.postArticle-readMore a', ) const articleUrl = safeGet(articleUrlElement, (el) => removeQueryFromURL(el.href)) const minReadElement = post.querySelector<HTMLSpanElement>('span\[title\]') const minRead = safeGet(minReadElement, (el) => el.title) return { claps, articleTitle, articleUrl, date, authorUrl, authorName, minRead, } as Article } catch (e) { console.log(e.message) return null } }) }, offset) offset += scrapedArticles.length _// scroll to the bottom of the page_ await page.evaluate(() => { window.scrollTo(0, document.body.scrollHeight) }) _// wait to fetch the new articles_ await page.waitFor(7500) const matchingArticles = scrapedArticles.filter((article) => { return article && new Date(article.date).getFullYear() >= 2018 }) if (!matchingArticles.length) { return articles } const parsedArticles = matchingArticles.filter((article) => { return new Date(article.date).getFullYear() === 2018 }) articles = \[...articles, ...parsedArticles\] console.log(articles\[articles.length - 1\]) }})} Before we save the data in a proper format let’s sort the articles by claps in descending order: const sortArticlesByClaps = (articles: Article[]) => {return articles.sort((fArticle, sArticle) => sArticle.claps - fArticle.claps)} Now let’s output the articles to a readable format because so far they only exist inside the memory of our computer. Output formats JSON We can use the format to dump all the data into a single file. Having all the articles stored this way may come in handy sometime in the future. JSON Converting to the format comes down to typing: JSON const jsonRepresentation = JSON.stringify(articles) We could stop right now with the representation of articles and just copy and paste into our list the articles we believe belong there. But, as you can imagine, this can also be automated. JSON HTML The format will surely make it easier to just copy and paste an item from the list than to manually copy everything from the format. HTML JSON David in his listed the articles in the following manner: article David’s list format We would like to have our list be in a format like this. We could, again, use to create and operate on elements but, since we are working with , we can just embed the values inside a string — browser is going to parse them anyways. puppeteer HTML HTML const createHTMLRepresentation = async (articles: Article[]) => {const list = articles.map((article) => {return `<li><a href="${article.articleUrl}">${article.articleTitle}</a> by<a href="${article.authorUrl}">${article.authorName}</a>[${article.minRead}] (${article.claps})</li>`}).join('') return `<!DOCTYPE html><html lang="en"><head><meta charset="UTF-8" /><meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta http-equiv="X-UA-Compatible" content="ie=edge" /><title>Articles</title></head><body><ol>${list}</ol></body></html>`} As you can see we just over articles and return a string containing the data formatted the way we like. We now have an array with elements - each representing an article. Now we just have to them to create a string and embed it inside a simple template. .map() <li> .join() HTML5 Saving the files Last thing left to do is to save the representations in separate files. const scrapedArticles = await scrapArticles()const articles = sortArticlesByClaps(scrapedArticles) console.log(`Scrapped ${articles.length} articles.`) const jsonRepresentation = JSON.stringify(articles)const htmlRepresentation = createHTMLRepresentation(articles) await Promise.all([fs.writeFileAsync(jsonFilepath, jsonRepresentation),fs.writeFileAsync(htmlFilepath, htmlRepresentation),]) The results According to the scraper there were 894 articles with the tag published this year on which averages 2.45 article a day. JavaScript Hackernoon Here’s what the file looks like: HTML <li> <a href="https://hackernoon.com/im-harvesting-credit-card-numbers-and-passwords-from-your-site-here-s-how-9a8cb347c5b5">I’m harvesting credit card numbers and passwords from your site. Here’s how.</a> by <a href="https://hackernoon.com/@david.gilbertson">David Gilbertson</a> [10 min read] (222000)</li><li> <a href="https://hackernoon.com/part-2-how-to-stop-me-harvesting-credit-card-numbers-and-passwords-from-your-site-844f739659b9">Part 2: How to stop me harvesting credit card numbers and passwords from your site</a> by <a href="https://hackernoon.com/@david.gilbertson">David Gilbertson</a> [16 min read] (18300)</li><li> <a href="https://hackernoon.com/javascript-2018-top-20-hackernoon-articles-of-the-year-9975563216d1">JAVASCRIPT 2018 — TOP 20 HACKERNOON ARTICLES OF THE YEAR</a> by <a href="https://hackernoon.com/@maciejcieslar">Maciej Cieślar</a> [2 min read] (332)</li> And now the file: JSON [ { "claps": 222000, "articleTitle": "I’m harvesting credit card numbers and passwords from your site. Here’s how.", "articleUrl": "https://hackernoon.com/im-harvesting-credit-card-numbers-and-passwords-from-your-site-here-s-how-9a8cb347c5b5", "date": "Sat, 06 Jan 2018 08:48:50 GMT", "authorUrl": "https://hackernoon.com/@david.gilbertson", "authorName": "David Gilbertson", "minRead": "10 min read" }, { "claps": 18300, "articleTitle": "Part 2: How to stop me harvesting credit card numbers and passwords from your site", "articleUrl": "https://hackernoon.com/part-2-how-to-stop-me-harvesting-credit-card-numbers-and-passwords-from-your-site-844f739659b9", "date": "Sat, 27 Jan 2018 08:38:33 GMT", "authorUrl": "https://hackernoon.com/@david.gilbertson", "authorName": "David Gilbertson", "minRead": "16 min read" }, { "claps": 218, "articleTitle": "JAVASCRIPT 2018 -- TOP 20 HACKERNOON ARTICLES OF THE YEAR", "articleUrl": "https://hackernoon.com/javascript-2018-top-20-hackernoon-articles-of-the-year-9975563216d1", "date": "Sat, 29 Dec 2018 16:26:36 GMT", "authorUrl": "https://hackernoon.com/@maciejcieslar", "authorName": "Maciej Cieślar", "minRead": "2 min read" }] I have probably saved myself a good 7–8 hours by creating a scraper which did all of the tedious, mind-numbing work for me. Once it was done all that was left to do was to review the top articles and choose what to put in the article. The code took about an hour to create whereas copying and pasting all the data by hand (let alone saving in both and formats) would easily take a lot more. HTML JSON Here is the , if you are interested in seeing what I chose to put in the list. article Originally published at www.mcieslar.com on January 7, 2019.