Learn how to leverage AWS and RServer for your scrapers When you are regularly scraping websites, you might want to outsource that process. There are clear for scraping: benefits of using cloud resources schedule your scrapers scale the number of scrapers no need for an internet connection with your local machine free up computational resources on your local machine Although I run only a couple of scrapers, the major advantage for me is that I can schedule my scrapers and do not need to worry whether my laptop is connected to a wifi when I’m not at home. This article describes the architecture and steps to set up a free and remote scraper using and . I will not go into too many details but rather explain the concept behind it. RServer AWS Works also with Python and on Digital Ocean For those of you who prefer Python with BeautifulSoup or DigitalOcean, you can build a similar setup. The architecture would be the same, and the necessary steps very are similar. Step 1: Create an AWS account First, head over to and sign up for a free account. If you sign up for the first time, you will get a 12 months trial period with free access to cloud resources. This is also referred to as free tier usage. Amazon Webservices These free resources do not include much processing power, but they are more than sufficient for our purposes. Step 2: Install RServer on an EC2 instance Next, you create an EC2 instance and install RServer on it. There a great Youtube-Tutorials on how to do this. And it actually takes less than 10 minutes. Check out the two by Manuel using or . Both tutorials are great and only differ in the operating system of the EC2 instance. Ubuntu CentOS With CentOS, you do not need to use any terminal commands for the installation. However, you might want to choose Ubuntu as there are more help-resources and tutorials available if you want to expand and configure your instance later. Step 3: Install rvest and your favorite R packages Now, you can log into your RServer with the IP address of your EC2 instance. Check out any tutorial from the previous step if you don’t know how. In RServer, you have to install all packages you need for your script. To scrape a website, you will most likely use . Additionally, I installed all packages from to clean and pre-process the scraping results. rvest tidyverse Before you can install , you might need to install and first. To do this, log into your instance with the terminal and install both packages: Note: rvest openssl libxml2 # On Ubuntu:ssh -i “.ssh/rstudio.pem” ec2-user@<server>.compute.amazonaws.comsudo apt-get install openssl-dev sudo apt-get install libxml2-dev # On CentOS:ssh -i “.ssh/rstudio.pem” ec2-user@<server>.compute.amazonaws.comsudo yum install openssl-develsudo yum install libxml2-devel Afterwards, you should be able to successfully install and scrape webpages. rvest Step 4: Install cronR on Rserver So far, you are able to scrape websites from your AWS instance. But to leverage the advantages of your cloud instance, install . The allows you to schedule your scripts and scrapers using crontabs. cronR cronR package That’s it! Now you can scrape websites with R autonomously using an AWS instance. Just upload your scripts to RServer and schedule them with . In addition, you can also connect more services to enhance your workflow: cronR Connect RServer with Github If you are already using Github, this step might seem natural to you. If you are not using Github, start considering it. I have all my scrapers in a private repository and synchronize it with RServer and my local machine. As such, I make sure that I’m always working with the most recent script which makes it easier to maintain my scrapers. Add a database to store your results Finally, you can set up a database that stores the results of your scraper. With your AWS free tier usage, you can set up a MySQL, PostgreSQL, MariaDB, or Oracle database. I use a MySQL database to store my scraper results. Every time a scraper is done, the results are added to the database. This way I make sure to push my results to a permanent storage that is unaffected if I pause or terminate the EC2 instance. Another benefit is that I can directly access the most recent entries from further services. For example, dashboards and visualizations at , , etc. are always up to date. And I can, of course, also access the database from my local machine with programs like . Google Data Studio , Tableau Plotly Sequel Pro The architecture to run a remote webscraper using R and AWS

Amazon

4 Steps to set up a free and remote Scraper with RServer and AWS

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

10 Lessons from 10 Years of AWS (part 1)

10 Lessons from 10 Years of AWS (part 2)

Top 10 AI Development Companies in USA

12 Strategies to Reduce Amazon S3 Costs

17 of the Best Amazon Web Services (AWS) for Web Developers to Learn

3 Risk-Mitigation Lessons That We Learned The Hard Way This Year

10 Lessons from 10 Years of AWS (part 1)

10 Lessons from 10 Years of AWS (part 2)

Top 10 AI Development Companies in USA

12 Strategies to Reduce Amazon S3 Costs

17 of the Best Amazon Web Services (AWS) for Web Developers to Learn

3 Risk-Mitigation Lessons That We Learned The Hard Way This Year

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps