What is Web Scraping?

Web scraping is a technique to extract a large amount of data from a website and display it or store it in a file for further use.

Often, companies need to extract volumes of data from a particular site to process and analyze. Web scraping with Ruby on Rails is an easy way to do that.

It is used to crawl and extract the required data from a static website or a JS rendered website.

You can also use the best web scraping API tool to integrate data harvesting techniques directly with your Ruby on Rails project. The choice you make here all comes down to the type of scraping you want to carry out, the extent of your coding skills, and the degree to which you are hoping to automate the processes that are involved.

When talking about web scraping using Ruby, here are a few terms to get familiar with:

Few terms to get familiar with:

  • Nokogiri:
    • A gem for web scraping Ruby, Nokogiri, uses CSS selectors or XPath for web scraping.
  • Capybara:
    • Allows JS-based interaction with the websites.
  • Kimurai:
    • It is a framework for web scraping with Ruby.
    • Combination of Nokogiri + Capybara.
    • Allows scraping data for JS rendered websites and even static HTTP requests.

There are few tools available for web scrapings such as Nokogiri, Capybara and Kimurai. But, Kimurai is the most powerful framework to scrape data.

Kimurai

A web scraping framework in ruby works out of the box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows us to scrape and interact with JavaScript rendered websites. t also aligns with any Ruby web scraping library for better functionalities.

Features :

  • Ruby web scraping for Javascript websites.
  • Supports Headless Chrome, Headless Firefox, PhantomJS or Simple HTTP requests(mechanize) engines.
  • Capybara methods used to fetch data.
  • Rich library for built-in helpers to make scraping easy.
  • Parallel Scrapping – Process web pages concurrently.
  • Pipelines: To organize and store data in one place for processing all spiders.

You can also scrape data from JS rendered websites, i.e. infinite scrollable websites and even static websites. Amazing right  !!! 

Read Also: Web scraping using Mechanize in Ruby on Rails

Static Websites:

You can use this framework in 2 ways:

  1. Making a rails app and extract information with the help of models and controllers.
    • Create a new rails app.

rails _5.2.3_ new web_scrapping_demo --database=postgresql

    • Change the database configurations in app/config/database.yml as per the requirement to run in the development environment.
    • Open rails console and create a database for the web app:

rails db:create

    • Add gem ‘kimurai’ to Gemfile.
    • Install the dependencies using:

bundle install

    • Generate a model using the below command with the parent as Kimurai::Base instead of ApplicationRecord:

rails g model Web Scrapper --parent Kimurai::Base

    • Perform database migrations for this generated model.

rails db:migrate

    • Generate a controller using:

rails g controller WebScrappersController index

    • Make a root path for the index action:

root 'web_scrappers#new'

    • Add routes for WebScrapper model:

resources: web_scrapper

    • Add a link to the index.html.erb file as shown below:

<%= link_to 'Start Scrap', new_web_scrapper_path %>

    • Now add an action in the WebScrappersController to perform scraping:

def new
Web Scrapper.crawl!
end

Note:  Here, crawl!  Performs the full run of the spider. parse method is very important and should be present in every spider. The entry point of any spider is parse.

    • Now add some website configurations in the model for which you need to perform scrapping.

Here,

@name = name of the spider/web scraper

@engine = specifies the supported engine

@start_urls = array of start URLs to process one by one inside parse method.

@config = optional, can provide various custom configurations such as user_agent, delay, etc…

Read the Case Study about – Web Scraping RPA (Data Extraction)

Note: You can use several supported engines here, but if we use mechanize no configurations or installations are involved and work for simple HTTP requests but no javascript but if we use other engines such as selenium_chrome, poltergeist_phantomjs, selenium_firefox are all javascript based and rendered in HEADLESS mode.

    • Add the parse method to the model for initiating the scrap process.

Here, in the above parse method,

response = Nokogiri::HTML::Document object for the requested website.

URL = String URL of a processed web page.

data = used to pass data between 2 requests.

The data to be fetched from a website is selected using XPath and structures the data as per the requirement.

    • Open the terminal and run the application using:

rails s

    • Click on the link 'Start Scrap'
      • The results will be saved in the results.json file using save_to helper of the gem.
    • Now, check out the stored JSON file, you will get the scraped data.

Hooray !! You have extracted information from the static website.

  • Making a simple ruby file for extracting the information.
    • Open the terminal and install kimurai using the below-mentioned command:

gem install kimurai

    • You can refer to the code written for the generated model and make a ruby file using it.
    • Run that ruby file using:

ruby filename.rb

Dynamic Websites / JS rendered websites:

Pre-requisites:

Install browsers with web drivers:

For Ubuntu 18.04:

  • For automatic installation, use the setup command:

$ kimurai setup localhost --local --ask-sudo

Note: It works using Ansible. If not installed, install using:

$ sudo apt install ansible

  • Firstly, install basic tools:

sudo apt install -q -y unzip wget tar openssl
sudo apt install -q -y xvfb

  • For manual installation, follow the commands for the specific browsers.

You can use this framework in 2 ways:

  • Making a rails app and extract information with the help of models and controllers.
    • Follow all the above steps from a to o for static websites.
    • Change the @engine  from :mechanize to :selenium_chrome  for using chrome driver for scraping.
    • Also, change the parse method in the model to get the desired output.

  • Making a simple ruby file for extracting the information.
    • Open the terminal and install kimurai using the below-mentioned command:

gem install kimurai

    • You can refer to the code written for the generated model in the section of the dynamic website and make a ruby file using it.
    • Run that ruby file using:

ruby filename.rb

You can find the whole source code here.

Visit BoTree Technologies for excellent Ruby on Rails web development services and hire Ruby on Rails web developers with experience in handling marketplace development projects.

Reach out to learn more about the New York web development agencies for the various ways to improve or build the quality of projects and across your company.

Consulting is free – let us help you grow!