CLI scraping gem using Object-Oriented Ruby and Nokogiri

Salma Elmasry
5 min readMar 18, 2018

--

What is Nokogiri

Nokogiri it is a ruby gem library parser for an HTML, XML, and SAX. Among Nokogiri’s many features is the ability to search documents via XPath or CSS3 selectors and pull off the desired data (ref).

Application overview

One of the most challenging tasks for me, as a beginner, was to scrap a website with hierarchical structure and nested data that are linked to external pages. The application was designed to allow the user to explore the website of a pump manufacturing company (Packo) that produces different pump models with different specifications. Specifically, they produce six main pump models. Under each model there are different set of pump series. My goal was to build a user friendly application that allow users to navigate the pump models and to determine the available series under each model. I used an Object-Oriented programing Ruby to build this application. In the following paragraph, I will demonstrate how to implement the scraping technique using nokogiri to build this application. You can view and download the gem from GitHub.

First-step: environment setup

1. Run bundle gem <file_name> in your terminal.

Tip: Be cautious about naming your gem. Don’t use dashes ,instead, use underscore (ex. pump_selection).

2. To run the file as ./bin/selection, inside the /bin folder, create a new file called selection.rb. Next, make the new file as an executable file by running this command chmod +x selection.rb. Finally, run ls -lah to confirm that the new file had changed.

3. In the lib/pump_selection, require all needed libraries like pry, nokogiri ,and open-uri.

You can install your gem library by running gem install <gem name> into your terminal — ex. gem install nokogiri.

requiring the gem libraries

4. Add gem dependencies into pump_selection.gemspec file.

Second-step: Instantiate your classes

  1. Create two files into my lib folder scraper.rb — which will be responsible for scraping methods, and cli.rb — which will be responsible for the interface with the user.
  2. Connect the PumpSelection module to all the constants and methods in both scrape.rb and cli.rb by ::” syntax.
  3. Add an attr_accessor for both name and series in scraper.rb.
  4. Initialize both variable with a nil value and set them to the name and series.
  5. Initialize a class variable @@all equals to an empty array — representing the collection array.
  6. Build a fake cli-interface to help you testing the methods as you go.
scraper.rb instance class
cli-interface instance class cli.rb

Third-step: Pull out the targeted data

  1. Inspect the site HTML using google developer tool — View=>Developer=>Developer Tools or cmd + opt + I, if you are on mac.
  2. Select the targeted element by hovering over it on the HTML (cmd+sht+c).
  3. Copy the CSS selectors.

Fourth-step: Build the scraping method (scraper.rb)

1- Name the method in a descriptive way. This will help you to understand the objective of the code if you revisit your code in the future.

2- The pump products and it’s corresponding series in the HTML were found in one div (“div.listingByBlockContainer div.categoryBlock”). Thus, I built one scraper method that manage parsing the page and selecting the targeted div.

3- Create another ruby method for scraping both products name and the corresponding series name, and pushing them into the collection array.

4- To display the products names #display_product, I iterated over the collection array using #each.with_index and printed only product name.

5- To display the series names #display_series, I built additional ruby method that reads the user input and translate it to an array index.

Tip: you will need to convert the string input to an integer value and subtract one — since arrays start counting from 0.

Fifth-step: Build the CLI - interface (cli.rb)

Build two methods: #list_product and #list_series that will communicate with the methods in the scraper.rb.

This how the outputs look like

the interface successfully show the list of pumps to the user
go one level deep and invoke each product related series

Conclusion

There are many lessons that I learned from developing this application which I would like to share with new developers who intend to do his/her first scraping cli gem.

  1. Try to pick a simple website with simple framework.
  2. Plan your gem and imagine how the interface should work.
  3. Use debugging tool as much as you can — binding.pry.
  4. Write a good README file telling the user how to install the gem, state the instruction in details, and write an overview of your gem.
  5. Refactor your code, and make your code as DRY as you can.

Finally, I hope this post provided you an easy entry point for applying nokogiri to scrape websites under object oriented ruby hood.

Happy coding!

References:

Flatiron — online web-developer program

--

--

No responses yet