Throughout the Learn program, I am constantly putting the concepts and materials I am taught to the test (quite literally) through various labs (verified through rspec tests) that are interspersed throughout the curricukum. At the end of larger modules, final projects are assigned and after recently completing a module focused on Object Oriented Ruby (OO Ruby), I was tasked with building a command line (CLI) program that could act as a Ruby gem. I ultimately created a ruby gem that would scrape information from TV.com including a listing of top shows determined by its users as well as to pull additional information on those shows.*
The gem is available for download from github.com (see link at end of this post).
The following is a screen-capture walkthrough of the gem in action:
And if you are looking for more of the nitty gritty details, read on…
The general requirements for the project were to:
- Build an Object Oriented Ruby application that provides a CLI interface to an external data source, whether scraped or via a public API.
- Data provided must go at least a level deep, generally by showing the user a list of available data and then being able to drill into a specific item.
- Package your code as a RubyGem and install a CLI for the user
In order to complete this project, I first outlined the steps that would be required to create a working application:
- Determine a domain of information and source to use in the application
- Build out the gem framework and enable git version control
- Define the methods required in the application
- Determine which elements to retrieve from the data source
- Program the methods to pull the appropriate data elements
- Build out the CLI flow control
- Test the code and iterate until it is working properly
- Refactor the code to simplify as necessary
- Run additional tests to confirm functionality still works
- Finalize documentation and publish the gem
Then, I got to work.
Determine a domain of information and source to use in the application
When I’m not working or programming, and need a little mindless entertainment, I’ll usually turn on the TV and find a show to watch. One of the places I like to look for information on showtimes and episodes is TV.com, which has an easy to use interface and lots of information.
TV.com has a page listing the top shows determined by its users, with links off to an individual web page for each show. Through inspecting the code using Google Chrome’s Developer Tools, and looking at the hierachy in terms of how TV.com displays different informational elements, I determined that it would be an adequate website to scrape, with my first level of information being the top 20 shows and then providing the user the opportunity to select a show to drill down into and view more information as the second level. *
Build out the gem framework and enable git version control
RubyGems.org, the primary repository for ruby gems online, along with the bundler gem, provides the developer with functionality to create the basic components and folder structure of a gem. I initialized the gem using the command
bundle gem <gem name>
This also created markdown template pages for the LICENSE and README, among other files.
I also initiated git version control by running “git init” from the terminal within the local directory I was building the application in to allow me to track the changes that I would make, and if necessary, be able to roll back to a previously committed change. This would also enable me to push the changes to the application to a respository on github.com and eventually to rubygems.org
Define the files/methods required in the application
I split the functionality of my application into three main classes: CLI, Scraper, and Show (model).
The CLI class would be responsible for the interaction between the end user and the application, calling for information to be scraped, saving that information to the Show class and displaying the information.
The Scraper class, called by the CLI class, would access the listing of all shows or a specific show page and scrape the specified attributes.
The Show class would be responsible for holding the attributes about the Shows that were scraped.
Determine which elements to retrieve from the data source
This task was pretty straight forward, looking at the details available on the TV.com list page and each sub show page.
For my initial Top 20 show listing, I decided to display the show name, its primary channel and its ranking. Having the ranking allowed me to ordering the listing from highest ranked to lowest. I also had to retrieve the URL for the detail page of each show, so if a user selected to view details about that show, I had it available to scrape.
For the show detail, I wanted to pull a variety of elements about the show in general as well as cast and episode information. My final list was:
- Show Name
- Air Time and TV Channel
- Premiere Date
- User Ranking
- Number of Votes
- Show Summary
- Top Billed Cast
- Recent Episodes
Program the methods to pull the appropriate data elements
Once I had defined my methods and determined which data elements I was going to scrape, I utilized the Nokogiri RubyGem and open-uri functionality of Ruby to open a connection to TV.com. Nokogiri allows for identifying elements discretely via xpath or a little more abstractly through CSS identification. I chose the latter for this project.
Scraping both the initial list of twenty shows and each show’s respective information had its challenges.
In scraping the show list, I realized I had to account for an ad that TV.com places in the middle of the list, following the second show, for which TV.com uses some of the same CSS selectors. I did not want to include the ad in the list and even if I did not try to remove the ad, Nokogiri was not able to scrape beyond it, throwing an error. Consequently, I added error control to the scraping process, to skip and go to the next show item. In this way, I was able to get the only the top 20 real “show” elements.
Scraping the cast information and episode information was also an interesting challenge as well. I wanted to get and store a series of pieces of information about each cast member and episode listed (usually around 5). Within my hash containing the show details, I created an empty array for cast members and episodes; subsequently, as I iterated through the listings of episodes and cast members, I created a hash for each episode or cast member and added it to the specific array. Then when the information would be displayed back to the user, I had to iterate again through this hash stored in memory.
Build out the basic CLI flow control
I wanted to make the app very simple and straightforward to use and I did so as I implemented this in my flow control. First, once the app was launched, it will watch for an input of “exit” before it exits and returns to the CLI prompt. On launch, the gem provides a listing of the top 20 shows from TV.com and then allows a user to input a number corresponding to the ranking of the show. Subsequently, the details for the show selected will be scraped and displayed to the user. Then, if the user wants to see the list of top shows again, they can enter “list” and pull the listing of shows previously scraped and stored in memory. Finally, if a user enters an invalid entry (anything other than a number from 1-20, “list”, or “exit”), an error message will be printed “Invalid Option, Try again.” and prompted again for input. I wanted to find out how to add color to the command line and make the error message red and learned I could do so by putting specific characters and a number for the color around the text:
puts "\n\e[31mInvalid Option, Try again.\e[0m"
Test the code, refactor until it is working as expected
In initially programming the scraping of the information, I was first scraping the listing of the top twenty shows, which included the URL to each show’s detail page, and subsequently, I was iterating through the show listing and scraping each subpage. However, in my testing I determined that all this would occur before very little information was provided to the user (only a welcome message), and would make the program appear slow to run. Consequently, I decided to rewrite and refactor some of my code to scrape the details for the show on demand, when the user specifically chose the show. This would result in the list of top shows appearing quickly and the scraping of each show occurring quickly as well (when selected). Additionally, I decided that in a given run of the application, the detail information for a show would be scraped only once. The program would check if that show’s detail was scraped by looking to see if the user_rating attribute was populated in the show’s hash (an attribute that only appears on the detail page). If this wasn’t found in the hash, the program would scrape the detail and add it to the show’s existing hash. Then for any subsequent lookups for this same show, it would see that if the user_rating element was populated and would pull directly from the hash without re-scraping.
Once I was pleased with the refactoring, I retested again and all the same functionality worked, with better response time.
Finalize documentation and publish the gem
Lastly, I completed the documentation RubyGems requires as metadata for the gem in the gemspec file (also initially created by bundler). I also wrote a README describing how the gem worked as I have also described here.
Currently my gem is available on github and can be downloaded from the repo below and bundled locally. Feel free to enhance and/or provide me with feedback as to how it can be improved!