N Degrees of Wikipedia
Why I built N Degrees
After building my Keyword Counter, I thought I would try my hand at another web scraping project. I also wanted my next project to use search in some capacity because I’ve been interested in exploring how search functions worked. And that’s when I decided to build N Degrees.
Anyone who is familiar with American trivia games will know the common parlor game, “Six Degrees of Kevin Bacon.” For the uninitiated, the object of this game is to connect actors to Kevin Bacon in six degrees or fewer. For example, if you wanted to connect Kevin Bacon to Danny DeVito, you could reason that Kevin Bacon starred with Jack Nicholson in “A Few Good Men” and Jack Nicholson worked with Danny DeVito in “One Flew Over the Cuckoo’s Nest.” So that’s two degrees.
I thought it might be fun to try out the “Six Degrees” logic with Wikipedia—but instead of connecting actors through movies, I would use the links on a specific Wikipedia page to find a path to another Wikipedia page in a certain number of degrees. So, using my method, I could connect Kevin Bacon to Tom Hanks by discovering that the Kevin Bacon Wikipedia page is linked to the “Apollo 13” film Wikipedia page and the “Apollo 13” Wikipedia page is linked to the Wikipedia page for Tom Hanks. So that’s counted as two degrees.
I wanted my project to be flexible. I didn’t want users to only be able to search from the Kevin Bacon Wikipedia page—I wanted them to be able to put in any page they liked. I also didn’t want them to be limited to searching only within six degrees—I wanted users to be able to find related pages in any number of degrees. Finally, I wanted to show users the successful path from the beginning page to the end page. I needed to account for all of these factors in the design.
What it does
My N Degrees project does what I intended (aside from a few issues noted below).
To get started, the user runs the n_degrees() function and is then asked for a beginning Wikipedia url, an end Wikipedia url, and a target number of degrees in which to find the end url. The beginning url is then scraped for links, and the unique urls are added to a continuing “search” list and a SQL database, which captures the url and its predecessor url. If the end url is not found in the first attempt, the urls found in the first pass are then scraped, added to the search list and database, and so on. The program continues this cycle of searching, scraping, and cataloging urls until either the end url is found within the target number of degrees or the search fails. If the search succeeds, a success message is returned, along with the number of degrees needed to complete the search and the search path from the beginning url to the ending url.
How I designed it
I decided to start out by writing out a pseudocode outline. One of my takeaways from the Keyword Counter project was that having a detailed outline really helped me figure out where I was in the program design, what the code inputs needed to be, what I needed that code to return, and why (in other words, what the next steps should be).
This process took me many, many hours.
In retrospect, going through this exercise improved my workflow considerably, and it was an important first step. N Degrees is the first, “real” multi-step project I’ve worked on, and having an outline helped me avoid getting “lost in the code.” After writing out the structure, I knew what each of the subfunctions needed to do. All I had to figure out was how exactly to do it. This was especially helpful when I was working with SQL and recursion, which I hadn’t used much before, and breadth-first search, which was a completely new concept.
So, to give a quick overview of the code, this program has three types of functions:
- “Simple” subfunctions – these do basic things like get the initial inputs, create the SQL database, and close the SQL database. These functions are named accordingly and are pretty straightforward.
- “Complex” subfunctions – these are more involved functions that do things like scrape pages and find the return path.
- The n_degrees function – this is the master function that sets up the breadth-first search, orders all of the subfunctions in a logical sequence, and returns the end result.
I used my pseudocode outline to come up with descriptive names for the functions, which helped keep things nice and organized. I perhaps could have combined the “simple” functions into the main function, but keeping them separate helped to make things visually simple and keep me focused on the primary task at hand. You can see the complete code on my GitHub page.
Some shortcomings of this code
So, no program is perfect, and mine is certainly no exception. And although I’m reasonably happy with what I’ve built, I realize that there are a few issues with my solution.
1. The scope of this project is large.
This will come as a surprise to no one, but Wikipedia is really big and many pages have a lot of links. Like, a lot of links. And while that’s not a problem in and of itself, it does have a few implications for executing this program.
It can take a long while to run.
Doing a breadth-first search can take a while, especially if the breadth of the page is large and a node to the end page is deep within the scraped page. For example, if the beginning page has 128 links on it, but the node to the end page is the 127th link, then the program needs to scrape the links on 126 pages before it finally gets to the node on the path to the end page. And that’s a little inefficient.
(A quick brainstorming aside: One way to try to speed things up might be to somehow predict the likelihood that a given page is related to the end page by examining a set of factors, like whether the potential node page and the end page share the same topic, had a certain number of links in common, etc. And then the program could sort these pages by likelihood “rank” and scrape the pages in that order. That might make the search a bit more efficient.)
It’s not fully tested.
This was hard program to test. One problem was that I needed to verify that the “answer” N Degrees came up with was actually the best answer. This meant that I had to manually verify the “best” path beforehand. Given the number and interconnectedness of the pages, this meant that I could only realistically verify two degrees before getting hopelessly lost in the weeds. It would have been nice to have a “sandbox” to test with, where every path to each node was known so it would be easier to verify the results.
Another problem is that running N Degrees is a rather CPU-intensive process. I only dared to try it with two-degree searches. Even with that consideration, sometimes my computer took a long time to process and made loud, whirring noises. So, I don’t know how it handles more than two degrees of separation. There might be an AWS solution for that; that’s another thing to consider for later.
2. It travels all links on the page—including “hidden” links.
One quirk that I realized about Wikipedia pages is that all of the links that connect to other Wikipedia pages are relative links. I was able to use this to my advantage when I set up the search function—in my program, I only add pages to the search list that are relative links.
That decision came with an interesting side effect. Oddly, there are relative links that aren’t visible on the page. Usually these are in Wikitables that provide aggregated information about a given topic. Although these links aren’t visible on the page, they are visible when you view the page source. I discovered this when I tried to find a path from the Kevin Bacon Wikipedia page to the Paul Giamatti Wikipedia page. I had expected to find this page in two degrees of separation, but it only took one—because the Paul Giamatti page is linked on the Kevin Bacon page through a Wikitable (specifically, the Wikipedia category page, “Golden Globe Award for Best Actor – Limited Series or Television Film”). I couldn’t figure out how to filter these tables out of my web scraping results with Beautiful Soup. That’s not a huge deal for me, but I may try to address it at some later point.
3. The search function is limited.
N Degrees uses unidirectional search, which makes the search straightforward, but sometimes incomplete. If a start page and a target page have a page in common, the target page must have a link on the common page to complete the path. I realized this was a problem when I tried to test whether N Degrees could successfully connect the Paul Resnick Wikipedia page to the Charles Severance Wikipedia page. Both Paul Resnick and Charles Severance are professors at the University of Michigan (UMich)—and they also happen to offer fantastic Python courses on Coursera that I can personally vouch for.
I had expected that N Degrees would connect these two pages through the UMich page. Unfortunately, while both of these professors’ pages link to the UMich page, the UMich page isn’t a node to either of the professors’ pages. So there actually is a connection, but it wasn’t found. I think using a bidirectional, breadth-first search might solve this problem—that’s something I may explore further later.
Concluding thoughts
So, summing up, this was a fun project to work on. I really did learn a lot, and it was especially gratifying to successfully write the recursive search function (because recursion is a concept that it’s taken me a while to get my head around).
Going forward, I’m interested in writing more complicated programs and getting more practice working with interconnected functions. And I definitely want to do more with SQL and continue my study of search algorithms.
You can view this project (and all of my other projects) on GitHub.
Never miss a post!
Subscribe to the blog and get updates every two weeks!