Breadth-first Search in Python

Aug 21, 2020

keyword-counter-hero

What is breadth-first search?

Breadth-first search (BFS) is a search algorithm that finds the vertices of the graph through a (roughly) horizontal traversal. Breadth-first search is a lot like an archeological dig—you want to skim off layers of earth one fine layer at a time to find your fossil (or end target).

Breadth-first search is often mentioned in the same breath as depth first search. Depth-first search differs from breadth-first search in that vertices are found by traversing the graph “vertically” instead of “horizontally.” To continue the excavation analogy, in a depth-first search, you dig a narrow, deep hole, and if you don’t find what you’re looking for, you move to the adjacent patch of earth, dig another deep hole, and so on.

The traditional BFS uses a queue to keep track of all vertices that have been visited. Basically, to find all of the vertices in a graph, you:

  1. Start at a vertex.
  2. Find the first adjacent vertex to your starting point.
  3. Determine whether you have visited that vertex before.
  4. If you have not visited the vertex, add it to a queue of vertices you need to visit.
  5. Progress to the next adjacent vertex to the starting point.
  6. When you have recorded all of the vertices adjacent to the starting point, make your queue the next set of vertices you need to traverse.
  7. Take the first item from your queue and make this your new starting point.
  8. Rinse and repeat.

Using principles of breadth-first search in code

The breadth-first search algorithm is used in many search functions, and my program, N Degrees of Wikipedia, relies heavily on its key features.

So, for a bit of background, N Degrees determines whether it’s possible to find a path from a beginning Wikipedia page to an end Wikipedia page within a given number of degrees of separation. To do this, it scrapes the links from the beginning page, adds those pages to a list and a SQL database, increases the degree counter by 1, scrapes all of the pages in the first list and adds their links to a second list, increases the counter by 1, and so on until it either finds the page or exceeds the given number of degrees, in which case the search fails. If the search succeeds, the program returns the path from the beginning page to the end page, along with number of degrees needed to make the search. To make the process more efficient, it keeps track of the pages that it has already scraped in the database and only adds the newly discovered pages to the search list.

The scanning, adding, and cataloging cycle is essentially what a breadth-first search does. The main difference between a typical breadth-first search and my implementation is that I use Python lists and a SQL database to simulate the behavior of a queue.

To make this work, I needed to set up a few different components:

1. I created a SQL database using SQLite to serve as a container for all of the links I encountered throughout the search.

This database has two fields: current_url, to record the actual url of the scraped page (i.e. vertex), and predecessor_url, to record how I arrived at this page (i.e. the parent of the vertex). The predecessor_url field is used to determine whether I have already visited and scraped a particular page (and to find the return path in a later step).

 

2. Next, I wrote a function, find_urls_on_page, that uses Beautiful Soup to look for new pages to add to my search list.

One nice thing about Wikipedia is that it uses relative links to link to its internal pages. So I was able to search through all of the links and find only the relative links—which were the ones relevant for my program—by looking for links that started with a “/”. These links are added to a Python list called “links_on_url”.

 

3. Then I set up my list containers, find_urls and all_urls_list in the main function, n_degrees.

The find_urls list includes all of the urls I started with when beginning a pass through a given degree level. The all_urls_list variable includes all of the urls found during the pass through a given degree level.

4. I also created a degree counter, called degree_counter (appropriately), which counts the number of degrees of separation traversed during the search.

Executing the program

With all of these pieces, everything is in place to run a query. To start, the program takes the beginning Wikipedia as its starting page and adds it to a list called find_urls. If the end page is not the same as the beginning page (in which case, the end page doesn’t actually need to be found), it increases the degree counter by 1 and checks to make sure that the page was not already marked as a predecessor_url in the database. If not (which it isn’t for the first pass through), it scrapes that page with the find_urls_on_page function.

If the end page is found during the scrape of the url in the find_urls list, the end page and its predecessor_url are added to the database. Then the breadth-first search ends and the program moves on to find the return path.

If the end page is not found, for each link that was scraped, the program queries the database to see if it has not already a predecessor_url to other urls in the database. (This would indicate that the page has already been visited). If it is not a predecessor_url, then it is added to the all_urls_list.

Once the program cycles through all of the urls in the find_urls list, it clears the find_urls list and the new urls in the all_urls_list are added to the find_urls list—and the cycle begins again.

This cycle continues until either the end page is found or the page isn’t found h the target number of degrees. In this way, N Degrees is able to execute a breadth-first search.

Concluding thoughts

So that’s a brief explanation of BFS and how I used features of it in my code for N Degrees. (Also, a quick side note: While writing this blog post, I noticed a few inefficiencies that I’ve since gone back and corrected. The process of explaining concepts to others really does help you see things in a new light.)

It is important to know the basics of BFS, to be sure, but I think it also important to understand the algorithm well enough to adapt its key components to new situations. Although my N Degrees program does not use the typical structure of a breadth-first search, it does capture its essence. I think that being able to extract the core elements of concepts and modify them will allow me to be more creative with my code and ultimately design better solutions.

You can view the complete code on GitHub.

Never miss a post!
Subscribe to the blog and get updates every two weeks!