Course


Project 1: Scientific School weblinks

In this project we aim to find all hyperlinks outgoing from the Scientific Programming School homepage.

Problem

In this starter project, we are given a task to extract all hyperlinks outgoing from the Scientific Programming School homepage.

Image

Solution

The solution below uses BS4 function findAll of the soup class object. It first downloads the raw html code with the line:

html_page = urllib.urlopen("https://scientificprogramming.io")

Then a BeautifulSoup object soup is created

soup = BeautifulSoup(html_page)

Finally, we use this object to find all links:

for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
    print(link.get('href'))

Note that we use Python Regular Expression (re.compile) and only search for the SSL enabled linkes that starts with the https. Let's now execute the code:

Python (3.6.9)
  • Show Input