Kayode Oluwasegun
Kayode Oluwasegun

Kayode Oluwasegun

Networked Programs in Python - But With a Program to Convert Web Pages to Markdowns

Networked Programs in Python - But With a Program to Convert Web Pages to Markdowns

Kayode Oluwasegun's photo
Kayode Oluwasegun
·Dec 8, 2021·

2 min read

The protocol that powers the web is known as HTTP (Hypertext Transfer Protocol). A connection can exist between the two programs to send and receive data.

A protocol is a set of rules and guidelines for communicating data. Rules are defined for each step and process during communication between two or more computers. Networks have to follow these rules to successfully transmit data.

Python provides a library called sockets to make connections and communicate data between connections.

import socket

mysock = sockek.socket(socket.AF_INET, sock.SOCK_STREAM)
cmd = "GET http://data.pr4e.org /HTTP1.0\r\n\r\n"
mysock.connect(("data.pr4e.org", 80))

mysock.sendall(cmd)

# loop through receive data until it returns a 0
# which indicate data is no longer sent

while True:
    data = mysock(512)
    if data < 1:
        break
    print(data.decode(), end='')

mysock.close()

Python provides a library, called urllib, to manage HTTP networks abstract the whole header part of HTTP.

Downloading a file from a web server

import urllib.request, urllib.parse, urllib.error

data = urllib.request("http://data.pr4e.org/cover3.jpg").read()

fhand = open("image.jpg", "wb")
fhand.write(data)
fhand.close()

The urllib is particularly useful when you want to scrap a website and use the information.

So to the example, let's write a program to convert web pages to markdowns.

  • Install markdownify, to be used to convert HTML to markdown
  • Install BeautifulSoup, to be used parse HTML
  • The code below makes a request to a URL given below, parse the HTML, convert the HTML to markdown and save the markdown in a file give as the document title name
from markdownify import markdownify as md
import urllib.request, urllib.parse, urllib.error
import ssl
from bs4 import BeautifulSoup

url = input("Enter URL: ")

# ignore SSL cerificate errors  
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

try:
    html = urllib.request.urlopen(url).read()
except:
    print("Error opening URL")
    exit()

soup = BeautifulSoup(html, "html.parser")

# remove all those lengthy class, id, name and iin-line styles
for tag in soup():
    for attribute in ["class", "id", "name", "style"]:
        del tag[attribute]

# remove the tags in the list below
for tag in soup(["style", "script", "sidebar", "aside"]):
    tag.decompose()

print(soup.prettify())

prettiified_html  = soup.prettify()

markdownified_html = md(prettiified_html)

fhand = open("{}.md".format(soup.title.string), "w")
fhand.write(markdownified_html)
fhand.close()

react-md.png

Markdown preview on the React framework URL.

Conclusion

The code we wrote where is not by any means perfect. It's just to make the show a quick example of things you can do with HTTP connections in python. We can decide to further work on the functions add extra features and integrations.

Thanks for reading through to the end! I anticipate your comment 🤗

 
Share this