TechWriter at work blog logo

TechWriter at work blog

Living and writing documentation at Documatt, small team of programmers that write documentation too.


URL encoding#

URL encoding (also known as percentage encoding) is a way to pass around characters otherwise prohibited in the URL and HTML forms because they have special meanings. For example, to use http:// as part of a URL, not its beginning, it has to be %-encoded to http%3A%2F%2F.

URL anatomy#

scheme://host-or-ip:port/path/to/somewhere?query=param&yet=another

where

  • scheme - is a type of service (like http or htts)

  • host-or-ip - textual or IP address of the server

  • port - defines the port number at the host (default for http is 80)

  • path/to/somewhere - request path

  • query=parameter - additional parameter name and its value

  • yet=another - query parameters may occur multiple times, and they are separated by &

Characters allowed in URL#

Many applications embrace URL-friendly strings as identifiers, names, or allowed values. An URL-friendly string is sometimes called slug.

The only characters that could appear inside the URL are split into two groups:

  • reserved characters ! * ' ( ) ; : @ & = + $ , / ? # [ ] have special meaning to URL and must be %-encoded to pass them as data in URL.

  • unreserved characters A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9 - _ . ~ are allowed in URLs as-is.

All other characters (e.g., non-English letters, math symbols) must also be URL-encoded.

Troubles with slashes#

For example, very problematic are slashes. Slash as / is a path separator, and slash as %2F is data.

For an imaginary REST API endpoint on /get-file/<path>, compare two completely different URLs.

https://api.somewhere.com/get-file/sweet/cheescake.html

will end up with 404 Not Found because there is no /get-file/sweet/cheescake.html endpoint.

However,

https://api.somewhere.com/get-file/sweet%2Fcheescake.html

will be correctly routed to /get-file/<path> endpoint because file path sweet/cheescake.html is URL-encoded as sweet%2Fcheescake.html.

HTML forms#

HTML forms are the second percent-encoding domain. When data entered in the HTML form are submitted, the browser percent-encodes its field names and values with application/x-www-form-urlencoded MIME type.

The slight difference between percent encoding for forms and URLs is described below.

For example, sending two field form:

POST /send-feedback HTTP/1.1
Content-Type: application/x-www-form-urlencoded

who=Matt&text=I+want+more+examples

Troubles with spaces#

Very special is also a space character. URLs cannot contain spaces.

Within the URL it is encoded as %20. For example, to obtain sweet cheescake.html file:

https://api.somewhere.com//get-file/sweet%20cheescake.html

(Using space for file names is not a wise idea, anyway.)

However, when space occurs in HTML form field name or value, it is encoded as +.

URL quoting in Python#

How to perform URL encoding in Python? The standard library module urllib.parse provides (among others) these functions:

By default, quote() function doesn’t encode / to %2F because it a “safe” character.

from urllib.parse import quote

path = "some/file with space.html"

# some/file%20with%20space.html
print(quote(path))

To encode all disallowed characters, set safe="" parameter:

# some%2Ffile%20with%20space.html
print(quote(path, safe=""))

quote_plus() and unquote_plus() work the same, but the space is encoded/decoded as + and it has no safe characters by default:

# some%2Ffile+with+space.html
print(quote_plus(path))

Comments

comments powered by Disqus