URL encoding#
URL encoding (also known as percentage encoding) is a way to pass around characters otherwise prohibited in the URL and HTML forms because they have special meanings. For example, to use http://
as part of a URL, not its beginning, it has to be %-encoded to http%3A%2F%2F
.
URL anatomy#
scheme://host-or-ip:port/path/to/somewhere?query=param&yet=another
where
scheme
- is a type of service (like http or htts)host-or-ip
- textual or IP address of the serverport
- defines the port number at the host (default for http is 80)path/to/somewhere
- request pathquery=parameter
- additional parameter name and its valueyet=another
- query parameters may occur multiple times, and they are separated by&
Characters allowed in URL#
Many applications embrace URL-friendly strings as identifiers, names, or allowed values. An URL-friendly string is sometimes called slug.
The only characters that could appear inside the URL are split into two groups:
reserved characters
! * ' ( ) ; : @ & = + $ , / ? # [ ]
have special meaning to URL and must be %-encoded to pass them as data in URL.unreserved characters
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9 - _ . ~
are allowed in URLs as-is.
All other characters (e.g., non-English letters, math symbols) must also be URL-encoded.
Troubles with slashes#
For example, very problematic are slashes. Slash as /
is a path separator, and slash as %2F
is data.
For an imaginary REST API endpoint on /get-file/<path>
, compare two completely different URLs.
https://api.somewhere.com/get-file/sweet/cheescake.html
will end up with 404 Not Found because there is no /get-file/sweet/cheescake.html
endpoint.
However,
https://api.somewhere.com/get-file/sweet%2Fcheescake.html
will be correctly routed to /get-file/<path>
endpoint because file path sweet/cheescake.html
is URL-encoded as sweet%2Fcheescake.html
.
HTML forms#
HTML forms are the second percent-encoding domain. When data entered in the HTML form are submitted, the browser percent-encodes its field names and values with application/x-www-form-urlencoded
MIME type.
The slight difference between percent encoding for forms and URLs is described below.
For example, sending two field form:
POST /send-feedback HTTP/1.1
Content-Type: application/x-www-form-urlencoded
who=Matt&text=I+want+more+examples
Troubles with spaces#
Very special is also a space character. URLs cannot contain spaces.
Within the URL it is encoded as %20
. For example, to obtain sweet cheescake.html
file:
https://api.somewhere.com//get-file/sweet%20cheescake.html
(Using space for file names is not a wise idea, anyway.)
However, when space occurs in HTML form field name or value, it is encoded as +
.
URL quoting in Python#
How to perform URL encoding in Python? The standard library module urllib.parse
provides (among others) these functions:
quote_plus()
andunquote_plus()
) for encoding and decoding HTML form values
By default, quote()
function doesn’t encode /
to %2F
because it a “safe” character.
from urllib.parse import quote
path = "some/file with space.html"
# some/file%20with%20space.html
print(quote(path))
To encode all disallowed characters, set safe=""
parameter:
# some%2Ffile%20with%20space.html
print(quote(path, safe=""))
quote_plus()
and unquote_plus()
work the same, but the space is encoded/decoded as +
and it has no safe characters by default:
# some%2Ffile+with+space.html
print(quote_plus(path))
Comments
comments powered by Disqus