What is Encoding?
Encoding is the process of converting data into a format required for a number of information processing needs.
Web applications employ several different encoding schemes for their data. We will discuss few encoding schema as follows:
- URL Encoding
- Unicode Encoding
- HTML Encoding
- Base64 Encoding
- HEX Encoding
When you pass information through a URL, you need to make sure it only uses specifically allowed characters like alphabetic characters, numerals, and a few special characters that have to mean in the URL string. When submitting data to CGI scripts using the GET method, you should encode the data as it will be sent over the URL.
Below is a table of characters and their encoding
|Character||Purpose in URL||Encoding|
|:||Separate protocol (HTTP) from address||%3B|
|/||Separate domain and directories||%2F|
|?||Separate query string||%3F|
|&||Separate query elements||%24|
|@||Separate username and password from domain||%40|
|%||Indicates an encoded character||%25|
|+||Indicates a space||%2B|
|<Space>||Not recommended in URLs||%20 or +|
Unicode is a character encoding standard that is designed to support the entire world’s writing systems. It employs various encoding schemes, some of which can be used to represent unusual characters in web applications.
- UTF-8 of Unicode (used in Linux by default, and much of the Internet)
- UTF-16 of Unicode (used by Microsoft Windows and Mac OS X’s file systems, Java programming language,
- A character encoding capable of encoding all possible characters (called code points) in Unicode.
- a code unit is 8-bits
- use one to four code units to encode Unicode
- 00100100 for “$” (one 8-bits);11000010 10100010 for “¢” (two 8-bits);11100010 10000010 10101100 for “€” (three 8-bits)
- another character encoding
- a code unit is 16-bits
- use one to two code units to encode Unicode
- 00000000 00100100 for “$” (one 16-bits);11011000 01010010 11011111 01100010 for “𤭢” (two 16-bits)
HTML encoding is used to represent problematic characters so that they can be safely incorporated into an HTML document.
HTML encoding defines numerous HTML entities to represent specific literal characters:
- " – “
- ' – ‘
- & – &
- < – <
- > – >
In addition, any character can be HTML-encoded using its ASCII code in decimal form:
- " — “
- ' — ‘
- or by using its ASCII code in hexadecimal form (prefixed by an x):
- " — “
- ' — ‘
Base64 encoding schemes are commonly used when there is a need to encode binary data that needs be stored and transferred over media that are designed to deal with textual data. This is to ensure that the data remains intact without modification during transport. Base64 is used commonly in a number of applications including email via MIME and storing complex data in XML.
If we want to convert the word ‘cake’ to base 64, we simply convert each of the characters to its ASCII decimal value and then get the binary value of each decimal value. In our case:
cake = 01100011011000010110101101100101
We now need to break up our binary string into chunks of 6 bits each, and since every 3 characters must make 4 base64 characters, we can pad with 0’s if we don’t have enough:
011000 110110 000101 101011 011001 010000 000000 000000
We now convert each one of the 6-bit binary values into its corresponding character, and in the case of an all zero value, we use the padding character (=). In our case we get:
Which is the base64 encoded value of the word ‘cake’?
Hex encoding is basically the same as for Base64 encoding – it’s used for when you want to send or store 8-bit data on a media that only accepts 6 or 7 bits. Hex encoding is performed by converting the 8-bit data to 2 hex characters. The hex characters are then stored as the two-byte string representation of the characters. Often, some kind of separator is used to make the encoded data easier for human reading. With 8 bits converted to three characters and each character stored as 1-4 bytes, you might use up to 12 bytes (or even more in some cases) for each byte of information.