| Current Path : /var/www/html/bibhas.ghoshal/COA_2020/Lectures/ |
| Current File : /var/www/html/bibhas.ghoshal/COA_2020/Lectures/Data_Representation.html |
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>A Tutorial on Data Representation - Integers, Floating-point numbers, and characters</title>
<link href="A%20Tutorial%20on%20Data%20Representation%20-%20Integers,%20Floating-point%20numbers,%20and%20characters_files/programming_notes_v1.css" rel="stylesheet" type="text/css">
<script type="text/javascript" src="A%20Tutorial%20on%20Data%20Representation%20-%20Integers,%20Floating-point%20numbers,%20and%20characters_files/programming_notes_v1.js"></script>
<link rel="shortcut icon" href="http://www3.ntu.edu.sg/home/ehchua/programming/favicon.ico" type="image/x-icon"></head>
<body>
<div id="wrap-outer">
<!-- header filled by JavaScript -->
<div id="header" class="header-footer"><p>yet another insignificant programming notes... | <a href="http://www3.ntu.edu.sg/home/ehchua/programming/index.html">HOME</a></p></div>
<div id="wrap-inner">
<div id="wrap-toc">
<h5>TABLE OF CONTENTS <a id="show-toc" href="#show-toc">(SHOW)</a></h5>
<div style="display: none;" id="toc"><a class="toc-H3" href="#zz-1.">1. Number Systems</a><br><a class="toc-H4" href="#zz-1.1">1.1 Decimal (Base 10) Number System</a><br><a class="toc-H4" href="#zz-1.2">1.2 Binary (Base 2) Number System</a><br><a class="toc-H4" href="#zz-1.3">1.3 Hexadecimal (Base 16) Number System</a><br><a class="toc-H4" href="#zz-1.4">1.4 Conversion from Hexadecimal to Binary</a><br><a class="toc-H4" href="#zz-1.5">1.5 Conversion from Binary to Hexadecimal</a><br><a class="toc-H4" href="#zz-1.6">1.6 Conversion from Base <em>r</em> to Decimal (Base 10)</a><br><a class="toc-H4" href="#zz-1.7">1.7 Conversion from Decimal (Base 10) to Base <em>r</em></a><br><a class="toc-H4" href="#zz-1.8">1.8 General Conversion between 2 Base Systems with Fractional Part</a><br><a class="toc-H4" href="#zz-1.9">1.9 Exercises (Number Systems Conversion)</a><br><a class="toc-H3" href="#zz-2.">2. Computer Memory & Data Representation</a><br><a class="toc-H3" href="#zz-3.">3. Integer Representation</a><br><a class="toc-H4" href="#zz-3.1">3.1 <em>n</em>-bit Unsigned Integers</a><br><a class="toc-H4" href="#zz-3.2">3.2 Signed Integers</a><br><a class="toc-H4" href="#zz-3.3">3.3 <em>n</em>-bit Sign Integers in Sign-Magnitude Representation</a><br><a class="toc-H4" href="#zz-3.4">3.4 <em>n</em>-bit Sign Integers in 1's Complement Representation</a><br><a class="toc-H4" href="#zz-3.5">3.5 <em>n</em>-bit Sign Integers in 2's Complement Representation</a><br><a class="toc-H4" href="#zz-3.6">3.6 Computers use 2's Complement Representation for Signed Integers</a><br><a class="toc-H4" href="#zz-3.7">3.7 Range of <em>n</em>-bit 2's Complement Signed Integers</a><br><a class="toc-H4" href="#zz-3.8">3.8 Decoding 2's Complement Numbers</a><br><a class="toc-H4" href="#zz-3.9">3.9 Big Endian vs. Little Endian</a><br><a class="toc-H4" href="#zz-3.10">3.10 Exercise (Integer Representation)</a><br><a class="toc-H3" href="#zz-4.">4. Floating-Point Number Representation</a><br><a class="toc-H4" href="#zz-4.1">4.1 IEEE-754 32-bit Single-Precision Floating-Point Numbers</a><br><a class="toc-H4" href="#zz-4.2">4.2 Exercises (Floating-point Numbers)</a><br><a class="toc-H4" href="#zz-4.3">4.3 IEEE-754 64-bit Double-Precision Floating-Point Numbers</a><br><a class="toc-H4" href="#zz-4.4">4.4 More on Floating-Point Representation</a><br><a class="toc-H3" href="#zz-5.">5. Character Encoding</a><br><a class="toc-H4" href="#zz-5.1">5.1 7-bit ASCII Code (aka US-ASCII, ISO/IEC 646, ITU-T T.50)</a><br><a class="toc-H4" href="#zz-5.2">5.2 8-bit Latin-1 (aka ISO/IEC 8859-1)</a><br><a class="toc-H4" href="#zz-5.3">5.3 Other 8-bit Extension of US-ASCII (ASCII Extensions)</a><br><a class="toc-H4" href="#zz-5.4">5.4 Unicode (aka ISO/IEC 10646 Universal Character Set)</a><br><a class="toc-H4" href="#zz-5.5">5.5 UTF-8 (Unicode Transformation Format - 8-bit)</a><br><a class="toc-H4" href="#zz-5.6">5.6 UTF-16 (Unicode Transformation Format - 16-bit)</a><br><a class="toc-H4" href="#zz-5.7">5.7 UTF-32 (Unicode Transformation Format - 32-bit)</a><br><a class="toc-H4" href="#zz-5.8">5.8 Formats of Multi-Byte (e.g., Unicode) Text Files</a><br><a class="toc-H4" href="#zz-5.9">5.9 Formats of Text Files</a><br><a class="toc-H4" href="#zz-5.10">5.10 Windows' CMD Codepage</a><br><a class="toc-H4" href="#zz-5.11">5.11 Chinese Character Sets</a><br><a class="toc-H4" href="#zz-5.12">5.12 Collating Sequences (for Ranking Characters)</a><br><a class="toc-H4" href="#zz-5.13">5.13 For Java Programmers - <code>java.nio.Charset</code></a><br><a class="toc-H4" href="#zz-5.14">5.14 For Java Programmers - <code>char</code> and <code>String</code></a><br><a class="toc-H4" href="#zz-5.15">5.15 Displaying Hex Values & Hex Editors</a><br><a class="toc-H3" href="#zz-6.">6. Summary - Why Bother about Data Representation?</a><br><a class="toc-H4" href="#zz-6.1">6.1 Exercises (Data Representation)</a><br><br></div> <!-- for showing the "Table of Content" -->
</div>
<div id="content-header">
<h1>A Tutorial on Data Representation</h1>
<h2>Integers, Floating-point Numbers, and Characters</h2>
</div>
<div id="content-main">
<h3>1. Number Systems<a id="zz-1."></a></h3>
<p>Human beings use <em>decimal</em> (base 10) and <em>duodecimal</em> (base 12) number systems for counting and measurements (probably because we have 10 fingers and two big toes). Computers use <em>binary</em>
(base 2) number system, as they are made from binary digital components
(known as transistors) operating in two states - on and off. In
computing, we also use <em>hexadecimal</em> (base 16) or <em>octal</em> (base 8) number systems, as a <em>compact</em> form for represent binary numbers.</p>
<h4>1.1 Decimal (Base 10) Number System<a id="zz-1.1"></a></h4>
<p>Decimal number system has ten symbols: 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9, called <em>digit</em>s. It uses <em>positional notation</em>. That is, the least-significant digit (right-most digit) is of the order of <code>10^0</code> (units or ones), the second right-most digit is of the order of <code>10^1</code> (tens), the third right-most digit is of the order of <code>10^2</code> (hundreds), and so on. For example,</p>
<pre class="color-example">735 = 7×10^2 + 3×10^1 + 5×10^0</pre>
<p>We shall denote a decimal number with an optional suffix <code>D</code> if ambiguity arises.</p>
<h4>1.2 Binary (Base 2) Number System<a id="zz-1.2"></a></h4>
<p>Binary number system has two symbols: 0 and 1, called <em>bits</em>. It is also a <em>positional notation</em>, for example,</p>
<pre class="color-example">10110B = 1×2^4 + 0×2^3 + 1×2^2 + 1×2^1 + 0×2^0</pre>
<p>We shall denote a binary number with a suffix <code>B</code>. Some programming languages denote binary numbers with prefix <code>0b</code> (e.g., <code>0b1001000</code>), or prefix <code>b</code> with the bits quoted (e.g., <code>b'10001111'</code>).</p>
<p>A binary digit is called a <em>bit</em>. Eight bits is called a <em>byte</em> (why 8-bit unit? Probably because <code>8=2<sup>3</sup></code>).</p>
<h4>1.3 Hexadecimal (Base 16) Number System<a id="zz-1.3"></a></h4>
<p>Hexadecimal number system uses 16 symbols: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, and F, called <em>hex digits</em>. It is a <em>positional notation</em>, for example,</p>
<pre class="color-example">A3EH = 10×16^2 + 3×16^1 + 14×16^0</pre>
<p>We shall denote a hexadecimal number (in short, hex) with a suffix <code>H</code>. Some programming languages denote hex numbers with prefix <code>0x</code> (e.g., <code>0x1A3C5F</code>), or prefix <code>x</code> with hex digit quoted (e.g., <code>x'C3A4D98B'</code>).</p>
<p>Each hexadecimal digit is also called a <em>hex digit</em>. Most programming languages accept lowercase <code>'a'</code> to <code>'f'</code> as well as uppercase <code>'A'</code> to <code>'F'</code>.</p>
<p>Computers uses binary system in their internal operations, as they
are built from binary digital electronic components. However, writing or
reading a long sequence of binary bits is cumbersome and error-prone.
Hexadecimal system is used as a <em>compact</em> form or <em>shorthand</em> for binary bits. Each hex digit is equivalent to 4 binary bits, i.e., shorthand for 4 bits, as follows:</p>
<table class="table-zebra font-code" style="width:60%">
<tbody><tr class="tr-alt">
<td>0H (0000B) (0D)</td>
<td>1H (0001B) (1D)</td>
<td>2H (0010B) (2D)</td>
<td>3H (0011B) (3D)</td>
</tr>
<tr>
<td>4H (0100B) (4D)</td>
<td>5H (0101B) (5D)</td>
<td>6H (0110B) (6D)</td>
<td>7H (0111B) (7D)</td>
</tr>
<tr class="tr-alt">
<td>8H (1000B) (8D)</td>
<td>9H (1001B) (9D)</td>
<td>AH (1010B) (10D)</td>
<td>BH (1011B) (11D)</td>
</tr>
<tr>
<td>CH (1100B) (12D)</td>
<td>DH (1101B) (13D)</td>
<td>EH (1110B) (14D)</td>
<td>FH (1111B) (15D)</td>
</tr>
</tbody></table>
<h4>1.4 Conversion from Hexadecimal to Binary<a id="zz-1.4"></a></h4>
<p>Replace each hex digit by the 4 equivalent bits, for examples,</p>
<pre class="color-example">A3C5H = 1010 0011 1100 0101B
102AH = 0001 0000 0010 1010B</pre>
<h4>1.5 Conversion from Binary to Hexadecimal<a id="zz-1.5"></a></h4>
<p>Starting from the right-most bit (least-significant bit), replace
each group of 4 bits by the equivalent hex digit (pad the left-most bits
with zero if necessary), for examples,</p>
<pre class="color-example">1001001010B = 0010 0100 1010B = 24AH
10001011001011B = 0010 0010 1100 1011B = 22CBH</pre>
<p>It is important to note that hexadecimal number provides a <em>compact form</em> or <em>shorthand</em> for representing binary bits.</p>
<h4>1.6 Conversion from Base <em>r</em> to Decimal (Base 10)<a id="zz-1.6"></a></h4>
<p>Given a <em>n</em>-digit base <em>r</em> number: <code>dn-1 dn-2 dn-3 ... d3 d2 d1 d0</code> (base r), the decimal equivalent is given by:</p>
<pre class="color-syntax">dn-1 × r^(n-1) + dn-2 × r^(n-2) + ... + d1 × r^1 + d0 × r^0</pre>
<p>For examples,</p>
<pre class="color-example">A1C2H = 10×16^3 + 1×16^2 + 12×16^1 + 2 = 41410 (base 10)
10110B = 1×2^4 + 1×2^2 + 1×2^1 = 22 (base 10)</pre>
<h4>1.7 Conversion from Decimal (Base 10) to Base <em>r</em><a id="zz-1.7"></a></h4>
<p>Use repeated division/remainder. For example,</p>
<pre class="color-example">To convert 261D to hexadecimal:
261/16 => quotient=16 remainder=5
16/16 => quotient=1 remainder=0
1/16 => quotient=0 remainder=1 (quotient=0 stop)
Hence, 261D = 105H</pre>
<p>The above procedure is actually applicable to conversion between any 2 base systems. For example,</p>
<pre class="color-example">To convert 1023(base 4) to base 3:
1023(base 4)/3 => quotient=25D remainder=0
25D/3 => quotient=8D remainder=1
8D/3 => quotient=2D remainder=2
2D/3 => quotient=0 remainder=2 (quotient=0 stop)
Hence, 1023(base 4) = 2210(base 3)</pre>
<h4>1.8 General Conversion between 2 Base Systems with Fractional Part<a id="zz-1.8"></a></h4>
<ol>
<li>Separate the integral and the fractional parts.</li>
<li>For the integral part, divide by the target radix repeatably, and collect the ramainder in reverse order.</li>
<li>For the fractional part, multiply the fractional part by the target
radix repeatably, and collect the integral part in the same order.</li>
</ol>
<p class="line-heading">Example 1:</p>
<pre class="color-example">Convert 18.6875D to binary
Integral Part = 18D
18/2 => quotient=9 remainder=0
9/2 => quotient=4 remainder=1
4/2 => quotient=2 remainder=0
2/2 => quotient=1 remainder=0
1/2 => quotient=0 remainder=1 (quotient=0 stop)
Hence, 18D = 10010B
Fractional Part = .6875D
.6875*2=1.375 => whole number is 1
.375*2=0.75 => whole number is 0
.75*2=1.5 => whole number is 1
.5*2=1.0 => whole number is 1
Hence .6875D = .1011B
Therefore, 18.6875D = 10010.1011B</pre>
<p class="line-heading">Example 2:</p>
<pre class="color-example">Convert 18.6875D to hexadecimal
Integral Part = 18D
18/16 => quotient=1 remainder=2
1/16 => quotient=0 remainder=1 (quotient=0 stop)
Hence, 18D = 12H
Fractional Part = .6875D
.6875*16=11.0 => whole number is 11D (BH)
Hence .6875D = .BH
Therefore, 18.6875D = 12.BH</pre>
<h4>1.9 Exercises (Number Systems Conversion)<a id="zz-1.9"></a></h4>
<ol>
<li>Convert the following <em>decimal</em> numbers into <em>binary</em> and <em>hexadecimal</em> numbers:
<ol>
<li><code>108</code></li>
<li><code>4848</code></li>
<li><code>9000</code></li>
</ol></li>
<li>Convert the following binary numbers into hexadecimal and decimal numbers:
<ol>
<li><code>1000011000</code></li>
<li><code>10000000</code></li>
<li><code>101010101010</code></li>
</ol></li>
<li>Convert the following hexadecimal numbers into binary and decimal numbers:
<ol>
<li><code>ABCDE</code></li>
<li><code>1234</code></li>
<li><code>80F</code></li>
</ol></li>
<li>Convert the following decimal numbers into binary equivalent:
<ol>
<li><code>19.25D</code></li>
<li><code>123.456D</code></li>
</ol>
</li>
</ol>
<p><span class="line-heading">Answers:</span> You could use the Windows' Calculator (<code>calc.exe</code>)
to carry out number system conversion, by setting it to the scientific
mode. (Run "calc" ⇒ Select "View" menu ⇒ Choose "Programmer" or
"Scientific" mode.)</p>
<ol>
<li><code>1101100B</code>, <code>1001011110000B</code>, <code>10001100101000B</code>, <code>6CH</code>, <code>12F0H</code>, <code>2328H</code>.</li>
<li><code>218H</code>, <code>80H</code>, <code>AAAH</code>, <code>536D</code>, <code>128D</code>, <code>2730D</code>.</li>
<li><code>10101011110011011110B</code>, <code>1001000110100B</code>, <code>100000001111B</code>, <code>703710D</code>, <code>4660D</code>, <code>2063D</code>.</li>
<li>??</li>
</ol>
<h3>2. Computer Memory & Data Representation<a id="zz-2."></a></h3>
<p>Computer uses <em>a fixed number of bits</em> to represent a piece of data, which could be a number, a character, or others. A <em>n</em>-bit storage location can represent up to <code>2^<em>n</em></code> distinct entities. For example, a 3-bit memory location can hold one of these eight binary patterns: <code>000</code>, <code>001</code>, <code>010</code>, <code>011</code>, <code>100</code>, <code>101</code>, <code>110</code>, or <code>111</code>.
Hence, it can represent at most 8 distinct entities. You could use them
to represent numbers 0 to 7, numbers 8881 to 8888, characters 'A' to
'H', or up to 8 kinds of fruits like apple, orange, banana; or up to 8
kinds of animals like lion, tiger, etc.</p>
<p>Integers, for example, can be represented in 8-bit, 16-bit, 32-bit or
64-bit. You, as the programmer, choose an appropriate bit-length for
your integers. Your choice will impose constraint on the range of
integers that can be represented. Besides the bit-length, an integer can
be represented in various <em>representation</em> schemes, e.g.,
unsigned vs. signed integers. An 8-bit unsigned integer has a range of 0
to 255, while an 8-bit signed integer has a range of -128 to 127 - both
representing 256 distinct numbers.</p>
<p>It is important to note that a computer memory location merely <em>stores a binary pattern</em>. It is entirely up to you, as the programmer, to decide on how these patterns are to be <em>interpreted</em>. For example, the 8-bit binary pattern <code>"0100 0001B"</code> can be interpreted as an unsigned integer <code>65</code>, or an ASCII character <code>'A'</code>,
or some secret information known only to you. In other words, you have
to first decide how to represent a piece of data in a binary pattern
before the binary patterns make sense. The interpretation of binary
pattern is called <em>data representation</em> or <em>encoding</em>.
Furthermore, it is important that the data representation schemes are
agreed-upon by all the parties, i.e., industrial standards need to be
formulated and straightly followed.</p>
<p>Once you decided on the data representation scheme, certain
constraints, in particular, the precision and range will be imposed.
Hence, it is important to understand <em> data representation</em> to write <em>correct</em> and <em>high-performance</em> programs.</p>
<h5>Rosette Stone and the Decipherment of Egyptian Hieroglyphs</h5>
<img class="image-float-right" src="A%20Tutorial%20on%20Data%20Representation%20-%20Integers,%20Floating-point%20numbers,%20and%20characters_files/Representation_RosettaStone.jpg">
<img class="image-float-right" src="A%20Tutorial%20on%20Data%20Representation%20-%20Integers,%20Floating-point%20numbers,%20and%20characters_files/Representation_hieroglyphs.jpg">
<p>Egyptian hieroglyphs (next-to-left) were used by the ancient
Egyptians since 4000BC.
Unfortunately, since 500AD, no one could longer read the ancient
Egyptian hieroglyphs, until the re-discovery of the Rosette Stone in
1799 by Napoleon's troop (during Napoleon's Egyptian invasion) near the
town of Rashid (Rosetta) in the Nile Delta.</p>
<p>The Rosetta Stone (left) is inscribed with a decree in 196BC on behalf of King Ptolemy V. The decree appears in <em>three</em> scripts: the upper text is <em>Ancient Egyptian hieroglyphs</em>, the middle portion Demotic script, and the lowest <em>Ancient Greek</em>.
Because it presents essentially the same text in all three scripts, and
Ancient Greek could still be understood, it provided the key to the
decipherment of the Egyptian hieroglyphs.</p>
<p>The moral of the story is unless you know the encoding scheme, there is no way that you can decode the data.</p>
<p>Reference and images: Wikipedia.</p>
<h3 id="int_rep" class="float-clear">3. Integer Representation<a id="zz-3."></a></h3>
<p>Integers are <em>whole numbers</em> or <em>fixed-point numbers</em> with the radix point <em>fixed</em> after the least-significant bit. They are contrast to <em>real numbers</em> or <em>floating-point numbers</em>,
where the position of the radix point varies. It is important to take
note that integers and floating-point numbers are treated differently in
computers. They have different representation and are processed
differently (e.g., floating-point numbers are processed in a so-called
floating-point processor). Floating-point numbers will be discussed
later.</p>
<p>Computers use <em>a fixed number of bits</em> to represent an
integer. The commonly-used bit-lengths for integers are 8-bit, 16-bit,
32-bit or 64-bit. Besides bit-lengths, there are two representation
schemes for integers:</p>
<ol>
<li><em>Unsigned Integers</em>: can represent zero and positive integers.</li>
<li><em>Signed Integers</em>: can represent zero, positive and negative integers. Three representation schemes had been proposed for signed integers:
<ol>
<li>Sign-Magnitude representation</li>
<li>1's Complement representation</li>
<li>2's Complement representation</li>
</ol>
</li>
</ol>
<p>You, as the programmer, need to decide on the bit-length and
representation scheme for your integers, depending on your application's
requirements. Suppose that you need a counter for counting a small
quantity from 0 up to 200, you might choose the 8-bit unsigned integer
scheme as there is no negative numbers involved.</p>
<h4>3.1 <em>n</em>-bit Unsigned Integers<a id="zz-3.1"></a></h4>
<p>Unsigned integers can represent zero and positive integers, but not negative integers.
The value of an unsigned integer is interpreted as "<em>the magnitude of its underlying binary pattern</em>".</p>
<p><span class="line-heading">Example 1:</span> Suppose that <code><em>n</em>=8</code> and the binary pattern is<code> 0100 0001B</code>, the value of this unsigned integer is<code> 1×2^0 + 1×2^6 = 65D</code>.</p>
<p><span class="line-heading">Example 2:</span> Suppose that <code><em>n</em>=16</code> and the binary pattern is<code> 0001 0000 0000 1000B</code>, the value of this unsigned integer is<code> 1×2^3 + 1×2^12 = 4104D</code>.</p>
<p><span class="line-heading">Example 3:</span> Suppose that <code><em>n</em>=16</code> and the binary pattern is<code> 0000 0000 0000 0000B</code>, the value of this unsigned integer is <code>0</code>.</p>
<p>An <em>n</em>-bit pattern can represent <code>2^<em>n</em></code> distinct integers. An <em>n</em>-bit unsigned integer can represent integers from <code>0</code> to <code>(2^<em>n</em>)-1</code>, as tabulated below:</p>
<table class="table-zebra font-code" style="width:60%">
<tbody><tr>
<th>n</th>
<th>Minimum</th>
<th>Maximum</th>
</tr>
<tr>
<td class="text-center">8</td>
<td class="text-center">0</td>
<td>(2^8)-1 (=255)</td>
</tr>
<tr class="tr-alt">
<td class="text-center">16</td>
<td class="text-center">0</td>
<td>(2^16)-1 (=65,535)</td>
</tr>
<tr>
<td class="text-center">32</td>
<td class="text-center">0</td>
<td>(2^32)-1 (=4,294,967,295) (9+ digits)</td>
</tr>
<tr class="tr-alt">
<td class="text-center">64</td>
<td class="text-center">0</td>
<td>(2^64)-1 (=18,446,744,073,709,551,615) (19+ digits)</td>
</tr>
</tbody></table>
<h4>3.2 Signed Integers<a id="zz-3.2"></a></h4>
<p>Signed integers can represent zero, positive integers, as well as
negative integers. Three representation schemes are available for signed
integers:</p>
<ol>
<li>Sign-Magnitude representation</li>
<li>1's Complement representation</li>
<li>2's Complement representation</li>
</ol>
<p>In all the above three schemes, the <em>most-significant bit</em> (msb) is called the <em>sign bit</em>. The sign bit is used to represent the <em>sign</em> of the integer - with 0 for positive integers and 1 for negative integers. The <em>magnitude</em> of the integer, however, is interpreted differently in different schemes.</p>
<h4>3.3 <em>n</em>-bit Sign Integers in Sign-Magnitude Representation<a id="zz-3.3"></a></h4>
<p>In sign-magnitude representation:</p>
<ul>
<li>The most-significant bit (msb) is the <em>sign bit</em>, with value of 0 representing positive integer and 1 representing negative integer.</li>
<li>The remaining <em>n</em>-1 bits represents the magnitude (absolute
value) of the integer. The absolute value of the integer is interpreted
as "the magnitude of the (<em>n</em>-1)-bit binary pattern".</li>
</ul>
<p><span class="line-heading">Example 1</span>: Suppose that <code><em>n</em>=8</code> and the binary representation is<code> 0 100 0001B</code>.<br>
Sign bit is <code>0</code> ⇒ positive<br>
Absolute value is <code>100 0001B = 65D</code><br>
Hence, the integer is <code>+65D</code></p>
<p><span class="line-heading">Example 2</span>: Suppose that <code><em>n</em>=8</code> and the binary representation is<code> 1 000 0001B</code>.<br>
Sign bit is <code>1</code> ⇒ negative<br>
Absolute value is <code>000 0001B = 1D</code><br>
Hence, the integer is <code>-1D</code></p>
<p><span class="line-heading">Example 3</span>: Suppose that <code><em>n</em>=8</code> and the binary representation is<code> 0 000 0000B</code>.<br>
Sign bit is <code>0</code> ⇒ positive<br>
Absolute value is <code>000 0000B = 0D</code><br>
Hence, the integer is <code>+0D</code></p>
<p><span class="line-heading">Example 4</span>: Suppose that <code><em>n</em>=8</code> and the binary representation is<code> 1 000 0000B</code>.<br>
Sign bit is <code>1</code> ⇒ negative<br>
Absolute value is <code>000 0000B = 0D</code><br>
Hence, the integer is <code>-0D</code></p>
<img class="image-center" src="A%20Tutorial%20on%20Data%20Representation%20-%20Integers,%20Floating-point%20numbers,%20and%20characters_files/DataRep_SignMagnitude.png" alt="sign-magnitude representation">
<p>The drawbacks of sign-magnitude representation are:</p>
<ol>
<li>There are two representations (<code>0000 0000B</code> and <code>1000 0000B</code>) for the number zero, which could lead to inefficiency and confusion.</li>
<li>Positive and negative integers need to be processed separately.</li></ol>
<h4>3.4 <em>n</em>-bit Sign Integers in 1's Complement Representation<a id="zz-3.4"></a></h4>
<p>In 1's complement representation:</p>
<ul>
<li>Again, the most significant bit (msb) is the <em>sign bit</em>, with value of 0 representing positive integers and 1 representing negative integers.</li>
<li>The remaining <em>n</em>-1 bits represents the magnitude of the integer, as follows:
<ul>
<li>for positive integers, the absolute value of the integer is equal to "the magnitude of the (<em>n</em>-1)-bit binary pattern".</li>
<li>for negative integers, the absolute value of the integer is equal to "the magnitude of the <em>complement</em> (<em>inverse</em>) of the (<em>n</em>-1)-bit binary pattern" (hence called 1's complement).</li>
</ul>
</li>
</ul>
<p><span class="line-heading">Example 1</span>: Suppose that <code><em>n</em>=8</code> and the binary representation<code> 0 100 0001B</code>.<br>
Sign bit is <code>0</code> ⇒ positive<br>
Absolute value is <code>100 0001B = 65D</code><br>
Hence, the integer is <code>+65D</code></p>
<p><span class="line-heading">Example 2</span>: Suppose that <code><em>n</em>=8</code> and the binary representation<code> 1 000 0001B</code>.<br>
Sign bit is <code>1</code> ⇒ negative<br>
Absolute value is the complement of <code>000 0001B</code>, i.e., <code>111 1110B = 126D</code><br>
Hence, the integer is <code>-126D</code></p>
<p><span class="line-heading">Example 3</span>: Suppose that <code><em>n</em>=8</code> and the binary representation<code> 0 000 0000B</code>.<br>
Sign bit is <code>0</code> ⇒ positive<br>
Absolute value is <code>000 0000B = 0D</code><br>
Hence, the integer is <code>+0D</code></p>
<p><span class="line-heading">Example 4</span>: Suppose that <code><em>n</em>=8</code> and the binary representation<code> 1 111 1111B</code>.<br>
Sign bit is <code>1</code> ⇒ negative<br>
Absolute value is the complement of <code>111 1111B</code>, i.e., <code>000 0000B = 0D</code><br>
Hence, the integer is <code>-0D</code></p>
<img class="image-center" src="A%20Tutorial%20on%20Data%20Representation%20-%20Integers,%20Floating-point%20numbers,%20and%20characters_files/DataRep_OneComplement.png" alt="1's complement">
<p>Again, the drawbacks are:</p>
<ol>
<li>There are two representations (<code>0000 0000B</code> and <code>1111 1111B</code>) for zero.</li>
<li>The positive integers and negative integers need to be processed separately.</li>
</ol>
<h4>3.5 <em>n</em>-bit Sign Integers in 2's Complement Representation<a id="zz-3.5"></a></h4>
<p>In 2's complement representation:</p>
<ul>
<li>Again, the most significant bit (msb) is the <em>sign bit</em>, with value of 0 representing positive integers and 1 representing negative integers.</li>
<li>The remaining <em>n</em>-1 bits represents the magnitude of the integer, as follows:
<ul>
<li>for positive integers, the absolute value of the integer is equal to "the magnitude of the (<em>n</em>-1)-bit binary pattern".</li>
<li>for negative integers, the absolute value of the integer is equal to "the magnitude of the <em>complement</em> of the (<em>n</em>-1)-bit binary pattern <em>plus one</em>" (hence called 2's complement).</li>
</ul>
</li>
</ul>
<p><span class="line-heading">Example 1</span>: Suppose that <code><em>n</em>=8</code> and the binary representation<code> 0 100 0001B</code>.<br>
Sign bit is <code>0</code> ⇒ positive<br>
Absolute value is <code>100 0001B = 65D</code><br>
Hence, the integer is <code>+65D</code></p>
<p><span class="line-heading">Example 2</span>: Suppose that <code><em>n</em>=8</code> and the binary representation<code> 1 000 0001B</code>.<br>
Sign bit is <code>1</code> ⇒ negative<br>
Absolute value is the complement of <code>000 0001B</code> plus <code>1</code>, i.e., <code>111 1110B + 1B = 127D</code><br>
Hence, the integer is <code>-127D</code></p>
<p><span class="line-heading">Example 3</span>: Suppose that <code><em>n</em>=8</code> and the binary representation<code> 0 000 0000B</code>.<br>
Sign bit is <code>0</code> ⇒ positive<br>
Absolute value is <code>000 0000B = 0D</code><br>
Hence, the integer is <code>+0D</code></p>
<p><span class="line-heading">Example 4</span>: Suppose that <code><em>n</em>=8</code> and the binary representation<code> 1 111 1111B</code>.<br>
Sign bit is <code>1</code> ⇒ negative<br>
Absolute value is the complement of <code>111 1111B</code> plus <code>1</code>, i.e., <code>000 0000B + 1B = 1D</code><br>
Hence, the integer is <code>-1D</code></p>
<img class="image-center" src="A%20Tutorial%20on%20Data%20Representation%20-%20Integers,%20Floating-point%20numbers,%20and%20characters_files/DataRep_TwoComplement.png" alt="2's complement">
<h4>3.6 Computers use 2's Complement Representation for Signed Integers<a id="zz-3.6"></a></h4>
<p>We have discussed three representations for signed integers:
signed-magnitude, 1's complement and 2's complement. Computers use 2's
complement in representing signed integers. This is because:</p>
<ol>
<li>There is only one representation for the number zero in 2's
complement, instead of two representations in sign-magnitude and 1's
complement.</li>
<li>Positive and negative integers can be treated together in addition
and subtraction. Subtraction can be carried out using the "addition
logic".</li>
</ol>
<p><span class="line-heading">Example 1: Addition of Two Positive Integers:</span> Suppose that<code> n=8, 65D + 5D = 70D</code></p>
<pre class="color-example">65D → 0100 0001B
5D → 0000 0101B(+
0100 0110B → 70D (OK)</pre>
<p><span class="line-heading">Example 2: Subtraction is treated as Addition of a Positive and a Negative Integers:</span> Suppose that<code> n=8, 5D - 5D = 65D + (-5D) = 60D</code></p>
<pre class="color-example">65D → 0100 0001B
-5D → 1111 1011B(+
0011 1100B → 60D (discard carry - OK)</pre>
<p><span class="line-heading">Example 3: Addition of Two Negative Integers:</span> Suppose that<code> n=8, -65D - 5D = (-65D) + (-5D) = -70D</code></p>
<pre class="color-example">-65D → 1011 1111B
-5D → 1111 1011B(+
1011 1010B → -70D (discard carry - OK)</pre>
<p>Because of the <em>fixed precision</em> (i.e., <em>fixed number of bits</em>), an <em>n</em>-bit 2's complement signed integer has a certain range. For example, for <code><em>n</em>=8</code>, the range of 2's complement signed integers is <code>-128</code> to <code>+127</code>. During addition (and subtraction), it is important to check whether the result exceeds this range, in other words, whether <em>overflow</em> or <em>underflow</em> has occurred.</p>
<p><span class="line-heading">Example 4: Overflow:</span> Suppose that<code> n=8, 127D + 2D = 129D</code> (overflow - beyond the range)</p>
<pre class="color-example">127D → 0111 1111B
2D → 0000 0010B(+
1000 0001B → -127D (wrong)</pre>
<p><span class="line-heading">Example 5: Underflow:</span> Suppose that<code> n=8, -125D - 5D = -130D</code> (underflow - below the range)</p>
<pre class="color-example">-125D → 1000 0011B
-5D → 1111 1011B(+
0111 1110B → +126D (wrong)</pre>
<p>The following diagram explains how the 2's complement works. By re-arranging the number line, values from <code>-128</code> to <code>+127</code> are represented contiguously by ignoring the carry bit.</p>
<img class="image-center" src="A%20Tutorial%20on%20Data%20Representation%20-%20Integers,%20Floating-point%20numbers,%20and%20characters_files/DataRep_SignedIntegers.gif" alt="signed integer">
<h4>3.7 Range of <em>n</em>-bit 2's Complement Signed Integers<a id="zz-3.7"></a></h4>
<p>An <em>n</em>-bit 2's complement signed integer can represent integers from <code>-2^(<em>n</em>-1)</code> to <code>+2^(<em>n</em>-1)-1</code>,
as tabulated. Take note that the scheme can represent all the integers
within the range, without any gap. In other words, there is no missing
integers within the supported range.</p>
<table class="table-zebra font-code" style="width:80%">
<tbody><tr>
<th>n</th>
<th>minimum</th>
<th>maximum</th>
</tr>
<tr>
<td class="text-center">8</td>
<td>-(2^7) (=-128)</td>
<td>+(2^7)-1 (=+127)</td>
</tr>
<tr>
<td class="text-center">16</td>
<td>-(2^15) (=-32,768)</td>
<td>+(2^15)-1 (=+32,767)</td>
</tr>
<tr>
<td class="text-center">32</td>
<td>-(2^31) (=-2,147,483,648)</td>
<td>+(2^31)-1 (=+2,147,483,647)(9+ digits)</td>
</tr>
<tr>
<td class="text-center">64</td>
<td>-(2^63) (=-9,223,372,036,854,775,808)</td>
<td>+(2^63)-1 (=+9,223,372,036,854,775,807)(18+ digits) </td>
</tr>
</tbody></table>
<h4>3.8 Decoding 2's Complement Numbers<a id="zz-3.8"></a></h4>
<ol>
<li>Check the <em>sign bit</em> (denoted as <code>S</code>).</li>
<li>If <code>S=0</code>, the number is positive and its absolute value is the binary value of the remaining <em>n</em>-1 bits.</li>
<li>If <code>S=1</code>, the number is negative. you could "invert the <em>n</em>-1 bits and plus 1" to get the absolute value of negative number.<br>
Alternatively,
you could scan the remaining <em>n</em>-1 bits from the right
(least-significant bit). Look for the first occurrence of 1. Flip all
the bits to the left of that first occurrence of 1. The flipped pattern
gives the absolute value. For example,
<pre class="color-example">n = 8, bit pattern = 1 100 0100B
S = 1 → negative
Scanning from the right and flip all the bits to the left of the first occurrence of 1 ⇒ <span class="underline">011 1</span>100B = 60D
Hence, the value is -60D</pre>
</li>
</ol>
<h4>3.9 Big Endian vs. Little Endian<a id="zz-3.9"></a></h4>
<p>Modern computers store one byte of data in each memory address or
location, i.e., byte addressable memory. An 32-bit integer is,
therefore, stored in 4 memory addresses.</p>
<p>The term"Endian" refers to the <em>order</em> of storing bytes in
computer memory. In "Big Endian" scheme, the most significant byte is
stored first in the lowest memory address (or big in first), while
"Little Endian" stores the least significant bytes in the lowest memory
address.</p>
<p>For example, the 32-bit integer 12345678H (2215053170<sub>10</sub>)
is stored as 12H 34H 56H 78H in big endian; and 78H 56H 34H 12H in
little endian. An 16-bit integer 00H 01H is interpreted as 0001H in big
endian, and 0100H as little endian.</p>
<h4>3.10 Exercise (Integer Representation)<a id="zz-3.10"></a></h4>
<ol>
<li>What are the ranges of 8-bit, 16-bit, 32-bit and 64-bit integer, in "unsigned" and "signed" representation?</li>
<li>Give the value of <code>88</code>, <code>0</code>, <code>1</code>, <code>127</code>, and <code>255</code> in<code> </code>8-bit unsigned representation.</li>
<li>Give the value of <code>+88</code>, <code>-88</code> , <code>-1</code>, <code>0</code>, <code>+1</code>, <code>-128</code>, and <code>+127</code> in 8-bit 2's complement signed representation.</li>
<li>Give the value of <code>+88</code>, <code>-88</code> , <code>-1</code>, <code>0</code>, <code>+1</code>, <code>-127</code>, and <code>+127</code> in 8-bit sign-magnitude representation.</li>
<li>Give the value of <code>+88</code>, <code>-88</code> , <code>-1</code>, <code>0</code>, <code>+1</code>, <code>-127</code> and <code>+127</code> in 8-bit 1's complement representation.</li>
<li>[TODO] more.</li>
</ol>
<h5>Answers</h5>
<ol>
<li>The range of unsigned <em>n</em>-bit integers is <code>[0, 2^n - 1]</code>. The range of <em>n</em>-bit 2's complement signed <em></em>integer is <code>[-2^(n-1), +2^(n-1)-1]</code>;</li>
<li><code>88 (0101 1000)</code>, <code>0 (0000 0000)</code>, <code>1 (0000 0001)</code>, <code>127 (0111 1111)</code>, <code>255 (1111 1111)</code>.</li>
<li><code>+88 (0101 1000)</code>, <code>-88 (1010 1000)</code>, <code>-1 (1111 1111)</code>, <code>0 (0000 0000)</code>, <code>+1 (0000 0001)</code>, <code>-128 (1000 0000)</code>, <code>+127 (0111 1111)</code>.</li>
<li><code>+88 (0101 1000)</code>, <code>-88 (1101 1000)</code>, <code>-1 (1000 0001)</code>, <code>0 (0000 0000 or 1000 0000)</code>, <code>+1 (0000 0001)</code>, <code>-127 (1111 1111)</code>, <code>+127 (0111 1111)</code>.</li>
<li><code>+88 (0101 1000)</code>, <code>-88 (1010 0111)</code>, <code>-1 (1111 1110)</code>, <code>0 (0000 0000 or 1111 1111)</code>, <code>+1 (0000 0001)</code>, <code>-127 (1000 0000)</code>, <code>+127 (0111 1111)</code>.</li>
</ol>
<h3 id="fp_rep">4. Floating-Point Number Representation<a id="zz-4."></a></h3>
<p>A floating-point number (or real number) can represent a very large (<code>1.23×10^88</code>) or a very small (<code>1.23×10^-88</code>) value. It could also represent very large negative number (<code>-1.23×10^88</code>) and very small negative number (<code>-1.23×10^88</code>), as well as zero, as illustrated:</p>
<img class="image-center" src="A%20Tutorial%20on%20Data%20Representation%20-%20Integers,%20Floating-point%20numbers,%20and%20characters_files/Representation_FloatingPointNumbers.png">
<p>A floating-point number is typically expressed in the scientific notation, with a <em>fraction</em> (<code>F</code>), and an <em>exponent</em> (<code>E</code>) of a certain <em>radix</em> (<code>r</code>), in the form of <code>F×r^E</code>. Decimal numbers use radix of 10 (<code>F×10^E</code>); while binary numbers use radix of 2 (<code>F×2^E</code>).</p>
<p>Representation of floating point number is not unique. For example, the number <code>55.66</code> can be represented as <code>5.566×10^1</code>, <code>0.5566×10^2</code>, <code>0.05566×10^3</code>, and so on. The fractional part can be <em>normalized</em>. In the normalized form, there is only a single non-zero digit before the radix point. For example, decimal number <code>123.4567</code> can be normalized as <code>1.234567×10^2</code>; binary number <code>1010.1011B</code> can be normalized as <code>1.0101011B×2^3</code>.</p>
<p>It is important to note that floating-point numbers suffer from <em>loss of precision</em> when represented with a fixed number of bits (e.g., 32-bit or 64-bit). This is because there are <em>infinite</em> number of real numbers (even within a small range of says 0.0 to 0.1). On the other hand, a <em>n</em>-bit binary pattern can represent a <em>finite</em> <code>2^<em>n</em></code>
distinct numbers. Hence, not all the real numbers can be represented.
The nearest approximation will be used instead, resulted in loss of
accuracy.</p>
<p> It is also important to note that floating number arithmetic is very
much less efficient than integer arithmetic. It could be speed up with a
so-called dedicated <em>floating-point co-processor</em>. Hence, use integers if your application does not require floating-point numbers.</p>
<p>In computers, floating-point numbers are represented in scientific notation of <em>fraction</em> (<code>F</code>) and <em>exponent</em> (<code>E</code>) with a <em>radix</em> of 2, in the form of <code>F×2^E</code>. Both <code>E</code> and <code>F</code>
can be positive as well as negative. Modern computers adopt IEEE 754
standard for representing floating-point numbers. There are two
representation schemes: 32-bit single-precision and 64-bit
double-precision.</p>
<h4>4.1 IEEE-754 32-bit Single-Precision Floating-Point Numbers<a id="zz-4.1"></a></h4>
<p>In 32-bit single-precision floating-point representation:</p>
<ul>
<li>The most significant bit is the <em>sign bit</em> (<code>S</code>), with 0 for positive numbers and 1 for negative numbers.</li>
<li>The following 8 bits represent <em>exponent</em> (<code>E</code>).</li>
<li>The remaining 23 bits represents <em>fraction</em> (<code>F</code>).</li>
</ul>
<img class="image-center" src="A%20Tutorial%20on%20Data%20Representation%20-%20Integers,%20Floating-point%20numbers,%20and%20characters_files/DataRep_Float.gif" alt="float">
<h5>Normalized Form</h5>
<p>Let's illustrate with an example, suppose that the 32-bit pattern is <code><span class="underline">1</span> <span class="underline">1000 0001</span> <span class="underline">011 0000 0000 0000 0000 0000</span></code>, with:</p>
<ul>
<li><code>S = 1</code></li>
<li><code>E = 1000 0001</code></li>
<li><code>F = 011 0000 0000 0000 0000 0000</code></li>
</ul>
<p>In the <em>normalized form</em>, the actual fraction is normalized with an implicit leading 1 in the form of <code>1.F</code>. In this example, the actual fraction is <code>1.011 0000 0000 0000 0000 0000 = 1 + 1×2^-2 + 1×2^-3 = 1.375D</code>.</p>
<p>The sign bit represents the sign of the number, with <code>S=0</code> for positive and <code>S=1</code> for negative number. In this example with <code>S=1</code>, this is a negative number, i.e., <code>-1.375D</code>.</p>
<p>In normalized form, the actual exponent is <code>E-127</code>
(so-called excess-127 or bias-127). This is because we need to represent
both positive and negative exponent. With an 8-bit E, ranging from 0 to
255, the excess-127 scheme could provide actual exponent of -127 to
128. In this example, <code>E-127=129-127=2D</code>.</p>
<p>Hence, the number represented is <code>-1.375×2^2=-5.5D</code>.</p>
<h5>De-Normalized Form</h5>
<p>Normalized form has a serious problem, with an implicit leading 1 for
the fraction, it cannot represent the number zero! Convince yourself on
this!</p>
<p>De-normalized form was devised to represent zero and other numbers.</p>
<p>For <code>E=0</code>, the numbers are in the de-normalized form. An
implicit leading 0 (instead of 1) is used for the fraction; and the
actual exponent is always <code>-126</code>. Hence, the number zero can be represented with <code>E=0</code> and <code>F=0</code> (because <code>0.0×2^-126=0</code>).</p>
<p>We can also represent very small positive and negative numbers in de-normalized form with <code>E=0</code>. For example, if <code>S=1</code>, <code>E=0</code>, and <code>F=011 0000 0000 0000 0000 0000</code>. The actual fraction is <code>0.011=1×2^-2+1×2^-3=0.375D</code>. Since <code>S=1</code>, it is a negative number. With <code>E=0</code>, the actual exponent is <code>-126</code>. Hence the number is <code>-0.375×2^-126 = -4.4×10^-39</code>, which is an extremely small negative number (close to zero).</p>
<h5>Summary</h5>
<p>In summary, the value (<code>N</code>) is calculated as follows:</p>
<ul>
<li>For <code>1 ≤ E ≤ 254, N = (-1)^S × 1.F × 2^(E-127)</code>. These numbers are in the so-called <em>normalized</em> form. The sign-bit represents the sign of the number. Fractional part (<code>1.F</code>) are normalized with an implicit leading 1. The exponent is bias (or in excess) of <code>127</code>, so as to represent both positive and negative exponent. The range of exponent is <code>-126</code> to <code>+127</code>.</li>
<li>For <code>E = 0, N = (-1)^S × 0.F × 2^(-126)</code>. These numbers are in the so-called <em>denormalized</em> form. The exponent of <code>2^-126</code> evaluates to a very small number. Denormalized form is needed to represent zero (with <code>F=0</code> and <code>E=0</code>). It can also represents very small positive and negative number close to zero.</li>
<li>For <code>E = 255</code>, it represents special values, such as <code>±INF</code> (positive and negative infinity) and <code>NaN</code> (not a number). This is beyond the scope of this article.</li>
</ul>
<p><span class="line-heading">Example 1:</span> Suppose that IEEE-754 32-bit floating-point representation pattern is <code><span class="underline">0</span> <span class="underline">10000000</span> <span class="underline">110 0000 0000 0000 0000 0000</span></code>.</p>
<pre class="color-example">Sign bit S = 0 ⇒ positive number
E = 1000 0000B = 128D (in normalized form)
Fraction is 1.11B (with an implicit leading 1) = 1 + 1×2^-1 + 1×2^-2 = 1.75D
The number is +1.75 × 2^(128-127) = +3.5D</pre>
<p><span class="line-heading">Example 2:</span> Suppose that IEEE-754 32-bit floating-point representation pattern is <code><span class="underline">1</span> <span class="underline">01111110</span> <span class="underline">100 0000 0000 0000 0000 0000</span></code>.</p>
<pre class="color-example">Sign bit S = 1 ⇒ negative number
E = 0111 1110B = 126D (in normalized form)
Fraction is 1.1B (with an implicit leading 1) = 1 + 2^-1 = 1.5D
The number is -1.5 × 2^(126-127) = -0.75D</pre>
<p><span class="line-heading">Example 3:</span> Suppose that IEEE-754 32-bit floating-point representation pattern is <code><span class="underline">1</span> <span class="underline">01111110</span> <span class="underline">000 0000 0000 0000 0000 0001</span></code>.</p>
<pre class="color-example">Sign bit S = 1 ⇒ negative number
E = 0111 1110B = 126D (in normalized form)
Fraction is 1.000 0000 0000 0000 0000 0001B (with an implicit leading 1) = 1 + 2^-23
The number is -(1 + 2^-23) × 2^(126-127) = -0.500000059604644775390625 (may not be exact in decimal!)</pre>
<p><span class="line-heading">Example 4 (De-Normalized Form):</span> Suppose that IEEE-754 32-bit floating-point representation pattern is <code><span class="underline">1</span> <span class="underline">00000000</span> <span class="underline">000 0000 0000 0000 0000 0001</span></code>.</p>
<pre class="color-example">Sign bit S = 1 ⇒ negative number
E = 0 (in de-normalized form)
Fraction is 0.000 0000 0000 0000 0000 0001B (with an implicit leading 0) = 1×2^-23
The number is -2^-23 × 2^(-126) = -2×(-149) ≈ -1.4×10^-45</pre>
<h4>4.2 Exercises (Floating-point Numbers)<a id="zz-4.2"></a></h4>
<ol>
<li>Compute the largest and smallest positive numbers that can be represented in the 32-bit normalized form.</li>
<li>Compute the largest and smallest negative numbers can be represented in the 32-bit normalized form.</li>
<li>Repeat (1) for the 32-bit denormalized form.</li>
<li>Repeat (2) for the 32-bit denormalized form.</li>
</ol>
<h5>Hints:</h5>
<ol>
<li>Largest positive number: <code>S=0</code>, <code>E=1111 1110 (254)</code>, <code>F=111 1111 1111 1111 1111 1111</code>.<br>
Smallest positive number: <code>S=0</code>, <code>E=0000 00001 (1)</code>, <code>F=000 0000 0000 0000 0000 0000</code>.</li>
<li>Same as above, but <code>S=1</code>.</li>
<li>Largest positive number: <code>S=0</code>, <code>E=0</code>, <code>F=111 1111 1111 1111 1111 1111</code>.<br>
Smallest positive number: <code>S=0</code>, <code>E=0</code>, <code>F=000 0000 0000 0000 0000 0001</code>.</li>
<li>Same as above, but <code>S=1</code>.</li>
</ol>
<h5>Notes For Java Users</h5>
<p>You can use JDK methods <code>Float.intBitsToFloat(int bits)</code> or <code>Double.longBitsToDouble(long bits)</code> to create a single-precision 32-bit <code>float</code> or double-precision 64-bit <code>double</code> with the specific bit patterns, and print their values. For examples,</p>
<pre class="color-example">System.out.println(Float.intBitsToFloat(0x7fffff));
System.out.println(Double.longBitsToDouble(0x1fffffffffffffL));</pre>
<h4>4.3 IEEE-754 64-bit Double-Precision Floating-Point Numbers<a id="zz-4.3"></a></h4>
<p>The representation scheme for 64-bit double-precision is similar to the 32-bit single-precision:</p>
<ul>
<li>The most significant bit is the <em>sign bit</em> (<code>S</code>), with 0 for positive numbers and 1 for negative numbers.</li>
<li>The following 11 bits represent <em>exponent</em> (<code>E</code>).</li>
<li>The remaining 52 bits represents <em>fraction</em> (<code>F</code>).</li>
</ul>
<img class="image-center" src="A%20Tutorial%20on%20Data%20Representation%20-%20Integers,%20Floating-point%20numbers,%20and%20characters_files/DataRep_Double.gif" alt="double">
<p>The value (<code>N</code>) is calculated as follows:</p>
<ul>
<li>Normalized form: For <code>1 ≤ E ≤ 2046, N = (-1)^S × 1.F × 2^(E-1023)</code>.</li>
<li>Denormalized form: For <code>E = 0, N = (-1)^S × 0.F × 2^(-1022)</code>. These are in the denormalized form.</li>
<li>For <code>E = 2047</code>, <code>N</code> represents special values, such as <code>±INF</code> (infinity), <code>NaN</code> (not a number).</li>
</ul>
<h4>4.4 More on Floating-Point Representation<a id="zz-4.4"></a></h4>
<p>There are three parts in the floating-point representation:</p>
<ul>
<li>The <em>sign bit</em> (<code>S</code>) is self-explanatory (0 for positive numbers and 1 for negative numbers).</li>
<li>For the <em>exponent</em> (<code>E</code>), a so-called <em>bias</em> (or <em>excess</em>)
is applied so as to represent both positive and negative exponent. The
bias is set at half of the range. For single precision with an 8-bit
exponent, the bias is 127 (or excess-127). For double precision with a
11-bit exponent, the bias is 1023 (or excess-1023).</li>
<li>The <em>fraction</em> (<code>F</code>) (also called the <em>mantissa</em> or <em>significand</em>)
is composed of an implicit leading bit (before the radix point) and the
fractional bits (after the radix point). The leading bit for normalized
numbers is 1; while the leading bit for denormalized numbers is 0.</li>
</ul>
<h5>Normalized Floating-Point Numbers</h5>
<p>In normalized form, the radix point is placed after the first non-zero digit, e,g., <code>9.8765D×10^-23D</code>, <code>1.001011B×2^11B</code>. For binary number, the leading bit is always 1, and need not be represented explicitly - this saves 1 bit of storage.</p>
<p>In IEEE 754's normalized form:</p>
<ul>
<li>For single-precision, <code>1 ≤ E ≤ 254</code> with excess of 127. Hence, the actual exponent is from <code>-126</code> to <code>+127</code>.
Negative exponents are used to represent small numbers (< 1.0);
while positive exponents are used to represent large numbers (> 1.0).<br>
<code>N = (-1)^S × 1.F × 2^(E-127)</code></li>
<li>For double-precision, <code>1 ≤ E ≤ 2046</code> with excess of 1023. The actual exponent is from <code>-1022</code> to <code>+1023</code>, and<br>
<code>N = (-1)^S × 1.F × 2^(E-1023)</code></li>
</ul>
<p>Take note that n-bit pattern has a <em>finite</em> number of combinations (<code>=2^n</code>), which could represent <em>finite</em> distinct numbers. It is not possible to represent the <em>infinite</em>
numbers in the real axis (even a small range says 0.0 to 1.0 has
infinite numbers). That is, not all floating-point numbers can be
accurately represented. Instead, the closest approximation is used,
which leads to <em>loss of accuracy</em>.</p>
<p>The <em>minimum</em> and <em>maximum</em> normalized floating-point numbers are:</p>
<table class="table-zebra font-code" style="width:80%">
<tbody><tr>
<th>Precision</th>
<th>Normalized N(min)</th>
<th>Normalized N(max)</th>
</tr>
<tr>
<td class="text-center">Single</td>
<td>0080 0000H<br>
0 00000001 00000000000000000000000B<br>
E = 1, F = 0<br>
N(min) = 1.0B × 2^-126<br>
(≈1.17549435 × 10^-38)</td>
<td>7F7F FFFFH<br>
0 11111110 00000000000000000000000B<br>
E = 254, F = 0<br>
N(max) = 1.1...1B × 2^127 = (2 - 2^-23) × 2^127<br>
(≈3.4028235 × 10^38)</td>
</tr>
<tr>
<td class="text-center">Double</td>
<td>0010 0000 0000 0000H<br>
N(min) = 1.0B × 2^-1022<br>
(≈2.2250738585072014 × 10^-308)</td>
<td>7FEF FFFF FFFF FFFFH<br>
N(max) = 1.1...1B × 2^1023 = (2 - 2^-52) × 2^1023<br>
(≈1.7976931348623157 × 10^308)</td>
</tr>
</tbody></table>
<img class="image-center" src="A%20Tutorial%20on%20Data%20Representation%20-%20Integers,%20Floating-point%20numbers,%20and%20characters_files/DataRep_RealNumbers.png" alt="real numbers">
<h5>Denormalized Floating-Point Numbers</h5>
<p>If <code>E = 0</code>, but the fraction is non-zero, then the value is in denormalized form, and a leading bit of 0 is assumed, as follows:</p>
<ul>
<li>For single-precision, <code>E = 0</code>,<br>
<code>N = (-1)^S × 0.F × 2^(-126)</code></li>
<li>For double-precision, <code>E = 0</code>,<br>
<code>N = (-1)^S × 0.F × 2^(-1022)</code></li>
</ul>
<p>Denormalized form can represent very small numbers closed to zero,
and zero, which cannot be represented in normalized form, as shown in
the above figure.</p>
<p>The minimum and maximum of <em>denormalized floating-point numbers</em> are:</p>
<table class="table-zebra font-code" style="width:90%">
<tbody><tr>
<th>Precision</th>
<th>Denormalized D(min)</th>
<th>Denormalized D(max)</th>
</tr>
<tr>
<td class="text-center">Single</td>
<td>0000 0001H<br>
0 00000000 00000000000000000000001B<br>
E = 0, F = 00000000000000000000001B<br>
D(min) = 0.0...1 × 2^-126 = 1 × 2^-23 × 2^-126 = 2^-149<br>
(≈1.4 × 10^-45)</td>
<td>007F FFFFH<br>
0 00000000 11111111111111111111111B<br>
E = 0, F = 11111111111111111111111B<br>
D(max) = 0.1...1 × 2^-126 = (1-2^-23)×2^-126<br>
(≈1.1754942 × 10^-38)</td>
</tr>
<tr>
<td class="text-center">Double</td>
<td>0000 0000 0000 0001H<br>
D(min) = 0.0...1 × 2^-1022 = 1 × 2^-52 × 2^-1022 = 2^-1074<br>
(≈4.9 × 10^-324)</td>
<td>001F FFFF FFFF FFFFH<br>
D(max) = 0.1...1 × 2^-1022 = (1-2^-52)×2^-1022<br>
(≈4.4501477170144023 × 10^-308)</td>
</tr>
</tbody></table>
<h5>Special Values</h5>
<p><strong>Zero</strong>: Zero cannot be represented in the normalized form, and must be represented in denormalized form with <code>E=0</code> and <code>F=0</code>. There are two representations for zero: <code>+0</code> with <code>S=0</code> and <code>-0</code> with <code>S=1</code>.</p>
<p><strong>Infinity</strong>: The value of +infinity (e.g., <code>1/0</code>) and -infinity (e.g., <code>-1/0</code>) are represented with an exponent of all 1's (<code>E = 255</code> for single-precision and <code>E = 2047</code> for double-precision), <code>F=0</code>, and <code>S=0</code> (for <code>+INF</code>) and <code>S=1</code> (for <code>-INF</code>).</p>
<p><strong>Not a Number (NaN)</strong>: <code>NaN</code> denotes a value that cannot be represented as real number (e.g. <code>0/0</code>). <code>NaN</code> is represented with Exponent of all 1's (<code>E = 255</code> for single-precision and <code>E = 2047</code> for double-precision) and any non-zero fraction.</p>
<h3 id="charencoding">5. Character Encoding<a id="zz-5."></a></h3>
<p>In computer memory, character are "encoded" (or "represented") using a
chosen "character encoding schemes" (aka "character set", "charset",
"character map", or "code page"). </p>
<p>For example, in ASCII (as well as Latin1, Unicode, and many other character sets):</p>
<ul>
<li>code numbers <code>65D (41H)</code> to <code>90D (5AH)</code> represents <code>'A'</code> to <code>'Z'</code>, respectively.</li>
<li>code numbers <code>97D (61H)</code> to <code>122D (7AH)</code> represents <code>'a'</code> to <code>'z'</code>, respectively.</li>
<li>code numbers <code>48D (30H)</code> to <code>57D (39H)</code> represents <code>'0'</code> to <code>'9'</code>, respectively.</li>
</ul>
<p>It is important to note that the representation scheme must be known
before a binary pattern can be interpreted. E.g., the 8-bit pattern "<code>0100 0010B</code>" could represent anything under the sun known only to the person encoded it.</p>
<p>The most commonly-used character encoding schemes are: 7-bit ASCII
(ISO/IEC 646) and 8-bit Latin-x (ISO/IEC 8859-x) for western european
characters, and Unicode (ISO/IEC 10646) for internationalization (i18n).</p>
<p>A 7-bit encoding scheme (such as ASCII) can represent 128 characters
and symbols. An 8-bit character encoding scheme (such as Latin-x) can
represent 256 characters and symbols; whereas a 16-bit encoding scheme
(such as Unicode UCS-2) can represents 65,536 characters and symbols.</p>
<h4 id="ASCII">5.1 7-bit ASCII Code (aka US-ASCII, ISO/IEC 646, ITU-T T.50)<a id="zz-5.1"></a></h4>
<ul>
<li>ASCII (American Standard Code for Information Interchange) is one of the earlier character coding schemes.</li>
<li>ASCII is originally a 7-bit code. It has been extended to 8-bit to
better utilize the 8-bit computer memory organization. (The 8th-bit was
originally used for <em>parity check</em> in the early computers.)</li>
<li>Code numbers <code>32D (20H)</code> to <code>126D (7EH)</code> are printable (displayable) characters as tabulated:
<table class="table-zebra font-code" style="width:60%;text-align:center">
<tbody><tr>
<th>Hex</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>A</th>
<th>B</th>
<th>C</th>
<th>D</th>
<th>E</th>
<th>F</th>
</tr>
<tr>
<th>2</th>
<td>SP</td><td>!</td><td>"</td>
<td>#</td><td>$</td><td>%</td><td>&</td><td>'</td><td>(</td><td>)</td><td>*</td><td>+</td><td>,</td><td>-</td><td>.</td><td>/</td>
</tr>
<tr>
<th>3</th>
<td>0</td><td>1</td><td>2</td><td>3</td><td>4</td><td>5</td><td>6</td><td>7</td><td>8</td><td>9</td><td>:</td><td>;</td><td><</td><td>=</td><td>></td>
<td>?</td>
</tr>
<tr>
<th>4</th>
<td>@</td><td>A</td><td>B</td><td>C</td><td>D</td><td>E</td><td>F</td><td>G</td><td>H</td><td>I</td><td>J</td><td>K</td><td>L</td><td>M</td><td>N</td><td>O</td>
</tr>
<tr>
<th>5</th>
<td>P</td><td>Q</td><td>R</td><td>S</td><td>T</td><td>U</td><td>V</td><td>W</td><td>X</td><td>Y</td><td>Z</td><td>[</td><td>\</td><td>]</td><td>^</td><td>_</td>
</tr>
<tr>
<th>6</th>
<td>`</td><td>a</td><td>b</td><td>c</td><td>d</td><td>e</td><td>f</td><td>g</td><td>h</td><td>i</td><td>j</td><td>k</td><td>l</td><td>m</td><td>n</td><td>o</td>
</tr>
<tr>
<th>7</th>
<td>p</td><td>q</td><td>r</td><td>s</td><td>t</td><td>u</td><td>v</td><td>w</td><td>x</td><td>y</td><td>z</td><td>{</td><td>|</td><td>}</td><td>~</td><td class="tr-alt"> </td>
</tr>
</tbody></table>
<ul>
<li>Code number <code>32D (20H)</code> is the <em>blank</em> or <em>space</em> character.</li>
<li><code>'0'</code> to <code>'9'</code>: <code>30H-39H (0011 0001B to 0011 1001B)</code> or <code>(0011 xxxxB</code> where <code>xxxx</code> is the equivalent integer value<code>)</code></li>
<li><code>'A'</code> to <code>'Z'</code>: <code>41H-5AH (0101 0001B to 0101 1010B)</code> or <code>(010x xxxxB)</code>. <code>'A'</code> to <code>'Z'</code> are continuous without gap.</li>
<li><code>'a'</code> to <code>'z'</code>: <code>61H-7AH (0110 0001B to 0111 1010B)</code> or <code>(011x xxxxB)</code>. <code>'A'</code> to <code>'Z'</code>
are also continuous without gap. However, there is a gap between
uppercase and lowercase letters. To convert between upper and lowercase,
flip the value of bit-5.</li>
</ul>
</li>
<li>Code numbers <code>0D (00H)</code> to <code>31D (1FH)</code>, and <code>127D (7FH)</code>
are special control characters, which are non-printable
(non-displayable), as tabulated below. Many of these characters were
used in the early days for transmission control (e.g., STX, ETX) and
printer control (e.g., Form-Feed), which are now obsolete. The remaining
meaningful codes today are:
<ul>
<li><code>09H</code> for Tab (<code>'\t'</code>).</li>
<li><code>0AH</code> for Line-Feed or newline (LF or <code>'\n'</code>) and <code>0DH</code> for Carriage-Return (CR or <code>'r'</code>), which are used as <em>line delimiter</em> (aka <em>line separator</em>, <em>end-of-line</em>) for text files. There is unfortunately no standard for line delimiter: Unixes and Mac use <code>0AH</code> (LF or "<code>\n</code>"), Windows use <code>0D0AH</code> (CR+LF or "<code>\r\n</code>"). Programming languages such as C/C++/Java (which was created on Unix) use <code>0AH</code> (LF or "<code>\n</code>").</li>
<li>In programming languages such as C/C++/Java, line-feed (<code>0AH</code>) is denoted as <code>'\n'</code>, carriage-return (<code>0DH</code>) as <code>'\r'</code>, tab (<code>09H</code>) as <code>'\t'</code>.</li>
</ul>
</li>
</ul>
<table class="table-zebra font-code" style="width:60%">
<colgroup><col class="tr-alt">
<col class="tr-alt">
<col>
<col>
<col class="tr-alt">
<col class="tr-alt">
<col>
<col>
</colgroup><tbody><tr>
<th>DEC</th><th>HEX</th><th colspan="2">Meaning</th><th>DEC</th><th>HEX</th><th colspan="2">Meaning</th>
</tr>
<tr>
<td>0</td><td>00</td><td>NUL</td><td>Null</td><td>17</td><td>11</td><td>DC1</td><td>Device Control 1</td>
</tr>
<tr>
<td>1</td><td>01</td><td>SOH</td><td>Start of Heading</td><td>18</td><td>12</td><td>DC2</td><td>Device Control 2</td>
</tr>
<tr>
<td>2</td><td>02</td><td>STX</td><td>Start of Text</td><td>19</td><td>13</td><td>DC3</td><td>Device Control 3</td>
</tr>
<tr>
<td>3</td><td>03</td><td>ETX</td><td>End of Text</td><td>20</td><td>14</td><td>DC4</td><td>Device Control 4</td>
</tr>
<tr>
<td>4</td><td>04</td><td>EOT</td><td>End of Transmission</td><td>21</td><td>15</td><td>NAK</td><td>Negative Ack.</td>
</tr>
<tr>
<td>5</td><td>05</td><td>ENQ</td><td>Enquiry</td><td>22</td><td>16</td><td>SYN</td><td>Sync. Idle</td>
</tr>
<tr>
<td>6</td><td>06</td><td>ACK</td><td>Acknowledgment</td><td>23</td><td>17</td><td>ETB</td><td>End of Transmission</td>
</tr>
<tr>
<td>7</td><td>07</td><td>BEL</td><td>Bell</td><td>24</td><td>18</td><td>CAN</td><td>Cancel</td>
</tr>
<tr>
<td>8</td><td>08</td><td>BS</td>
<td>Back Space <code>'\b'</code></td>
<td>25</td><td>19</td><td>EM</td><td>End of Medium</td>
</tr>
<tr>
<td><strong>9</strong></td><td><strong>09</strong></td><td><strong>HT</strong></td><td><strong>Horizontal Tab <code>'\t'</code></strong></td><td>26</td><td>1A</td><td>SUB</td><td>Substitute</td>
</tr>
<tr>
<td><strong>10</strong></td><td><strong>0A</strong></td><td><strong>LF</strong></td><td><strong>Line Feed <code>'\n'</code></strong></td><td>27</td><td>1B</td><td>ESC</td><td>Escape</td>
</tr>
<tr>
<td>11</td><td>0B</td><td>VT</td><td>Vertical Feed</td><td>28</td><td>1C</td><td>IS4</td><td>File Separator</td>
</tr>
<tr>
<td>12</td><td>0C</td><td>FF</td><td>Form Feed <code>'f'</code></td><td>29</td><td>1D</td><td>IS3</td><td>Group Separator</td>
</tr>
<tr>
<td><strong>13</strong></td>
<td><strong>0D</strong></td>
<td><strong>CR</strong></td>
<td><strong>Carriage Return <code>'\r'</code></strong></td>
<td>30</td><td>1E</td><td>IS2</td><td>Record Separator</td>
</tr>
<tr>
<td>14</td><td>0E</td><td>SO</td><td>Shift Out</td><td>31</td><td>1F</td><td>IS1</td><td>Unit Separator</td>
</tr>
<tr>
<td>15</td><td>0F</td><td>SI</td><td>Shift In</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>16</td><td>10</td><td>DLE</td><td>Datalink Escape</td>
<td>127</td>
<td>7F</td>
<td>DEL</td>
<td>Delete</td>
</tr>
</tbody></table>
<h4>5.2 8-bit Latin-1 (aka ISO/IEC 8859-1)<a id="zz-5.2"></a></h4>
<p>ISO/IEC-8859 is a <em>collection</em> of 8-bit character encoding standards for the western languages.</p>
<p>ISO/IEC 8859-1, aka Latin alphabet No. 1, or Latin-1 in short, is the
most commonly-used encoding scheme for western european languages. It
has 191 printable characters from the latin script, which covers
languages like English, German, Italian, Portuguese and Spanish. Latin-1
is backward compatible with the 7-bit US-ASCII code. That is, the
first 128 characters in Latin-1 (code numbers 0 to 127 (7FH)), is the
same as US-ASCII. Code numbers 128 (80H) to 159 (9FH) are not assigned.
Code numbers 160 (A0H) to 255 (FFH) are given as follows:</p>
<table class="table-zebra font-code" style="width:60%;text-align:center">
<tbody><tr>
<th>Hex</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>A</th>
<th>B</th>
<th>C</th>
<th>D</th>
<th>E</th>
<th>F</th>
</tr>
<tr>
<th>A</th>
<td>NBSP</td>
<td>¡</td><td>¢</td>
<td>£</td><td>¤</td><td>¥</td><td>¦</td><td>§</td><td>¨</td><td>©</td><td>ª</td><td>«</td><td>¬</td><td>SHY</td><td>®</td><td>¯</td>
</tr>
<tr>
<th>B</th>
<td>°</td><td>±</td><td>²</td><td>³</td><td>´</td><td>µ</td><td>¶</td><td>·</td><td>¸</td><td>¹</td><td>º</td><td>»</td><td>¼</td><td>½</td><td>¾</td>
<td>¿</td>
</tr>
<tr>
<th>C</th>
<td>À</td><td>Á</td><td>Â</td><td>Ã</td><td>Ä</td><td>Å</td><td>Æ</td><td>Ç</td><td>È</td><td>É</td><td>Ê</td><td>Ë</td><td>Ì</td><td>Í</td><td>Î</td><td>Ï</td>
</tr>
<tr>
<th>D</th>
<td>Ð</td><td>Ñ</td><td>Ò</td><td>Ó</td><td>Ô</td><td>Õ</td><td>Ö</td><td>×</td><td>Ø</td><td>Ù</td><td>Ú</td><td>Û</td><td>Ü</td><td>Ý</td><td>Þ</td><td>ß</td>
</tr>
<tr>
<th>E</th>
<td>à</td><td>á</td><td>â</td><td>ã</td><td>ä</td><td>å</td><td>æ</td><td>ç</td><td>è</td><td>é</td><td>ê</td><td>ë</td><td>ì</td><td>í</td><td>î</td><td>ï</td>
</tr>
<tr>
<th>F</th>
<td>ð</td><td>ñ</td><td>ò</td><td>ó</td><td>ô</td><td>õ</td><td>ö</td><td>÷</td><td>ø</td><td>ù</td><td>ú</td><td>û</td><td>ü</td><td>ý</td><td>þ</td><td>ÿ</td>
</tr>
</tbody></table>
<p>ISO/IEC-8859 has 16 parts. Besides the most commonly-used Part 1,
Part 2 is meant for Central European (Polish, Czech, Hungarian, etc),
Part 3 for South European (Turkish, etc), Part 4 for North European
(Estonian, Latvian, etc), Part 5 for Cyrillic, Part 6 for Arabic, Part 7
for Greek, Part 8 for Hebrew, Part 9 for Turkish, Part 10 for Nordic,
Part 11 for Thai, Part 12 was abandon, Part 13 for Baltic Rim, Part 14
for Celtic, Part 15 for French, Finnish, etc. Part 16 for South-Eastern
European.</p>
<h4>5.3 Other 8-bit Extension of US-ASCII (ASCII Extensions)<a id="zz-5.3"></a></h4>
<p>Beside the standardized ISO-8859-x, there are many 8-bit ASCII extensions, which are not compatible with each others.</p>
<p><strong>ANSI </strong>(American National Standards Institute) (aka <strong>Windows-1252</strong>,
or Windows Codepage 1252): for Latin alphabets used in the legacy
DOS/Windows systems. It is a superset of ISO-8859-1 with code numbers
128 (80H) to 159 (9FH) assigned to displayable characters, such as
"smart" single-quotes and double-quotes. A common problem in web
browsers is that all the quotes and apostrophes (produced by "smart
quotes" in some Microsoft software) were replaced with question marks or
some strange symbols. It it because the document is labeled as
ISO-8859-1 (instead of Windows-1252), where these code numbers are
undefined. Most modern browsers and e-mail clients treat charset
ISO-8859-1 as Windows-1252 in order to accommodate such mis-labeling.</p>
<table class="table-zebra font-code" style="width:60%;text-align:center">
<tbody><tr>
<th>Hex</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>A</th>
<th>B</th>
<th>C</th>
<th>D</th>
<th>E</th>
<th>F</th>
</tr>
<tr>
<th>8</th>
<td>€</td>
<td> </td>
<td>‚</td>
<td>ƒ</td>
<td>„</td>
<td>…</td>
<td>†</td>
<td>‡</td>
<td>ˆ</td>
<td>‰</td>
<td>Š</td>
<td>‹</td>
<td>Œ</td>
<td> </td>
<td>Ž</td>
<td> </td>
</tr>
<tr>
<th>9</th>
<td> </td>
<td>‘</td>
<td>’</td>
<td>“</td>
<td>”</td>
<td>•</td>
<td>–</td>
<td>—</td>
<td> </td>
<td>™</td>
<td>š</td>
<td>›</td>
<td>œ</td>
<td> </td>
<td>ž</td>
<td>Ÿ</td>
</tr>
</tbody></table>
<p><strong>EBCDIC</strong> (Extended Binary Coded Decimal Interchange Code): Used in the early IBM computers.</p>
<h4>5.4 Unicode (aka ISO/IEC 10646 Universal Character Set)<a id="zz-5.4"></a></h4>
<p>Before Unicode, no single character encoding scheme could represent
characters in all languages. For example, western european uses several
encoding schemes (in the ISO-8859-x family). Even a single language
like Chinese has a few encoding schemes (GB2312/GBK, BIG5). Many
encoding schemes are in conflict of each other, i.e., the same code
number is assigned to different characters.</p>
<p>Unicode aims to provide a standard character encoding scheme, which
is universal, efficient, uniform and unambiguous. Unicode standard is
maintained by a non-profit organization called the Unicode Consortium (@
<a href="http://www.unicode.org/">www.unicode.org</a>). Unicode is an ISO/IEC standard 10646.</p>
<p>Unicode is backward compatible with the 7-bit US-ASCII and 8-bit
Latin-1 (ISO-8859-1). That is, the first 128 characters are the same as
US-ASCII; and the first 256 characters are the same as Latin-1.</p>
<p>Unicode originally uses 16 bits (called UCS-2 or Unicode Character
Set - 2 byte), which can represent up to 65,536 characters. It has since
been expanded to more than 16 bits, currently stands at 21 bits. The
range of the legal codes in ISO/IEC 10646 is now from U+0000H to
U+10FFFFH (21 bits or about 2 million characters), covering all current
and ancient historical scripts. The original 16-bit range of U+0000H to
U+FFFFH (65536 characters) is known as <em>Basic Multilingual Plane</em> (BMP), covering all the major languages in use currently. The characters outside BMP are called <em>Supplementary Characters</em>, which are not frequently-used.</p>
<p>Unicode has two encoding schemes:</p>
<ul>
<li><strong>UCS-2</strong> (Universal Character Set - 2 Byte): Uses 2
bytes (16 bits), covering 65,536 characters in the BMP. BMP is
sufficient for most of the applications. UCS-2 is now obsolete.</li>
<li><strong>UCS-4</strong> (Universal Character Set - 4 Byte): Uses 4 bytes (32 bits), covering BMP and the supplementary characters.</li>
</ul>
<img class="image-center" src="A%20Tutorial%20on%20Data%20Representation%20-%20Integers,%20Floating-point%20numbers,%20and%20characters_files/DataRep_Unicode.png">
<h4>5.5 UTF-8 (Unicode Transformation Format - 8-bit)<a id="zz-5.5"></a></h4>
<p>The 16/32-bit Unicode (UCS-2/4) is grossly inefficient if the
document contains mainly ASCII characters, because each character
occupies two bytes of storage. Variable-length encoding schemes, such as
UTF-8, which uses 1-4 bytes to represent a character, was devised to
improve the efficiency. In UTF-8, the 128 commonly-used US-ASCII
characters use only 1 byte, but some less-commonly characters may
require up to 4 bytes. Overall, the efficiency improved for document
containing mainly US-ASCII texts.</p>
<p>The transformation between Unicode and UTF-8 is as follows:</p>
<table class="table-zebra font-code" style="width:60%">
<tbody><tr>
<th>Bits</th>
<th>Unicode</th>
<th>UTF-8 Code</th>
<th>Bytes</th>
</tr>
<tr>
<td class="text-center">7</td>
<td class="text-right">00000000 0xxxxxxx</td>
<td class="text-right">0xxxxxxx</td>
<td>1 (ASCII)</td>
</tr>
<tr>
<td class="text-center">11</td>
<td class="text-right">00000yyy yyxxxxxx</td>
<td class="text-right">110yyyyy 10xxxxxx</td>
<td>2</td>
</tr>
<tr>
<td class="text-center">16</td>
<td class="text-right">zzzzyyyy yyxxxxxx</td>
<td class="text-right">1110zzzz 10yyyyyy 10xxxxxx</td>
<td>3</td>
</tr>
<tr>
<td class="text-center">21</td>
<td class="text-right">000uuuuu zzzzyyyy yyxxxxxx</td>
<td class="text-right">11110uuu 10uuzzzz 10yyyyyy 10xxxxxx</td>
<td>4</td>
</tr>
</tbody></table>
<p>In UTF-8, Unicode numbers corresponding to the 7-bit ASCII characters
are padded with a leading zero; thus has the same value as ASCII.
Hence, UTF-8 can be used with all software using ASCII. Unicode numbers
of 128 and above, which are less frequently used, are encoded using
more bytes (2-4 bytes). UTF-8 generally requires less storage and is
compatible with ASCII. The drawback of UTF-8 is more processing power
needed to unpack the code due to its variable length. UTF-8 is the most
popular format for Unicode.</p>
<p>Notes:</p>
<ul>
<li>UTF-8 uses 1-3 bytes for the characters in BMP (16-bit), and 4 bytes for supplementary characters outside BMP (21-bit).</li>
<li>The 128 ASCII characters (basic Latin letters, digits, and
punctuation signs) use one byte. Most European and Middle East
characters use a 2-byte sequence, which includes extended Latin letters
(with tilde, macron, acute, grave and other accents), Greek, Armenian,
Hebrew, Arabic, and others. Chinese, Japanese and Korean (CJK) use
three-byte sequences.</li>
<li>All the bytes, except the 128 ASCII characters, have a leading <code>'1'</code> bit. In other words, the ASCII bytes, with a leading <code>'0'</code> bit, can be identified and decoded easily.</li>
</ul>
<p><strong>Example</strong>: 您好 <code>(Unicode: 60A8H 597DH)</code></p>
<pre class="color-example">Unicode (UCS-2) is 60A8H = 0110 0000 10 101000B
⇒ UTF-8 is 11100110 10000010 10101000B = E6 82 A8H
Unicode (UCS-2) is 597DH = 0101 1001 01 111101B
⇒ UTF-8 is 11100101 10100101 10111101B = E5 A5 BDH</pre>
<h4>5.6 UTF-16 (Unicode Transformation Format - 16-bit)<a id="zz-5.6"></a></h4>
<p>UTF-16 is a variable-length Unicode character encoding scheme, which
uses 2 to 4 bytes. UTF-16 is not commonly used. The transformation table
is as follows:</p>
<table class="table-zebra font-code" style="width:60%">
<tbody><tr>
<th>Unicode</th>
<th>UTF-16 Code</th>
<th>Bytes</th>
</tr>
<tr>
<td class="text-right">xxxxxxxx xxxxxxxx</td>
<td class="text-right">Same as UCS-2 - no encoding</td>
<td>2</td>
</tr>
<tr>
<td class="text-right">000uuuuu zzzzyyyy yyxxxxxx<br>
(uuuuu≠0)</td>
<td class="text-right">110110ww wwzzzzyy 110111yy yyxxxxxx<br>
(wwww = uuuuu - 1)</td>
<td>4</td>
</tr>
</tbody></table>
<p>Take note that for the 65536 characters in BMP, the UTF-16 is the
same as UCS-2 (2 bytes). However, 4 bytes are used for the supplementary
characters outside the BMP.</p>
<p>For BMP characters, UTF-16 is the same as UCS-2. For supplementary
characters, each character requires a pair 16-bit values, the first from
the high-surrogates range, (<code>\uD800-\uDBFF</code>), the second from the low-surrogates range (<code>\uDC00-\uDFFF</code>).</p>
<h4>5.7 UTF-32 (Unicode Transformation Format - 32-bit)<a id="zz-5.7"></a></h4>
<p>Same as UCS-4, which uses 4 bytes for each character - unencoded.</p>
<h4>5.8 Formats of Multi-Byte (e.g., Unicode) Text Files<a id="zz-5.8"></a></h4>
<p><span class="line-heading">Endianess (or byte-order)</span>: For a multi-byte character, you need to take care of the order of the bytes in storage. In <em>big endian</em>, the most significant byte is stored at the memory location with the lowest address (big byte first). In <em>little endian</em>,
the most significant byte is stored at the memory location with the
highest address (little byte first). For example, 您 (with Unicode number
of <code>60A8H</code>) is stored as <code>60 A8</code> in big endian; and stored as <code>A8 60</code> in little endian. Big endian, which produces a more readable hex dump, is more commonly-used, and is often the default.</p>
<p><span class="line-heading">BOM (Byte Order Mark)</span>: BOM is a special Unicode character having code number of <code>FEFFH</code>, which is used to differentiate big-endian and little-endian. For big-endian, BOM appears as <code>FE FFH</code> in the storage. For little-endian, BOM appears as <code>FF FEH</code>. Unicode reserves these two code numbers to prevent it from crashing with another character.</p>
<p>Unicode text files could take on these formats:</p>
<ul>
<li>Big Endian: UCS-2BE, UTF-16BE, UTF-32BE.</li>
<li>Little Endian: UCS-2LE, UTF-16LE, UTF-32LE.</li>
<li>UTF-16 with BOM. The first character of the file is a BOM character,
which specifies the endianess. For big-endian, BOM appears as <code>FE FFH</code> in the storage. For little-endian, BOM appears as <code>FF FEH</code>.</li>
</ul>
<p>UTF-8 file is always stored as big endian. BOM plays no part.
However, in some systems (in particular Windows), a BOM is added as the
first character in the UTF-8 file as the signature to identity the file
as UTF-8 encoded. The BOM character (<code>FEFFH</code>) is encoded in UTF-8 as <code>EF BB BF</code>.
Adding a BOM as the first character of the file is not recommended, as
it may be incorrectly interpreted in other system. You can have a UTF-8
file without BOM.</p>
<h4>5.9 Formats of Text Files<a id="zz-5.9"></a></h4>
<p><span class="line-heading">Line Delimiter or End-Of-Line (EOL)</span>:
Sometimes, when you use the Windows NotePad to open a text file
(created in Unix or Mac), all the lines are joined together. This is
because different operating platforms use different character as the
so-called <em>line delimiter</em> (or <em>end-of-line</em> or EOL). Two non-printable control characters are involved: <code>0AH</code> (Line-Feed or LF) and <code>0DH</code> (Carriage-Return or CR).</p>
<ul>
<li>Windows/DOS uses <code>OD0AH</code> (CR+LF or "<code>\r\n</code>") as EOL.</li>
<li>Unix and Mac use <code>0AH</code> (LF or "<code>\n</code>") only.</li>
</ul>
<p><span class="line-heading">End-of-File (EOF)</span>: [TODO]</p>
<h4>5.10 Windows' CMD Codepage<a id="zz-5.10"></a></h4>
<p>Character encoding scheme (charset) in Windows is called <em>codepage</em>. In CMD shell, you can issue command <code>"chcp"</code> to display the current codepage, or <code>"chcp codepage-number"</code> to change the codepage.</p>
<p>Take note that:</p>
<ul>
<li>The default codepage 437 (used in the original DOS) is an 8-bit character set called <em>Extended ASCII</em>, which is different from Latin-1 for code numbers above 127.</li>
<li>Codepage 1252 (Windows-1252), is not exactly the same as Latin-1.
It assigns code number 80H to 9FH to letters and punctuation, such as
smart single-quotes and double-quotes. A common problem in browser that
display quotes and apostrophe in question marks or boxes is because the
page is supposed to be Windows-1252, but mislabelled as ISO-8859-1.</li>
<li>For internationalization and chinese character set: codepage 65001
for UTF8, codepage 1201 for UCS-2BE, codepage 1200 for UCS-2LE,
codepage 936 for chinese characters in GB2312, codepage 950 for chinese
characters in Big5.</li>
</ul>
<h4 id="chinese_charset">5.11 Chinese Character Sets<a id="zz-5.11"></a></h4>
<p>Unicode supports all languages, including asian languages like
Chinese (both simplified and traditional characters), Japanese and
Korean (collectively called CJK). There are more than 20,000 CJK
characters in Unicode. Unicode characters are often encoded in the UTF-8
scheme, which unfortunately, requires 3 bytes for each CJK character,
instead of 2 bytes in the unencoded UCS-2 (UTF-16).</p>
<p>Worse still, there are also various chinese character sets, which is not compatible with Unicode:</p>
<ul>
<li>GB2312/GBK: for <em>simplified</em> chinese characters. GB2312
uses 2 bytes for each chinese character. The most significant bit (MSB)
of both bytes are set to 1 to co-exist with 7-bit ASCII with the MSB of
0. There are about 6700 characters. GBK is an extension of GB2312, which
include more characters as well as traditional chinese characters.</li>
<li>BIG5: for <em>traditional</em> chinese characters BIG5 also uses 2
bytes for each chinese character. The most significant bit of both
bytes are also set to 1. BIG5 is not compatible with GBK, i.e., the same
code number is assigned to different character.</li>
</ul>
<p>For example, the world is made more interesting with these many standards:</p>
<table class="table-zebra font-code" style="width:60%">
<tbody><tr>
<th> </th>
<th>Standard</th>
<th>Characters</th>
<th>Codes</th>
</tr>
<tr>
<td rowspan="3">Simplified</td>
<td>GB2312</td>
<td>和谐</td>
<td>BACD D0B3</td>
</tr>
<tr>
<td>UCS-2</td>
<td>和谐</td>
<td>548C 8C10</td>
</tr>
<tr>
<td>UTF-8</td>
<td>和谐</td>
<td>E5928C E8B090</td>
</tr>
<tr>
<td rowspan="3">Traditional</td>
<td>BIG5</td>
<td>和諧</td>
<td>A94D BFD3</td>
</tr>
<tr>
<td>UCS-2</td>
<td>和諧</td>
<td>548C 8AE7</td>
</tr>
<tr>
<td>UTF-8</td>
<td>和諧</td>
<td>E5928C E8ABA7</td>
</tr>
</tbody></table>
<p><span class="line-heading">Notes for Windows' CMD Users</span>: To
display the chinese character correctly in CMD shell, you need to choose
the correct codepage, e.g., 65001 for UTF8, 936 for GB2312/GBK, 950 for
Big5, 1201 for UCS-2BE, 1200 for UCS-2LE, 437 for the original DOS. You
can use command "<code>chcp</code>" to display the current code page and command "<code>chcp <em>codepage_number</em></code>"
to change the codepage. You also have to choose a font that can display
the characters (e.g., Courier New, Consolas or Lucida Console, NOT
Raster font).</p>
<h4>5.12 Collating Sequences (for Ranking Characters)<a id="zz-5.12"></a></h4>
<p>A string consists of a sequence of characters in upper or lower cases, e.g., <code>"apple"</code>, <code>"BOY"</code>, <code>"Cat"</code>.
In sorting or comparing strings, if we order the characters according
to the underlying code numbers (e.g., US-ASCII) character-by-character,
the order for the example would be <code>"BOY"</code>, <code>"apple"</code>, <code>"Cat"</code> because uppercase letters have a smaller code number than lowercase letters. This does not agree with the so-called <em>dictionary order</em>, where the same uppercase and lowercase letters have the same rank. Another common problem in ordering strings is <code>"10"</code> (ten) at times is ordered in front of <code>"1"</code> to <code>"9"</code>.</p>
<p> Hence, in sorting or comparison of strings, a so-called <em>collating sequence</em> (or <em>collation</em>)
is often defined, which specifies the ranks for letters (uppercase,
lowercase), numbers, and special symbols. There are many collating
sequences available. It is entirely up to you to choose a collating
sequence to meet your application's specific requirements. Some <em>case-insensitive dictionary-order collating sequences</em> have the same rank for same uppercase and lowercase letters, i.e., <code>'A'</code>, <code>'a'</code> ⇒ <code>'B'</code>, <code>'b'</code> ⇒ ... ⇒ <code>'Z'</code>, <code>'z'</code>. Some <em>case-sensitive dictionary-order collating sequences</em> put the uppercase letter before its lowercase counterpart, i.e., <code>'A'</code> ⇒<code>'B'</code> ⇒ <code>'C'</code>... ⇒ <code>'a'</code> ⇒<code> 'b' ⇒ <code>'c'</code>...</code>. Typically, space is ranked before digits <code>'0'</code> to <code>'9'</code>, followed by the alphabets.</p>
<p>Collating sequence is often language dependent, as different
languages use different sets of characters (e.g., á, é, a, α) with their
own orders.</p>
<h4>5.13 For Java Programmers - <code>java.nio.Charset</code><a id="zz-5.13"></a></h4>
<p>JDK 1.4 introduced a new <code>java.nio.charset</code> package to
support encoding/decoding of characters from UCS-2 used internally in
Java program to any supported charset used by external devices.</p>
<p><strong>Example</strong>: The following program encodes some Unicode
texts in various encoding scheme, and display the Hex codes of the
encoded byte sequences.</p>
<pre class="color-example">import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.Charset;
public class <strong>TestCharsetEncodeDecode</strong> {
public static void main(String[] args) {
<span class="color-comment">// Try these charsets for encoding</span>
String[] charsetNames = {"US-ASCII", "ISO-8859-1", "UTF-8", "UTF-16",
"UTF-16BE", "UTF-16LE", "GBK", "BIG5"};
String message = "Hi,您好!"; <span class="color-comment">// message with non-ASCII characters</span>
<span class="color-comment">// Print UCS-2 in hex codes</span>
System.out.printf("%10s: ", "UCS-2");
for (int i = 0; i < message.length(); i++) {
System.out.printf("%04X ", (int)message.charAt(i));
}
System.out.println();
for (String charsetName: charsetNames) {
<span class="color-comment">// Get a Charset instance given the charset name string</span>
Charset charset = Charset.forName(charsetName);
System.out.printf("%10s: ", charset.name());
<span class="color-comment">// Encode the Unicode UCS-2 characters into a byte sequence in this charset.</span>
ByteBuffer bb = charset.encode(message);
while (bb.hasRemaining()) {
System.out.printf("%02X ", bb.get()); <span class="color-comment">// Print hex code</span>
}
System.out.println();
bb.rewind();
}
}
}</pre>
<pre class="output"> UCS-2: 0048 0069 002C 60A8 597D 0021 <span class="color-comment">[16-bit fixed-length]</span>
<span class="color-comment">H i , 您 好 !</span>
US-ASCII: 48 69 2C 3F 3F 21 <span class="color-comment">[8-bit fixed-length]</span>
<span class="color-comment">H i , ? ? !</span>
ISO-8859-1: 48 69 2C 3F 3F 21 <span class="color-comment">[8-bit fixed-length]</span>
<span class="color-comment">H i , ? ? !</span>
UTF-8: 48 69 2C <span class="underline">E6 82 A8</span> <span class="underline">E5 A5 BD</span> 21 <span class="color-comment">[1-4 bytes variable-length]</span>
<span class="color-comment">H i , 您 好 !</span>
UTF-16: <span class="underline">FE FF</span> <span class="underline">00 48</span> <span class="underline">00 69</span> <span class="underline">00 2C</span> <span class="underline">60 A8</span> <span class="underline">59 7D</span> <span class="underline">00 21</span> <span class="color-comment">[2-4 bytes variable-length]</span>
<span class="color-comment">BOM H i , 您 好 ! [Byte-Order-Mark indicates Big-Endian]</span>
UTF-16BE: <span class="underline">00 48</span> <span class="underline">00 69</span> <span class="underline">00 2C</span> <span class="underline">60 A8</span> <span class="underline">59 7D</span> <span class="underline">00 21</span> <span class="color-comment">[2-4 bytes variable-length]</span>
<span class="color-comment">H i , 您 好 !</span>
UTF-16LE: <span class="underline">48 00</span> <span class="underline">69 00</span> <span class="underline">2C 00</span> <span class="underline">A8 60</span> <span class="underline">7D 59</span> <span class="underline">21 00</span> <span class="color-comment">[2-4 bytes variable-length]</span>
<span class="color-comment">H i , 您 好 !</span>
GBK: 48 69 2C <span class="underline">C4 FA</span> <span class="underline">BA C3</span> 21 <span class="color-comment">[1-2 bytes variable-length]</span>
<span class="color-comment">H i , 您 好 !</span>
Big5: 48 69 2C <span class="underline">B1 7A</span> <span class="underline">A6 6E</span> 21 <span class="color-comment">[1-2 bytes variable-length]</span>
<span class="color-comment">H i , 您 好 !</span>
</pre>
<h4>5.14 For Java Programmers - <code>char</code> and <code>String</code><a id="zz-5.14"></a></h4>
<p>The <code>char</code> data type are based on the <em>original</em> 16-bit Unicode standard called UCS-2. The Unicode has since evolved to 21 bits, with code range of U+0000 to U+10FFFF.
The set of characters from U+0000 to U+FFFF is known as the <em>Basic Multilingual Plane</em> (<em>BMP</em>). Characters above U+FFFF are called <em>supplementary</em> characters. A 16-bit Java <code>char</code> cannot hold a supplementary character.</p>
<p>Recall that in the UTF-16 encoding scheme, a BMP characters uses 2
bytes. It is the same as UCS-2. A supplementary character uses 4 bytes.
and requires a pair of 16-bit values, the first from the high-surrogates
range, (<code>\uD800-\uDBFF</code>), the second from the low-surrogates range (<code>\uDC00-\uDFFF</code>).</p>
<p>In Java, a <code>String</code> is a sequences of Unicode characters. Java, in fact, uses UTF-16 for <code>String</code> and <code>StringBuffer</code>. For BMP characters, they are the same as UCS-2. For supplementary characters, each characters requires a pair of <code>char</code> values.</p>
<p>Java methods that accept a 16-bit <code>char</code> value does not support supplementary characters. Methods that accept a 32-bit <code>int</code> value support all Unicode characters (in the lower 21 bits), including supplementary characters.</p>
<p>This is meant to be an academic discussion. I have yet to encounter the use of supplementary characters!</p>
<h4>5.15 Displaying Hex Values & Hex Editors<a id="zz-5.15"></a></h4>
<p>At times, you may need to display the hex values of a file, especially in dealing with Unicode characters. A <em>Hex Editor</em>
is a handy tool that a good programmer should possess in his/her
toolbox. There are many freeware/shareware Hex Editor available. Try
google "Hex Editor".</p>
<p> I used the followings:</p>
<ul>
<li>NotePad++ with Hex Editor Plug-in: Open-source and free. You can
toggle between Hex view and Normal view by pushing the "H" button.</li>
<li>PSPad: Freeware. You can toggle to Hex view by choosing "View" menu and select "Hex Edit Mode".</li>
<li>TextPad: Shareware without expiration period. To view the Hex value,
you need to "open" the file by choosing the file format of "binary"
(??).</li>
<li>UltraEdit: Shareware, not free, 30-day trial only.</li>
</ul>
<p>Let me know if you have a better choice, which is fast to launch,
easy to use, can toggle between Hex and normal view, free, ....</p>
<p>The following Java program can be used to display hex code for Java Primitives (integer, character and floating-point):</p>
<table class="table-program">
<colgroup><col class="col-line-number">
<col class="col-program">
</colgroup><tbody>
<tr>
<td>
<pre class="text-right">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30</pre>
</td>
<td>
<pre>public class PrintHexCode {
public static void main(String[] args) {
int i = 12345;
System.out.println("Decimal is " + i); <span class="color-comment">// 12345</span>
System.out.println("Hex is " + Integer.toHexString(i)); <span class="color-comment">// 3039</span>
System.out.println("Binary is " + Integer.toBinaryString(i)); <span class="color-comment">// 11000000111001</span>
System.out.println("Octal is " + Integer.toOctalString(i)); <span class="color-comment">// 30071</span>
System.out.printf("Hex is %x\n", i); <span class="color-comment">// 3039</span>
System.out.printf("Octal is %o\n", i); <span class="color-comment">// 30071</span>
char c = 'a';
System.out.println("Character is " + c); <span class="color-comment">// a</span>
System.out.printf("Character is %c\n", c); <span class="color-comment">// a</span>
System.out.printf("Hex is %x\n", (short)c); <span class="color-comment">// 61</span>
System.out.printf("Decimal is %d\n", (short)c); <span class="color-comment">// 97</span>
float f = 3.5f;
System.out.println("Decimal is " + f); <span class="color-comment">// 3.5</span>
System.out.println(Float.toHexString(f)); <span class="color-comment">// 0x1.cp1 (Fraction=1.c, Exponent=1)</span>
f = -0.75f;
System.out.println("Decimal is " + f); <span class="color-comment">// -0.75</span>
System.out.println(Float.toHexString(f)); <span class="color-comment">// -0x1.8p-1 (F=-1.8, E=-1)</span>
double d = 11.22;
System.out.println("Decimal is " + d); <span class="color-comment">// 11.22</span>
System.out.println(Double.toHexString(d)); <span class="color-comment">// 0x1.670a3d70a3d71p3 (F=1.670a3d70a3d71 E=3)</span>
}
}</pre>
</td>
</tr>
</tbody>
</table>
<p>In Eclipse, you can view the hex code for <em>integer</em> primitive
Java variables in debug mode as follows: In debug perspective,
"Variable" panel ⇒ Select the "menu" (inverted triangle) ⇒ Java ⇒ Java
Preferences... ⇒ Primitive Display Options ⇒ Check "Display hexadecimal
values (byte, short, char, int, long)".</p>
<h3>6. Summary - Why Bother about Data Representation?<a id="zz-6."></a></h3>
<p>Integer number <code>1</code>, floating-point number <code>1.0</code> character symbol <code>'1'</code>, and string <code>"1"</code> are totally different inside the computer memory. You need to know the difference to write good and high-performance programs.</p>
<ul>
<li>In 8-bit <em>signed integer</em>, integer number <code>1</code> is represented as <code>00000001B</code>.</li>
<li>In 8-bit <em>unsigned integer</em>, integer number <code>1</code> is represented as <code>00000001B</code>.</li>
<li>In 16-bit <em>signed integer</em>, integer number <code>1</code> is represented as <code>00000000 00000001B</code>.</li>
<li>In 32-bit <em>signed integer</em>, integer number <code>1</code> is represented as <code>00000000 </code><code>00000000 </code><code>00000000 00000001B</code>.</li>
<li>In 32-bit <em>floating-point representation</em>, number <code>1.0</code> is represented as <code>0 01111111 0000000 00000000 00000000B</code>, i.e., <code>S=0</code>, <code>E=127</code>, <code>F=0</code>.</li>
<li>In 64-bit <em>floating-point representation</em>, number <code>1.0</code> is represented as <code>0 01111111111 0000 00000000 00000000 00000000 00000000 00000000 00000000B</code>, i.e., <code>S=0</code>, <code>E=1023</code>, <code>F=0</code>.</li>
<li>In 8-bit Latin-1, the character symbol <code>'1'</code> is represented as <code>00110001B</code> (or <code>31H</code>).</li>
<li>In 16-bit UCS-2, the character symbol <code>'1'</code> is represented as <code>00000000 00110001B</code>.</li>
<li>In UTF-8, the character symbol <code>'1'</code> is represented as <code>00110001B</code>.</li>
</ul>
<p>If you "add" a 16-bit signed integer <code>1</code> and Latin-1 character <code>'1'</code> or a string <code>"1",</code> you could get a surprise.</p>
<h4>6.1 Exercises (Data Representation)<a id="zz-6.1"></a></h4>
<p>For the following 16-bit codes:</p>
<pre class="color-example">0000 0000 0010 1010;
1000 0000 0010 1010;</pre>
<p>Give their values, if they are representing:</p>
<ol>
<li>a 16-bit unsigned integer;</li>
<li>a 16-bit signed integer;</li>
<li>two 8-bit unsigned integers;</li>
<li>two 8-bit signed integers;</li>
<li>a 16-bit Unicode characters;</li>
<li>two 8-bit ISO-8859-1 characters.</li>
</ol>
<p>Ans: (1) <code>42</code>, <code>32810</code>; (2) <code>42</code>, <code>-32726</code>; (3) <code>0</code>, <code>42</code>; <code>128</code>, <code>42</code>; (4) <code>0</code>, <code>42</code>; <code>-128</code>, <code>42</code>; (5) <code>'*'</code>; <code>'耪'</code>; (6) <code>NUL</code>, <code>'*'</code>; <code>PAD</code>, <code>'*'</code>.</p>
<p> </p>
<p class="references">REFERENCES & RESOURCES</p>
<ol>
<li>(Floating-Point Number Specification) IEEE 754 (1985), "IEEE Standard for Binary Floating-Point Arithmetic".</li>
<li>(ASCII Specification) ISO/IEC 646 (1991) (or ITU-T T.50-1992),
"Information technology - 7-bit coded character set for information
interchange".</li>
<li>(Latin-I Specification) ISO/IEC 8859-1, "Information technology - 8-bit single-byte coded graphic character sets -
Part 1: Latin alphabet No. 1".</li>
<li>(Unicode Specification) ISO/IEC 10646, "Information technology - Universal Multiple-Octet Coded Character Set (UCS)".</li>
<li>Unicode Consortium @ <a href="http://www.unicode.org/">http://www.unicode.org</a>.</li>
</ol>
</div> <!-- End the content-main division -->
<div id="content-footer">
<p>Last modified: January, 2014</p>
</div>
</div> <!-- End the wrap-inner division -->
<!-- footer filled by JavaScript -->
<div id="footer" class="header-footer"><p>Feedback, comments, corrections, and errata can be sent to Chua Hock-Chuan (ehchua@ntu.edu.sg) | <a href="http://www3.ntu.edu.sg/home/ehchua/programming/index.html">HOME</a></p></div>
</div> <!-- End the wrap-outer division -->
<!-- @@ end change in v1 -->
</body></html>