HVV - 'HeadView Video' format specification
Copyright: Kevin Paul Markwell on Tuesday 13th March 2007.
All rights reserved by the copyright author.
Permissions: By application to the author via kevinmarkwell@gmail.com
Version: 1 LIVE - for submission to the Internet Engineering Task Force (IETF)

Note:
This document has been manually converted from the official format structure laid out by the IETF.
The IETF format structure requires fixed-width 72-character wide text with basic symbols used for diagrams and tables.
For this web page version, I have made some typographical changes to improve readability.
All diagrams and tables remain in their original form for now.

Abstract:
1. This file format is designed primarily for close-up face-to-face video-conferencing via the internet. At the moment, there is no specific allowance for the unique characteristics of the human face and graphic file formats for streaming video files of people's faces can be unnecessarily large.
2. Although broadband has allowed for files to be much larger, large files increase load on server resources. Currently, the best still image compression is achieved using JPEG technology, with DiVX MPEG for video. The "HVV" format is designed to control the use of small JPEG files to comprise each camera still-frame, and to inter-mix synchronised audio data to form a video-style file format.

Data structure:
1. To allow for flexible streaming, the data stream is organised into sections called Persecs, Markers, Samplings and Frames. A "Persec" section occurs once for each second of transmission. These contain data about the transmission itself - date and time, the broadcaster's current upload rate, the transmitting URI, and the current second's worth of subtitling. Regardless of the frequency of video and sound capturing, each second's worth of camera recording is separated by a Persec.
2. A section called a "Marker" follows, occurring at the frequency of the capture rate. Each contains a timing serial number so data can be sorted correctly upon reception (see "Marker-serial" below).
3. A section called a "Sampling" follows, containing details of the sound captured for the current second and either the compressed sound data itself or phonemes for use with speech synthesis.
4. A section called a "Frame" follows, containing specially-compacted data from a single camera still. The Frame is last in the sequence for the benefit of a receiver who wants to hear audio only, and is not interested in seeing who is speaking. This format is therefore also suitable for audio-only transmissions such as "internet radio stations".
5. If the frame rate of capture was 15 frames per second, there would be a total of 32 sections of data - 1 Persec, 15 Markers, 1 Sampling, and 15 Frames.
6. Regardless of the size of the captured data, the data is broken up into datagram packets of 10 Kilobytes (10240 bytes=81920 bits). This allows for low-specification computers and/or busy networks to send and receive this data with ease.
7. Below is a complete breakdown of the data structure hierarchy (details are described in the next sections):

 -------------------+-----+--------------------+------------------------
 DATA SECTION'S NAME|BYTES|EXAMPLE LITERAL     |MEANING
                    |USED |AND/OR HEX (hh/&__h)|
 -------------------+-----+--------------------+------------------------
 Persec:
  Persec-Time       | 7   |07 d7 03 0d 10 1d 31|13/03/2007 at 16:29:49
  Persec-Rate       | 7   |00 00 00 00 3f f4 00|4191232 bits per second
  Persec-URI        |>9   |"http://www.a.it:89"|fully-qualified URI
  Persec-Subtitling:
   Subtitle-language| 2   |en                  |ISO-based language code
   Subtitle-words   |>1   |hello there&0h      |words said during second
 ------------------- ----- -------------------- ------------------------
 Marker:
  Marker-label      | 2   |MA                  |"Marker"
  Marker-serial     | 2   |0fh                 |15 frames-per-second
 ------------------- ----- -------------------- ------------------------
 Sampling:
  Audio-format      | 3   |MP3                 |MP3 format indicator
  Audio-bit-rate    | 6   |00 00 00 05 00 00   |320Kbps (40KB/s)
  Audio-data-length | 4   |00 00 00 28         |40 Kilobytes of data
  Audio-data        |>0   |                    |
 ------------------- ----- -------------------- ------------------------
 Frame:
  Part-code         | 3   |BKG                 |"background"
  Part-resolution   |>4   |,~,d,               |width and height
  Part-depth        | 1   |&30h                |48-bit colour depth
  Part-position     |>4   |,},q,               |x and y co-ordinates
  Part-type-format  | 1   |&52h                |RMC-format indicator
  Part-type-version | 1   |&01h                |version 1
  Part-length       |>3   |,^d,                |24164 bytes=(94*256)+100
  Part-data         |>1   |                    |run-lengths and colours
 -------------------+-----+--------------------+------------------------

Definition of a "Persec":
This is the data section containing the primary transmission information. The receiver's interpretation of this gives key information about the reception and control of the remainder of the data stream. The data is defined as follows:
1. "Persec-time" = seven bytes for the transmission date and time in the format YYYY/MM/DD@HH:MM:SS according to the internet-based Network Time Protocol (NTP, the internet's Universal Time Clock). The 8-byte NTP format can be converted to be used as follows:

Byte number(/s)|Purpose
---------------+--------------------------------------------------------
 1 and 2       |Year (0000-9999) in hexadecimal (&00h &00h to &27h &0fh)
 3             |Month (1-12) in hexadecimal (&00h to &0ch)
 4             |Date (1-31) in hexadecimal (&01h to &1fh)
 5             |Hour (0-23) in hexadecimal (&00h to &17h)
 6             |Minute (0-59) in hexadecimal (&00h to &3bh)
 7             |Second (0-59) in hexadecimal (&00h to &3bh)
---------------+--------------------------------------------------------
2. "PerSec-rate" = seven bytes for the transmission's current upload rate (values up to 72057594037927936 or 64 Petabits); Obviously the use of Petabits per second seems unlikely, but future computer speeds are often underestimated; the use of it here is to demonstrate that any value could be represented if the number of bytes used to hold the value is agreed upon.
3. "Persec-URI" = a variable number of bytes for the fully-qualified URI address the transmission originates from, in DNS form rather than using IP addressing, e.g. "http://server.bigcompany.com/people/bill.smith/deskcam:80". The URI could of course be faked; it is up to the receiver software to verify it. The whole URI is surrounding by double-quotation marks.
4. "Persec-subtitling" = per-second subtitling allowing for multiple languages. This consists of:
   A. "Subtitle-language" = two bytes for the Internationally-recognised internet ISO 639 code or superceding standard for the sub-title language.
   B. "Subtitle-words" = a variable number of bytes for the current second's subtitles for that language, encoded in double-byte UTF-16 format. To mark the end of the Subtitle-words section, a value of &00h is used.
   C. the Persec-subtitling section then repeats with the next Subtitle-language, until there are no more subtitles encoded.
5. There is no fixed data length for a Persec due to the variable length of Persec-URI and Subtitle-words.

Definition of a "Marker":
The Marker data section is defined as follows:
1. "Marker-label" = two bytes for a fixed label indicator ("MA").
2. "Marker-serial" = two bytes for the serial number, to allow for the scope of up to 65535 frames per second.
3. In total the Marker data's size is four bytes.

Definition of a "Sampling":
Samplings can contain either audio or phoneme data, both, or neither.
The audio data is defined as follows:
1. "Audio-format" = three bytes showing one of these literal strings: PCM = uncompressed sound data, ideal for high bandwidth used with older or unknown computers. MP3 = as per the MP3 file specification, allowing for bitrate variation and ID3 tagging.
2. "Audio-bit-rate" = six bytes for the bit-rate in bits per second, allowing for values up to 281474976710655, or nearly 32 Terabytes. Here is an example for 320Kbps (327680 bytes per second):

 ----------------------+----+------------------+------------+-----------
 Byte Position         |Byte|Byte column       |Example     |Decimal
                       |num.|multiplier        |for 128Kbps |value
 ---------------------- ---- ------------------ ------------ -----------
 Most significant byte | 1  |72057594037927936 |  00        |     0
                       | 2  |  281474976710656 |  00        |     0
                       | 3  |    1099511627776 |  00        |     0
                       | 4  |       4294967296 |  00        |     0
                       | 5  |         16777216 |  00        |     0
                       | 6  |            65536 |  05        |327680
                       | 7  |              256 |  00        |     0
 Least significant byte| 8  |                1 |  00        |     0
 ----------------------+----+------------------+------------+-----------
In the case of MP3 data, the bit-rate could be encoded here as well as within the Audio-data itself.
3. "Audio-data-length" = four bytes for the audio data's length in bytes, allowing for values up to 4294967295, or nearly 4 Gigabytes. This data is for only one-second's worth of audio recording, so even raw data recorded at CD-transparent quality (192Kbps) would only reach 192Kb (=196608 bits, =24576 bytes, =24 Kilobytes). This length encoding is important because the data could contain byte values from 0 to 255, so an "ending" byte could not be used.
4. "Audio-data" = the actual data according to the preceeding Audio-format and Audio-version. The decoder should ignore data beyond the described Audio-length, because it could easily be mis-interpreted. Warnings could of course be used if there were differences between the noted Audio-length and the actual length. Phonemes can be used so that a speech-synthesised "voice" can be used to "speak", over-riding the original recorded voice. If this is used, the main use for transmitting the original audio data as well could be for the user to have a choice. Phonemes are recorded as literal bytes to speed up decoding and prevent interpretational difficulties.
5. A Sampling ends with a &00h value.
6. There is no fixed data length for a Sampling due to the variable length of Audio-data.

Definition of a Frame:
1. The major difference between the way camera frames are currently stored and this format is the shape. Generally, recording and the resultant frame image has used either a ratio of 4:3 or 16:9, so is rectangular. This format exclusively uses a 3:4 ratio and the resultant frame image is elliptical. This does not rely on the person being recorded to keep their head still, because algorithms are applied to dispose of the extraneous area surrounding the human head captured. Whether or not neck and/or shoulders are included, and the subject's distance from the camera can be changed at any point during recording. The 3:4 ratio and elliptical format creates a result similar to the traditional framing of portrait photography.
2. Another significant difference using this format is the use of resolution variance. A single frame image can contain differing levels of resolution and colour, creating a collage of different image sizes and colours to make up the image.
3. Streaming is done frame-by-frame to allow for maximum reception advantages. This format does NOT use the principle of grouping frames to compress data. It relies entirely on separate compression for each frame.
4. When people look at a person talking, they usually concentrate on a few features of the person's face. Although people might choose to look at other parts of the head, most movement occurs on these features, so these are generally more interesting. The Frame data has therefore been ordered with the largest data resolution or change frequency, each of these facial parts is referred to as a "Part". The order of data storage is as follows:

	mouth (MOU)
	left pupil (PUL)
	right pupil (PUR)
	eye left (EYL)
	eye right (EYR)
	eyebrow left (EBL)
	eyebrow right (EBR)
	nose (NOS)
	forehead (FHD)
	chin (CHI)
	ear left (EAL)
	ear right (EAR)
	neck (NEC)
	hair (HAI)
	face (FAC)
	head (HED)
	background (BKG)
5. The three-letter abbreviation is used as a descriptor for the data. Receiving programs could choose to ignore certain parts of the transmission by ignoring the data stored next to a given descriptor. For example, a low-bandwidth receiving program could just read the data for the "neck" (NEC) once and keep that data on-screen instead of constantly reading new neck data that might have hardly changed. Anything which obscures one of these Parts would be compressed at the same resolution and colour depth as the part or parts it obscures.
6. The hierarcial structure for the Parts is:
        Frame
         |_____background (BKG)
         |_____head (HED)
                |_______hair (HAI)
                |_______ear left (EAL)
                |_______ear right (EAR)
                |_______face (FAC)
                |        |_____forehead (FHD)
                |        |_____eyebrow left (EBL)
                |        |_____eyebrow right (EBR)
                |        |_____eye left (EYL)
                |        |      |____pupil left (PUL)
                |        |
                |        |_____eye right (EYR)
                |        |      |____pupil right (PUR)
                |        |
                |        |_____nose (NOS)
                |        |_____mouth (MOU)
                |        |_____chin (CHI)
                |
                |_______neck (NEC)

7. This structure can be used as a framework for a person's image to be used in a computer-generated model such as in computer games.
8. Each Part's data is structured as follows:
   A. "Part-code" = three bytes for a literal code for the Part in the form "ppp", e.g. "MOU" for the "Mouth" Part.
   B. "Part-resolution" = the resolution of the Part expressed in the form ",www,hhh,". Any number of bytes can be used to represent values because there is comma separation between the width and height; this allows for any level of resolution. An example is ",S,d," (83*100) which could be used for an "Eyebrow left" Part. Each Part can be set to use a different resolution, so the decoding program needs to increase the scale of Parts recorded at lower resolution. For example, if a Head Part (HED) was recorded at a resolution of 300 pixels wide, and a Forehead Part (FHD) was compacted to a resolution of 75 pixels wide, the Forehead will need scaling up by four times so that it is the same width as the Head. Naturally, if the user has set their viewing window to a resolution of just 150 pixels wide, the Head will need scaling down to half-size, and the Forehead will need scaling up to twice the size. The user's resolution should always be used in the calculation for scaling, rather than basing it on the transmitted Frame resolution. This is also quicker and less prone to error.
   C. "Part-depth" = one byte for the Part's colour depth as follows:
                    Number of colours   |Bit   |Name       |
                                        |depth |           |
                    --------------------+------+-----------+
                                      2 |  1   |monochrome |
                                     16 |  4   |rough      |
                                    256 |  8   |low        |
                                 65,536 | 16   |medium     |
                             16,777,216 | 24   |high       |
                          4,294,967,296 | 32   |super      |
                    281,474,976,710,656 | 48   |filmscan   |
                    --------------------+------+-----------+

This byte is encoded literally to speed decoding.
   D. "Part-position" = the Part's 2D co-ordinate positioning in the form ",xxx,yyy,". Any number of bytes can be used to represent values because of comma-delimitation. An example is ",},q," (125,113) which, in a Frame measuring 640 by 480 pixels, could be a "Forehead" position.
   E. "Part-type-format" = one byte for the Part's data format chosen by the encoding software, which can be:
           &4Ah (ASCII "J" for JPEG,
           &50h (ASCII "P") for PNG, or
           &52h (ASCII "R") for RMC - Run-Modal Compression (see below).
   F. "Part-type-version" = one byte for the Part's data format version chosen by the encoding software. For "RMC" the version is 1 for the format defined below, so the encoded byte would be &1h. For "JPG", the version ranges from 0 to &64h for zero to 100% quality, and &65h to &c8h for JPEG 2000 zero to 100% quality. For "PNG", the version ranges according to the PNG specification.
   G. "Part-length" = the Part's data length in bytes in the form ",lll,". Any number of bytes can be used to represent values because of comma-delimitation.
   H. "Part-data" = the Part's compressed data, in JPEG, PNG or RMC format matching the "Part-type-format" and "Part-type-version".
"JPEG" = Joint Photographic Experts Group:
JPEG compression can be used instead of RMC. This does not offer support for transparency, but creates a generally small amount of data, even at very low-loss levels.
"PNG" = Portable Network Graphics:
PNG can be used instead. This offers support for transparency.
"RMC" = Run-Modal Compression:
This is a simple variant of Run-Length-based bitmap file formats. The colour depth is already encoded in the file, so the data just alternates between a byte for the run-length in pixels (up to 255 pixels wide) and the modal average colour for that section of pixels in the relevant number of bytes. The degree of transparency (opacity/translucency) can be recorded per-pixel using 32-bit colour-depth in the standard ARGB format. For example, if most of 255 pixels is semi-opaque bright purple and 32-bit colour is being used, the four bytes of data in hex would be "77ff00ff", where the starting "77" means 127 (semi-opacity), and the ending "ff00ff" would be the colour bright purple. The colour of a pixel or pixel group in a 48-bit Part will always use 6 bytes. There is no 40-bit encoding because this is not a standard within computer graphics. The technique for the modal averaging of the pixels and decision of where the run-lengths should occur is entirely the choice of the programmer who writes the encoder, they can specify the degree of "lossyness" or offer the user options to control this. The decoder program only has to read the colour depth, then loop through the RMC data. To end an RMC data section, a value of zero is used for the run-length in pixels; whereas ata in other formats is controlled by the initial Data-length.
   I. The data stream then continues with either the next Part-code or the next Persec.
The format for a whole Frame (with the first Part and several other Parts in RMC format) is as follows (key below):
 BKG,www,hhh,c,xxx,yyy,fvpccccpccccpccccpcccc0
 HED,www,hhh,c,xxx,yyy,fvdddddddddddd0
 HAI,www,hhh,c,xxx,yyy,fvdddddddddddddd0
 EAL,www,hhh,c,xxx,yyy,fvpccccpccccpccccpcccc0
 EAR,www,hhh,c,xxx,yyy,fvpccccpccccpccccpcccc0
 FAC,www,hhh,c,xxx,yyy,fvdddddddddddddd0
 FHD,www,hhh,c,xxx,yyy,fvpccccpccccpccccpcccc0
 EBL,www,hhh,c,xxx,yyy,fvdddddddddddddddddddd0
 EBR,www,hhh,c,xxx,yyy,fvpccccpccccpccccpcccc0
 EYL,www,hhh,c,xxx,yyy,fvpccccpccccpccccpcccc0
 PUL,www,hhh,c,xxx,yyy,fvpccccpccccpccccpcccc0
 EYR,www,hhh,c,xxx,yyy,fvpccccpccccpccccpcccc0
 PUR,www,hhh,c,xxx,yyy,fvdddddddddddddd0
 NOS,www,hhh,c,xxx,yyy,fvpccccpccccpccccpcccc0
 MOU,www,hhh,c,xxx,yyy,fvpccccpccccpccccpcccc0
 CHI,www,hhh,c,xxx,yyy,fvpccccpccccpccccpcccc0
 NEC,www,hhh,c,xxx,yyy,fvpccccpccccpccccpcccc0

BKG / HED / HAI etc. = literal codes for Parts as detailed previously.
www = width in pixels.
hhh = height in pixels.
c = colour bit-depth.
xxx = x co-ordinate position in pixels.
yyy = y co-ordinate position in pixels.
f = format for successive image data.
v = version for format for successive image data.
p = number of pixels in a run-length.
cccc = 32-bit colour code for modal average of pixels' run-length.
ddd = variable-length data for other image data formats.
0 = a zero-value byte to end the Part data.

Time-line for transmission:
The diagrams below show how data is sent for a 12-frame per second transmission. Each diagrams shows half of a second. The data sections sent are coded "P" for "Persec", "M" for "Marker", "S" for "Sampling, and "F" for "Frame". The dots show that the sections can take a variable amount of time to transmit, so the receiving program should prepare for each data section at these times. It is up to the program whether or not it should discard data or remember it for delayed playback.

 --------------+--------------------------------------------------------
 Time         |0         1         2         3         4
 in 100ths/sec|01234567890123456789012345678901234567890123456789
 ------------- ---------------------------------------------------------
 Section sent |P.MS.F.  MF.     MF.     MF.      MF.     MF.
 -------------+---------------------------------------------------------


 -------------+---------------------------------------------------------
 Time         |5         6         7         8         9         1second
 in 100ths/sec|012345678901234567890123456789012345678901234567890
 ------------- ---------------------------------------------------------
 Section sent |MF.      MF.     MF.     MF.      MF.     MF.     P.MS.F.
 -------------+---------------------------------------------------------

As can be seen, although each frame occurs at about every 0.0833 seconds, the transmission data timing is slightly adjusted to keep synchronisation based on integers. The timing for the above is therefore 0,0.09,0.17,0.25,0.34,0.42 then 0.5,0.59,0.67,0.75,0.84,0.92, and 1. This will keep the time sent sufficiently accurate, so the receiving program can create synchronisation to expect this rate. To simplify synchronisation, the data length of each Persec should be kept short so that each Marker can be decoded at regular intervals. For the decoding program, it is simplest to interpret the data as a continuous stream, and create synchronisation based on the stream, rather than attempt to read sections of data at specific time intervals.

Final comment:
At the moment, the Frame is defined for people talking close-up, but the principle of separating an image could apply to other subjects. For example, in a high camera view of a sports match the players and ball are generally more interesting than the crowd. The crowd areas could therefore be split into a lower-resolution image without a central area, and the pitch could be kept at a higher resolution to allow for the players and ball movement. With motion-tracking software, the image could be split by player, so a cricket game of 9 fielders, 2 batsmen, 1 umpire, a ball, and 2 wickets could be split into 15 images of higher resolution, and 1 low-resolution image of the pitch.

Citations:
None known.
Links to related web pages of interest

Acknowledgements:
Conceptual development - Duncan Peters