HVV - 'HeadView Video' format specification
Copyright: Kevin Paul Markwell on Tuesday 13th March 2007.
All rights reserved by the copyright author.
Permissions: By application to the author via
kevinmarkwell@gmail.com
Version: 1 LIVE - for submission to the
Internet Engineering Task Force (IETF)
Note:
This document has been manually converted from the official format structure laid out by the IETF.
The IETF format structure requires fixed-width 72-character wide text with basic symbols used for diagrams and tables.
For this web page version, I have made some typographical changes to improve readability.
All diagrams and tables remain in their original form for now.
Abstract:
1. This file format is designed primarily for close-up face-to-face video-conferencing via the internet. At the moment, there is no specific allowance for the unique characteristics of the human face and graphic file formats for streaming video files of people's faces can be unnecessarily large.
2. Although broadband has allowed for files to be much larger, large files increase load on server resources. Currently, the best still image compression is achieved using JPEG technology, with DiVX MPEG for video. The "HVV" format is designed to control the use of small JPEG files to comprise each camera still-frame, and to inter-mix synchronised audio data to form a video-style file format.
Data structure:
1. To allow for flexible streaming, the data stream is organised into sections called Persecs, Markers, Samplings and Frames. A "Persec" section occurs once for each second of transmission. These contain data about the transmission itself - date and time, the broadcaster's current upload rate, the transmitting URI, and the current second's worth of subtitling. Regardless of the frequency of video and sound capturing, each second's worth of camera recording is separated by a Persec.
2. A section called a "Marker" follows, occurring at the frequency of the capture rate. Each contains a timing serial number so data can be sorted correctly upon reception (see "Marker-serial" below).
3. A section called a "Sampling" follows, containing details of the sound captured for the current second and either the compressed sound data itself or phonemes for use with speech synthesis.
4. A section called a "Frame" follows, containing specially-compacted data from a single camera still.
The Frame is last in the sequence for the benefit of a receiver who wants to hear audio only, and is not interested in seeing who is speaking. This format is therefore also suitable for audio-only transmissions such as "internet radio stations".
5. If the frame rate of capture was 15 frames per second, there would be a total of 32 sections of data - 1 Persec, 15 Markers, 1 Sampling, and 15 Frames.
6. Regardless of the size of the captured data, the data is broken up into datagram packets of 10 Kilobytes (10240 bytes=81920 bits). This allows for low-specification computers and/or busy networks to send and receive this data with ease.
7. Below is a complete breakdown of the data structure hierarchy (details are described in the next sections):
-------------------+-----+--------------------+------------------------
DATA SECTION'S NAME|BYTES|EXAMPLE LITERAL |MEANING
|USED |AND/OR HEX (hh/&__h)|
-------------------+-----+--------------------+------------------------
Persec:
Persec-Time | 7 |07 d7 03 0d 10 1d 31|13/03/2007 at 16:29:49
Persec-Rate | 7 |00 00 00 00 3f f4 00|4191232 bits per second
Persec-URI |>9 |"http://www.a.it:89"|fully-qualified URI
Persec-Subtitling:
Subtitle-language| 2 |en |ISO-based language code
Subtitle-words |>1 |hello there&0h |words said during second
------------------- ----- -------------------- ------------------------
Marker:
Marker-label | 2 |MA |"Marker"
Marker-serial | 2 |0fh |15 frames-per-second
------------------- ----- -------------------- ------------------------
Sampling:
Audio-format | 3 |MP3 |MP3 format indicator
Audio-bit-rate | 6 |00 00 00 05 00 00 |320Kbps (40KB/s)
Audio-data-length | 4 |00 00 00 28 |40 Kilobytes of data
Audio-data |>0 | |
------------------- ----- -------------------- ------------------------
Frame:
Part-code | 3 |BKG |"background"
Part-resolution |>4 |,~,d, |width and height
Part-depth | 1 |&30h |48-bit colour depth
Part-position |>4 |,},q, |x and y co-ordinates
Part-type-format | 1 |&52h |RMC-format indicator
Part-type-version | 1 |&01h |version 1
Part-length |>3 |,^d, |24164 bytes=(94*256)+100
Part-data |>1 | |run-lengths and colours
-------------------+-----+--------------------+------------------------
Definition of a "Persec":
This is the data section containing the primary transmission information. The receiver's interpretation of this gives key information about the reception and control of the remainder of the data stream. The data is defined as follows:
1. "Persec-time" = seven bytes for the transmission date and time in the format YYYY/MM/DD@HH:MM:SS according to the internet-based Network Time Protocol (NTP, the internet's Universal Time Clock). The 8-byte NTP format can be converted to be used as follows:
Byte number(/s)|Purpose ---------------+-------------------------------------------------------- 1 and 2 |Year (0000-9999) in hexadecimal (&00h &00h to &27h &0fh) 3 |Month (1-12) in hexadecimal (&00h to &0ch) 4 |Date (1-31) in hexadecimal (&01h to &1fh) 5 |Hour (0-23) in hexadecimal (&00h to &17h) 6 |Minute (0-59) in hexadecimal (&00h to &3bh) 7 |Second (0-59) in hexadecimal (&00h to &3bh) ---------------+--------------------------------------------------------2. "PerSec-rate" = seven bytes for the transmission's current upload rate (values up to 72057594037927936 or 64 Petabits); Obviously the use of Petabits per second seems unlikely, but future computer speeds are often underestimated; the use of it here is to demonstrate that any value could be represented if the number of bytes used to hold the value is agreed upon.
Definition of a "Marker":
The Marker data section is defined as follows:
1. "Marker-label" = two bytes for a fixed label indicator ("MA").
2. "Marker-serial" = two bytes for the serial number, to allow for the scope of up to 65535 frames per second.
3. In total the Marker data's size is four bytes.
Definition of a "Sampling":
Samplings can contain either audio or phoneme data, both, or neither.
The audio data is defined as follows:
1. "Audio-format" = three bytes showing one of these literal strings:
PCM = uncompressed sound data, ideal for high bandwidth used with older or unknown computers.
MP3 = as per the MP3 file specification, allowing for bitrate variation and ID3 tagging.
2. "Audio-bit-rate" = six bytes for the bit-rate in bits per second, allowing for values up to 281474976710655, or nearly 32 Terabytes. Here is an example for 320Kbps (327680 bytes per second):
----------------------+----+------------------+------------+-----------
Byte Position |Byte|Byte column |Example |Decimal
|num.|multiplier |for 128Kbps |value
---------------------- ---- ------------------ ------------ -----------
Most significant byte | 1 |72057594037927936 | 00 | 0
| 2 | 281474976710656 | 00 | 0
| 3 | 1099511627776 | 00 | 0
| 4 | 4294967296 | 00 | 0
| 5 | 16777216 | 00 | 0
| 6 | 65536 | 05 |327680
| 7 | 256 | 00 | 0
Least significant byte| 8 | 1 | 00 | 0
----------------------+----+------------------+------------+-----------
In the case of MP3 data, the bit-rate could be encoded here as well as within the Audio-data itself.
Definition of a Frame:
1. The major difference between the way camera frames are currently stored and this format is the shape. Generally, recording and the resultant frame image has used either a ratio of 4:3 or 16:9, so is rectangular. This format exclusively uses a 3:4 ratio and the resultant frame image is elliptical. This does not rely on the person being recorded to keep their head still, because algorithms are applied to dispose of the extraneous area surrounding the human head captured. Whether or not neck and/or shoulders are included, and the subject's distance from the camera can be changed at any point during recording. The 3:4 ratio and elliptical format creates a result similar to the traditional framing of portrait photography.
2. Another significant difference using this format is the use of resolution variance. A single frame image can contain differing levels of resolution and colour, creating a collage of different image sizes and colours to make up the image.
3. Streaming is done frame-by-frame to allow for maximum reception advantages. This format does NOT use the principle of grouping frames to compress data. It relies entirely on separate compression for each frame.
4. When people look at a person talking, they usually concentrate on a few features of the person's face. Although people might choose to look at other parts of the head, most movement occurs on these features, so these are generally more interesting. The Frame data has therefore been ordered with the largest data resolution or change frequency, each of these facial parts is referred to as a "Part". The order of data storage is as follows:
mouth (MOU) left pupil (PUL) right pupil (PUR) eye left (EYL) eye right (EYR) eyebrow left (EBL) eyebrow right (EBR) nose (NOS) forehead (FHD) chin (CHI) ear left (EAL) ear right (EAR) neck (NEC) hair (HAI) face (FAC) head (HED) background (BKG)5. The three-letter abbreviation is used as a descriptor for the data. Receiving programs could choose to ignore certain parts of the transmission by ignoring the data stored next to a given descriptor. For example, a low-bandwidth receiving program could just read the data for the "neck" (NEC) once and keep that data on-screen instead of constantly reading new neck data that might have hardly changed. Anything which obscures one of these Parts would be compressed at the same resolution and colour depth as the part or parts it obscures.
Frame
|_____background (BKG)
|_____head (HED)
|_______hair (HAI)
|_______ear left (EAL)
|_______ear right (EAR)
|_______face (FAC)
| |_____forehead (FHD)
| |_____eyebrow left (EBL)
| |_____eyebrow right (EBR)
| |_____eye left (EYL)
| | |____pupil left (PUL)
| |
| |_____eye right (EYR)
| | |____pupil right (PUR)
| |
| |_____nose (NOS)
| |_____mouth (MOU)
| |_____chin (CHI)
|
|_______neck (NEC)
Number of colours |Bit |Name |
|depth | |
--------------------+------+-----------+
2 | 1 |monochrome |
16 | 4 |rough |
256 | 8 |low |
65,536 | 16 |medium |
16,777,216 | 24 |high |
4,294,967,296 | 32 |super |
281,474,976,710,656 | 48 |filmscan |
--------------------+------+-----------+
&4Ah (ASCII "J" for JPEG,
&50h (ASCII "P") for PNG, or
&52h (ASCII "R") for RMC - Run-Modal Compression (see below).
F. "Part-type-version" = one byte for the Part's data format version chosen by the encoding software. For "RMC" the version is 1 for the format defined below, so the encoded byte would be &1h. For "JPG", the version ranges from 0 to &64h for zero to 100% quality, and &65h to &c8h for JPEG 2000 zero to 100% quality. For "PNG", the version ranges according to the PNG specification.
BKG,www,hhh,c,xxx,yyy,fvpccccpccccpccccpcccc0 HED,www,hhh,c,xxx,yyy,fvdddddddddddd0 HAI,www,hhh,c,xxx,yyy,fvdddddddddddddd0 EAL,www,hhh,c,xxx,yyy,fvpccccpccccpccccpcccc0 EAR,www,hhh,c,xxx,yyy,fvpccccpccccpccccpcccc0 FAC,www,hhh,c,xxx,yyy,fvdddddddddddddd0 FHD,www,hhh,c,xxx,yyy,fvpccccpccccpccccpcccc0 EBL,www,hhh,c,xxx,yyy,fvdddddddddddddddddddd0 EBR,www,hhh,c,xxx,yyy,fvpccccpccccpccccpcccc0 EYL,www,hhh,c,xxx,yyy,fvpccccpccccpccccpcccc0 PUL,www,hhh,c,xxx,yyy,fvpccccpccccpccccpcccc0 EYR,www,hhh,c,xxx,yyy,fvpccccpccccpccccpcccc0 PUR,www,hhh,c,xxx,yyy,fvdddddddddddddd0 NOS,www,hhh,c,xxx,yyy,fvpccccpccccpccccpcccc0 MOU,www,hhh,c,xxx,yyy,fvpccccpccccpccccpcccc0 CHI,www,hhh,c,xxx,yyy,fvpccccpccccpccccpcccc0 NEC,www,hhh,c,xxx,yyy,fvpccccpccccpccccpcccc0
Time-line for transmission:
The diagrams below show how data is sent for a 12-frame per second transmission. Each diagrams shows half of a second. The data sections sent are coded "P" for "Persec", "M" for "Marker", "S" for "Sampling, and "F" for "Frame". The dots show that the sections can take a variable amount of time to transmit, so the receiving program should prepare for each data section at these times. It is up to the program whether or not it should discard data or remember it for delayed playback.
--------------+-------------------------------------------------------- Time |0 1 2 3 4 in 100ths/sec|01234567890123456789012345678901234567890123456789 ------------- --------------------------------------------------------- Section sent |P.MS.F. MF. MF. MF. MF. MF. -------------+--------------------------------------------------------- -------------+--------------------------------------------------------- Time |5 6 7 8 9 1second in 100ths/sec|012345678901234567890123456789012345678901234567890 ------------- --------------------------------------------------------- Section sent |MF. MF. MF. MF. MF. MF. P.MS.F. -------------+---------------------------------------------------------
Final comment:
At the moment, the Frame is defined for people talking close-up, but the principle of separating an image could apply to other subjects. For example, in a high camera view of a sports match the players and ball are generally more interesting than the crowd. The crowd areas could therefore be split into a lower-resolution image without a central area, and the pitch could be kept at a higher resolution to allow for the players and ball movement. With motion-tracking software, the image could be split by player, so a cricket game of 9 fielders, 2 batsmen, 1 umpire, a ball, and 2 wickets could be split into 15 images of higher resolution, and 1 low-resolution image of the pitch.
Citations:
None known.
Links to related web pages of interest
Acknowledgements:
Conceptual development - Duncan Peters