WO2008087466A1

WO2008087466A1 - Run-length encoding of binary sequences followed by two independent compressions

Info

Publication number: WO2008087466A1
Application number: PCT/IB2007/000173
Authority: WO
Inventors: Rosen Stefanov
Original assignee: Rosen Stefanov
Priority date: 2007-01-17
Filing date: 2007-01-17
Publication date: 2008-07-24
Also published as: GB0909518D0; GB2456728A

Abstract

The enclosed method compresses a binary sequence BS ={b1, b2, b3,..., bN} wherein a run-length encoder (RLE) and two independent compressions are used. Let the output sequence of RLE be denoted with RLS. If b1=0 then RLS is 0, x0, y0, x1,.... If b1= 1 then RLS is 1, y0, x0, y1, &mldr;. Where xi is the length of i-th run with zeros and yj is the length of j-th run with ones in BS. RLS without b1 can be divided in two sequences RLS0 and RLS1 where: • RLS0={ x0, x1, &mldr;}. • RLS1={y0, y1,&mldr;}. Let us denote the number of symbols in sequence X with |X|, and the entropy of X with H(X). Then |X|H(X) is the size of encoded X sequence. The inequalities: • |RLS0|H(RLS0)+ |RLS1|H (RLS1)≤|BS|H(BS) • |RLS0|H(RLS0)+|RLS1|H(RLS1)≤|RLS|H(RLS) are proved.

Description

Run-length encoding of binary sequences followed by two independent compressions Technical Field

The present invention relates to compression of sequences of symbols - encoding /decoding method.

Background Art

The following terminology will be used:

• "codec/coder" : method of encoding/decoding information aiming at shortening

•

• If S is a sequence of symbols from an alphabet A , then "finite scheme of sequence S" is FS(S)=(A₁P(S)) where P(S) is the set of probabilites of occurrence of each symbol from A in 5".

• " = * " number

• If S is a sequence of symbols from an alphabet A , then "Entropy of sequence S", or H(S) , is the entropy of FS(S)=(A₁P(S)). [4, §23]

• If S is a sequence of symbols from an alphabet A , then "entropy codec/entropy coder" is a method and/or means to encode a symbol a using only FS(S) from the sequence S with an average size close to — Iog₂(p(a)) , As usual, p(a) is probability of occurrence of a in ^.(for example Huffrnan coding [4, §2.1], arithmetic coding [4, §4]- US patent 4,122,440)

• If S is a sequence of symbols from an alphabet A , then with \S\ the number of letters from A in S will be denoted.

|ιS| H (S) is the size of compressed S with an entropy coder. The run length encoding is old and simple, but used only in very special cases. For example see [4, page 2,Pattern-Finding Approaches]: "For instance, fax machines send simple black and white images. These are easily compressed with a solution known as run length encoding, which counts the number of times a black or white pixel is repeated. ..." and "... Run length encoding is the most common example of solutions based on identifying patterns. In most of the cases, patterns are just too complicated for a computer to find regularly." O¹ Brien found and patented RLE followed by LZ77(US patent 4,988,998 "Data compression system for successively applying at least two data compression methods to an input data stream"). See [5,§3.3] for the usage of RLE in JPEG.

The enclosed method uses RLE to construct better entropy compression system, as compared to existing entropy coders. This means, that if S is a sequence of symbols from an alphabet A (A and S are fixing the statistical structure, i.e. finite scheme), then the length of encoded S with the enclosed method is less or equal to Is(H(S) (the best possible achievement of a pure entropy coder).

Disclosure of Invention

Let two sets of numbers be give where .

The following denotations will be used:

•

^'

If a binary source BS is given, then c_h d, C, D, ... can be interpreted as:

• c_k is the number of 1-runs with size k

• d, is the number of 0-runs with size /

• / is the number of ones in the BS • O is the number of zeros in BS

• C is the number of runs with ones

• D is the number of runs with zeros

. c_k + d_k is the number of runs with size k

After every run with ones, there is a run with zeros (and vice versa), except the last run in the sequence. \C— D\≤ 1 is always true. If C and D are both even (or odd) then C-D . If C is even and D is odd (or vice versa), then \C— D\— 1 . But in the latter case the last run (a few bits) can be skipped (stored and loaded separately). So, from now on we will assume that C is equal to D.

Let the output sequence of RLE applied on BS =[ b_x h₂ b₃ ..., Z>_|βS|} be denoted with RLS. If B₁=O then RLS is 0,x_{0 t} y_Q x_{x t} y_x ... . IfO₁=I then iώis l,J>_0> VΛ- • Where *, is the length of /-th run with zeros in BS and y_} is the length of j-th run with ones in BS. RLS without the first symbol (which is b_x ) can be devided in two sequences RLS₀ and RLS₁ where: - is the length ofi-th run with zeros in BS. - is the length of >'-th run with ones in BS.

Next statements are obvious: • p_k is probability of occurrence of c_h (1-runs with size k ) in RLS₁

• ability of occurrence of d, (0-runs with size / ) in RLS₀

_'

s probability of occurrence of c_k + d_k (runs with size Ar ) in RLS

90 Tree finite schemes can be formed:

95

100

105

The size in bits of JKS is 21(without the skipped last 0). The entropy is H(BS )=0.99836 . The overall size of encoded BS with an entropy coder is 11 d RLS₀ and

Example 1 is a regular case and is the object of the invention, as indicate in claim 1 to compress binary sequences better than the other entropy coders do. BS22RLS and RLS22RLS 115 Inequalities explain the reasons, which will be proven below.

Lemma: Let X₁ ,x_2> ... ; y_t , y₂,— be arbitrary positive numbers wit

120 The lemma was proven in [3 Lemma 1.4.1 page 16].

BS22RLS Inequality -

Proof: Let us use the lemma two times, having in mind th

125

1) Substitute: X₁= P₁ and y^qp'

130 2) Substitute: X₁=^q₁ODd y,=pg^{l l}

Summing 1) and 2): ^d because _c=jD

135

BS22RLS InequaUty is proved.

Equality in BS22RLS is reached, when p—q— 0.5 . It is colorary from [3, Lemma IAl].

140 If abinary source BS is given, then symbolize

BS22RLS(BS) .

Using BS22RLS , it is possible to design multiple hybrid methods.

145 The following inequality explains why it is better to use two independent comρressions(

RLS₀ RLS₁) than just one (RLS).

RLS22RLS Inequality

150 Proof:

Because function xlog₂(x) is continuous and convex then [2, page 4 or page 6]

160

Brief Description of Drawinfgs

165 An example for carrying out the invention is shown in the attached drawings and is described in detail as follows:

Fig.l shows a simplified hardware implementation of an encoder according to claim 1. It consist of:

• "RLE" - run length encoder, its input receives the bits of a binary sequence and its 170 outputs are 0 or 1 runs.

• "Switch" - It has one input and two outputs ("RunO" and "Runl "). The input is going directly to the active output. The output "RunO" can be activated by activating "Select run 0". The output "Runl " can be activated by activating "Select run 1 ". It is necessary to activate an output before starting the encoding process. The first bit of

175 encoded binary sequence can be used to activate an output. The first bit is needed for initialization of the decoder as well.

• "Entropy coder" (e.g. Huffinan[4, §2.1] or Arithmetic [4, §4]) consists of two finite schemes ("FSO" and "FSl") and only one of them is active at a given moment. The "Entropy coder" encodes the input symbol depending on the active finite scheme.

180 "FSO" can be activated by activating "Select FSO". "FSl" can be activated by activating "Select FSl". Activation of an finite scheme must be done before receiving a symbol. "FSO" and/or "FSl" are found earlier or are updated after every symbol (adaptive compression^, §5]).

• Line "Run" is used to move current run from "RLE" to the "Switch"

185 • Line "RunO" is used to move runs with zeros from "Switch" to the "Entropy coder".

The line is responsible to activate the "Select FSO" and "Select run 1 " before sending the run to the "Entropy coder".

• Line "Runl " is used to move runs with ones from "Switch" to the "Entropy coder". Also the line is responsible to activate the "Select FSl" and "Select run 0" before

190 sending the run to the "Entropy coder".

• "First bit": The first bit of encoded binary sequence and it is used to initialize the device.

• " S " : the input of the device.

• "E": the output of the device. 195

Fig.2 shows a simplified hardware implementation of a decoder according to claim 1. It consist of:

• "RLE" - run length decoder, its input receives 0 or 1 runs. Its outputs are bits of a 200 binary sequence.

• "Switch" - It has one input and two outputs ("RunO" and "Runl "). The input is going directly to the active output. The output "RunO" can be activated by activating "Select run 0". The output "Runl" can be activated by activating "Select run 1". It is necessary to activate an output before starting the decoding process. The first bit of 205 decoded sequence can be used to activate an output.

• "Entropy coder" (for example Huffman[4, §2.1] or Arithmetic [4, §4]) - consists of two finite schemes ("FSO" and "FSl") and only one of them is active at a given moment. The "Entropy coder" decodes the input symbol depending on the active finite scheme. "FSO" can be activated by activating "Select FSO". "FSl" can be

210 activated by activating "Select FSl ". Activation of an finite scheme must be done before receiving a symbol. "FSO" and/or "FSl" are found earlier or are updated after every symbol (adaptive compression^, §5]).

• Line "Run" is used to move current run from active output of "Switch" to "RLE".

• Line "RunO" is used to move runs with zeros from "Switch" to the "Run". The line is 215 responsible to activate the "Select FSO" and "Select run 1 " before sending the run to the "Run".

• Line "Runl " is used to move runs with ones from "Switch" to the "Run". The line is responsible to activate the "Select FSl" and "Select run 0" before sending the run to the "Run".

220 • "First bit": The first bit of encoded binary sequence and it is used to initialize the device.

• "S": the input of the device.

• "E": the output of the device.

225 The required explanation of the invention by means of two drawings is attached.

Modes for Carrying Out the Invention (Advanced Entropy Coders)

230

An advantageous embodiment of the invention is indicated in claim 2 . The further development according to claim 2: it is possible to compress any sequence better than other entropy coders by compressing three sequences, one of which is binary.

Let a sequence S of source symbols be given. The number of its symbols is \s\ and the

235 symbols are from an alphabet A . LGt S-[S₁ s_2>... ,s^ then, the set 5¹^=(O₁ Z>₂₎... ,δ_|S|} will denote the sequence of first bits¹ or b, is the first bit of S₁. S_B can be seen as a binary source and as a random variable associated with S . Substitute (XY) with S and X with S_B in main equation for conditional uncertainty H (XJ)= H [X)+ H [YlX) [3, theorem 1.4.4] .Then H(S)=H(S _B)+H(Y I S _B) where:

• p is the probability of 1 in S_B

lCan be last bit or some other bit. • If S₁ is the sequence of all elements from S starting with 1 , and S₀ is the sequence of all elements from S starting with 0, then next equality is true.

25

Example : If A={θΛ23] , S=[ 1,1,3,2,1,0,2,3,1,2} then 255 5₅={θ,O.l, 1,0,0,1,1,0,1} -first bits from the binary representation of elements of S.

S₁ ={l,O,O,l,θ} - second bits from the binary representation of elements of S, but only if the first bit is 1.

S_o={ 1,1,1,0,1} - second bits from the binary representation of elements of S, but only if the first bit is 0. 260 Now p=0.5 , ^₀=O-I, ^=0.4,^₂=0.3, ^=0.2 .

Some calculations follow

²⁶⁵

270

²⁷⁵

n advantageous embodiment of the invention is indicated in claim 3. The further development according to claim 3: it is possible to compress any sequence better than other entropy coders by compressing several binary sequences. 280

Proof: Base equality 1 can be applied for S₁ and S₀ also. And so on.

AEC Inequality:

285 Proof: Because B for every u .

In example 2 S₁ and S₀ can be compressed further with BS22RLS but there is no further compression for S_B because p—q=0.5 .

290

Industrial Applicability

The invention can be used in: digital communication, digital television, digital photography, computers. Especially in JPEG, MPEG: "The JPEG algorithm, for instance, can 295 use either Huffman coding or arithmetic coding to compress the coefficients"[4,§4.4], "Lossy JPEG compression can be described in six main steps: .... 5.Run length coding- in order to make the best possible use of the long series of zeros ... 6. Variable length coding(Huffman coding)..."[5,§3.3].

300

References:

[1] Claude E. Shannon, Warren Weaver, The mathematical theory of commumcation,\JϊάvQrsity of Illinois Press, (1998)

[2] AJ. Khinchin, Mathematical foundations of information theory, Dover Publications, Inc.,

305 New York (1957)

[3] Robert B. Ash, Information Theory, Dover Publications, Inc., New York (1990)

[4] Peter Wayner, Compression Algorithms for Real Programmers, Morgan Kaufmann-

Academic Press, a Harcourt Science and Technology Company (2000)

[5J H. Benoit, Digital television:MPEG-l, MPEG-2 and principles of DVB system^ocdX Press

310 (Second edition 2002)

Claims

Claims315 What is claimed is:

1. Software and/or hardware machine-implemented compression method, in which a binary sequence BS is being compressed by compressing two sequences RLS₀ , RLS₁ derived from BS Where is the length of i-th run with zeros in BS. 320

- ^is ^ length of/-th run with ones in BS.

2. Software and/or hardware machine-implemented compression method, in which a sequence Sis being compressed by compressing three sequences S_B , S₁ and S₀ derived from S. S_B is a binary sequence and it is compressed as in claim 1. The conection among the

325 sequences as random variables is given by the equation:

H (S)=H (S _B)+ PH(S₁)^qH(S₀) where p is the probability of 1 in S_B and q= l-p is the probability of 0 in S_B ,

3. Software and/or hardware machine-implemented compression method in which a 330 sequence S is being compressed by compressing U binary sequences

S_B(u),u£l ,2,..., U. The latter are derived from S. One, several or all of the binary sequences are compressed as in claim 1.