【数据压缩】第二类词典编码LZW

Apr. 22, 2022

LZW简介

标题第二类词典编码算法的想法是企图从输入的数据中创建一个“短语词典”，这种短语可以是任意字符的组合，编码数据过程中当遇到已经在词典中出现的短语时，编码器就输出这个词典中的短语的“索引号”，而不是短语本身。

不再设置搜索缓冲求
编解码器应同步建立词典。

LZW压缩编码后，不需要传输词典，解码器可以在解码过程中自行建立词典。LZW的词典以所有单个字符形成的词典（ASCII码和对应的0~255的索引号）为基础，在这个基础上根据输入逐渐拼接出每一条短语。

LZW编码算法的思想

编码时，有一个前缀串变量P和一个单字符变量C，

词典初始化未包含所有的单字符，当前前缀串为空
将下一个字符读入C
检查P+C是否在词典中。如果在，将C添加到P，然后回到2；如果不在，将输出与P对应的码字（短语索引），将P+C放到词典中，然后令P=C，也就是在P中只保留刚读入的单字符，回到2.

编解码示例

假设要编码的句子是abbababac。

编码

Step1

a b b a b a b a c

初始化P=""，为空，初始词典为ASCII码表。

索引	短语
0	...
65	A
...	...
97	a
98	b
99	c
...	...
...	...

Step2

↓
a b b a b a b a c

读入第一个字符到C，此时P=""，C = 'a'。 P+C是否在词典中？ P+C="a"，在词典中，对应索引号为97。令P = P+C，得到P="a"。

Step3

a ↓
a b b a b a b a c
↓
97

读入下一个字符到C，此时P="a"，C='b'。 P+C是否在词典中？ P+C="ab"，不在词典中，这一个新的短语

输出P对应的码字，此时P="a"，对应的索引号为97，输出这个97
将P+C="ab"放入词典中，索引为256。
令P=C，即P="b"。

此时的词典：

索引	短语
0	...
65	A
...	...
97	a
98	b
99	c
...	...
...	...
256	ab

Step4

  b ↓
a b b a b a b a c
↓ ↓
. 98

读入下一个字符到C，此时P="b"，C='b'。 P+C是否在词典中？ P+C="bb"，不在词典中，这一个新的短语

输出P对应的码字，此时P="b"，对应的索引号为98，输出这个98
将P+C="bb"放入词典中，索引为257。
令P=C，即P="b"。

此时的词典：

索引	短语
0	...
65	A
...	...
97	a
98	b
99	c
...	...
...	...
256	ab
257	bb

Step5

    b ↓
a b b a b a b a c
↓ ↓ ↓
. . 98

读入下一个字符到C，此时P="b"，C='a'。 P+C是否在词典中？ P+C="ba"，不在词典中，这是一个新的短语

输出P对应的码字，此时P="b"，对应的索引号为98，输出这个98
将P+C="ba"放入词典中，索引为258。
令P=C，即P="a"。

此时的词典：

索引	短语
0	...
65	A
...	...
97	a
98	b
99	c
...	...
...	...
256	ab
257	bb
258	ba

Step6

      a ↓
a b b a b a b a c
↓ ↓ ↓ 
. . .

读入下一个字符到C，此时P="a"，C='b'。 P+C是否在词典中？ P+C="ab"，在词典中

令P=P+C，得P="ab"。

Step7

      a b ↓
a b b a b a b a c
↓ ↓ ↓  ↓
. . . 256

读入下一个字符到C，此时P="ab"，C='a'。 P+C是否在词典中？ P+C="aba"，不在词典中，这是一个新的短语

输出P对应的码字，此时P="ab"，对应的索引号为256
将P+C="aba"放入词典中，索引为259。
令P=C，即P="a"。

索引	短语
0	...
65	A
...	...
97	a
98	b
99	c
...	...
...	...
256	ab
257	bb
258	ba
259	aba

Step8

          a ↓
a b b a b a b a c
↓ ↓ ↓  ↓
. . .  .

读下一个字符到C，此时P="a"，C='b'

P+C是否在词典中？P+C="ab"，在词典中，索引为256

令P=P+C = "ab"

Step9

          a b ↓
a b b a b a b a c
↓ ↓ ↓  ↓
. . .  .

读下一个字符到C，此时P="ab"，C='a'

P+C是否在词典中？P+C="aba"，在词典中，索引为260。

Step10

          a b a ↓
a b b a b a b a c
↓ ↓ ↓  ↓    ↓
. . .  .   259

读下一个字符到C，此时P="aba"，C='c'。

P+C是否在词典中？P+C="abac"，不在词典中。

输出P对应得码字，P="aba"，输出259
将P+C="abac"放入字典中，索引为260
令P=C="c"

索引	短语
0	...
65	A
...	...
97	a
98	b
99	c
...	...
...	...
256	ab
257	bb
258	ba
259	aba
260	abac

Step11

                c ↓
a b b a b a b a c
↓ ↓ ↓  ↓    ↓   ↓
. . .  .    .   99

没有更多字符了，输出P，此时P='c'，对应索引号99。

最终得到的码字序列为97,98,98,256,259,99，如果需要9bit来存放每个码字，那么总的大小为$9 \rm bit \times 6=54bit$，原文本$\rm 8bit \times 9 = 72bit$，比压缩前少了18比特。

解码

97 98 98 256 259 99

Step1

首先，前3个码字都在字典中（目前只有ASCII码表的部分）查得到，直接把它们找出来。

97 98 98 256 259 99
↓  ↓  ↓
a  b  b

Step2

根据已有的文本"abb"，按照前文中的编码算法建立字典：

索引	短语
...	...
256	ab
257	bb

Step3

读入下一个码字256，256在字典中已经有了，是"ab"

97 98 98 256 259 99
↓  ↓  ↓   ↓
a  b  b   ab

到此已经解码出的文本是"abbab"，继续Step2中的编码，继续生成字典：

索引	短语
...	...
256	ab
257	bb
258	ba

注意在这里生成以上字典后，解码出的文本并没有用尽，在把"ba"放入字典后，P="a"，C="b"，按照编码方法，应该令P=P+C="ab"，"ab"存在于字典中，那么就下一个字符到C，但是下一个字符还没有解码出，我们将它设为C='?'。

因此下一个短语为"ab?"，假设它在词典中不存在，那么就应该将P+C="ab?"添加到字典中

索引	短语
...	...
256	ab
257	bb
258	ba
259	ab?

添加到字典中后，按照编码方法，令P=C="?"。

Step4

而由上一个码字得到的文本中，假设了下一个码字的第一个字母为'?'。

下一个码字是259，259对应字典中为"ab?"，因此下一个码字的第一个字母是'a'，所以'?'='a'。

     256 259
      ↑   ↑
a b b ab ???....
      ab ? = 259
         ab? = 259

如上，可以求得'?'=a，

索引	短语
...	...
256	ab
257	bb
258	ba
259	aba

如上可以总结规律

初始化，码字（索引号）pW=0，cW=0
从文件中读第一个码字到cW，第一个码字可以从字典中直接解出短语dict[cW]。解出后令pW = cW
读下一个码字到cW
如果cW在字典内
1. 从字典中取出短语，并输出到解压文件流
2. 令P = dict[pW]，C= dict[cW][0]（dict[cW][0]指字典中cW码字对应的短语的第一个字符），P+C放入到字典中
3. 令pW = cW
如果cW不在字典内
1. 令P = dict[pW],c=dict[pW][0]
2. 将P+C输出到字符流
3. 将P+C输出到字典
重复3. ~ 5.

如此可以解出剩下的文本。

数字查找树

在具体的实现中，字典使用的数据结构叫”数字查找树“。

简而言之，在建立词典过程中，新短语一定是已有的短语加一个字符，那么就把这个字符接到已有的那条短语后面。

数字查找树的根节点下的第一级子节点是ASCII码0~255项。

如果要插入'ab'，就把'b'挂在'a'下面，得到a → b。

如果再要插入'abc'，就把'c'挂在'a-b'下面，得到a → b → c。

如果再要插入'abd'，就把'd'挂在'a-b'下面，如此：

a → b → c
     ↘
       d

如果让b来记录它的子节点（c、d），b需要维护一个不断增长的数组，会比较麻烦，所以实际实现中，让b的子节点以链表的形式连接在一起，b只需要知道链表的第一个是谁，就可以找到它的所有子节点。

（横向为上下级关系，纵向为同级关系。把d称为c的邻居）

a → b → c
        ↓
        d

再插入abcd：

a → b → c → d
        ↓
        d

再插入abe：

a → b → c → d
        ↓
        d
        ↓
        e

....

数字查找树与LZW的字典建立过程十分契合。

数字查找树代码实现

节点的结构体表示：

struct {
  int suffix;
  int parent, firstchild, nextsibling;
} node

suffix是指当前节点指代的字符
parent是指上一个字符，也就是上级节点。
fistchild是指节点的下一个字符
nextsibling是指节点的邻居字符，借助nextsibling可以遍历这个树杈上的所有子节点。

数字查找树的初始化，首先要把ASCII的256个字符放进去，作为第一级子节点。

void InitDictionary( void){
    int i;
    for( i=0; i<256; i++){
        dictionary[i].suffix = i;   // 每一个ASCII字符
        dictionary[i].parent = -1; // 因为是第一级节点，所以没有上级节点
        dictionary[i].firstchild = -1; // 还没有添加子节点
        dictionary[i].nextsibling = i+1; // 右侧相邻的子节点
    }
    dictionary[255].nextsibling = -1; // 最后一个ASCII字符，第一级的最后一个节点，没有下一个相邻的节点了
    next_code = 256; // 新节点会一次放到dictionary列表的后面
}

向数字查找树中添加新节点：

void AddToDictionary( int character, int string_code){  // 把character挂到string_code后面
    int firstsibling, nextsibling;                     
    if( 0>string_code) return;   //  string_code不合法
    dictionary[next_code].suffix = character;       // 把新节点放入词典中，
    dictionary[next_code].parent = string_code;     // 父节点是string_code
    dictionary[next_code].nextsibling = -1;         // 先不考虑邻居
    dictionary[next_code].firstchild = -1;          // 没有子节点
    firstsibling = dictionary[string_code].firstchild;   // 查找string_code下已经挂了哪些节点，要把character放在他们后面
    if( -1<firstsibling){    // the parent has child   string_code下还有其他节点，找到当前的最后一个节点
        nextsibling = firstsibling;   // 从第一个结点开始，找它的邻居
        while( -1<dictionary[nextsibling].nextsibling )   // 一直找到最后一个邻居
            nextsibling = dictionary[nextsibling].nextsibling;
        dictionary[nextsibling].nextsibling = next_code;
    }else{// no child before, modify it to be the first
        dictionary[string_code].firstchild = next_code; // string_code下没有其他节点，把自己作为string_code下的第一个子节点
    }
    next_code ++;
}

查找某个短语是否在词典里：

// 本质上是在查找string_code下有没有character，两个字符之间的关系。通过不断调用，可以查出一整个短语在词典中是否存在
int InDictionary( int character, int string_code){ // character子节点和string_code上级节点
    int sibling;
    if( 0>string_code) return character;
sibling = dictionary[string_code].firstchild; // 从string_code下的第一个子节点开始找，遍历他的所有邻居节点
while( -1&lt;sibling){
    if( character == dictionary[sibling].suffix) return sibling; // 找到了
    sibling = dictionary[sibling].nextsibling; // 没找到，去找下一个
}
return -1;  // 没找到，string_code下没有character

}

文件输入与输出

将文件读写封装为按比特流输入输出。

按比特流输出输入文件在数据压缩算法中是比较实用的。

#include <stdlib.h>
#include <stdio.h>
#include "bitio.h"
BITFILE *OpenBitFileInput( char *filename){ // 打开文件比特流输入句柄
    BITFILE *bf;
    bf = (BITFILE *)malloc( sizeof(BITFILE));
    if( NULL == bf) return NULL;
    if( NULL == filename)   bf->fp = stdin;
    else bf->fp = fopen( filename, "rb");
    if( NULL == bf->fp) return NULL;
    bf->mask = 0x80;
    bf->rack = 0;
    return bf;
}
BITFILE *OpenBitFileOutput( char *filename){ // 打开文件比特流输出句柄
BITFILE *bf;
bf = (BITFILE *)malloc( sizeof(BITFILE));
if( NULL == bf) return NULL;
if( NULL == filename)   bf->fp = stdout;
else bf->fp = fopen( filename, "wb");
if( NULL == bf->fp) return NULL;
bf->mask = 0x80;
bf->rack = 0;
return bf;
}
void CloseBitFileInput( BITFILE *bf){ // 关闭文件比特流输入句柄
fclose( bf->fp);
free( bf);
}
void CloseBitFileOutput( BITFILE *bf){// 关闭文件比特流输出句柄
// Output the remaining bits
if( 0x80 != bf->mask) fputc( bf->rack, bf->fp);
fclose( bf->fp);
free( bf);
}
int BitInput( BITFILE *bf){// 读取1bit
int value;
if( 0x80 == bf-&gt;mask){
    bf-&gt;rack = fgetc( bf-&gt;fp);
    if( EOF == bf-&gt;rack){
        fprintf(stderr, &quot;Read after the end of file reached\n&quot;);
        exit( -1);
    }
}
value = bf-&gt;mask &amp; bf-&gt;rack;
bf-&gt;mask &gt;&gt;= 1;
if( 0==bf-&gt;mask) bf-&gt;mask = 0x80;
return( (0==value)?0:1);

}
unsigned long BitsInput( BITFILE *bf, int count){ // 读取许多bit
unsigned long mask;
unsigned long value;
mask = 1L << (count-1);
value = 0L;
while( 0!=mask){
if( 1 == BitInput( bf))
value |= mask;
mask >>= 1;
}
return value;
}
void BitOutput( BITFILE *bf, int bit){ // 输出1bit
if( 0 != bit) bf->rack |= bf->mask;
bf->mask >>= 1;
if( 0 == bf->mask){  // eight bits in rack
fputc( bf->rack, bf->fp);
bf->rack = 0;
bf->mask = 0x80;
}
}
void BitsOutput( BITFILE *bf, unsigned long code, int count){ // 输出许多bit
unsigned long mask;
mask = 1L &lt;&lt; (count-1);
while( 0 != mask){
    BitOutput( bf, (int)(0==(code&amp;mask)?0:1));
    mask &gt;&gt;= 1;
}

}

LZW代码实现

编码

void LZWEncode( FILE *fp, BITFILE *bf){
    int character;
    int string_code;
    int index;
    unsigned long file_length;
fseek( fp, 0, SEEK_END); // 来到文件尾
file_length = ftell( fp); // 根据文件尾得到文件长度
fseek( fp, 0, SEEK_SET); // 回到文件头
BitsOutput( bf, file_length, 4*8); // 将源文件长度输出到压缩结果的前4字节。
InitDictionary();   // 初始化词典
string_code = -1;   // 初始化string_code，即P，P一开始是空的
while( EOF!=(character=fgetc( fp))){ // 从文件中读取一个字符C
    index = InDictionary( character, string_code);  // 判断P+C是否在字典里
    if( 0&lt;=index){   // string+character in dictionary P+C在字典里，所在位置为index
        string_code = index;    // P = P + C
    }else{  // string+character not in dictionary P + C不在字典里

        output( bf, string_code);   // 将P的索引输出到压缩结果中
        if( MAX_CODE &gt; next_code){   // free space in dictionary 字典是否已满
            // add string+character to dictionary
            AddToDictionary( character, string_code);   // 将P+C放入字典
        }
        string_code = character;    // 令P = C
    }
}
output( bf, string_code);

}

解码

int DecodeString( int start, int code){     // 从索引号解码字符串 start：修改d_stack的起始点，code： 码字
    //需填充
    int i = start;
    int string_code = code;
    while(string_code>=0){
        d_stack[i] = dictionary[string_code].suffix;
        string_code = dictionary[string_code].parent;
        i++;
    }
    //d_stack[i] = dictionary[string_code].suffix;
    //i++;
    return i;
}
void WriteTo(char *dst, int *src, int size) // WriteTo函数是反向写入的，因为DecodeString中存放d_stack也是反着的。
{
int t = size;
for(t=size; t>=1;t–){
dst[size - t] = (char ) src[t-1];
}
}
void LZWDecode( BITFILE *bf, FILE *fp){
//需填充
int character;
int string_code;
int index;
unsigned long file_length;
char * text, * start;
fseek( fp, 0, SEEK_SET);
file_length = BitsInput(bf, 8 * 4); // 读取文件大小
printf("File length: %ld\r\n", file_length);
text = (unsigned char *) malloc(file_length);   // 按文件大小分配
start = text;   // 输出缓存开头
memset(text, 0x00, file_length + 1);    // 初始化输出缓存
InitDictionary();   // 初始化字典
char * end = text + file_length;    // 计算输出缓存结尾
int cW, pW;     // 声明cW和pW
int count;      // 声明count，每次解码得到的字符数
cW = input(bf);     // 读入第一个码字
*text = dictionary[cW].suffix;  // 得到第一个码字的字符
text++;
pW = cW;        // pW = cW
while(end-text&gt;0){
    cW = input(bf);     // 读入一个码字
    if(cW&lt;next_code){        // 码字在字典内
        count = DecodeString(0, cW);    // 码字解码为字符串
    }else {     // 码字不在字典内
        count = DecodeString(1, pW);  // 这里将Start设为1，是为了不覆写d_stack的第一个字符，
                                      // 即`P = dict[pW]`,`C=dict[pW][0]`
                                      // 而此时P+C=d_stack
    }
    AddToDictionary(d_stack[count-1], pW); // P+C输出到字典
    pW = cW;
    WriteTo(text, d_stack, count); // P+C输出到文件
    text += count;
    //while( 0&lt;count--) printf(&quot;%c&quot;, (char)(d_stack[count]
}
fwrite(start, 1, file_length, fp);

}

压缩效果

测试了纯文本文件

纯文本（本文）压缩：17636Byte->12212Byte，压缩率69.2%
Word文档压缩：61780Byte->83166Byte，压缩率134%。。。。
PDF文档压缩：1233370Byte->1412280Byte，压缩率114%
BMP图像压缩:

174134Byte->75004Byte，压缩率43%
MP4视频压缩：3075295Byte->3760296Byte，压缩率122.7%

综上，对于原始的文件，如纯文本、BMP图像，都有压缩效果

对于已经采用了一些压缩处理的文件，如Word（本质为文本和图片的Zip打包），MP4（使用H264压缩），PDF，均出现文件大小反而增加的情况。